Joining Data¶

Introduction¶

You have the tools to obtain data from a single table in whatever format you want it. But what if the data you want is spread across multiple tables?

That's where JOIN comes in! JOIN is incredibly important in practical SQL workflows. So let's get started.

Example¶

We'll use our imaginary pets table, which has three columns:

ID - ID number for the pet
Name - name of the pet
Animal - type of animal

We'll also add another table, called owners. This table also has three columns:

ID - ID number for the owner (different from the ID number for the pet)
Name - name of the owner
Pet_ID - ID number for the pet that belongs to the owner (which matches the ID number for the pet in the pets table)

To get information that applies to a certain pet, we match the ID column in the pets table to the Pet_ID column in the owners table.

For example,

the pets table shows that Dr. Harris Bonkers is the pet with ID 1.
The owners table shows that Aubrey Little is the owner of the pet with ID 1.

Putting these two facts together, Dr. Harris Bonkers is owned by Aubrey Little.

Fortunately, we don't have to do this by hand to figure out which owner goes with which pet. In the next section, you'll learn how to use JOIN to create a new table combining information from the pets and owners tables.

JOIN¶

Using JOIN, we can write a query to create a table with just two columns: the name of the pet and the name of the owner.

We combine information from both tables by matching rows where the ID column in the pets table matches the Pet_ID column in the owners table.

In the query, ON determines which column in each table to use to combine the tables. Notice that since the ID column exists in both tables, we have to clarify which one to use. We use p.ID to refer to the ID column from the pets table, and o.Pet_ID refers to the Pet_ID column from the owners table.

In general, when you're joining tables, it's a good habit to specify which table each of your columns comes from. That way, you don't have to pull up the schema every time you go back to read the query.

The type of JOIN we're using today is called an INNER JOIN. That means that a row will only be put in the final output table if the value in the columns you're using to combine them shows up in both the tables you're joining. For example, if Tom's ID number of 4 didn't exist in the pets table, we would only get 3 rows back from this query. There are other types of JOIN, but an INNER JOIN is very widely used, so it's a good one to start with.

Example: How many files are covered by each type of software license?¶

GitHub is the most popular place to collaborate on software projects. A GitHub repository (or repo) is a collection of files associated with a specific project.

Most repos on GitHub are shared under a specific legal license, which determines the legal restrictions on how they are used. For our example, we're going to look at how many different files have been released under each license.

We'll work with two tables in the database. The first table is the licenses table, which provides the name of each GitHub repo (in the repo_name column) and its corresponding license. Here's a view of the first five rows.

In [1]:

            
                Copied!
                
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "github_repos" dataset
dataset_ref = client.dataset("github_repos", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "licenses" table
licenses_ref = dataset_ref.table("licenses")

# API request - fetch the table
licenses_table = client.get_table(licenses_ref)

# Preview the first five lines of the "licenses" table
client.list_rows(licenses_table, max_results=5).to_dataframe()

from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "github_repos" dataset
dataset_ref = client.dataset("github_repos", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "licenses" table
licenses_ref = dataset_ref.table("licenses")

# API request - fetch the table
licenses_table = client.get_table(licenses_ref)

# Preview the first five lines of the "licenses" table
client.list_rows(licenses_table, max_results=5).to_dataframe()

Using Kaggle's public dataset BigQuery integration.

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:19: UserWarning: Cannot use bqstorage_client if max_results is set, reverting to fetching data with the tabledata.list endpoint.

Out[1]:

	repo_name	license
0	zoffixznet/perl6-App-Nopaste	artistic-2.0
1	vikt0rs/devstack-plugin-zmq	artistic-2.0
2	LegitTalon/nicolas	artistic-2.0
3	Arabidopsis-Information-Portal/Intern-Hello-World	artistic-2.0
4	andrewdonkin/falcon	artistic-2.0

The second table is the sample_files table, which provides, among other information, the GitHub repo that each file belongs to (in the repo_name column). The first several rows of this table are printed below.

In [2]:

            
                Copied!
                
# Construct a reference to the "sample_files" table
files_ref = dataset_ref.table("sample_files")

# API request - fetch the table
files_table = client.get_table(files_ref)

# Preview the first five lines of the "sample_files" table
client.list_rows(files_table, max_results=5).to_dataframe()

# Construct a reference to the "sample_files" table
files_ref = dataset_ref.table("sample_files")

# API request - fetch the table
files_table = client.get_table(files_ref)

# Preview the first five lines of the "sample_files" table
client.list_rows(files_table, max_results=5).to_dataframe()

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:8: UserWarning: Cannot use bqstorage_client if max_results is set, reverting to fetching data with the tabledata.list endpoint.

Out[2]:

	repo_name	ref	path	mode	id	symlink_target
0	git/git	refs/heads/master	RelNotes	40960	62615ffa4e97803da96aefbc798ab50f949a8db7	Documentation/RelNotes/2.10.0.txt
1	np/ling	refs/heads/master	tests/success/plug_compose.t/plug_compose.ll	40960	0c1605e4b447158085656487dc477f7670c4bac1	../../../fixtures/all/plug_compose.ll
2	np/ling	refs/heads/master	fixtures/strict-par-success/parallel_assoc_lef...	40960	b59bff84ec03d12fabd3b51a27ed7e39a180097e	../all/parallel_assoc_left.ll
3	np/ling	refs/heads/master	fixtures/sequence/parallel_assoc_2tensor2_left.ll	40960	f29523e3fb65702d99478e429eac6f801f32152b	../all/parallel_assoc_2tensor2_left.ll
4	np/ling	refs/heads/master	fixtures/success/my_dual.ll	40960	38a3af095088f90dfc956cb990e893909c3ab286	../all/my_dual.ll

Next, we write a query that uses information in both tables to determine how many files are released in each license.

In [3]:

            
                Copied!
                
                    
                    
                
                

        
# Query to determine the number of files per license, sorted by number of files
query = """
        SELECT L.license, COUNT(1) AS number_of_files
        FROM `bigquery-public-data.github_repos.sample_files` AS sf
        INNER JOIN `bigquery-public-data.github_repos.licenses` AS L 
            ON sf.repo_name = L.repo_name
        GROUP BY L.license
        ORDER BY number_of_files DESC
        """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
file_count_by_license = query_job.to_dataframe()
# Query to determine the number of files per license, sorted by number of files
query = """
        SELECT L.license, COUNT(1) AS number_of_files
        FROM `bigquery-public-data.github_repos.sample_files` AS sf
        INNER JOIN `bigquery-public-data.github_repos.licenses` AS L 
            ON sf.repo_name = L.repo_name
        GROUP BY L.license
        ORDER BY number_of_files DESC
        """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
file_count_by_license = query_job.to_dataframe()

/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/client.py:440: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.
  "Cannot create BigQuery Storage client, the dependency "

It's a big query, and so we'll investigate each piece separately.

We'll begin with the JOIN (highlighted in blue above). This specifies the sources of data and how to join them. We use ON to specify that we combine the tables by matching the values in the repo_name columns in the tables.

Next, we'll talk about SELECT and GROUP BY (highlighted in yellow). The GROUP BY breaks the data into a different group for each license, before we COUNT the number of rows in the sample_files table that corresponds to each license. (Remember that you can count the number of rows with COUNT(1).)

Finally, the ORDER BY (highlighted in purple) sorts the results so that licenses with more files appear first.

It was a big query, but it gave us a nice table summarizing how many files have been committed under each license:

In [4]:

            
                Copied!
                
# Print the DataFrame
file_count_by_license
# Print the DataFrame
file_count_by_license

Out[4]:

	license	number_of_files
0	mit	20418035
1	gpl-2.0	16792944
2	apache-2.0	7106098
3	gpl-3.0	4902118
4	bsd-3-clause	2929339
5	agpl-3.0	1292022
6	lgpl-2.1	792881
7	bsd-2-clause	694532
8	lgpl-3.0	563605
9	mpl-2.0	472843
10	cc0-1.0	405920
11	epl-1.0	320146
12	unlicense	208872
13	artistic-2.0	148414
14	isc	117503

You'll use JOIN clauses a lot and get very efficient with them as you get some practice.

Your turn¶

You are on the last step. Finish it by solving these exercises.

Have questions or comments? Visit the course discussion forum to chat with other learners.