Joining two data sets to supplement or filter

Published in

learn data science

7 min readMar 23, 2016

Worst departure delay time by Carrier in Jan, 2014

One of the common operations when you work with data is to bring another data and join or merge it to the current data set you are working on. If we bring additional columns from the new data we call it ‘join’, if we bring additional rows from the new data then we call it ‘merge’ or ‘combine’.

Not surprisingly, dplyr has a very rich functionality in this area. Not only does it provide the essential data joining features like ‘Left join’, ‘Full join’, ‘Inner join’, but also it provides something called ‘filtering join’ operation, which is to use the target data to filter the current data without brining the extra columns from the target data.

In this post and the next post, I am going to talk about how to bring another data to supplement the current data, find out if there is any data that doesn’t have the matching data in the target data, and merge two data sets to create a better data to join.

Let’s start with the basic ‘left_join’ operation first.

Find the worst flight for each airline carrier

First, let’s take a look at the flight data I have been using. You can check this post to see how you can get the data and import.

flight %>% 
 select(FL_DATE, CARRIER, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME)

Now I want to know what are the flights that had the worst departure time delay for each airline carrier (CARRIER). I can quickly set a grouping level at ‘CARRIER’ then use ‘top_n()’ function to get the worst from each ‘CARRIER’ like below.

flight %>% 
 select(FL_DATE, CARRIER, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME) %>%
 group_by(CARRIER) %>%
 top_n(1, DEP_DELAY)

‘top_n()’ function takes the bigger number to be the higher ranking. If you want to do the opposite order then you can use desc() function to wrap around ‘DEP_DELAY’.

Anyway, you can see the worst flight from each of the ‘CARRIER’, but the problem is that most of us including me have no clue with the most of the values presented in ‘CARRIER’ column. These are supposed to be 2 letter abbreviation codes that would correspond to the actual airline company names.

We need to find and get a mapping table data that has a list of the corresponding airline carrier codes and names, then we can join this ‘flight’ data frame with the ‘carrier code’ data.

Scrape data from a web page

This day and age we can find such mapping data easily, just a single click with google search, you’ll find bunch. The only problem is, many of such data can be on those web pages and don’t have ‘download’ option to download it in, say, Excel or CSV. So you want to scrape that data on the web pages and dump that into a data frame.

In the previous post, I have demonstrated how you can quickly scrape data from web pages. If you missed it I’d recommend you check that out to find the detail.

Just for a quick refresh, here’s a web site that has the airline carrier code and name mapping. It is from US Federal Aviation Administration (FAA), so it sounds legit. ;)

http://aspmhelp.faa.gov/index.php/ASQP:_Carrier_Codes_and_Names

And here’s how you can scrape the table data on the page and dump it into a data frame.

library(rvest)faa_mapping_html <- read_html("http://aspmhelp.faa.gov/index.php/ASQP:_Carrier_Codes_and_Names")faa_mapping <- faa_mapping_html %>%
 html_node("table") %>%
 html_table()

Again, please check this post for the detail about how to scrape the data from web pages like above.

Clean up the data from web

Before getting into joining this data, let’s do a few clean ups quickly on this airline carrier mapping data, ‘faa_mapping’.

First, let’s rename the column names to be something that is easy for later use like below.

faa_mapping %>%
 rename(code = `IATA Carrier Code`, name = `Carrier Name`)

And, we don’t really need a column that holds 3 letters abbreviation for this exercise, so let’s drop that with ‘select()’ function. Check this post for the detail of column selection and rename.

faa_mapping %>%
 rename(code = `IATA Carrier Code`, name = `Carrier Name`) %>%
 select(-`ICAO Carrier Code`)

The output looks good, so I’m going to save this data into a data frame like below so that I can join this to ‘flight’ data frame.

faa_mapping <- faa_mapping %>%
 rename(code = `IATA Carrier Code`, name = `Carrier Name`) %>%
 select(-`ICAO Carrier Code`)

Finally, we are ready for the join!

Join functions from dplyr

dplyr provides rich ‘join’ functionality with variety of functions. It has basically two types of join functions. One is to bring the columns from the target data frame to enrich or supplement the original data frame with the added columns values.

These are:

left_join
right_join
full_join
inner_join

And another is to not bring the columns but use the data from the target data frame to filter the data in the original data frame.

These are:

semi_join
anti_join

Instead of explaining the differences among these, let’s start taking a look from ‘left_join’, which is one of the most common operation when it comes to joining data.

Left join

Let’s take a look at the data from ‘flight’ data frame again with the command below.

flight %>% 
 select(FL_DATE, CARRIER, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME) %>%
 group_by(CARRIER) %>%
 top_n(1, DEP_DELAY)

Again, we’re looking at the worst flight from each airline carrier in Jan 2014.

Now, we want to use ‘CARRIER’ column from this data frame and ‘Code’ column from the previously scraped data in ‘faa_mapping’ data frame to join the two data frames. You can use ‘left_join()’ function like below.

flight %>% 
 select(FL_DATE, CARRIER, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME) %>%
 group_by(CARRIER) %>%
 top_n(1, DEP_DELAY) %>%
 left_join(faa_mapping, by = c("CARRIER" = "code"))

‘by’ argument inside ‘left_join()’ function is where you can set which columns from the two data frames you want to use to join the two data frames. In this case, that is, ‘CARRIER’ column from ‘flight’ data frame, and ‘code’ column from ‘faa_mapping’ data frame. Once you run this, you’ll get something like below.

We can see that ‘name’ column has been added to ‘flight’ data frame. Now we know what each of the CARRIER code means, except for two CARRIER codes, ‘FL’ and ‘VX’!

Let’s see what we can do with this with other dplyr’s ‘join’ functions.

Inner Join

Now, if we care only the flights that are listed in the FAA carrier code mapping data then we can forget about these two CARRIER codes that didn’t have the corresponding values in the mapping data, and we can use ‘inner_join()’ function instead of ‘left_join()’ function.

flight %>% 
 select(FL_DATE, CARRIER, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME) %>%
 group_by(CARRIER) %>%
 top_n(1, DEP_DELAY) %>%
 inner_join(faa_mapping, by = c("CARRIER" = "code"))

As you can see, ‘FL’ and ‘VX’ are gone now, that’s because ‘inner_join’ would keep only the rows with the corresponding values in the target data frame.

Filter Join with Anti join and Semi join

Sometimes you are not necessary looking for the values from the target data frame. You’d rather want to use the target data frame to filter the data in the original data frame. There are two ways to do this, one is to keep the rows that have the corresponding values in the target data frame.

flight %>% 
 select(FL_DATE, CARRIER, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME) %>%
 group_by(CARRIER) %>%
 top_n(1, DEP_DELAY) %>%
 semi_join(faa_mapping, by = c("CARRIER" = "code"))

Another is to keep the rows that don’t have the corresponding values in the target data frame. In this particular scenario, this operation can help to see how many rows will end up not having the mapping values after the join operation. So, if you run something like below,

flight %>% 
 select(FL_DATE, CARRIER, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME) %>%
 anti_join(faa_mapping, by = c("CARRIER" = "code"))

you’ll get something like below.

It shows 14,215 rows, that’s about 3% of the data, given this ‘flight’ data frame has 471,949 rows. It’s tempting to simply drop these data for the two carriers and move on.

But we don’t have to give up so easily. Maybe, we can find another web site that has a better mapping data for the airline carrier codes and the names.

And that’s the topic for the next post. I’m going to talk about how to scrape another web site data, how to evaluate the new data set in a comparison with the old data set, and how to create a better mapping table data by using both data sets. Stay tuned!

Joining two data sets to supplement or filter

Find the worst flight for each airline carrier

Scrape data from a web page

Clean up the data from web

Join functions from dplyr

Left join

Inner Join

Filter Join with Anti join and Semi join

If you liked this post, please click the green heart below to share with the world!

hadley/dplyr

dplyr - Plyr specialised for data frames: faster & with remote datastores

Subscribe to my Medium posts

Enter your email to recieve automatic updates.

Written by Kan Nishida