Filtering Data with Aggregate and Window Calculations

Kan Nishida
learn data science
Published in
6 min readMar 14, 2016

--

One of the cool things about dplyr is that we can directly aggregate values inside the filter operation, without having to have a separate step for the aggregation. Let’s take a look how it is and why that is so cool.

Filter with Aggregate functions

What if you want to see only the flights whose arrival delay time (ARR_DELAY) is greater than the overall average arrival delay time ? First, let’s look at the average arrival delay time.

flight %>% 
summarize(average = mean(ARR_DELAY, na.rm = TRUE))

It’s about 12.75 minutes. So we’re interested in seeing any flights whose arrival delay time is greater than about 12.75 minutes. Guess what ? You can actually do this average calculation directly inside the ‘filter()’ function like below. Note that ‘na.rm = TRUE’ argument inside ‘mean()’ function above is to remove any NA values before the calculation. If this is not set TRUE then it will return NA when there is NA values in the data because it doesn’t know what to make of NA. R is precise, it’s the tool that statisticians use and rely on.

flight %>% 
select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
filter(ARR_DELAY > mean(ARR_DELAY, na.rm = TRUE))

I’m so used to the world of SQL and BI tools. Yes, I used to work at Oracle building BI tools for databases. So something like this makes me feel like it’s too good to be true.

But then after I’ve gotten more used to this “dplyr world” a.k.a. “Hadleyverse”, I realize how much I was trying to adopt my analysis questions into the way SQL queries allow, not how we wanted to think or ask questions. “What are the flights whose arrival delay times are greater than the overall average ?” That’s the question we want to ask, and dplyr lets us translate that to the command without reconstucting the question. This is why I sometimes emphasize that dplyr commands are so intuitive. And it turned out if it’s intuitive to write the command and construct the analysis steps it’s also intuitive for other people to follow my commands and steps.

Filter with Aggregate function with Group

Let’s go one step further. What if you want to see the flights whose arrival delay times are greater than the average of each airline carrier, instead of the overall average ? To answer this question, you can simply add ‘group_by()’ function right before the ‘filter’ step like below.

flight %>% 
select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
group_by(CARRIER) %>%
filter(ARR_DELAY > mean(ARR_DELAY, na.rm = TRUE))
flights with arrival delay time greater than average per carrier

Yes, it’s that simple. This ‘group_by()’ function makes the data frame to be ‘grouped’ based on a given column(s) so that any aggregate functions like ‘mean()’ in the following steps would do the aggregate calculations for each group.

Let’s find out the result is really reflecting our intention. We can quickly calculate each airline carrier’s average arrival delay time with ‘summarize()’ function like below.

flight %>% 
group_by(CARRIER) %>%
summarize(average = mean(ARR_DELAY, na.rm = TRUE))
Average arrival delay time per carrier

As you can see the average arrival delay time for carrier ‘EV’, for example, is about 19.69 minutes. Now when you look at the previous result of the filtering, you can see that the arrival delay times for ‘EV” are all greater than 19.69.

Filtering with Window functions

Filtering with aggregate functions is cool, but there is more.

Let’s say you have a question like, “What are the worst 10 flights based on the arrival delay time ?” To answer this question, we can simply use one of the rank functions called ‘min_rank()’ from dplyr and call it directly inside the ‘filter()’ function like below.

flight %>% 
select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
filter(min_rank(ARR_DELAY) <= 10)

By the way, I’m using ‘min_rank()’ function here but there is another rank function called ‘dense_rank()’ from dplyr. Both functions return the ranking number based on a given measure column (e.g. Arrival Delay time), and only the difference is when there are ties like below.

As you can see, ‘flight D’ is considered as #4 with ‘min_rank()’ function because ‘flight B’ and ‘flight C’ are ties at #2, and ‘min_rank()’ function reserves #3 before giving the next rank number. On the other hand, ‘dense_rank()’ function doesn’t reserve the rank #3 and instead give that to whatever the one at the next in order.

Anyway, if I run the command above, we’ll get something like below.

But when you look closer at ‘ARR_DELAY’ column though, it looks that it is showing the best 10, not the worst 10, because the less the arrival time means the flights actually had arrived earlier in this context. So basically we want the order to be flipped. For this, you can use ‘desc()’ function just like you would do with SQL.

flight %>% 
select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
filter(min_rank(desc(ARR_DELAY)) <= 10)

Now we get the worst 10 flights in January 2014. I see a lot of American Airlines there, interesting.

But just like what we did with the aggregate function, what if you want to see the worst 10 flights for each airline carrier, instead of the overall worst 10 ? Yes, you guessed it right, you can simply add ‘group_by()’ function right before the filter step again.

flight %>% 
select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
group_by(CARRIER) %>%
filter(min_rank(desc(ARR_DELAY)) <= 10)

To make it easier to see if the result really reflects our intention, let’s try to see only the worst 1 for each carrier by chaning the condition from ‘<=10’ (less than 10 or equal) to ‘<=1’ (less than 1 or equal).

flight %>% 
select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
group_by(CARRIER) %>%
filter(min_rank(desc(ARR_DELAY)) <= 1)

Now we can see the worst flight based on the arrival delay time for each of the 14 carriers.

What we have done here is often called ‘Window operation’ with other tools like SQL and BI tools. However, with dplyr you don’t have to think about it as ‘Window operation’ problem trying to adopt your question into a complex query structure. Instead, you focus on building your analysis questions in a quick and iterative way. The combination of the ‘pipe (%>%)’ operation, dplyr’s beautiful command interface design, and its well thought and rich functionality, you can perform this type of analysis quickly and in an intuitive way without even realizing how complicated it could have been with other tools. Great win for anybody who want to do a quick, iterative, and solid data analysis.

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.