Why learn dplyr for everyday data analysis ?
I was one of those who started learning R relatively recently. I tried learning R by writing R codes. But I was a bit perplexed by the R’s own way of doing things and I didn’t become so enthusiastic about it right away. Though, that was only until I was introduced to this R package called ‘dplyr’ and saw something similar to the below.
flight %>%
group_by(CARRIER) %>%
filter(min_rank(desc(ARR_DELAY)) <= 10)
The three lines of codes above are basically calculating the worst 10 flights in terms of the arrival delay time for each airline carrier. If you look closer you’d notice that there are ‘%>%’ marks, and these are called ‘pipe’. And whatever the command of its left hand side can pass the output (or result) of the command to the command at the right hand side of the pipe. So the first line is about getting flight data, and the next line is grouping the data by CARRIER column values. Then finally, the last line is filtering the data by keeping only the ones that are ranked as the worst 10 based on the arrival delay time (ARR_DELAY) for each CARRIER. I found it very simple to construct the analysis and very intuitive to follow what’s happening in the operation.
You would appreciate especially after you see how we would have to write SQL to do basically the same thing like below.
SELECT *
FROM (
SELECT *, Rank()
over (Partition BY CARRIER
ORDER BY ARR_DELAY DESC) AS Rank
FROM flight_table
) WHERE Rank <= 10
Yikes, hope I’m not the only one who is having a problem following this SQL. Yes, I had been working for BI and Database products at Oracle for many years, but that didn’t make it any easier.
The question like “What are the worst flights for the arrival delay time for each airline carrier ?” sounds easy to answer to the analyst side of my brain when you have the related data, but it has not because of the way most of other traditional SQL based tools called ‘BI (Business Intelligence)’ force you to think in SQL (Structured Query Language) way. Now, dplyr has changed that, and I was hooked.
And the cool thing is, dplyr doesn’t stop there. Let’s say I wanted to see the worst 10 airline carriers using the average arrival delay time for each airport they have departured. I can add ‘ORIGIN’ to ‘group_by’ clause and add ‘summarise’ clause to calculate the average arrival delay time, and finally use the value to rank inside mutate clause.
> flight %>%
group_by(ORIGIN, CARRIER) %>%
summarise(avg_arr_delay = mean(ARR_DELAY, na.rm = TRUE)) %>%
mutate(rank = min_rank(desc(avg_arr_delay)))Source: local data frame [9 x 4]
Groups: ORIGIN [1]ORIGIN CARRIER avg_arr_delay rank
(chr) (chr) (dbl) (dbl)
1 ABE EV 15.7964602 1
2 ABI EV 5.5789474 1
3 ABI MQ 0.3019802 2
4 ABQ AA -0.3953488 9
5 ABQ B6 0.6521739 8
6 ABQ DL 17.4666667 2
7 ABQ EV 11.7439024 5
8 ABQ F9 35.8750000 1
9 ABQ MQ 13.2888889 4
10 ABQ OO 2.1098266 7
.. ... ... ... ...
From here, I want to see only the ones that left SFO (San Francisco International Airport) and to exclude the ones that arrived earlier than expected, (Yes, that happens obviously!) and show them in an order that starts from the worst.
> flight %>%
group_by(ORIGIN, CARRIER) %>%
summarise(avg_arr_delay = mean(ARR_DELAY, na.rm = TRUE)) %>%
mutate(rank = min_rank(desc(avg_arr_delay))) %>%
arrange(desc(avg_arr_delay)) %>%
filter(avg_arr_delay > 0) %>%
filter(ORIGIN == "SFO")Source: local data frame [9 x 4]
Groups: ORIGIN [1]ORIGIN CARRIER avg_arr_delay rank
(chr) (chr) (dbl) (dbl)
1 SFO F9 17.3486239 1
2 SFO WN 15.9129264 2
3 SFO B6 11.3708207 3
4 SFO OO 7.2768606 4
5 SFO HA 6.2580645 5
6 SFO UA 5.6466804 6
7 SFO AS 2.5698006 7
8 SFO AA 2.2303173 8
9 SFO VX 0.3293286 9
The questions I’m asking with the data are getting more complicated, but the code above doesn’t get as much complicated because each line simply follow my question one step at the time. I can simply read from the top to the bottom as if I’m following the steps of the questions. And even better, I can change the order of the each command and reconstruct them based on my needs. Yes, finally, we have a tool that follows how we want to ask questions and think.
dplyr is created by Hadley Wickham, Romain Francois, and others.
And if you’re familiar with R you should be familiar with Hadley Wickham and his world called ‘Hadleyverse’ already. Over the years, he has built many (and many) powerful, useful, and beautiful R packages that are making data analysis easier and more fun for more people around the world today.
I have learned a lot about this amazing world since I started this journey, and now wanted to share my experience with this great amazing ‘dplyr’ and ‘Hadleyverse’ with the world so that even more people can access to more data easily and work with them to find answers quickly and in a much more advanced yet a fun way.
Working with data includes accessing and extracting data, then transforming and analyzing the data. And it’s almost impossible to talk about dplyr without mentioning other R packages that Hadley and other amazing people built such as lubridate, tidyr, stringr, readr, and others. So I will cover those along with dplyr as well.
Enough said, now let the journey begins!