Why SQL is not for Analysis, but dplyr is

Kan Nishida
learn data science
Published in
8 min readAug 5, 2016

--

I keep meeting people who are starting out to learn SQL. There are bunch of data that is still in the relational database, and SQL provides a simple grammar to access to the data in a quite flexible way. As long as you do the basic query like counting rows and calculating the grand total you can get by for a while, but the problem is when you start wanting to analyze the data beyond the way you normally do to calculate a simple grand total, for example.

Here is a super simple example. I have the famous flight delay data here, and let’s say we want to calculate median value for all the flights. In R + dplyr, this is all you need to do.

flight %>%summarize(median_delay = median(ARR_DELAY))

‘%>%’ is called ‘pipe’, it passes the data, that is ‘flight’ in this case, from the left side of it to the right side. If you work in Exploratory Desktop you don’t even need that part, instead it would simply look like this.

Now, if you want to calculate that for each carrier you just add ‘group_by’ command right before the ‘summarize’ command.

flight %>%group_by(CARRIER) %>%summarize(median_delay = median(ARR_DELAY))

As you probably know, median is different from average. A median value is located at the center of all the entries while the average value is a value that is calculated as the total values being divided by the number of the values. Because of this nature of ‘median’, SQL based database really sucks to do this calculation. Let’s take a look at a few examples of SQL queries for some popular databases.

If you are working with Amazon Redshift the query would look like this.

select median(ARR_DELAY) over (partition by CARRIER)
from flight

It looks a bit weird, because we thought we signed up to the grammar that has ‘Group By’ syntax to group the data, but there are these ‘over partition’ thing.

But this is still better than what you actually need to write for PostgreSQL like below.

with ordered_flights as (
select
ARR_DELAY,
row_number() over (order by id) as row_id,
(select count(1) from flights) as ct
from flights
)
select avg(ARR_DELAY) as median
from ordered_flights
where row_id between ct/2.0 and ct/2.0 + 1

Hmm, I’m kind of lost here. But this is still even better when you compare to the one we would need to write for MySQL. By the way, MySQL doesn’t even support Window function that was used in the above example, so things get even more complicated. I found this stack overflow page ‘Simplest way to calculate median with MySQL’, and here is the answer that got the most ‘useful’ scores.

SELECT avg(t1.val) as median_val FROM (
SELECT @rownum:=@rownum+1 as `row_number`, d.val
FROM data d, (SELECT @rownum:=0) r
WHERE 1
— put some where clause here
ORDER BY d.val
) as t1,
(
SELECT count(*) as total_rows
FROM data d
WHERE 1
— put same where clause here
) as t2
WHERE 1
AND t1.row_number in ( floor((total_rows+1)/2), floor((total_rows+2)/2) );

Wow, wow, wow…

Again, let’s take a look at how the median can be calculated in R.

group_by(CARRIER)summarize(median = median(ARR_DELAY))

Phew, thanks to god we have R and dplyr! ;)

But ok, let’s say you have somehow managed to overcome this jungle SQL queries and don’t care of sharing the code with others who will most likely have hard time understanding the SQL code like above. Well, still the performance for returning the result would be terrible due to the nature of the operation that is required to find (not really calculate) the median values, and I don’t recommend you even try. ;)

SQL is not designed to do data analysis

This is just a tip of the iceberg. There are many other examples on why SQL is not for analyzing data. It is not SQL’s fault per se, because SQL has never claimed to be designed for analyzing the data. It was rather designed for querying and managing the data including inserting, updating, and deleting rows of data. I used to work at Oracle who had brought SQL to this world in a commercial way in the first place, and we tried so many things to make SQL look like the language that could be used even for analyzing the data. But at the end of the day, the ways to make it work within SQL always turned out to be so convoluted that many people outside of the core geeks including myself couldn’t understand anymore and even people like us had hard time to optimize the queries to perform in a reasonable time. Again, the relational database is not designed for analyzing the data, and SQL is not designed for analyzing the data either.

R is designed for data analysis

Instead of keep talking about why SQL sucks for analyzing data, let me quickly demonstrate how great R and dplyr package are for analyzing data, not just for statisticians, but for ordinal people including SQL users, Excel users, and BI tools users in a super effective way.

Before starting, if you are not familiar with ‘dplyr’ yet, it provides a set of the data wrangling grammar based commands (or functions) and they can be flexibly used together.

Here are a few of the commands I’m going to use below.

  • group_by — group the data — equivalent to GROUP BY in SQL
  • summarize — aggregate the data — equivalent to the aggregate functions in SELECT in SQL
  • mutate — create new columns with expressions — equivalent to the expression in SELECT in SQL
  • arrange — sort the data — equivalent to ORDER BY in SQL

Ok, let’s start. I’m going to keep using the same ‘flight’ data and demonstrate why R is 100x better than SQL using the following example use cases.

  • Top10
  • Filtering based on Aggregated Values
  • Comparing to Previous Values — Daily Growth
  • Cumulative Sum

Top 10

Let’s say we want to see the top 10 flights for each carrier based on the arrival delay time, all we need to do is to type something like below.

group_by(CARRIER) %>%top_n(10, ARR_DELAY)

The result can be visualized clearly with Scatterplot chart like below. You can see top 10 delayed flights for each carrier.

Now, when I remove the ‘group_by’ command like below, you can spot the difference clearly in the chart.

top_n(10, ARR_DELAY)

You can see just top 10 flights based on their arrival delay times here.

By the way, we can simply flip the top 10 to bottom 10 by just adding ‘-’ like below.

group_by(CARRIER) %>%top_n(-10, ARR_DELAY)

Just to show you how insanely simple that is, here is what you would have to write with SQL.

SELECT *
FROM (
SELECT *, Rank()
over (Partition BY CARRIER
ORDER BY ARR_DELAY DESC) AS Rank
FROM flight_table
)
WHERE Rank <= 10

Hmm….

Filtering based on Aggregated Values

Let’s say we want to see all the flights that had delayed more than the average of each carrier. All you need to do is to add ‘mean’ function in the ‘filter’ command like below.

group_by(CARRIER) %>%filter(ARR_DELAY >= mean(ARR_DELAY))

The ‘mean’ (average) calculation is done for each Carrier and the data is filtered based on it. Isn’t this intuitive or at least easy to read and understand?

We can visualize this result clearly by using Boxplot chart. You would notice that the leg of each box starts at the average delay time of each carrier.

Now, when I remove ‘group_by’ step the same chart would look like the one below. You can see all the legs are starting at the same point, that is the average of the arrival delay for all the flights.

Comparing to Previous Values

Let’s say we want to see the daily changes in values or in percentages compared to the day before based on the arrival delay time for each carrier. This is one of the typical use cases for using ‘Window’ functions.

group_by(CARRIER, FL_DATE) %>%summarize(average_delay = mean(ARR_DELAY)) %>%arrange(CARRIER,FL_DATE) %>%mutate(change = average_delay - lag(average_delay))

‘lag’ function returns the previous value, in an each group boundary.

Here is how it looks like in Line chart.

How about tracking the change in percentage? We can add an extra expression in the same ‘mutate’ command.

group_by(CARRIER, FL_DATE) %>%summarize(average_delay = mean(ARR_DELAY, na.rm=TRUE)) %>%mutate(change = average_delay - lag(average_delay), percent_change = change/lag(average_delay)*100)

Did you notice that I’m using ‘change’ column, which is just created in the same ‘mutate’ command, in this new expression already? Yes, R is built on this thing called ‘vectorized’ operation, that is basically the same as ‘columnar’ operation if you are coming from BI world, therefore once the operation of an expression is done, that result becomes available to be used in the next expression even in a same command. Another reason why I claim R is designed for data analysis while SQL, which is optimized for the row based operation, is designed for data management, not for analysis.

Anyway, now the percent changes in Line chart would look like this.

Cumulative Sum

Let’s say we want to see the cumulative sum of the daily average flight delay time for each carrier. All you need to do is to use ‘cumsum’ function in ‘mutate’ command.

group_by(CARRIER, FL_DATE) %>%summarize(average_delay = mean(ARR_DELAY, na.rm=TRUE)) %>%arrange(CARRIER,FL_DATE) %>%mutate(cumulative_sum = cumsum(average_delay))

And it would look something like below in Line chart.

Conclusion

Lastly, I don’t deny there is a room for SQL even for data analysis. Especially when it doesn’t make sense to move all the data out of the database and you just need to answer very simple questions by using SQL queries to summarize the data. But, you need to have a clear strategy of when to use SQL and when to not use SQL. The best strategy I keep seeing with many R users is to fine tune the SQL queries to extract the data at a manageable size for your PC’s memory either by filtering, aggregating, or sampling, then import into R instance in memory so that you can quickly and iteratively explore and analyze the data with all the statistical horse powers.

If you would like to try the R’s power for data analysis quickly I would encourage you to start with Exploratory Desktop beta by signing up from here.

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.