Why Factor is one of the most amazing things in R & forcats helps you realize it

Kan Nishida
learn data science
Published in
10 min readMar 20, 2017

--

R is like wine, the more you experience it, the more you appreciate what it does, how it does, and why it does. You might hit the initial learning curve, but after you overcome it, then you start feeling how it is beautifully and practically designed to address very common challenges of the everyday data analysis.

Factor data type is one of them. It is designed to address typical challenges we encounter when we work with categorical data. This data type looks like Character data type from the outset, but it can contain additional information to manage the levels and the order (or sequence) of the categorical values.

Why Factor?

How many categories and how are they all doing?

A simple example would be US State names like ‘California’, ‘New York’, ‘Texas’, etc. We know that there are always 50 of them (or maybe more when including other special districts.). And even when we happen to not have data for some of the states we sometimes still want to see all the 50 states listed so that we can know which states have the data and which states don’t. In R, by making this ‘US State’ column as Factor data type we can keep all these 50 states as 50 levels regardless of whether each state has data or not in a given data frame.

How are categories sorted (ordered)?

Another example would be the day of week names, such as ‘Sunday’, ‘Monday’, ‘Tuesday’, ‘Wednesday’, etc. There are 7 of them (or 7 levels). In this case though, not only we care about the number of the levels but also we care about the order of the values. For example, when we visualize these, we would expect them to be sorted as ‘Sunday’, ‘Monday’, ‘Tuesday’, ‘Wednesday’, and so on like below, instead of as an alphabetical order, such as ‘Friday’, ‘Monday’, ‘Saturday’, ‘Sunday’, etc.

Again, by converting this ‘Day of Week’ column to Factor data type, we can not only register the 7 days of the week as 7 levels but also define the order of how they should be sorted appropriately.

Factor data type alone separates R from other BI tools

So basically, with Factor data type, we can register the levels (number of the categories) and the orders as part of the columns (or variables) natively so that we can let the columns dictate how to handle such level and sorting information. This is a huge advantage especially comparing to other tools like Excel or typical BI tools.

A gift from God — forcats package

But the only problem is, it was not so straightforward to assemble and manage such levels and order with Factor for many of us. The typical syntax for defining the levels and order would look something like this.

factor(criteria, levels = c(1,2,3), labels = c("low", "medium", "high"), ordered = TRUE)

It’s a lot going on in this one function call and a bit intimidating.

But then, at the middle of the last summer, a package called ‘forcats’ was delivered by a god of ‘tidyverse’, Hadley Wickham. And just like any other things he touches upon, all the sudden not only it makes much easier to work with Factor, but also it has made Factor an essential part of my data analysis flow ranging from visualizing data to building machine learning models.

In this post, I’m going to demonstrate why Factor should be your best friend and how ‘forcats’ package makes the journey of knowing Factor data type super easy and fun. I’m going to use Exploratory as a front end (UI), but obviously, you can do the same things in RStudio or other tools as well.

Set the Order of Categories based on Another Column Values

Take a look at the chart below. It is showing the similarities among countries based on the United Nations General Assembly’s voting history based on the data I downloaded from here. Each Scatter chart represents the years each of the past US Presidents served.

Now, you would notice that those Scatter charts are actually sorted by the US President names alphabetically, not by the years they served. It starts with ‘Barack Obama’ and ends with ‘Ronald Reagan’. But obviously, it would be much easier to see them being sorted by the years so that we can see the trend by time.

The US President names were originally from another data frame called ‘presidents’ like below, and it was later joined to the main data frame with ‘left_join’ command from ‘dplyr’ package.

Luckily, there is a column called ‘year’ so we can use this column to define the order of the US President names for ‘PRESIDENT’ column with Factor.

There are two ways to do this.

One is to sort the data by ‘year’ column first. Then use ‘fct_inorder’ function from ‘forcats’ package.

fct_inorder(PRESIDENT)

This function sets the order based on the original order in the data set.

Another way is to use ‘fct_reorder’ function, which can take another column as a reference for the order, so the original data doesn’t need to be sorted beforehand.

fct_reorder(PRESIDENT, YEAR, fun = first)

I’m setting an aggregate function called ‘first’ to pick the first value of ‘year’ for each president so that the first year of each president would be used to define the order of the President names.

You can go to Summary view and confirm the new order as well.

Either way, by going back to Small Multiple, we can now see the scatter charts being sorted based on the order that was set at the US President column level.

As you have seen, once you set the order rule to the column, which is ‘PRESIDENT’ in this case, everything else including the charts will start respecting the order. This means you don’t need to configure such order rules separately for each chart.

Reverse the order

If we want to show the US Presidents from the latest to the oldest, then we can simply use ‘fct_rev’ from ‘forcats’ package to reverse the original order we have set above.

fct_rev(PRESIDENT)

Create ‘Other’ Bucket When Too Many Categories

When you have a column with many unique categorical values and assign it to Color, you will end up with a chart like below.

Here, I have assigned US State Code column to Color to show the ratio of each airline carrier’s flights by US states for this particular time period. but obviously, it’s hard to compare among the states inside each of the bars. So, typically what we would end up doing is to move the states with small ratios into ‘Others’ bucket so that we can compare among the major states.

This is when the ‘fct_lump’ function from ‘forcats’ package comes to rescue. It keeps only a top N number of the categories, in this case, that is US states, and moves everything else into an ‘Other’ bucket.

fct_lump(state, n=5)

Now you can see only CA (California), FL (Florida), GA (Georgia), IL (Illinois), TX (Texas), and Other in colors like below, and it’s much easier to compare those top 5 states in each carrier.

By the way, as I have talked about this in the following blog about ‘Anomaly Detection, being able to create ‘Other’ bucket is useful when building machine learning models as well because some categorical values with small ratios might not have enough data to create the models.

Keep Top 5 for Each Group with Group By

We can switch the Y-Axis calculation to ‘% of Total’ for the above chart. This will make it easier to see the ratio of the states like below.

Now when you look at ‘Hawaiian’ airline though, you would notice that most of the flights are in ‘Other’ group (Blue color). This is because the calculation used to create ‘Other’ group by ‘fct_lump’ function was done against the entire data set and the top 5 states happen to be only less than 10% for Hawaiian airline. This means that we have just lost valuable information, especially for this carrier. 😱

Not to worry. If you are a ‘frequent’ reader of this blog, you know we can simply add ‘group_by’ step to group the data frame before the ‘top 5’ frequent calculation.

Before adding the ‘group_by’ step, click ‘Pin’ button first to pin the chart to the final result step.

Now, go to the step before the step where we used ‘fct_lump’ function, that is ‘Separate’ step in this example. Then, select ‘Group By’ from Add button menu and select ‘CARRIER’ column to group the data by the carrier.

Once the command is run, the ‘Other’ calculation with ‘fct_lump’ function will be done automatically, and you will see the top 5 states and ‘Other’ in each of the carrier bars. 🎉

We can see that ‘HI (Hawaii)’ is the most frequent state for Hawaiian airline, which is kind of expected.

Set the Order of Categories Manually

Now you might want to control the way the airline carrier names are sorted at X-Axis.

Let’s say we want to show ‘JetBlue’, ‘Southwest’, and ‘Virgin’ first then show the rest as is. For this, we can use the ‘fct_relevel’ function from ‘forcats’ package like below.

fct_relevel(name, "JetBlue", "Southwest", "Virgin")

If you have a ‘Group By’ step in your data wrangling pipeline, make sure that you do this operation before the step because you don’t want to set a different sorting order in each group.

Set Base Level for Categories for Machine Learning Models

Setting the base level of the categorical data is critical for some of the machine learning algorithms. For example, when you run ‘Survival Analysis — Cox Regression Model’ for your customer retention analysis, you will see the result like below.

The ‘Hazard Ratio’ column under Parameter Estimate shows the ratio of the customers who are more likely to quit your service for each country where the customers live or operating system they use. But what ratio, right? Well, these values are basically the relative values compared against the ‘base’ levels, which are shown under ‘Summary of Fit’ table above.

For example, the first line of ‘India’ shows 1.2534, which means that the users from India are 25% more chance to quit than the ones from the United States, which is the base level for ‘country’.

1.2534 - 1 = 0.2534

So it’s important to know what the base levels are for each of the predictor columns when working with this types of machine learning models.

Now, what if you want to set the base level to something else? For example, let’s say your primary customers are in Japan and on Mac OS, and you want to understand other customers in a comparison to such primary customer type. The base level is essentially the first level in the factor column, and you can manually set the level easily with ‘fct_relevel’ function from ‘forcats’ package.

fct_relevel(country, "Japan")

In Summary view, you can see that the new column called ‘country_japan_as_baseline’ being created with ‘Japan’ as ‘Base Level’. You can compare to the original column ‘country’, which shows ‘United States’ as ‘Base Level’.

And when you use this ‘country_japan_as_baseline’ column for building the Cox Regression model again, you will see the Hazard Ratio being re-calculated based on this new base level.

And now we can understand that the users from India are 44% more likely to quit compared to the users from Japan. 😱

That’s it for this post. As always, there are much more to explore with Factor data type, and I’m going to introduce other cool things you could do. But for this post, I hope that I have demonstrated how Factor data type can be practically useful for your everyday data analysis. And the functions from ‘forcats’ makes it super easy to work with the categorical data by taking advantage of the beauty of Factor.

If you like to quickly try some of the functions from ‘forcats’, we have exposed them from the column header menu in Exploratory.

You can simply select a column and select the one you are interested in from the menu.

If you don’t have Exploratory yet, sign up for a free trial from here. If you are a student, teacher, or journalist, it’s free!

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.