A Beginner’s Guide to EDA with Linear Regression — Part 8

What difference it makes whether Predictor Variable is Numerical, Categorical or Logical?

Kan Nishida
learn data science

--

This is the eighth and last episode of the series of “A practical guide of EDA with Linear Regression”, continued from the previous post. If you haven’t read this series yet, I’d strongly recommend you start from the first post.

In the previous post, we have learned that Plurality is a good predictor to explain the changes in Gestation Week.

Its coefficient is -1.32, which means that one value increase in Plurality variable would make Gestation Week 1.32 weeks shorter when all the other variables stay the same.

That says, if you are expecting to have twin babies the gestation week would be 1.32 weeks shorter compared to a single baby. And, if you are expecting to have triplet babies the gestation week would be 1.32 weeks shorter compared to twin babies, and it would be 2.64 weeks shorter compared to a single baby.

Here is a formula to calculate the change in Gestation Week.

A Change in Gestation = 1.32 (Coefficient for Plurality) * 2 (difference between a single and triplets)

And here is the visual explanation of how the changes in Gestation Week can be calculated.

The blue line is this linear regression model. The coefficient is really the slope of the line.

Now, this means that when you have Quintuplet (5) babies the gestation week will be 5.28 weeks shorter compared to a single baby.

5.28 = 1.32 * (5 - 1)

But this is based on an assumption that there is a linear relationship between Plurality and Gestation.

We can check if the assumption holds true by using Boxplot chart like below. Y-Axis shows Gestation Week and X-Axis shows Plurality. Each box represents the range between the 25 percentile and the 75 percentile and the center line inside each box shows the median value.

We can see a sort of a linear relationship between Gestation Week and Plurality. As the value of Plurality increases the value of Gestation Week decreases, though the trend stops between quadruplets (4) and quintuplets (5).

Now, do we have enough data for each plurality number? I mean, do we have enough data for quadruplets and quintuplets in this data set?

Here’s a bar chart that shows the ratio of each Plurality number in this data set.

As you can see we don’t have many triplets or more babies in this data.

So even though Plurality variable’s P value in the above model is small enough to tell us that Plurality has some degree of influences on Gestation Week in a statistically significant way, I am not sure if that really holds true when the values are more than 3 (triplets) given that we have a very small amount of data for these babies.

Maybe the coefficient of Plurality is heavily represented by the change from 1 (single) to 2 (twins), but not much by the changes between twins and triplets or triplets and quadruplets, etc. And, maybe we don’t have enough data to prove any relationship between Gestation Week and Plurality for Triplets and more.

Should a Variable be Numerical or Categorical?

We can actually address these concerns by converting the numerical nature of Plural variable to categorical nature. In R, this is just a matter of converting the data type from Numeric to Character by using ‘as.character’ function.

as.character(plurality)

Here, I’m creating a new column for this character data type of Plurality in Exploratory.

We can quickly see this new column under Summary view.

We can clearly see that most of the data is concentrated under a single baby category.

Now, we can re-build the model with this new categorical Plurality variable and see the coefficient for each Plurality category.

Just like we saw in the third post in this series, the coefficients of the categorical variables can be understood by comparing them to the base level of the variable. In Exploratory, we can check what is the base level of this Plurality variable under the Model Summary tab.

Since the Plurality column is character data type, it automatically sets the most frequent value as the base level. In this case, that happens to be 1.

This means if you have twin babies your gestation week tends to be 1.29 weeks shorter compared to a single baby if all the other variables stay the same.

If you have triplet babies the gestation week tends to be 3.16 weeks shorter compared to a single baby.

Now, if you have quadruplet (4) babies, the gestation week tends to be 0.066 weeks shorter, but we are not sure if that is caused by a chance or it is statistically significant given that the P value is super high (0.96).

And, when you look at quintuplet (5) babies, which is at the most right hand side, it is statistically significant but its confidence interval is pretty long compared to other variables.

The confidence interval here indicates that the coefficient value should be between -3.278 and -0.025 in 95% of the time.

This range tends to become wider when there is not enough data for a given variable.

So, this tells us that as the number of Plurality becomes bigger the model starts becoming less confident about how much they would influence Gestation Week, especially after 4 (quadruplets).

Converting to Logical Variable

The reason why the model starts becoming less confident as the number of Plurality becomes bigger is due to the fact that we don’t have much data for babies that are more than twins.

Then, it might be useful to investigate this variable’s influence power by transforming it to Logical, instead of Categorical.

Logical data means that its values are either TRUE or FALSE based on whether they match a condition that was given to the variable.

Here, we can create a condition to be something like below in R.

Plural > 1

If a given value matches this condition, it means that this baby is one of the multiplet types and the value will be TRUE. If not, this baby is single and the value will be FALSE.

You can do this within ‘mutate’ step like below in Exploratory.

We can check this newly created column is_Plural under Summary view. It turned out that only 3.21% of the babies in this data are multiplets.

Now, we can re-build the model with this newly created variable is_Plural instead of the previously created categorical variable of Plurality.

We can see that the coefficient of is_Plural is about -1.38, which means that if a given baby is multiplet, regardless whether it is twin or triplet or others, then the gestation week tends to become about 1.38 weeks shorter than a single baby when other variables stay the same.

With that, that’s all we have! You finished! Congrats! 🎉🎉🎉

Closing Thoughts…

The recent buzz around Data Science and AI are often about the prediction part of the Machine Learning (or Statistical Learning), not so much on the ‘gaining insights’ part. But this insight part is where you find a valuable information that is necessary in order for humans to make important business decisions.

The world we operate our businesses is complex and filled with a lot of uncertainties, and it is very hard to predict what will happen, how will happen, and when will happen exactly.

Machine Learning is pretty good at prediction based on the patterns or correlations it finds in a given data. But what we really want to know is ‘why’ part of the prediction rather than the prediction itself when we analyze data. We want to understand what would cause the predicted result before making decisions in the real world.

Unfortunately, even the modern Machine Learning algorithms can’t help us find answers to this ‘why’ question easily. And this is where our domain knowledge in our businesses and human’s reasoning skill can be a great help.

This is why I have written this series that is nothing fancy or new about. All the methods I have introduced in this series have been there and employed by statisticians for many years.

But, the more people I talk to, especially the ones who are entering to the world of Data Science recently, I have come to realize that many of us are still not familiar with how to use even the basic statistical learning algorithms like Linear Regression for gaining useful insights from data.

This series of the posts are not necessarily trying to show you the exact way you are supposed to do data analysis. There are many other ways that are not covered in this series. But I hope that it gives you an idea of how you can use the statistical learning algorithms like Linear Regression to help answers the questions you will have when you will deal with data in the real world.

Thanks for reading! and let’s keep learning together!

Try it for yourself!

If you want to try this out quickly, you can download the data from here, import it in Exploratory Desktop, and follow the steps.

If you don’t have Exploratory Desktop yet, you can sign up from here for 30 days free trial!

Learn Data Science without Programming

If you are interested in learning various powerful Data Science methods ranging from Machine Learning, Statistics, Data Visualization, and Data Wrangling without programming, go visit our Booster Training home page and enroll today!

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.