Quick Introduction to Logistic Regression in Exploratory

Kan Nishida
learn data science
Published in
6 min readJan 24, 2017

--

We have added an easier way for you to build, predict, and evaluate some of the well known regression models like Linear Regression, Logistic Regression, and GLM with v3.0.

In this post, I’m going to use Logistic Regression as an example to demonstrate how that works at high level.

Build Logistic Regression Model

Logistic Regression is one of the regression model algorithms that can be used to predict the binary outcome like TRUE or FALSE based on input variables (predictors).

Here, I have US flight delay data and created a column that indicates if the arrival times of the flights were delayed (TRUE) or not (FALSE).

mutate(is_delay = ARR_DELAY > 0)

Now, let’s build a logistic regression model to see if we can predict whether each flight was delayed for its arrival or not, based on the departure delay time (DEP_DELAY) and how long it was flying in the air (AIR_TIME).

You can click ‘Add’ button at the top and select ‘Build Logistic Regression’ model under ‘Build Model’.

This will open ‘Build Logistic Regression Model’ UI dialog like below.

Note that this ‘build_lr’ function is an R wrapper function that calls ‘glm’ function from the ‘stats’ package with ‘family’ argument being set to ‘binomial’ and ‘link’ being set to ‘logit’. Unlike ‘glm’ function, ‘build_lr’ function returns a data frame that holds the model(s). All the parameters for ‘glm’ function are availble at right hand side of the dialog.

Anyway, we can select ‘is_delay’ column for ‘Predict the outcome of’ parameter, and ‘DEP_DELAY’ and ‘AIR_TIME’ columns as the predictors.

You can also check ‘Split for Training and Test Data Sets’ and set the data split ratio. Once you hit ‘Run’ button, you will get the model summary information like AIC (Akaike information criterion), BIC (Bayesian information criterion), etc. and also the parameter estimates to see how each predictor would impact the outcome (Estimate), whether they are statistically significant (P Value), what the confidence intervals for the predictors are, etc.

You can extract these data as data frames or use this model to predict against the training or the test data.

Predict

You can select ‘Predict on Test Data’ from ‘Add’ button menu. This will create a data frame that has the original data frame columns and the predicted information like ‘predicted value (the log odds of the event)’, ‘standard error’, ‘predicted_probability’, ‘predicted_label’, etc.

The values for ‘predicated_label’ are decided based on the threshold value you set manually or the value automatically set to optimize for a given metric like ‘Accuracy’, ‘F Score’, etc. in the ‘Predict Data — Binary’ dialog.

Confusion Matrix

The predicted data output is just like any other data frames, and you can quickly start visualizing it. For example, you can use Pivot Table to get ‘confusion matrix’ by assigning the original ‘answer’ column, in this case that is ‘is_delay’, to Row and the ‘predicted_label’ column to Column. You can keep the default ‘Number of Rows’ to Value, but you want to set the window calculation setting to use ‘% of Total’ so that we can see the percentage of each section like True Positive, False Positive, etc.

Evaluate with ROC Curve

One of the ways to evaluate your model’s prediction performance is to draw ROC (Receiver operating characteristic) curve. For this, you can select ‘Binary Classification — ROC’ under Evaluate Quality of Prediction’ menu.

This will generate data with ‘true_positive_rate’ and ‘false_positive_rate’ like below.

And you can go to Viz tab and quickly visualize this data with either Scatter, Line, or Area chart. Here, I’m using Line chart. Note that you want to use ‘Average’ or ‘Median’ for Y-Axis since there can be multiple entries for the same ‘false positive rate’ values at X-Axis.

Build a Model for Each Group and Evaluate Them

The cool thing of building a model as part of the grammar based data wrangling steps is that you can bring ‘Grouped data frame’ concept in the mix.

You can go to the step before ‘Logistic Regression’ step, then you can insert ‘Group’ step to group the data frame by, for example, ‘CARRIER’.

Now, by clicking on ‘Logistic Regression’ step to go back to the step, it will re-build the model automatically, this time one model for each ‘carrier’. And this means that you can compare the quality of the models right away. The top section in the summary view lists the model summary information for each ‘carrier’ model.

And by clicking back on ‘Calculate ROC’ step at right hand side, it will automatically predict against the testing data for ‘Predict’ Step and recalculate the ROC values. Therefore, by going back to Viz view and simply assigning ‘CARRIER’ column to Color, you will get the ROC curves each of which represent each model that was built for each ‘CARRIER’ group.

Use Branch to keep evaluation data separatory

You can also generate the prediction performance measures like AUC, Recall, Precision, etc. based on the data at ‘Predict’ step. If you want to keep ‘Calculate ROC’ step, then you can create a branch that is essentially another data frame that branches off of this ‘Predict’ step.

And in the newly created branch, you can select ‘Binary Classification — Metrics’.

This will get you a series of the metrics to evaluate the prediction performance for all the models each of which was built for each group, in this case that is ‘CARRIER’.

I have shared the data with reproducible steps here.

You can download the EDF and import it to your Exploratory Desktop. You need Exploratory v3.0 to reproduce this.

If you want to reproduce this in a standalone R environment like RStudio, here is an R script.

# Set libPaths.
.libPaths("/Users/kannishida/.exploratory/R/3.3")

# Load required packages.
library(tidyr)
library(dplyr)
library(exploratory)

# Data Analysis Steps
exploratory::read_delim_file("/Users/kannishida/Dropbox/Data/demo/airline_delay_2016-08.csv" , ",", quote = "\"", skip = 0 , col_names = TRUE , na = c("","NA") , locale=locale(encoding = "UTF-8", decimal_mark = "."), trim_ws = FALSE , progress = FALSE) %>%
exploratory::clean_data_frame() %>%
select(-CANCELLATION_CODE) %>%
separate(ORIGIN_CITY_NAME, into = c("city", "state"), sep = "\\s*\\,\\s*", convert = TRUE) %>%
filter(!is.na(ARR_DELAY)) %>%
mutate(is_delayed = ARR_DELAY>0) %>%
group_by(CARRIER) %>%
build_lr(is_delayed ~ DEP_DELAY + AIR_TIME, test_rate = 0.3) %>%
prediction_binary(data = "test", threshold = "accuracy") %>%
do_roc(predicted_probability, is_delayed)

If you don’t have Exploratory Desktop yet but want to try this out quickly, you can sign up at our website for free trial!

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.