Exploratory v4.3 Released!

Kan Nishida
learn data science
Published in
15 min readApr 8, 2018

--

I’m super excited to announce Exploratory v4.3! 🎉

There are a lot of enhancements and bug fixes we have made over the last few months especially in the following four areas.

  • Analytics
  • Dashboard
  • Visualization
  • Data Wrangling

Let’s take a look at some of the highlights quickly.

Analytics

We have made many improvements to Analytics view. Let’s start with the new additions around Hypothesis Testing under Analytics view.

Hypothesis Testing

We have added the following three hypothesis testings.

  • T-Test
  • Anova Test
  • Chi-Square Test

Let’s take a look at one by one.

T-Test

T-Test is to test if the difference in the average (mean) for two categories are statistically significant. It can be used when we want to know the difference we see between two categories is something we can count on or is something that could have happened by chance.

To demonstrate, this is a chart that shows the average monthly spending of the customers by gender (female vs. male) for this fictional supermarket.

Now, is the difference between the two bars statistically significant? Or is it just a marginal difference that could have happened by chance?

We can go to Analytics view and select T-Test to test this hypothesis.

Assign Monthly Spending to Target Variable and Gender to Explanatory Variable, and click Run button.

What we want to pay attention to here is P-Value, which is about 0.22. This means that it’s not small enough to reject the null hypothesis. (Note that I’m using the broadly accepted P-Value threshold value (0.05) as a guide here, but that’s not to say I’m blindly believing in this particular value. But that’s for another topic.)

What is the null hypothesis then? The null hypothesis here is that the difference between the two categories (Female and Male) and the changes in the monthly spending are independent. They are not dependent on one another. And not being able to reject this null hypothesis means that the difference in Gender and the change in Monthly Spending can be independent, not related. So the difference between Female and Male we saw with the bar chart above could have happened by chance.

We can check the Error Bar tab to see how Monthly Spending values are for each gender with 95% confidence interval.

Let’s pick one more example.

Here, I have a bar chart showing the average monthly spendings for one group of customers who have churned (right hand side) and another group who are still customers with this store (left hand side).

Is this difference statistically significant?

Again, we can use T-Test under Analytics tab to test this.

This time we have P-Value as very close to 0. Note that when you see ‘e-13’ at the end of the number, it means that there are 13 zeros as floating points, so it is actually something like 0.00000000000004433…… and it is small enough to reject the null hypothesis. This means that we can conclude that the difference in the average monthly spending between the churned customers and not-churned customers is statistically significant, not by a chance.

ANOVA Test

T-Test is to test the difference in the average of two categories. When you have more than two categories, you want to turn to ANOVA Test.

To demonstrate, I’m going to use the same superstore customer data.

We are looking at the average customer monthly spending for three regions, North Bay, South Bay, and Wine Counties.

Are these difference statistically significant?

We can use ANOVA Test under Analytics tab to test this.

First, what is the null hypothesis?

It is that the three categories and the changes in the average customer monthly spending are independent. This means that the difference in the regions wouldn’t make a difference in the monthly spending.

Now, given that the P-Value is very close to zero, we can reject the null hypothesis and conclude that the differences in the average monthly spending among the regions are statistically significant.

Here is how Error Bar shows the average values with the 95% confidence interval ranges by the regions.

Unlike the previous example, we can see that the ranges are not overlapping especially with Wine Countries.

Chi-Square Test

This one is my favorite of this release. Chi-Square test is to test if two categorical variables are independent or not. This is also used for A/B Testing where we want to see if the difference between A’s performance and B’s performance such as conversion rate, subscription rate, etc. is statistically significant or not.

To demonstrate, I’m going to use US Baby data that I have been using for the series of ‘A beginner’s guide to Linear Regression’.

Here, I’m showing the ratio of prematurely born babies for each Mother Race.

The blue is the ratio of the prematurely born babies. We can see the ratios are different among Mother Races. For example, Black has more and Chinese has less than the others. Now the question here is, Mother Races and the ratio (or rate) of the prematurely born babies are independent? To put in another way, the difference in the ratios of prematurely born babies among Mother Races is statistically significant?

We can use Chi-Square Test under Analytics view to test this.

The P-Value is zero (or almost zero), which means that we can reject the null hypothesis. The null hypothesis here is Mother Race and whether the babies are born prematurely or not are independent. Now that we can reject it, we can conclude that the difference in the prematurely born baby ratios among the mother race types is statistically significant.

Chi-Square for Exploratory Data Analysis

While Chi-Square test is to test the hypothesis, we can also use it as part of Exploratory Data Analysis.

Given Chi-Square value is the result of the calculation where we calculate the difference between the expected values, which are calculated based on an assumption that two categorical variables are independent and the actual values, it can help us find which combinations of the two categorical variable values have gaps that are larger than expected.

To demonstrate, let’s take a look at the US Baby data again. Here, I’m showing the ratio of each Mother Race type for each state with a bar chart.

Among all these combinations of the races and the states, I want to find out if there are any combinations that are unique. Just by looking at it, we can see most of the states have White (blue) and Black (green) as the top race types. But some states show different patterns than the typical ones. For example, Hawaii (HI) has a lot more ‘Other Asians’, ‘Filipino’, and ‘Japanese’ than other states.

Are there any other unique combinations?

This is when we can turn to Chi-Square Test, too. First, let’s run the test under Analytics tab by assigning Mother Race to Target Variable and State to Explanatory Variable.

Given that the P-Value is zero (or almost zero), Mother Race and State are not independent, meaning that when State changes we would expect the ratio of the Mother Race would change as well.

Now that we see the Chi-Square value being 188,822.67, which is a very high value and this is the reason why P-Value is so low. Again, the Chi-Square value represents the differences between the expected ratio of Mother Race in each State and the actual ratio of the two.

By going to Contribution tab, we can see which combinations of Mother Race and State to have made the Chi-Square value higher.

For example, as expected, we can see Other Asian, Japanese, and Filipino are the three races in Hawaii that contribute the good portion of the Chi-Square value.

We can also see that the combinations of Black and some states like Louisiana (LA), California (CA), Georgia (GA) are also contributing the good portion of the Chi-Square value.

While we can see which combinations are contributing to the Chi-Square value, we don’t know the direction, meaning that we don’t know if the ratio of such combination is bigger than expected or smaller than expected. This is when we want to go to Difference tab.

For example, we can see that the combination of Black and California is contributing to the Chi-Square value because the actual ratio of this combination is too low from the expected ratio.

Chi-Square Test is useful when you want to test your hypothesis with two categorical variables (columns), but it can be handy when you are exploring your data as well.

Shapiro-Wilk Test / Normality Test

Many of the statistical algorithms assume the numerical variable values be normally distributed (Normal Distribution) such as T-test, Pearson Correlation, etc. But how can we be sure if the variables (columns) you are interested in are normally distributed?

One of the most popular ways to test this is something called Shapiro–Wilk test, and we have added it as ‘Normality Test’ under Analytics tab.

Here, I have these 6 numeric variables for the US Baby data.

Which are the ones ‘normally distributed’?

We can run Shapiro-Wilk test by assigning the numeric variables to Columns.

To understand this test, we need to understand the null hypothesis this test is testing first. The null hypothesis is that a sample of the numeric variables is coming from a normally distributed population. That says, if the P-Value is small enough then we can reject the null hypothesis meaning that a given variable is not normally distributed. If the P-Value is big enough then we can’t reject the null hypothesis meaning that a given variable is normally distributed.

Phew! Sometimes, just thinking of this null hypothesis thing makes my head hurts… ;)

Anyway, by looking at the P-Values, only Father Age column has the P-Value big enough, which means it is normally distributed.

It’s hard to put our heads around this null hypothesis, so we have created a column ‘Normal Distribution’ that simply tells if it’s normally distributed or not. It uses 0.05 as the P-Value threshold by default.

You can check Q-Q (Quantile-Quantile) Plot to see which part of the data is close to or far from the theoretical normal distribution. If it’s a perfect normal distribution all the data points (blue dots) for a given variable should be on the gray straight line.

Survival Analysis — Log-Rank Test

We have added Log-Rank Test for the survival curve from this release. Log-Rank Test is to test if the difference in the survival curves between the cohorts (or groups) is statistically significant. The null hypothesis is that the survival curves for multiple cohorts are identical, which means that the difference in the categorical variable would have no influence on the difference in the survival curves.

To demonstrate, I have run a cohort analysis with Survival Analysis under Analytics view. Survival Analysis uses ‘Kaplan-Meyer’ algorithm to draw the survival curves like below.

We are looking at the survival curves that shows the user retention rates through times for the users from two OS (Operating Systems) groups — Mac and Windows.

Now, is the difference between Mac’s survival curve and Windows’s survival curve is statistically significant? Or, is it just a marginal difference that could have happened by chance?

We can go to Summary tab and see that P-Value is 0.03, which is enough small to conclude that the difference in the two survival curves of Mac and Windows is statistically significant.

Time Series Forecasting with Prophet

We have added two new tabs for two seasonalities, Yearly and Weekly, for Time Series Forecasting with Prophet under Analytics view.

Under Yearly, you can see the yearly seasonality trend.

We can see that the sales go up in May and between August and December.

Under Weekly, you can see the weekly seasonality trend.

Here, we can see that the sales tend to be lower on Sunday and Monday, but it’s pretty much the same for the rest of the week.

Note that in order to get the weekly seasonality you need to set the Date/Time Column’s aggregation level to Day.

And of course, you can use ‘Repeat By’ to subset the data into multiple groups and build multiple Prophet’s time series forecasting models.

For example, here, we can see some regions like USCA and LATAM show that the sales go up towards to the end of the year while Asia and Europe show a strong sales hike around May.

Filter Support

Just like you could do for Chart in the previous releases, now you can create Analytics level Filters as well. This means that you can have multiple analytics tabs to build multiple models, say Linear Regression model, but with different data sets with different filter configurations.

Analytics Property

We have added Property support to each analytic type under Analytics view so that you can customize the default settings.

One of the properties exposed is to set the P-Value threshold. For example, Linear Regression Analysis use this setting to show the coefficient bar either as gray color or not. The default value for this is 0.05 (5%), but that doesn’t mean it should be always 0.05. It should really depend on your use cases.

You can now change that setting like below.

Dashboard

We have re-designed the dashboard layout configuration experience. Now you can have more than 4 charts! ;)

And, you can now drag-and-drop to change the chart positions.

Additionally, you can have a title for Dashboard!

If you are interested in further, here is a blog post about Dashboard for more details.

Visualization

Column Name Search

When there are many columns it can be hard to find a column you want to assign to the chart by just scrolling up and down.

You can now search the columns simply by typing the part of the column names!

Trend Line — Metrics

The trend line with Linear Regression for Scatter chart now shows the following metrics.

  • P-Value
  • R Squared
  • Coefficient Estimate
  • Correlation

Reference Line — Range

We have added Reference Line support in v4.2, and now we are adding Range support for Reference Line.

Here, I’m showing Monthly Spending of each customer by Region with Scatter chart.

I can add a reference line with Range of 2 Standard Deviations.

This would be even more interesting when I make the reference line and the range to be calculated for each category at X-Axis.

The distribution of the monthly spending for each Region group seems to overlap by just looking at the dots. But having the average and the 2 standard deviation range for each group allows us to see that the average monthly spending of Wine Counties is actually very high compared to other two regions, especially when it is compared to the South Bay’s two standard deviation range.

Integer and Integer 10 Aggregation

Sometimes you want to show the numeric values as integer 10, which rounds the values by every 10 such as 10, 20, 30, etc., at, say, X-Axis.

Here, I have assigned Age column to X-Axis to see the average of the monthly spending for each age.

By changing the number type to ‘Integer 10’, I can show the average for each decade generation like below.

Data Wrangling

Column Name Search

Just like we have added this to Chart, all the data wrangling UIs also support the column name search. Here is an example of Filter UI.

Data Wrangling Steps

We have made the user experience around the data wrangling steps at the right hand side to be more consistent with other parts of Exploratory.

First, now you can select multiple steps by pressing Shift key or Command (Mac) or Control (Windows) key. This makes the multiple steps selection very similar to the experience of selecting multiple columns in Summary and Table views.

And when the multiple steps are selected you can do one of the following operations.

  • Cut Steps
  • Copy Steps
  • Delete Steps

Another thing we did is to support Drag-and-Drop to change the order of the steps.

Column Name Highlight in Steps

When you have many data wrangling steps on the right hand side, it becomes harder to find where the columns you’re investigating have originally been created and how they are defined.

Now you can simply click on the column either in Summary or Table view. This will highlight the column name with light blue color in the data wrangling steps like below.

This looks like a small change, but it can be very useful! ;)

Projects

You can finally duplicate (copy) existing projects at Projects page.

That’s it for now!

Make sure to check out the release note for all the enhancements and bug fixes and download v4.3 from our download page to start exploring it today!

If you don’t have an Exploratory account yet, sign up for 30 days free trial without a credit card! If you happen to be a current student or teacher at schools, it’s free!

Happy Exploratory v4.3! 🍾

Cheers,

Kan

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.