Exploratory v3.5 Released!

Published in

learn data science

9 min readJun 5, 2017

I’m excited to announce Exploratory v3.5 today! 🎉

The main things are Market Impact Analysis, which is essentially an UI wrapper for an algorithm called Causal Impact from Google, a set of enhancements for SVD (Singular Value Decomposition) algorithm support, Annotation Support for the steps at the right-hand side, a set of enhancements for Pivot Table, and another Performance Improvement.

Let’s take a look quickly.

Market Impact Analysis

Let’s say you are a marketing person and you run a marketing campaign. You want to know how the campaign has actually helped to increase, say, your website traffic, sign ups, conversions, or whatever you hoped to improve. It’s actually hard to measure such impact because there are many attributes that can influence the outcome (e.g. page views). This is called ‘noise’.

Google had the same problem and Kay Brodersen and the team at Google built this algorithm called Causal Impact to address this very challenge and open-sourced it as an R package a few years back.

I have written a separate post to introduce the Causal Impact algorithm as well as to explain how it’s implemented in Exploratory, but here’s a quick overview.

Basically, it builds a Bayesian structural time series model based on multiple (can be single, too) comparable control groups (or markets) and uses the model to project (or forecast) a synthetic baseline for the time period of after the event.

The chart below is visualizing the result of applying the Causal Impact algorithm to a typical Google Analytics data with Date, Page Views, and Countries. The blue line represents the actual page views. The orange line represents the page views that we would have expected if there was no such ‘Impacting Event’ occurred. And the green line represents the difference between the two, hence it’s considered as the real ‘impact’ by the event. The event itself is represented as the red circle.

We can also accumulate all the impact values since the event occurred and see the total impact values up to any given point. The purple line below represents this.

This solution helps you measure the ROI (Return-on-Investment) of such event (e.g. marketing campaign) in a much better way even when we are not sure if the event was the only source of the impact. Again, I have written a separate post for more details, I’d recommend you take a look and give it a shot.

An Introduction to Causal Impact Analysis

Let’s say you are a marketing person and you run a marketing campaign. You want to know how the campaign has actually…

blog.exploratory.io

SVD / PCA Algorithm

SVD (Singular Value Decomposition) is an algorithm that can be used to reduce dimensionality. This technique is often associated with PCA (Principal Component Analysis).

There are many situations you want to use this algorithm. One of them is to segment (or cluster) customers, locations, products, etc. For example, we have this last year’s California Ballot Measure election result data. We can see what is the ratio of the people who supported for each measure by each county such as San Francisco, Napa, Los Angeles, etc.

Now let’s say we want to segment the counties or understand the similarities among the counties. There are many ways to do this, and one of them is to use this SVD algorithm.

First, we want to make each of the ballot measures to be a column so that it will look like we have multiple measures for each county. We can use ‘Pivot’ command to do this quickly.

Then, we can select all the measure columns while pressing Shift or Control/Command key and select ‘Run SVD to Reduce Dimensionality’ from the column header menu.

This will create two new columns like below.

These two columns represent the counties based on all the original measures in a two-dimensional space. And this means we can easily visualize the relationship among the counties by using these two new measures with Scatter chart like below.

Now, based on the values of these two measures we can create clusters (or segments) by applying K-means algorithm. Once we assign the newly generated cluster ID to Color on the same Scatter chart, it makes it easier to see which counties are similar to one another.

Again, there are many ways to segment this kind of data. I have used a technique of mixing Euclidian distance and MDS algorithms to cluster the counties for this same data in this blog post before, in case you’re interested.

Finding Similarities Among California Counties based on 2016 Election Result with dist, mds, and…

In the last two posts, we have looked at how Californians voted for 2016 US Election including the Presidential…

blog.exploratory.io

The SVD support is not entirely new, but we have refreshed the UI and fixed some of the critical bugs, so it’s much easier to use it now.

Annotating Steps

When you analyze data it’s often that you will end up with many steps at the right hand side. And when you come back to this analysis after a while it can be hard to remember what you did and why you did.

And of course, this can become even worse when you collaborate with others by sharing EDF files, which includes all the steps to reproduce the data, because those steps are not necessarily self-explanatory.

This is why we are adding ‘Annotating Steps’ feature. As the name indicates, you can set the title text for each step and type annotation text as notes. Adding the annotation is super simple. When you mouse over on any steps you will see this ‘comment’ looking icon.

Simply clicking on it will open a dialog where you can type both the title and the annotation text.

When you see the green colored annotation icon you can simply mouse over on it, which will show the annotation text.

Performance Improvements

As some of you know, we have been working on the performance improvements over the last few releases. In this release, we have reviewed Summary, Table, and Chart view rendering areas and did a lot of fine-tunes. The result turns out to be pretty good, or at least that’s how I’m feeling it now. You will notice the improvements especially when you move between the data transformation steps (the right-hand side), between the views (Summary, Table, and Chart), between the chart tabs, and between the data frames.

Being able to move between these views quickly makes you want to explore your data more, or at least it won’t stop you from being experimental when analyzing data. Which is very important for us because at the end of the day we want Exploratory to provide the best exploratory data analysis environment! 💪

Pivot Table Enhancements

We have made a few enhancements for the Pivot Table.

Multiple Measures Support

Finally, you can add multiple measure columns (Value) into Pivot Table. 😎

And the ‘% of Total’ calculation is supported for each measure, which means you can show the original values of one measure column and the % of Total values of the same column next to each other!

Group Level Color

When you have multiple columns in Row, then you might want to apply the color scale at the group level instead of each row or column level. Now you can select ‘Row (Group Level)’ for Color Scale setting.

Here are some differences.

This is when I set ‘All’ to Color Scale. We can see high numbers are shown with Red color in Japan and United States. No cells in the UK are not highlighted with Red color because the numbers in the UK are much lower than the other two countries in general.

Now, when I set it to ‘Row (Detail)’, then we can see a lot of the red colored cells. This is because the color is scaled for each row. This is convenient if you want to see what are the top values for each group at the detail level, in this case, that is ‘source’ column.

But it would be more useful if we can find out which groups have higher values in each country. This is when you want to set the Color Scale to ‘Row (Group Level).

Here, we can see ‘t.co (Twitter)’, ‘quora.com (Quora)’, and ‘github.com (Github)’ are the top sources in the UK. Something that could not have been obvious with the other two configurations.

% of Total at Row Group Level

The Row Group Level is also supported for ‘% of Total’ calculation.

Export Pivot Table Data

Now you can directly export the data presented in Pivot Table in a series of formats like CSV, JSON (more on this later), Clipboard, and EDF.

Other Enhancements

We have also managed to deliver a bunch of other incremental enhancements. Here are some of them worth mentioning.

Separating a Single Column to Multiples by the Position

Not only by setting a separating character, you can also specify the position(s) to separate the text.

Here, I’m trying to separate the departure time data into the hour and the minute by separating the first two digits and the last two digits into two separate columns.