Introduction to K-means Clustering in Exploratory

Kan Nishida
learn data science
Published in
4 min readJan 12, 2017

--

Note

This post was written a few years back, since then the tool I used in this post has evolved a lot. I’d suggest you to take a look at this recent post about K-means clustering to get the latest information.

The original post starts from here...

Clustering is to split the data into a set of groups based on the underlying characteristics or patterns in the data. One of the popular clustering algorithms is called ‘k-means clustering’, which would split the data into a set of clusters (groups) based on the distances between each data point and the center location of each cluster. One of the easiest ways to understand this concept is to use Scatterplot to visualize the clustered data.

Let’s say we get data that looks something like below with Scatterplot chart.

Now, if we use ‘k-means clustering’ algorithm to split this data into a set of groups, say 5 groups, it will look something like below.

In this chart, each color represents its own cluster. You can see that the data is grouped into 5 clusters based on the proximity to one another.

There are two ways to cluster your data.

Let’s take a look one by one.

Cluster Data based on Variables (Columns)

Here, I have US flights delay data like below. (Download link)

I’ve assigned DEP_DELAY (Departure Delay) column to X-Axis and ARR_DELAY (Arrival Delay) column to Y-Axis.

Each dot represents a flight that left a certain city on a certain date. Now, let’s say we want to group these flights into a given number of clusters based on the arrival delay and departure delay times.

This is when we want to use ‘k-means clustering’ algorithm. In Exploratory, we can select ‘Cluster with K-means’ under Analytics tab under Others button to run the algorithm quickly.

In this scenario, we want to cluster all the flights so we can keep the first parameter to be the default of ‘Rows into clusters’.

And select ARR_DELAY (Arrival Delay) and DEP_DELAY (Departure Delay) columns so that it will caclulate the distances based on the values from these columns to create the clusters. We can keep all the other parameters as default and hit Run button.

By assigning a newly created column called ‘cluster’, which holds the cluster id values, we can see that the data is now split into 3 groups.

We can change the number of the clusters to, say, 5,

and we can see the result right away.

Now, not just those two columns, how about adding another column to calculate the clusters?

We can bring ‘DISTANCE’ (Distance) column,

and hit Run button to re-calculate the distances.

Now, the red colored cluster and orange colored cluster are on top of each other, and it doesn’t appear that the data is clearly split. However, by assigning ‘DISTANCE’ column to Z-Axis, we can actually see that the data is indeed clearly clustered into 5 groups, though in this case it’s clustered mostly along with ‘DISTANCE’ values.

This is because ‘DISTANCE’ values are much larger compared to the other two columns values. We can quickly normalize those three column values like below.

 mutate_at(vars(ARR_DELAY, DEP_DELAY, DISTANCE), scale) %>%
mutate_at(vars(ARR_DELAY, DEP_DELAY, DISTANCE), as.numeric) %>%

Then, we can apply the same ‘K-means clustering’ algorithm. Now, we would get the chart like below.

You can see that now the data is split into 5 clusters along with the three dimensions.

As you can see, you can quickly run ‘K-means clustering’ algorithm in Exploratory and find similarity among the data with just a few clicks. If you don’t have Exploratory yet, you can sign up from here for free!

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.