Tidy untidy United Nations’ migration data with tidyr

Kan Nishida
learn data science
Published in
7 min readMay 24, 2016

--

I got another data suggested by Jeremy last Wednesday morning with his Data Is Plural. This time it was about estimates of the number of people who have migrated across the countries all over the world and it was prepared and published by United Nation. As is often the case with this type of data, which has a list of the countries or years from this type of organizations like United Nations, World Bank, IMF, etc, the data was not tidy. Here’s how it looks in Excel.

I’m not talking about the commentary information occupying the first 14 rows. It is that each of the origin countries — where the migrants are coming from — is presented in each column and each of the destination countries — where the migrants are going to — is represented in each row. Hence, you can find how many people migrated from the original country to the destination country at the each intersection of the columns and rows .

We call this type of data representation ‘untidy’ or ‘wide’, as I posted before, and it makes it harder to perform data analysis and visualize the data. But we can ‘tidy’ this data quickly with ‘gather’ command from ‘tidyr’ package of ‘Hadleyverse’ to turn this around.

Let’s take a look how we can do that quickly in Exploratory and find something interesting from this data set.

Import Data

You can download this data from here on this United Nation’s web page.

The Excel file contains a bunch of worksheets to include different years and data broken down by total / male / female. But I’m going to import ‘Table 16’ which contains the total migrants data for 2015 for this post.

Select ‘Table 16’ from Sheet dropdown and click ‘Get Data’ button. You will notice that we can skip the first 15 rows to get the actual data, type 15 in ‘Skip First N Rows’ and click ‘Get Data’ button again.

Once confirmed that the data in Preview Table look ok, then click ‘Save’ button.

You would notice that there are 240 columns and 265 rows. There are so many columns, and again, this is because each ‘origin’ country is presented in its own column.

Remove Regions and keep only Countries

We can quickly turn to Table view to see the raw data.

First thing you would notice is that the first 8 rows are not countries, they are regions. This time we want to see how people are migrating from countries to countries so we can remove these rows for the regions. When you look at ‘V5’ column, it looks that those ‘region’ rows don’t have any value there.

By running the command like below to keep only the rows whose V5 column have NA,

filter(is.na(V5))

we get something like below.

These are all regions, not countries, which means that we can safely remove these rows by adding an exclamation mark ‘!’ right before ‘is.na()’ function like below.

filter(!is.na(V5))

Once we run it we get 232 rows with all countries data.

Remove unnecessary columns

When you look at the columns you would notice that there are unnecessary columns like ‘Total’, ‘Other South’, etc, because we are interested in estimates of the migrants only for countries to countries. We can remove those unnecessary columns with ‘select’ command along with other unnecessary columns like below.

select(-V1, -V3, -V5, -Total, -starts_with("Other"))

I’m using minus ‘-’ to delete columns and using ‘start_with’ function inside ‘select’ command to delete multiple columns whose names matche the text pattern of “Other”.

Rename column names

Let’s make V2 and V4 column names to be something meaningful. V4 column actually represents a country code with 3 digits. We can use ‘rename’ command like below.

rename(destination_country = V2, country_code = V4)

Gather 232 columns to make it tidy

Now, it’s ready to tidy this ‘matrix’-ish data form by using ‘gather’ command from tidyr package of ‘Hadleyverse’ like below.

gather(origin_country, migrants, Afghanistan:Zimbabwe)

The first argument is setting a name for a new column that will hold the country names, the second is setting a name for a new column that will hold the numbers, and the last ‘Afghanistan:Zimbabwe’ is selecting all the columns between ‘Afghanistan’ column and ‘Zimbabwe’ column.

This column selection is another amazing part of this ‘gather’ command because you can select columns the same way you would with ‘select’ command. That means, you can use all the column selecting functions like ‘starts_with’, ‘contains’ or even use the column positions like ‘3:252’. However, using the column names or the text pattern matching with the selecting functions is recommended over using the column positions because it is more robust especially when columns being newly added or removed, as Hadley suggested. Anyway, this is pretty awesome.

After running the command, now we can see the estimated number of people who migrated for each pair of the countries. Of course, there are many pair of the countries people didn’t migrate from / to, and there is no reason for keeping those pairs, so we can add ‘na.rm’ argument to remove them as part of the ‘gather’ command operation like below.

gather(origin_country, migrants, Afghanistan:Zimbabwe, na.rm=TRUE)

Now we have only the pairs with the values, and the data is tidy!

Visualize the tidy data

We can quickly visualize this data with Heatmap chart by assigning ‘destination_country’ to X-axis, ‘origin_country’ to Y-axis, and ‘migrants’ to Color.

Because there are too many combinations of the origin and destination countries, it’s almost impossible to see anything meaningful from this chart. Instead, we can focus on the pairs of the countries with greater than 100,000 people by using ‘filter’ command like below.

filter(migrants > 1000000)

Now it’s much easier to find the intersections of the desition and origin countires who have higher than others. For example, we can see many light blue colored cells and one red cell on United States as ‘destination counry’ column. And the red colored intersection is about 12 million people who migrated from Mexico to United States in 2015.

Who are coming to United States?

With this tidy data, we can find some other interesting insights easily. For example, filtering ‘destination_country’ to only ‘United States’ will give you a view to answer a question of ‘Where are people, who migrated to United States, coming from?’

filter(destination_country == "United States of America")

Where are Americans migrating to?

Instead of filtering ‘desitination_country’ to ‘United States’, how about filtering ‘origin_country’ to ‘United States’? This will give you a view to answer a question of ‘Where are Americans migrating to?’

filter(origin_country == "United States of America")

Tidying ‘untidy’ data will help you understand the world a lot better

The reason we could filter and visualize the data easily and quickly as above was because the data was already in ‘tidy’ format. In a nice and clean ‘relational database’ world we don’t really need to worry about this ‘untidy’ data. But today, more than 80 or 90% of the data can be found outside of ‘relational database’. Having a tool that can help you tidy untidy data opens doors for accessing and working with a lot of interesting data all over the world!

If you want to try ‘tidy-ing’ quickly with Exploratory Desktop, the beta trial is open. Please sign up from here. As always, please let me know if you have any feedback!

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.