Scraping US Presidents List from Web and Transforming it to be Useful

Kan Nishida
learn data science
Published in
7 min readJul 25, 2016

--

I was working on United Nations voting records data just the other day and wanted to have a mapping table that maps all the years starting from 1945 to 2016 to corresponding names of the US Presidents, something like this.

With data like this I can map each of the resolutions from the United Nations voting recode data to US Presidents based on the year the resolutions were voted in.

Well, long story short, I have managed to get this mapping table, but it was not so straightforward so I wanted to share how I did. I’m going to walk you through using Exploratory Desktop, but you can also find an R script at the end to run it in standalone R environments. Also, for those who just want the data set you can use in Exploratory Desktop right away you can find a link to the shared data at the end as well.

Scrape List of US Presidents Data

Typing “List of US Presidents’ get you a bunch of web pages that have similar information. But what I wanted to get is the one that has the name of the Presidents and the years during which they served in a table format, and I found this page from ‘The Libary of Congress’.

So I quickly tried this page in Scrape Web Page dialog by typing the URL and clicking ‘Get Data’ button.

There are nine html tables in this web page, but the fourth one is the one we are looking for.

Once we click ‘Save’ button, then the data gets imported and shows up in Summary view quickly.

Filter data to keep only necessary ones

First, let’s get rid of unnecessary information from this data. Since we are looking for only from 1945 to the current day we can keep only after the 50th row of the data.

We can do this quickly with ‘slice’ command from ‘dplyr’ package like below.

slice(50:n())

n() would return the total number of the rows, which is 64 with this data, so ‘slice’ command essentially will keep the rows only from row 50 to row 64.

Also, we can use ‘select’ command to keep only YEAR and PRESIDENT columns.

select(YEAR, PRESIDENT)

Separate Year column values into multiple rows

YEAR column has values like ‘2001–2009’, which means the period started at 2001 and ended at 2009. Now, what we want to get at end of the day is to repeat the president names as many as the corresponding years like below.

In order to get to this, first we want to separate these YEAR values into separate rows as Start Year and End Year so that we can expand and fill the years between the Start and the End later.

It’s actually very simple to do this by using ‘separate_rows’ command from ‘tidyr’ package like below.

separate_rows(YEAR, sep="-")

I have specified ‘-’ as a separating character, now we have two rows for each of the Presidents term.

Clean up Year column values

From here we want to expand the YEAR values to fill the years between the start and the end for each of the Presidency. To do that, first we need to convert YEAR column from ‘character’ type to ‘numeric’ type like below.

mutate(YEAR = extract_numeric(YEAR))

I’m using ‘extract_numeric’ command from ‘tidyr’ package because it is just so convenient and powerful, but you can do this with other functions like ‘as.numeric’ as well.

Now you would notice that there is NA for the President Obama.

This is because he is still serving as the President as of now so the original data did not have the end year. We can assume that he is going to finish his term as expected on January 20th, 2017, so we can fill it with the command like below.

mutate(YEAR = coalesce(YEAR, 2017))

‘coalesce’ function from ‘dplyr’ package works as similar to the function with the same name in SQL or ‘nvl’ function of Oracle SQL. It simply replaces NA with a specified value, in this case that is ‘2017’

Expand Years for Each of the Presidents

From here, we want to expand YEAR column values so that each president will be repeated as many times as number of years for which they served as the President.

We can use ‘expand’ command to do this, but we want to expand only within each President era, not across the Presidents. So first, we want to group the data by PRESIDENT column so that ‘expand’ command can work within each of the President.

group_by(PRESIDENT)

Next, we can use ‘expand’ command like below.

expand(YEAR = min(YEAR, na.rm=TRUE):max(YEAR, na.rm=TRUE))

Basically what I’m doing here is to set the min and max values for YEAR within each of the President names and asking the command to expand the data so that YEAR column will have continuous numeric values between the min and max values as the result like below.

Filter out the duplicated transitional years

Now, US Presidency period starts on January 20th of the beginning of the term year and ends on January 20th of the end year. So, obviously there will be duplicated years for the transitional years when the old Presidents pass the office to the new Presidents, as you can see below.

Since there should not be much of the actions during those 20 days at United Nations and these 20 days should not impact on the analysis I was doing, I decided to make the transitional year belong to only the new Presidents.

Since the grouping is still set at PRESIDENT as you can see below,

we can use ‘last’ function inside ‘filter’ command like below to get rid of the last year row for each of the Presidents.

filter(YEAR != last(YEAR))

After running ‘arrange’ command to sort the data on YEAR column, it would look something like below.

As you can see we have all the years from 1945 to 2016 mapped to corresponding US Presidents names.

I have shared this data at exploratory.io so that you can simply download and import it into your Exploratory Desktop, if you already have, to see how it is done exactly or start using it by joining to your data with one of the ‘join’ commands right away.

Reproduce it in R

Here is an R script you can use to reproduce what I have demonstrated above in R console, RStudio, etc.

# Set libPaths.
.libPaths("/Users/kannishida/.exploratory/R/3.3")

# Load required packages.
library(tidyr)
library(dplyr)
library(exploratory)

# Data Analysis Steps
exploratory::scrape_html_table("https://www.loc.gov/rr/print/list/057_chron.html", 4, "TRUE") %>%
exploratory::clean_data_frame() %>%
slice(50:n()) %>%
select(YEAR, PRESIDENT) %>%
separate_rows(YEAR, sep="-") %>%
mutate(YEAR = extract_numeric(YEAR)) %>%
mutate(YEAR = coalesce(YEAR, 2017)) %>%
group_by(PRESIDENT) %>%
expand(YEAR = min(YEAR, na.rm=TRUE):max(YEAR, na.rm=TRUE)) %>%
filter(YEAR != last(YEAR)) %>%
arrange(YEAR)

The exploratory package is not on CRAN, but on Github, so you want to install it with the devtools package from Github like below.

install.packages("devtools")
devtools::install_github("exploratory-io/exploratory_func")
library(exploratory)

If you have installed Exploratory Desktop already, then the above script is setting the library path to the Exploratory’s package installation location so that you don’t need to install the dependent packages including ‘exploratory’ separately. It will just work.

If you are interested in trying this out with Exploratory Desktop, download the latest on our download page.

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.