Extract data from Private Github Repository with REST API

Kan Nishida
learn data science
Published in
6 min readApr 12, 2016

--

After publishing this post, Jenny Bryan kindly pointed out that there was another way to get authenticated, which was to use ‘Personal Access Token’. And this actually works like a churn! Given this is a bit more secure than using the basic authentication of using ‘username’ and ‘password’, I’d recommend using ‘Personal Access Token’ option over ‘Basic Authentication’ option, if ‘OAuth’ option doesn’t work for you. Also, Hadley has written this amazing informative note “Best practices for writing an API package” and it includes ‘Authentication’ related information. I highly recommend you read it.

I have updated this post with this new information. Again, thanks Jenny for the feedback!

Updated on 4/12/2016

This is a follow up post to the previous one about ‘how to analyze Github Issue data with Github REST API.

If your repository at Github is private then you need to get authenticated before accessing your project data. The good news is, ‘httr’ package provides a few ways for the authentication including OAuth 2.0, Personal Access Token, and Basic Authentication. Authenticating with OAuth 2.0 is the most secure option and I would recommend this if possible. But, if you don’t own the repository that you’re participating, this might not work for you, at least it didn’t for me for some reason. Maybe, there might be some configuration to make the OAuth option to work. So, I ended up trying with Personal Access Token and Basic Authentication options and both worked great.

Anyway, I’ll talk about these three options in this post so that you can pick whichever the one that would work the best for you.

Authenticate with OAuth 2.0

Hadley has written up a sample script for using Github OAuth 2.0 so I’d suggest you take a look at the original one. Here, I’m going to walk you through step by step of how to setup your app credentials at Github and how to call Github API with the credentials through OAuth.

Get Client ID and Client Secret

First, in order to use OAuth option, you need these. If you don’t have them yet, go to Github page and go to Setting page.

And go to ‘OAuth applications’ section.

And, select ‘Developer applications’ and create a developer application if you don’t have one yet.

Once you filled out the forms you will get the Client ID and Client Secret. You will need these to authenticate with OAuth.

Use httr functions to set OAuth and authenticate with Github with API

‘httr’ package has pre-configured the endpoint for Github so all we need to do is to use ‘oauth_endpoints()’ function to get Github endpoint and use ‘oauth_app’ to set the Client ID and Client Secret, then get a token by using ‘oauth2.0_token()’ function.

library(httr)oauth_endpoints("github")myapp <- oauth_app("github",
key = <your key>,
secret = <your_secret>)
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)

Once you get the token you can register it with ‘config()’ function and pass the registered token to ‘GET()’ function call like below.

library(httr)oauth_endpoints("github")myapp <- oauth_app("github",
key = <your key>,
secret = <your_secret>)
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)gtoken <- config(token = github_token)req <- GET("https://api.github.com/repos/hadley/dplyr/issues",
query = list(state = "all", per_page = 100, page = 1), gtoken)
library(jsonlite)
github_df <- fromJSON(content(req, type = 'text'))

And, once your authentication with Github OAuth is successful you will get the data back, and the rest is the same, you can quickly get the data into a data frame like we did in the previous post.

library(httr)oauth_endpoints("github")myapp <- oauth_app("github",
key = <your key>,
secret = <your_secret>)
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)gtoken <- config(token = github_token)req <- GET("https://api.github.com/repos/hadley/dplyr/issues",
query = list(state = "all", per_page = 100, page = 1), gtoken)
library(jsonlite)
github_df <- fromJSON(content(req, type = 'text'))

Now, you might get an error in the returned data saying ‘404 Not Found’ for some reasons, and that’s when you might want to simply go with the basic authentication.

Authenticate with Personal Access Token

Personal Access Token is a key that is generated by Github for your personal use. You can generate multiple tokens and configure each token to scope the access. You can always delete the token at Github to disable the access. Because of these, it’s a bit more secure and more convenient than Basic Authentication with which you need to type Github username and password.

But, still Personal Access Token is basically a passcode so you want to treat this securely and I would recommend setting up your system environment variable to store this information on your file system and using ‘Sys.getenv()’ function to get the value from this environment variable in your R code. I’m going to explain how to do this, but let’s get Personal Access Token from Github first.

Obtain Personal Access Token from Github

Go to Github website and select ‘Setting’.

Click on ‘Personal access tokens’ menu.

Click ‘Generate new token’ button to generate the token.

Enter some description for ‘Token description’ and check at least ‘repo:status’.

Once you save it you will get the token text like below. Make sure you copy this text right away because this is the only time you will see this text.

Once you copy this text then you can register this to your environment variable.

Create an environment variable to store ‘Personal access token’

You can register and store the token in ‘.Renviron’ file under your home directory. If you are on Mac it should be something like:

/Users/<your_userid>/.Renviron

If you don’t have ‘.Renviron’ file then you can create a new one. Make sure you have ‘.’ (dot) at the beginning of the file name.

Now, add an entry for ‘GITHUB_PAT’ like below in the ‘.Renviron’ file.

GITHUB_PAT=<your_personal_access_token>

Make sure there is one empty row at the end by hitting a return key at the end of the variable line.

Once you save this file then you need to restart R (or RStudio). This file gets loaded automatically when R starts, and it sets all the environment variables registered inside this file. Once it’s registered by R, then you can access the variable by using ‘Sys.getenv()’ function like below.

Sys.getenv("GITHUB_PAT")

And, you can call this inside ‘authenticate()’ function and set that as one of the arguments for ‘GET()’ function like below.

library(httr)
req <- GET("https://api.github.com/repos/exploratory-io/tam/issues",
query = list(state = "all", per_page = 100, page = 1),
authenticate(Sys.getenv("GITHUB_PAT"), ""))

Authenticate with Basic Authentication

For Basic Authentication, you can simply type your Github username and password inside ‘authenticate()’ function.

library(httr)
req <- GET("https://api.github.com/repos/exploratory-io/tam/issues",
query = list(state = "all", per_page = 100, page = 1),
authenticate("kanaugust", "<your_password>"))

It’s super simple and it works. But, you need to type your password in a clear text, which might make you uncomfortable especially when you work somewhere people are constantly walking behind you. ;)

So I strongly recommend you store your Github password in .Renviron file and use ‘Sys.getenv()’ function to retrieve it from your R code, as I have shown at Personal Access Token section above.

Then, you code would look something like below.

library(httr)
req <- GET("https://api.github.com/repos/exploratory-io/tam/issues",
query = list(state = "all", per_page = 100, page = 1),
authenticate("kanaugust", Sys.getenv("GITHUB_PASS")))

Once the authentication is successful, then you will get the data back from Github, and the rest is the same. Now you can start wrangle and analyze Github Issue data!

If you liked this post, please click the green heart below to share with the world!

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.