Managing OAuth Tokens

Jenny Bryan

2018-06-28

Who should read this

This vignette explains Google auth token management for anyone who wants to use googlesheets in code that runs non-interactively. Examples:

Since googlesheets gets its authorization functionality from httr, some of the content here may be relevant to other API-wrapping R packages that use httr.

Bonus content: The Making of googlesheets for CRAN. At the end are my notes on CRAN submission when a package makes extensive use of OAuth.

How to completely avoid reading this document

Which Google Sheets activities require authorization? And which do not?

Reading from a Sheet that is “published to the web” does not require authorization, if and only if you identify the Sheet via key or URL:

library(googlesheets)
suppressPackageStartupMessages(library(dplyr))
gs_gap_key() %>%
  gs_key(lookup = FALSE) %>% 
  gs_read() %>% 
  head(3)
#> Worksheets feed constructed with public visibility
#> Accessing worksheet titled 'Africa'.
#> Parsed with column specification:
#> cols(
#>   country = col_character(),
#>   continent = col_character(),
#>   year = col_double(),
#>   lifeExp = col_double(),
#>   pop = col_double(),
#>   gdpPercap = col_double()
#> )
#> # A tibble: 3 x 6
#>   country continent  year lifeExp      pop gdpPercap
#>   <chr>   <chr>     <dbl>   <dbl>    <dbl>     <dbl>
#> 1 Algeria Africa     1952    43.1  9279525     2449.
#> 2 Algeria Africa     1957    45.7 10270856     3014.
#> 3 Algeria Africa     1962    48.3 11000948     2551.

On the other hand, if you identify a Sheet by its name, googlesheets will require authorization, because we must list of all your Sheets on Google Drive in order to look up the Sheet’s key. This will be true even if the Sheet you seek is “published to the web”. It’s the key look up that requires auth, not reading the Sheet.

Implication: if your non-interactive googlesheets code only needs to read a published Sheet, you can eliminate the need for authorization by using Sheet key for access, as in the above example. And you can stop reading this now!

Of course, many other activities do require authorization. For example, creating a new Sheet:

iris_ss <- gs_new("iris_bit", input = head(iris, 3), trim = TRUE, verbose = FALSE)
iris_ss %>% 
  gs_read()
#> Accessing worksheet titled 'Sheet1'.
#> Parsed with column specification:
#> cols(
#>   Sepal.Length = col_double(),
#>   Sepal.Width = col_double(),
#>   Petal.Length = col_double(),
#>   Petal.Width = col_double(),
#>   Species = col_character()
#> )
#> # A tibble: 3 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa

Where do tokens come from? The OAuth 2.0 flow

googlesheets uses Google’s OAuth 2.0 flow for Installed Applications to work with the Drive and Sheets APIs.

The googlesheets package plays the role of “Your App” in this figure and you are the User.

The first time you do something that requires authorization, googlesheets must request a token on your behalf. You can also trigger this manually with gs_auth(). You, the user, will be taken to the browser for “User login & consent”:

This is where you authenticate yourself, so that googlesheets can subsequently place authorized requests on your behalf.

Behind the scenes, googlesheets uses httr::oauth2.0_token() (and ultimately httr::init_oauth2.0()) to complete the “authorization code, exchange code for token, token response” ping pong and store a token. This token is stored in an environment within googlesheets and is attached to subsequent API requests as necessary.

You can use gs_user() to see if there is currently a valid token in force, who the associated Google user is, etc.:

gs_user()
#>           displayName: google sheets
#>          emailAddress: gspreadr@gmail.com
#>                  date: 2018-06-28 23:20:35 GMT
#>          permissionId: 14497944239034869033
#>          rootFolderId: 0AOdw-qi1jh3fUk9PVA

Where do tokens live in between R sessions?

By default, when googlesheets gets a token for you, it’s stored in memory for use in the current R session AND it’s cached to a file named .httr-oauth in current working directory. This caching behavior comes from httr.

It’s a good idea to inform yourself about the presence/absence/location of .httr-oauth, especially if you’re having trouble with non-interactive authorization. Not all methods of file browsing will reveal dotfiles, so be aware of that. Recent versions of RStudio will show .httr-oauth, but older versions will not. From R itself, you can use list.files(all.files = TRUE) to get a list of files in current working directory, including dotfiles.

In subsequent R sessions, at the first need for authorization, googlesheets looks for a cached token in .httr-oauth before initiating the entire OAuth 2.0 flow. Many APIs limit the number of active tokens per account, so it’s better to refresh existing tokens than to request completely new ones. More on refreshing later.

Another chance to stop reading this document

If your usage is pretty simple, you may only need to make sure that the token cached in .httr-oauth is the one you want (e.g., associated with the correct Google user) and make sure this file lives alongside your R script or R Markdown file. If you are relying on automatic loading from cache in .httr-oauth, this error message is highly suggestive that .httr-oauth cannot be found at runtime: “oauth_listener() needs an interactive environment”.

Here is a workflow that is doomed to fail:

You must make sure that .httr-oauth will be found in working directory when your script runs or your .Rmd is rendered.

Don’t publish your tokens

Tokens, stored in .httr-oauth or elsewhere, grant whoever’s got them the power to deal on your behalf with an API, in our case Sheets and Drive. So protect them as you would your username and password. In particular, if you’re using a version control system, you should exclude files that contain tokens. For example, you want to list .httr-oauth in your .gitignore file.

How do I store and retrieve a token?

In googlesheets, we’ve built some functionality into gs_auth() so the user can retrieve the current token for explicit storage to file and can load such a stored token from file. To be clear, most users should just enjoy the automagic token management offered by httr and the .httr-oauth cache file. But for non-interactive work and testing/developing googlesheets itself, we found it helpful to take more control.

In an interactive session, create and store a token. Caching properties are baked into a token, so if you never want this token to be cached to .httr-oauth, such as when it gets refreshed, specify that at creation time via cache = FALSE. Use gd_token() at any time to see some info on the current token.

library(googlesheets)
token <- gs_auth(cache = FALSE)
gd_token()
saveRDS(token, file = "googlesheets_token.rds")

Things to think about:

Let’s focus on the R script or Rmd file you are preparing for non-interactive execution. Put these lines in it:

library(googlesheets)
gs_auth(token = "googlesheets_token.rds")
## and you're back in business, using the same old token
## if you want silence re: token loading, use this instead
suppressMessages(gs_auth(token = "googlesheets_token.rds", verbose = FALSE))

Things to think about:

What’s the difference between token storage in .httr-oauth and what we do above? They are both .rds files. But the .httr-oauth file is conceived to hold multiple credentials. Therefore tokens are stored in a list, where each is identified by an MD5 hash created from the associated endpoint + app + scope. In contrast, the token stored in the example above is a single object, which is simpler. The explicit process of writing the token to file also makes it more likely that your token gets created with the intended combination of key, secret, and Google account.

Token expiration and refresh

There are actually two different kinds of tokens in the OAuth 2.0 flow used by googlesheets: a refresh token and an access token. Refresh tokens are quite durable, whereas access tokens are highly perishable. Access tokens are what’s actually attached to requests. Part of the beauty of httr is that it automatically uses a valid refresh token to obtain a new access token. That’s what’s happening whenever you see this message: “Auto-refreshing stale OAuth token.”

If your access token has expired (or doesn’t exist) and your refresh token is invalid (or no where to be found), then any token-requiring request will trigger the entire OAuth 2.0 flow. In particular, you’ll need to redo “User login & Consent” in the browser. If this happens in a non-interactive setting, this will therefore lead to some sort of failure.

You should design your workflow to reuse existing refresh tokens whenever possible. Don’t just take my word for it, here’s the official Google advice:

Save refresh tokens in secure long-term storage and continue to use them as long as they remain valid. Limits apply to the number of refresh tokens that are issued per client-user combination, and per user across all clients, and these limits are different. If your application requests enough refresh tokens to go over one of the limits, older refresh tokens stop working.

Specific facts about Google tokens:

The latter point is the most relevant to an active project. If you’re developing around a Google API, it is very easy to burn through 25 refresh tokens if you aren’t careful, which causes earlier ones to silently fall off the end and become invalid. If those are the tokens you have placed on a server or on Travis CI, then you will start to get failures there.

Tokens for testing

We use testthat to run automated unit tests on the googlesheets package itself. Since most of the interesting functionality requires authorization, we have to make authorized API requests, if we want to have acceptable test coverage. Therefore we use the code given earlier to create and store a refresh token:

library(googlesheets)
token <- gs_auth()
saveRDS(token, file = "tests/testthat/googlesheets_token.rds")

Pro tip: start with a fresh token or one near the beginning of the current 25-token sequence.

In affected testing files, we explicitly put the token into force:

suppressMessages(gs_auth(token = "googlesheets_token.rds", verbose = FALSE))

run the tests that require authorization and then suspend token usage (but do NOT revoke the refresh token):

gs_deauth(verbose = FALSE)

Running the googlesheets tests yourself

If you want to check the googlesheets package, you will need to store a valid token in tests/testthat/googlesheets_token.rds.

Note to self: things I still need to do to make testing by others possible:

Encrypting tokens for hosted continuous integration

If you want to use googlesheets with hosted continuous integration, such as Travis CI, you need to secure your token on the host, e.g., the googlesheets_token.rds file described above. I have only done this in the context of testthat and Travis CI, but I imagine something very similar would apply to other testing approaches and CI services. I describe this here as a template for testing other packages that wrap an API and that make authorized requests in the unit tests. This has evolved from instructions originally worked out by Noam Ross. These instructions assume you’ve set up continuous integration and, in particular, that you have an existing .travis.yml file.

OAuth 2.0 tokens are, sadly, too large to be stored as environment variables, so we must instead follow the instructions for encrypting files. This requires the Travis command line client which, in turn, requires a Ruby installation.

Install the Travis command line client (will probably require sudo):

gem install travis

Log into your Travis account using your GitHub username and password.

travis login

Encrypt the token and send to Travis:

travis encrypt-file tests/testthat/googlesheets_token.rds --add

The --add option should add a decrypt command to your pre-existing .travis.yml file, along these lines:

before_install:
- openssl aes-256-cbc -K $encrypted_xyz_key -iv $encrypted_xyz_iv -in
tests/testthat/googlesheets_token.rds.enc -out tests/testthat/googlesheets_token.rds -d

Double check that the token and encrypted token live in tests/testthat/ and that .travis.yml reflects the correct path. You will probably need to move the encrypted token into the correct directory and edit the path(s) in .travis.yml.

Carefully ignore, commit, and push:

Do not get mixed up re: what gets ignored where.

At this point, if you blindly bundle the package and send it to win-builder or CRAN, the unencrypted token will be included. See the next section for how to handle that.

The Making of googlesheets.tar.gz for CRAN

Both CRAN and win-builder take a package bundle as their input. And then they run R CMD check on it, which calls R CMD build among other things.

Here’s my problem: to compile googlesheets’s vignettes and run tests, you need an OAuth2 token. But I can’t send that.

Observation: Getting something onto CRAN requires me to short-circuit machinery that is meant to insure quality (tests) and usability/documentation (vignettes). The problem is that you can secure an encrypted token file on Travis but cannot do so on CRAN or win-builder. This is also why all of the googlesheets examples are inside \dontrun{}. Sometimes you hear unqualified claims that “CRAN >> GitHub”. But if your package wraps an API, reality is more complicated.

Two main principles:

Prepare the source package

Embrace the NOT_CRAN environment variable. It’s not just for testthat::skip_on_cran() anymore! Commit fully and with your whole heart. I have this line in ~/.Renviron:

where true is the default, development state. You’ll temporarily toggle it to false when making your final .tar.gz for CRAN.

Testing. Determine which tests can run on CRAN and which can not.

This way all tests run all the time locally and on Travis, but never on CRAN.

Vignettes. Recall that knitr chunk options can take values from R an expression. In an early setup chunk, consult the NOT_CRAN environment variable to create an eponymous R object:

You will use the NOT_CRAN object to set chunk options that control code execution.

Vignette double jeopardy. There are two vignette checks that hit our OAuth pain point. They correspond to these two lines from a log file:

    * checking running R code from vignettes ... OK
    * checking re-building of vignette outputs ... OK

Why does the vignette code get executed twice? Even though the vignettes used are the ones you build? Good question! See this thread on R-devel. Bottom line: you have to deal with both of these and they require separate solutions.

Recap:

Create the .tar.gz

I use Git and create a branch to document the process of CRAN submission.

Start in the default development state:

Delete build/ and inst/doc if they happen to exist in your source package. They probably don’t. But delete and make a commit if they do.

Build the package, pass 1 of 2. In the parent directory of your package, confirm that NOT_CRAN is “true” and build the package:

Unpack the bundle. Move googlesheets_VERSION.tar.gz somewhere safe and unpack it.

What are the main differences between your original source package and this unpacked bundle?

Preserve everything about the vignettes for inclusion in final build. If you’re tracking the CRAN submission process in a git branch, copy these new/modified directories over to your source package:

Verify that you’re NOT gitignoring all html – you want the html for your vignettes to be in this commit! If you use devtools, you may also need to remove inst/doc from .gitignore. devtools::use_vignette() puts it there. Commit.

Leave your development state and enter the CRAN state. Toggle the value of NOT_CRAN in ~/.Renviron:

Rbuildignore any sensitive files that were necessary for vignette building but that shouldn’t go out on the internet. In my case, I add this line to .Rbuildignore to ignore my unencrypted OAuth2 token (note it is always gitignored):

Build the package, pass 2 of 2. If you copied the post-build state of build/, inst/doc/, and vignettes back into your source package, build from there. Otherwise, do this on your unpacked package bundle from the first R CMD build pass. Make sure NOT_CRAN is false now!

For me this is extremely fast compared to the first build, because so much testing and vignette code is suppressed when NOT_CRAN is false.

This will produce googlesheets_VERSION.tar.gz. Again. This is what’s going to CRAN.

Check your work. Unpack this new .tar.gz and have a look around.

The only difference I see from the previously unpacked bundle is that my token file is gone. Which is perfect. This is the .tar.gz I want to send:

You could make a final commit in your CRAN submission branch. The only change it will show is the change to .Rbuildignore. Presumably you are already gitignoring any sensitive files.

You have ended in the CRAN state:

Get back to normal development state:

If you’ve been working in a git branch, just reset the NOT_CRAN env var and checkout master!

Recap:

Observation: It sure would be nice if there was a concept of .tar.gz-ignore. The token file could be used during R CMD build, but would be excluded from the tarball.