Piwowar et al. (2017a) investigated the Open Access (OA) status of scholarly literature. In this notebook, I re-examined articles categorized as hybrid open access from the Crossref and the WoS samples using the raw data (Piwowar et al., 2017b). In particular, I wondered how many delayed articles are included in the subset hybrid (via crossref license). To this end, Crossref API was used to obtain information about licensing delays using the rcrossref package (Chamberlain et al., 2017).
readr::read_csv("wos_100k.csv") %>%
# only use publications in period 2009:2015
filter(year %in% 2009:2015) -> wos
The first dataset contains random DOIs from journal articles indexed in the Web of Science together with data about their open access availability from oaDOI (wos_100k.csv) (Piwowar et al., 2017b). The WoS sample includes 102731 unique DOIs.
Here is breakdown by open access type:
wos %>%
group_by(oa_color_long) %>%
summarize(n = n()) %>%
mutate(perc_prop = n / sum(n) *100)
According to Piwowar et al. (2017a), articles were categorized as hybrid if they were “Free under an open license in a toll-access journal”. What licenses were used to identify hybrid open access?
wos %>%
filter(oa_color_long == "hybrid") %>%
group_by(license) %>%
summarize(n = n()) %>%
mutate(perc_prop = n / sum(n) * 100)
It seems that not for every hybrid open access articles a license was disclosed in the dataset indicated by NA values.
Now, let’s obtain a breakdown of hybrid open access by publisher:
wos %>%
filter(oa_color_long == "hybrid") %>%
group_by(publisher) %>%
summarize(n = n()) %>%
mutate(perc_prop = n / sum(n) * 100) %>%
arrange(desc(n))
How did oaDOI determine the hybrid OA evidence?
wos %>%
filter(oa_color_long == "hybrid") %>%
group_by(evidence) %>%
summarize(n = n()) %>%
mutate(perc_prop = n / sum(n) * 100) %>%
arrange(desc(n))
Licenses metadata in Crossref can also be used to identify delayed open access content. To check, let us write a helper function that a) retrieves Crossref metadata for a single DOI using the rcrossref package, and b) parses licensing metadata. delay-in-days must be zero in order indicate immediate free access.
licensing_check <- function(doi = NULL) {
cr_tmp <- rcrossref::cr_works_(doi, parse = FALSE) %>%
jsonlite::fromJSON()
cr_tmp$message$license
}
# example for delayed usage of http://www.elsevier.com/open-access/userlicense/1.0/ license
licensing_check("10.1016/j.dam.2008.06.028")
Now, let us apply the function to every hybrid open access article in the subset hybrid (via crossref license).
wos %>%
filter(oa_color_long == "hybrid") %>%
filter(evidence == "hybrid (via crossref license)") -> cr_dois
cr_dois
cr_df <- purrr::map(cr_dois$doi, purrr::safely(licensing_check))
tt <- purrr::map(cr_df, "result")
names(tt) <- cr_dois$doi
license_dates <- map_df(tt, `[`, c("URL", "delay-in-days"), .id = "doi")
# backup
readr::write_csv(license_dates, "wos_cr_licensing_md.csv")
We need to check for open licenses used by oaDOI, which can be found in the source code of oaDOI.
license_patterns <- tolower(c("creativecommons.org/licenses/",
"http://koreanjpathol.org/authors/access.php",
"http://olabout.wiley.com/WileyCDA/Section/id-815641.html",
"http://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html",
"http://pubs.acs.org/page/policy/authorchoice_ccbyncnd_termsofuse.html",
"http://pubs.acs.org/page/policy/authorchoice_termsofuse.html",
"http://www.elsevier.com/open-access/userlicense/1.0/",
"http://www.ieee.org/publications_standards/publications/rights/oapa.pdf"))
license_dates %>%
mutate(URL = tolower(URL)) %>%
mutate(hybrid_license = ifelse(grepl(paste(license_patterns, collapse = "|"), URL), TRUE, FALSE)) %>%
filter(hybrid_license == TRUE) %>%
filter(`delay-in-days` > 0) -> wos_delay
The following table shows freely available articles with delayed licenses.
wos_delay
The same steps are now applied to the Crossref sample (crossref_100k.csv) (Piwowar et al., 2017b).
cr_sample <- readr::read_csv("crossref_100k.csv")
The Crossref sample includes 100000 unique DOIs.
Breakdown by access type:
cr_sample %>%
group_by(oa_color_long) %>%
summarize(n = n()) %>%
mutate(perc_prop = n / sum(n) *100)
cr_sample %>%
filter(oa_color_long == "hybrid") %>%
group_by(license) %>%
summarize(n = n()) %>%
mutate(perc_prop = n / sum(n) * 100)
Let’s see how hybrid OA is distributed over publishers.
cr_sample %>%
filter(oa_color_long == "hybrid") %>%
group_by(publisher) %>%
summarize(n = n()) %>%
mutate(perc_prop = n / sum(n) * 100) %>%
arrange(desc(n))
How did oaDOI determine the hybrid open access evidence:
cr_sample %>%
filter(oa_color_long == "hybrid") %>%
group_by(evidence) %>%
summarize(n = n()) %>%
mutate(perc_prop = n / sum(n) * 100) %>%
arrange(desc(n))
Now obtain Crossref licensing metadata for hybrid open access article determined by oaDOI:
cr_sample %>%
filter(oa_color_long == "hybrid") %>%
filter(evidence == "hybrid (via crossref license)") -> cr_sample_dois
cr_df <- purrr::map(cr_sample_dois$doi, purrr::safely(licensing_check))
tt <- purrr::map(cr_df, "result")
names(tt) <- cr_sample_dois$doi
license_dates <- map_df(tt, `[`, c("URL", "delay-in-days"), .id = "doi")
# backup
readr::write_csv(license_dates, "cr_sample_cr_licensing_md.csv")
license_dates %>%
mutate(URL = tolower(URL)) %>%
mutate(hybrid_license = ifelse(grepl(paste(license_patterns, collapse = "|"), URL), TRUE, FALSE)) %>%
filter(hybrid_license == TRUE) %>%
filter(`delay-in-days` > 0) -> cr_delay
cr_delay
Finally some validation: Do articles with more than one license can have at least one delayed and at least one immediate open license?
# some validation
cr_not_delayed <- license_dates %>%
mutate(URL = tolower(URL)) %>%
mutate(hybrid_license = ifelse(grepl(paste(license_patterns, collapse = "|"), URL), TRUE, FALSE)) %>%
filter(hybrid_license == TRUE) %>%
filter(`delay-in-days` < 1)
table(cr_delay$doi %in% cr_not_delayed$doi)
##
## FALSE
## 1205
The following table shows that across both DOI samples around 70% of articles determined as hybrid open access via Crossref licensing information were tagged with open licenses, which came into effect a certain time after publication.
data_frame(Sample = c("Crossref", "WoS"),
`Articles in hybrid (via crossref license) subset` = c(n_distinct(cr_sample_dois$doi), n_distinct(cr_dois$doi)),
`Delayed licenses found` = c(n_distinct(cr_delay$doi), n_distinct(wos_delay$doi))) %>%
mutate(`Proportion (in%)` = `Delayed licenses found` / `Articles in hybrid (via crossref license) subset`)
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2 rcrossref_0.8.0 dplyr_0.7.4 purrr_0.2.4
## [5] readr_1.1.1 tidyr_0.7.2 tibble_1.3.4 ggplot2_2.2.1
## [9] tidyverse_1.1.1
##
## loaded via a namespace (and not attached):
## [1] reshape2_1.4.2 haven_1.1.0 lattice_0.20-35 colorspace_1.3-2
## [5] miniUI_0.1.1 htmltools_0.3.6 yaml_2.1.15 rlang_0.1.4.9000
## [9] foreign_0.8-69 glue_1.1.1 modelr_0.1.0 readxl_1.0.0
## [13] bindr_0.1 plyr_1.8.4 stringr_1.2.0 munsell_0.4.3
## [17] gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2 codetools_0.2-15
## [21] psych_1.7.5 evaluate_0.10.1 knitr_1.17.20 forcats_0.2.0
## [25] httpuv_1.3.5 curl_3.0 parallel_3.4.3 triebeard_0.3.0
## [29] urltools_1.6.0 broom_0.4.2 Rcpp_0.12.14 xtable_1.8-2
## [33] scales_0.5.0 backports_1.1.0 jsonlite_1.5 mime_0.5
## [37] mnormt_1.5-5 hms_0.3 digest_0.6.12 stringi_1.1.5
## [41] shiny_1.0.5 grid_3.4.3 rprojroot_1.2 bibtex_0.4.2
## [45] tools_3.4.3 magrittr_1.5 lazyeval_0.2.1 crul_0.4.0
## [49] pkgconfig_2.0.1 xml2_1.1.1 lubridate_1.6.0 assertthat_0.2.0
## [53] rmarkdown_1.8.3 httr_1.3.1 R6_2.2.2 nlme_3.1-131
## [57] compiler_3.4.3
To the extent possible under law, Najko Jahn has waived all copyright and related or neighboring rights to this work. This work is published from: Germany.
Chamberlain, S., Boettiger, C., Hart, T., and Ram, K. (2017). Rcrossref: Client for various ’crossref’ ’apis’. Available at: https://github.com/ropensci/rcrossref.
Piwowar, H., Priem, J., Larivière, V., Alperin, J. P., Matthias, L., Norlander, B., et al. (2017a). The state of oa: A large-scale analysis of the prevalence and impact of open access articles. –. doi:10.7287/peerj.preprints.3119v1.
Piwowar, H., Priem, J., Larivière, V., Alperin, J. P., Matthias, L., Norlander, B., et al. (2017b). Data from: The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles. doi:10.5281/zenodo.837902.