citecorp: working with open citations
citecorp is a new (hit CRAN in late August) R package for working with data from the OpenCitations Corpus (OCC). OpenCitations, run by David Shotton and Silvio Peroni, houses the OCC, an open repository of scholarly citation data under the very open CC0 license. The I4OC (Initiative for Open Citations) is a collaboration between many parties, with the aim of promoting “unrestricted availability of scholarly citation data”. Citation data is available through Crossref, and available in R via our packages rcrossref, fulltext and crminer. Citation data is also available via the OCC; and this OCC data is now available in R through the new package citecorp.
How much citation data does the OCC have?
Quoting the OpenCitations website (as of today):
the OCC has ingested the references from 326,743 citing bibliographic resources, and contains information about 13,964,148 citation links to 7,565,367 cited resources
Why citation data? Citations are the links among scholarly works (articles, books, etc.), leading to many important uses including finding related articles, calculating article impact, and even use in academic hiring decisions.
Why open citation data? Until recently most citation data has been locked behind publisher walls. Unfortunately, many publishers see giving away citation data as losing potential profits. Through the I4OC, many publishers have made their citation metadata public, but some of the largest publishers still have not done so: Elsevier, American Chemical Society, IEEE. Without all citation data being open, any work that builds on citation data only has a sub-sample of all citations; you are drawing conclusions about citations from a rather small subset of all existing citations. Nonetheless, the currently available open citation data is an important resource; and can be amended with citation data behind paywalls for those that have access.
About citecorp and the OCC
OpenCitations created their own identifiers called Open Citation Identifiers (oci), e.g.,
020010009033611182421271436182433010601-02001030701361924302723102137251614233701000005090307
You are probably not going to be using oci identifiers, but rather DOIs and/or PMIDs
(PubMed identifier) and/or PMCIDs (PubMed Central identifier) (see the PubMed Wikipedia entry
for more). See ?citecorp::oc_lookup
for methods for cross-walking among identifier types.
OpenCitations has a Sparql endpoint for querying their data; you can find that at https://opencitations.net/sparql; we do interface with the OCC Sparql endpoint in citecorp, but we don’t provide a user interface directly to it in citecorp.
The OCC is also available as data dumps.
Links
citecorp source code: https://github.com/ropenscilabs/citecorp
citecorp on CRAN: https://cloud.r-project.org/web/packages/citecorp/
Installation
Install from CRAN
install.packages("citecorp")
Development version
remotes::install_github("ropenscilabs/citecorp")
Load citecorp
library(citecorp)
Converting among identifiers
Three functions are provided for converting among different identifier types; each function gives back a data.frame containing the url for the article, PMID, PMCID and DOI:
oc_doi2ids
oc_pmid2ids
oc_pmcid2ids
oc_doi2ids("10.1097/igc.0000000000000609")
#> type value
#> 1 paper https://w3id.org/oc/corpus/br/1
#> 2 pmid 26645990
#> 3 pmcid PMC4679344
#> 4 doi 10.1097/igc.0000000000000609
oc_pmid2ids("26645990")
#> type value
#> 1 paper https://w3id.org/oc/corpus/br/1
#> 2 doi 10.1097/igc.0000000000000609
#> 3 pmcid PMC4679344
#> 4 pmid 26645990
oc_pmcid2ids("PMC4679344")
#> type value
#> 1 paper https://w3id.org/oc/corpus/br/1
#> 2 doi 10.1097/igc.0000000000000609
#> 3 pmid 26645990
#> 4 pmcid PMC4679344
Under the hood we interact with their Sparql endpoint to do these queries.
COCI methods
A series of three more functions are meant for fetching references of, citations to, or metadata for individual scholarly works.
Here, we look for data for the DOI 10.1108/jd-12-2013-0166
Peroni, S., Dutton, A., Gray, T. and Shotton, D. (2015), “Setting our bibliographic references free: towards open citation data”, Journal of Documentation, Vol. 71 No. 2, pp. 253-277.
Note: If you don’t load tibble
you get normal data.frame’s
library(tibble)
doi <- "10.1108/jd-12-2013-0166"
references: the works cited within the paper
oc_coci_refs(doi)
#> # A tibble: 36 x 7
#> oci timespan citing creation author_sc journal_sc cited
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 020010100083… P9Y2M5D 10.1108… 2015-03… no no 10.1001/j…
#> 2 020010100083… P41Y8M 10.1108… 2015-03… no no 10.1002/a…
#> 3 020010100083… P25Y6M 10.1108… 2015-03… no no 10.1002/(…
#> 4 020010100083… P17Y2M 10.1108… 2015-03… no no 10.1007/b…
#> 5 020010100083… P2Y2M3D 10.1108… 2015-03… no no 10.1007/s…
#> 6 020010100083… P5Y8M27D 10.1108… 2015-03… no no 10.1007/s…
#> 7 020010100083… P2Y3M 10.1108… 2015-03… no no 10.1016/j…
#> 8 020010100083… P1Y10M 10.1108… 2015-03… no no 10.1016/j…
#> 9 020010100083… P12Y 10.1108… 2015-03… no no 10.1023/a…
#> 10 020010100083… P13Y10M 10.1108… 2015-03… no no 10.1038/3…
#> # … with 26 more rows
citations: the works that cite the paper
oc_coci_cites(doi)
#> # A tibble: 13 x 7
#> oci timespan citing creation author_sc journal_sc cited
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 02001010707360… P1Y4M1D 10.1177/… 2016-07… no no 10.110…
#> 2 02007050504361… P2Y11M2… 10.7554/… 2018-03… no no 10.110…
#> 3 02001010405360… P2Y 10.1145/… 2018 no no 10.110…
#> 4 02001000903361… P2Y3M4D 10.1093/… 2017-06… no no 10.110…
#> 5 02001000007360… P1Y 10.1007/… 2017 no no 10.110…
#> 6 02003030406361… P0Y 10.3346/… 2015 no no 10.110…
#> 7 02001000007360… P2Y9M12D 10.1007/… 2017-12… no no 10.110…
#> 8 02003020303362… P1Y14D 10.3233/… 2016-03… no no 10.110…
#> 9 02003020303362… P3Y5M4D 10.3233/… 2018-08… no no 10.110…
#> 10 02001000007360… P2Y 10.1007/… 2018 no no 10.110…
#> 11 02001010402362… P3Y4M21D 10.1142/… 2018-07… no no 10.110…
#> 12 02001000007360… P1Y 10.1007/… 2017 no no 10.110…
#> 13 02001000507362… P2Y4M 10.1057/… 2017-08 no no 10.110…
metadata: including the ISSN, volumne, title, authors, etc.
oc_coci_meta(doi)
#> # A tibble: 1 x 13
#> reference source_id volume title author source_title oa_link citation
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 10.1001/… issn:002… 71 Sett… Peron… Journal Of … "" 10.1177…
#> # … with 5 more variables: page <chr>, citation_count <chr>, issue <chr>,
#> # year <chr>, doi <chr>
Use cases
There are many example use cases using the OCC already in the literature. Here are a few of those, not necessarily using R:
- Zhu et al. 2019. Nine Million Books and Eleven Million Citations: A Study of Book-Based Scholarly Communication Using OpenCitations. arXiv preprint arXiv:1906.06039
- Kaminska 2019. PLOS ONE – a case study of citation analysis of research papers based on the data in an open citation index (The OpenCitations Corpus)
- Simon et al. 2019. BioReader: a text mining tool for performing classification of biomedical literature. BMC Bioinformatics, 19(S13). doi:10.1186/s12859-019-2607-x (OpenCitations data used in Fig. 1)
- Di Iorio et al. 2019. Open data to evaluate academic researchers: an experiment with the Italian Scientific Habilitation. arXiv preprint arXiv:1902.03287
To do
- I’d like to vectorize the functions for converting among IDs (e.g.,
oc_doi2ids
) to make them more user friendly. See https://github.com/ropenscilabs/citecorp/issues/1 - Contributors! If you’d like to contribute, head on over to the repo and get started
Get in touch
Get in touch if you have any citecorp questions in the issue tracker or the rOpenSci discussion forum.