The SPAR Citation Indexer

The SPAR Citation Indexer, a.k.a. SPACIN, is a script and a series of Python classes that allow one to process particular JSON files containing the bibliographic reference lists of papers, produced by another script included in the OpenCitations GitHub repository that retrieves such lists by querying the Europe PubMed Central API. An excerpt of an input JSON file used by SPACIN is introduced as follows:

            
{
  "localid": "MED-27193261", "doi": "10.1038/ncomms11627", 
  "curator": "BEE EuropeanPubMedCentralProcessor", 
  "source_provider": "Europe PubMed Central", 
  "source": "http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4874038/fullTextXML", 
  "references": [ {
    "bibentry": "Weaver C. T, Elson C. O, Fouser L. A. & Kolls J. K. The Th17 pathway and inflammatory diseases of the intestines, lungs, and skin. Annu. Rev. Pathol. 8, 477\u2013512 (2013). PMID: 23157335", 
    "doi": "10.1146/annurev-pathol-011110-130318", … }, … ], … 
}

SPACIN processes such JSON files and retrieves additional metadata information about all the citing/cited articles by querying the Crossref API and the ORCID API. These API are also used to disambiguate bibliographic resources and agents by means of the identifiers retrieved (e.g., DOI, ISSN, ISBN, ORCID, URL, and Crossref member URL). Once SPACIN has retrieved all these metadata, RDF resources are created (or reused, if they have been already added in the past) and stored in the file system in JSON-LD format. In addition, they are also uploaded to the triplestore (via SPARQL UPDATE protocol) specified by the variable triplestore_url in the file conf_spacin.py.

SPACIN stores all the metadata relevant to bibliographic entities by using the OCC metadata model summarised in . The ontological terms of such metadata model are collected within an ontology called the OpenCitations Ontology (OCO), which includes several terms from the SPAR Ontologies and other vocabularies. In particular, the following six bibliographic entity types occur in the datasets created by SPACIN:

bibliographic resources (br), class fabio:Expression – resources that either cite or are cited by other bibliographic resources (e.g. journal articles), or that contain such citing/cited resources (e.g. journals);
resource embodiments (re), class fabio:Manifestation – details of the physical or digital forms in which the bibliographic resources are made available by their publishers;
bibliographic entries (be), class biro:BibliographicReference – literal textual bibliographic entries occurring in the reference lists of bibliographic resources;
responsible agents (ra), class foaf:Agent – names of agents having certain roles with respect to the bibliographic resources (i.e. names of authors, editors, publishers, etc.);
agent roles (ar), class pro:RoleInTime – roles held by agents with respect to the bibliographic resources (e.g. author, editor, publisher);
identifiers (id) (class datacite:Identifier) – external identifiers (e.g. DOI, ORCID, PubMedID) associated to bibliographic resources and agents.

The Graffoo diagram [4] of the main ontological entities created by SPACIN. — The Graffoo diagram of the main ontological entities created by SPACIN.

Upon initial curation by SPACIN, a URL is assigned to each entity, according to the following schema:

            
[corpus URL] + [sub-dataset ID] + / + [entity ID]

The corpus URL (which is https://w3id.org/oc/corpus/ for the OCC) identifies the entire dataset one is creating. It is possible to set it up by modifying the variable base_iri specified in the conf_spacin.py file in the GitHub repository. The sub-dataset ID is the two-letter short name for the class of items we are taking into consideration (e.g. br for bibliographic references), while the entity ID is a number that identifies a particular item that is unique among resources of the same type. For instance, the entity https://w3id.org/oc/corpus/br/525205 in the OCC identifies the 525,205^th bibliographic resource stored by SPACIN – where the two-letter short name for the class of items plus / plus the number (br/525205 in the example) is called corpus identifier, since it allows the unique identification of any entity within the dataset one is building by means of SPACIN.

Each of the entities created by SPACIN is associated with metadata describing its provenance using the PROV-O ontology and its PROV-DC extension (e.g. https://w3id.org/oc/corpus/br/525205/prov/se/1 in the OCC). In particular, SPACIN stores information about the curatorial activities related to each entity, the curatorial agents involved, and their roles.

All these RDF data are stored in BibJSON encoded as JSON-LD, defined through an appropriate JSON-LD context which hides the complexity of the model (shown in ) behind natural language keywords. For instance, the following excerpt is the JSON-LD linearisation of the aforementioned “br/525205” entity:

            
{
  "iri": "gbr:525205", "a": [ "article", "document" ], 
  "label": "bibliographic resource 525205 [br/525205]", 
  "title": "The Electronic Patient Reported Outcome Tool: Testing Usability and Feasibility of a Mobile App and Portal to Support Care for Patients With Complex Chronic Disease and Disability in Primary Care Settings", 
  "year": "2016", "part_of": "gbr:476045",   "format": "gre:217773", 
  "reference": [ "gbe:727463", "gbe:727473", … ], 
  "citation": [ "gbr:1095446", "gbr:525190", … ], 
  "contributor": [ "gar:1591192", "gar:1591193", … ], 
  "identifier": [ "gid:816999", "gid:816997", … ]
}

In this excerpt, iri defines the URL of the resource in consideration (where gbr: is a prefix for https://w3id.org/oc/corpus/br/), while a, label, title, year, part_of, reference, citation, format, contributor, and identifier stand for rdf:type, rdfs:label, dcterms:title, fabio:hasPublicationYear, frbr:partOf, biro:references, cito:cites, frbr:embodiment, c4o:isDocumentContextFor, and datacite:hasIdentifier respectively (where gbe: is a prefix for https://w3id.org/oc/corpus/be/, gre: for https://w3id.org/oc/corpus/re/, and gid: for https://w3id.org/oc/corpus/id/).

Additional information about how SPACIN stores citation data, and the way they are represented in RDF, are detailed in the OCC Metadata Document .

Abstract

Introduction

The SPAR Citation Indexer

Conclusions

References