Abstract

There are several distinct ways to represent data drift in the Linked Open Data world. In this paper we introduce an approach for tracking data changes that has been used in the context of the OpenCitations Project. Such approach has been inspired by existing works on change tracking mechanisms in documents created through word-processors such as Microsoft Word and OpenOffice Writer.

RASH: https://w3id.org/oc/paper/occ-driftalod2016.html

Introduction

Data change in time, and the reason for this change can be manyfold. On the one hand, they can contain mistakes that are corrected once they are identified, even after the publication date. On the other hand, information (or, better, representations of certain actual situations, like the composition of the government of a country) naturally evolves in time.

RDF technologies (RDF, OWL, SPARQL, etc.) were not originally thought to keep track of such changes natively. Thus, alternative approaches have been proposed in the past so as to extend such formalisms with mechanisms for adding such additional endeavour. The introduction of Named Graphs and the Provenance Ontology (PROV-O) are among the most used and appropriate ways for enabling the description of time-dependent (or, more generally, context-dependent) data. However, there can still exist different ways of keeping track of such changes in time.

In this paper we introduce an approach for tracking changes in RDF data by means of RDF provenance statements, which has been concretely used in the context of the OpenCitations Project . The main aim of OpenCitations is the creation of an open repository of scholarly citation data – the OpenCitations Corpus (OCC) – made available under a Creative Commons public domain dedication to provide in RDF accurate citation information (bibliographic references) harvested from the scholarly literature. All the entities in the OCC have metadata describing their provenance, so as to keep track of the curatorial activities related to each OCC entity, the curatorial agents involved, their roles, and the sources used for retrieving such data. By means of the extension to the Provenance Ontology (PROV-O) we propose for handling such provenance data (which is one of the contributions of this work), we show how it is possible to reconstruct a particular status (or snapshot) of an entity in the OCC at a specified time by using a mechanism inspired by existing works on change tracking mechanisms in documents created through word-processors such as Microsoft Word and OpenOffice Writer.

The rest of the paper is organised as follows. In we briefly introduce some possible approaches to keep track of changes of RDF data. In we describe our approach for addressing such issue, while in we discuss its application in the context of the OCC. Finally, in we conclude the paper sketching out some future works.

Approaches to changes

In the past, several works in the Semantic Web domain have concerned theoretical and practical aspects of change tracking in ontologies and RDF data . However, the main focus of this paper is not about expanding the theoretical notion of delta (i.e. the function that defines the changes) nor about discussing algorithms that are able to identify changes between two versions of the same object (e.g. an ontology) a posteriori. Rather, we are interested in mechanisms (based on RDF) to keep track explicitly of the changes when they happen, so as to reconstruct the whole history of an entity at a given time.

Two approaches can be used for representing how a particular dataset has evolved in time. On the one hand, we have statement-centric approaches, that basically provide mechanisms to record how the set of statements in a dataset has evolved by means of simple operations such as addition and deletion. On the other hand, we have resource-centric approaches, that mainly allow one to say when an instance of a time-dependent class or property (traditionally called anti-rigid concept ) changes its status somehow.

There are at least two possible approaches belonging to the first of the aforementioned categories: physical snapshots and massive statement reification.

A physical snapshot of a given LOD dataset is a particular record of all the statements in such dataset at a given time. Using this technique, the tracking of all the changes of the dataset is stored every time one thinks is appropriate, e.g. every time a statement has been added/modified, after a certain amount of modification to the dataset, after a particular time interval (every week, every month, etc.), and so on. This is a quite common strategy for several LOD datasets available online (such as DBPedia , which makes available versioned datasets as described at http://wiki.dbpedia.org/datasets), it is quite easy to implement, but one would need extraordinary amounts of space and time for keeping track of how a dataset has changed, since every snapshot would record the entire dataset at a certain date.

The massive statement reification mechanism requires the creation of additional identifiers (one for each statement), and all of them are, in some way, marked when they have been created/removed and by whom. This kind of approach can be coupled easily with existing models, such as PROV-O , so as to keep track of how a statement has been modified in time – similarly to what Wikidata implements. In this case, the size of the dataset continuously increases – since deleted statements are not really removed from the dataset, rather they are marked as deleted. However, such mechanism also allows one to track changes and to index them when they actually happen. This is a quite huge advantage, since it would allow one to restore any possible status of the dataset by discarding all the modifications happened after a certain date.

Among the resource-centric mechanisms, it is worth mentioning the provenance-centric and the by-design approaches, that allow one to record changes of a certain resource, e.g. a particular class or an individual, by means of re-using existing models and without explicitly referring to the set of statements they are involved in.

An ontology that can be used for addressing the former category is PROV-DC , which enables expressing how entities change in time by means of additional classes and properties added to PROV-O, which now allows the specification of activities such as prov:Create, prov:Modify, etc. While this is a valuable and simple approach, it is not easy to understand in a formal way which particular aspect of an entity has actually changed.

The alternative approaches, i.e. those compliant with the by-design mechanism, oblige the dataset creator to include, from the very beginning, a finest conceptualisation of the (anti-rigid) entities that can change in time in the actual ontology she is using for representing the data. A good option here is to use particular ontology design patterns, such as the time-indexed situation pattern or the 4D Fluent OWL ontology . However, if something that now can be modified was not considered as such at the very beginning, it would be possible that part of the ontology used for representing the data (and consequently the data themselves) could be modified accordingly – wasting time and, potentially, changing the current organisation of the data, thus limiting their reusability in the long term.

Both the aforementioned resource-centric mechanisms would allow not to delete permanently any information, rather they would oblige to include the entire history of each entity in the dataset, since they use particular ontological constructs to tell the user when an entity has been created/invalidated, by whom, and so on.

A document-inspired approach to data drift

The approach we propose reuses techniques proper to both statement-centric and resource-centric approaches, taking inspiration from a well-known structure for keeping track of changes in word-processor documents, in particular OpenOffice Writer (OOW herein). When an author activates the change tracking plugin in OOW, every insertion and deletion into the document are tracked by using two different mechanisms proper to overlapping markup theories, called milestone (for insertions) and stand-off markup (for deletions) . Milestones allows one to add the new content directly within the existing text, marking it in some way that can be recognisable. Contrarily, stand-off markup removes explicitly a piece of text from the actual content of the document, and places it in an auxiliary space for easy retrieving and, if needed, restoration.

Following the same principles, we developed a mechanism that allows us to either add or remove new statements directly to the current set of data related to an entity (i.e. the RDF triples that have such entity as subject, readapting some of the aspects of the approach introduced in ), while preserving provenance information of such addition/deletion actions in an appropriate contextual space, i.e. the provenance graph associated to such entity (as also suggested in ). For doing that we leverage the PROV-O ontology, and extend it by adding an additional data property called hasUpdateQuery, which allows us to record insertions and deletions as SPARQL INSERT and SPARQL DELETE queries – while the use of SPARQL variables is prohibited in the update queries.

The main idea of our approach is that each entity in a dataset (i.e. an instance e of the class prov:Entity) is represented by one or more snapshots (other instances e1, e2, e3, … of prov:Entity, each intended as specialisation of e via prov:specializationOf). Each snapshot records the composition of the entity e (i.e. the set of statements using e as subject) at a fixed point in time. In addition, each snapshot is linked to the others according to their temporal creation/invalidation by means of the property prov:wasDerivedFrom.

Please let us introduce a working example for discussing the approach proposed. For instance, let us consider the entity sp as composed by the following two statements:

            
:sp a foaf:Person ; 
  foaf:name "Silvio Peroni" .
         

The addition of these statements also generates, at least, the following provenance statements, so as to set sp as a provenance entity, where its statements are implicitly encoded in a specific snapshot:

            
:sp a prov:Entity .

:sp-snapshot-1 a prov:Entity ;
  prov:specializationOf :sp .
         

Then suppose the curator of such data will decide to split the full name of sp using two distinct properties, i.e. foaf:givenName and foaf:familyName, so as to remove the more generic foaf:name:

            
:sp a foaf:Person ;
  foaf:givenName "Silvio" ;
  foaf:familyName "Peroni" .
         

In this case, a new snapshot of the entity will be generated, which specifies which statements have been added/deleted (by means of the property new:hasUpdateQuery) starting from the previous snapshot linked through the property prov:wasDerivedFrom, as follows:

            
:sp-snapshot-2 a prov:Entity ;
  prov:specializationOf :sp ;
  prov:wasDerivedFrom :sp-snapshot-1 ;
  new:hasUpdateQuery "INSERT DATA { :sp foaf:givenName 'Silvio' ; foaf:familyName 'Peroni' } ; DELETE DATA { :sp foaf:name 'Silvio Peroni' }" .
         

Using such snapshot-oriented structure, which clearly indicates how a previous snapshot of an entity has been modified to reach the set of statements currently available, makes easier to:

For instance, to get back to the status recorded by the first snapshot of the aforementioned example, we can run all the inverse operations of the update query specified in the second snapshot, i.e.:

            
INSERT DATA { :sp foaf:name 'Silvio Peroni' } ;
DELETE DATA { :sp foaf:givenName 'Silvio' ; foaf:familyName 'Peroni' }
         

A real application: the OpenCitations Corpus

The OCC has been accompanied by a formal metadata model which is strictly followed by all the data in the corpus. The metadata model is explicitly aligned with the SPAR Ontologies for expressing the data and to other standard vocabularies, e.g. PROV-O and PROV-DC , for expressing contextual information of entities, such as provenance information. All the ontological entities introduced by the metadata model are conveniently grouped together in the OpenCitations Ontology (OCO), which also implements the oco:hasUpdateQuery for keeping track of changes as described in . The entities included in the corpus can have one of the following types:

Each OCC entity is identified by a URL (e.g. https://w3id.org/oc/corpus/br/525205) that includes a two-letter short name for the class of such entity (e.g. br for bibliographic resources) and the number (e.g. 525205) that uniquely identifies it among the resources of the same type. Independently from the particular type assigned to entities, they have associated provenance information such as those introduced in . In particular, we record four different kinds of provenance entities, as indicated in :

All this information is stored in the provenance graph related to the particular OCC entity in consideration. The URL of such provenance graph is the URL of the entity in consideration plus /prov/. The URL of all the aforementioned provenance entities (e.g. https://w3id.org/oc/corpus/br/525205/prov/se/1) is built using the provenance graph as base and adding two-letter short name for the class of such provenance entity (e.g. se for snapshot of entity metadata) plus / plus the number (e.g. 1) that uniquely identifies it among the resources of the same type in the context of that particular provenance graph. An exception to that URL template is provided for all the provenance agents that are shared among the whole corpus and, thus, that have https://w3id.org/oc/corpus/prov/pa/ as base URL (e.g. https://w3id.org/oc/corpus/prov/pa/1).

As an example, let us discuss the provenance statements added during the creation and modification of https://w3id.org/oc/corpus/br/525205 – that are all available online. After the creation, the following statements are added to the corpus:

            
# Snapshot of entity metadata
<https://w3id.org/oc/corpus/br/525205/prov/se/1> a prov:Entity ;
  rdfs:label "snapshot of entity metadata 1 related to bibliographic resource 525205 [se/1 -> br/525205]" ;
  prov:generatedAtTime "2016-08-08T22:25:48"^^xsd:dateTime ;
  prov:hadPrimarySource <http://api.crossref.org/works/10.2196/mhealth.5331> ;
  prov:specializationOf <https://w3id.org/oc/corpus/br/525205> ;
  prov:wasGeneratedBy <https://w3id.org/oc/corpus/br/525205/prov/ca/1> .

# Curatorial activity
<https://w3id.org/oc/corpus/br/525205/prov/ca/1> a prov:Activity, prov:Create ;
  rdfs:label "curatorial activity 1 related to bibliographic resource 525205 [ca/1 -> br/525205]" ;
  dcterms:description "The entity 'https://w3id.org/oc/corpus/br/525205' has been created." ;
  prov:qualifiedAssociation 
    <https://w3id.org/oc/corpus/br/525205/prov/cr/1> ,
    <https://w3id.org/oc/corpus/br/525205/prov/cr/2> .

# Curatorial roles
<https://w3id.org/oc/corpus/br/525205/prov/cr/1> a prov:Association ;
  rdfs:label "curatorial role 1 related to bibliographic resource 525205 [cr/1 -> br/525205]" ;
  prov:agent <https://w3id.org/oc/corpus/prov/pa/1> ;
  prov:hadRole oco:occ-curator .

<https://w3id.org/oc/corpus/br/525205/prov/cr/2> a prov:Association ;
  rdfs:label "curatorial role 2 related to bibliographic resource 525205 [cr/2 -> br/525205]" ;
  prov:agent <https://w3id.org/oc/corpus/prov/pa/2> ;
  prov:hadRole oco:source-metadata-provider .

# Provenance agents
<https://w3id.org/oc/corpus/prov/pa/1> a prov:Agent ;
  rdfs:label "provenance agent 1 [pa/1]" ;
  foaf:name "SPACIN CrossrefProcessor" .

<https://w3id.org/oc/corpus/prov/pa/2> a prov:Agent ;
  rdfs:label "provenance agent 2 [pa/2]" ;
  foaf:name "Crossref" .
         

Basically, the first snapshot of the resource br/525205 has been created on August 8, 2016, at 22:25:48 (property prov:generatedAtTime), starting from the data contained in the source document http://api.crossref.org/works/10.2196/mhealth.5331 (property prov:hadPrimarySource). The activity that generated the data of br/525205 (property prov:wasGeneratedBy) was a creation (class prov:Create) that involved (property prov:qualifiedAssociation) two agents (referred by the property prov:agent), i.e. SPACIN CrossrefProcessor (that is one of the automatic scripts of OpenCitations responsible for the creation of RDF data) and Crossref, as OCC curator and source metadata provider respectively.

Then, few days after its creation, the resource br/525205 has been extended with additional data concerning its citation links to other bibliographic resources, as well as the completion of the full textual references it includes. The following provenance statements have been, thus, generated:

            
# The old snapshot has been invalidated...
<https://w3id.org/oc/corpus/br/525205/prov/se/1> 
  prov:invalidatedAtTime "2016-08-29T22:42:06"^^xsd:dateTime ;
  prov:wasInvalidatedBy <https://w3id.org/oc/corpus/br/525205/prov/ca/2> .

# ... and it has been substituted by a new one
<https://w3id.org/oc/corpus/br/525205/prov/se/2> a prov:Entity ;
  rdfs:label "snapshot of entity metadata 2 related to bibliographic resource 525205 [se/2 -> br/525205]" ;
  prov:generatedAtTime "2016-08-29T22:42:06"^^xsd:dateTime ;
  prov:hadPrimarySource <http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4911509/fullTextXML> ;
  prov:specializationOf <https://w3id.org/oc/corpus/br/525205> ;
  prov:wasDerivedFrom <https://w3id.org/oc/corpus/br/525205/prov/se/1> ;
  prov:wasGeneratedBy <https://w3id.org/oc/corpus/br/525205/prov/ca/2> ;
  oco:hasUpdateQuery "INSERT DATA { GRAPH <https://w3id.org/oc/corpus/br/> { <https://w3id.org/oc/corpus/br/525205> <http://purl.org/spar/cito/cites> <https://w3id.org/oc/corpus/br/1095459> . <https://w3id.org/oc/corpus/br/525205> <http://purl.org/vocab/frbr/core#part> <https://w3id.org/oc/corpus/be/727491> . <https://w3id.org/oc/corpus/br/525205> <http://purl.org/vocab/frbr/core#part> <https://w3id.org/oc/corpus/be/727452> ... } }" .

# Curatorial activity
<https://w3id.org/oc/corpus/br/525205/prov/ca/2> a prov:Activity, prov:Modify ;
  rdfs:label "curatorial activity 2 related to bibliographic resource 525205 [ca/2 -> br/525205]" ;
  dcterms:description "The entity 'https://w3id.org/oc/corpus/br/525205' has been extended with citation data." ;
  prov:qualifiedAssociation 
    <https://w3id.org/oc/corpus/br/525205/prov/cr/3> ,
    <https://w3id.org/oc/corpus/br/525205/prov/cr/4> .

# Curatorial roles
<https://w3id.org/oc/corpus/br/525205/prov/cr/3> a prov:Association ;
  rdfs:label "curatorial role 3 related to bibliographic resource 525205 [cr/3 -> br/525205]" ;
  prov:agent <https://w3id.org/oc/corpus/prov/pa/1> ;
  prov:hadRole oco:occ-curator .
  
<https://w3id.org/oc/corpus/br/525205/prov/cr/4> a prov:Association ;
  rdfs:label "curatorial role 4 related to bibliographic resource 525205 [cr/4 -> br/525205]" ;
  prov:agent <https://w3id.org/oc/corpus/prov/pa/2> ;
  prov:hadRole oco:source-metadata-provider .
         

The new snapshot has substituted the previous one (properties prov:invalidatedAtTime and prov:wasInvalidatedBy) by updating the information about the resource br/525205 with the update query specified (property oco:hasUpdateQuery). The new snapshot has been created by a particular modification activity (class prov:Modify) that involved the same agents with the same roles as before.

Conclusions

In this paper we have introduced an approach for keeping track of changes in RDF data and, consequently, in LOD datasets. The method proposed is actually derived from existing techniques applied to the Document Engineering domain for addressing similar issues. We have also described the use of this approach within the OpenCitations Project as the main mechanism for providing a complete history of how the entities in the OpenCitations Corpus have evolved in time. In the future, we plan to develop automatic tools that allow us to restore a particular snapshot of an entity by looking at its provenance information only, so as to facilitate the restoration of entities at a particular time.

References

  1. Peroni, S., Shotton, D. (2016). Metadata for the OpenCitations Corpus. Figshare. https://dx.doi.org/10.6084/m9.figshare.3443876

  2. Peroni, S., Dutton, A., Gray, T., Shotton, D. (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71 (2): 253–277. http://dx.doi.org/10.1108/JD-12-2013-0166

  3. Peroni, S., Shotton, D., Vitali, F. (2016). Freedom for bibliographic references: OpenCitations arise. To appear in Proceedings of 2016 International Workshop on Linked Data for Information Extraction (LD4IE 2016). https://w3id.org/oc/paper/occ-lisc2016.html

  4. Lebo, T., Sahoo, S., McGuinness, D. (2013). PROV-O: The PROV Ontology. W3C Recommendation, 30 April 2013. World Wide Web Consortium. http://www.w3.org/TR/prov-o/

  5. Carroll, J. J., Bizer, C., Hayes, P., & Stickler, P. (2005). Named graphs. Web Semantics: Science, Services and Agents on the World Wide Web, 3(4): 247–267. http://dx.doi.org/10.1016/j.websem.2005.09.001

  6. Vrandecic, D., Krötzsch, M. (2014). Wikidata: a free collaborative knowledge base. Communication of the ACM, 57 (10): 78–85. http://dx.doi.org/10.1145/2629489

  7. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P. N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C. (2015). DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, 6 (2): 167–195. http://dx.doi.org/10.3233/SW-140134

  8. Garijo, D., Eckert, K. (2013). Dublin Core to PROV Mapping. W3C Working Group Note, 30 April 2013. https://www.w3.org/TR/prov-dc/

  9. Welty, C. A., Fikes, R. (2006). A Reusable Ontology for Fluents in OWL. In Proceedings of FOIS 2006: 226–236.

  10. Peroni, S., Poggi, F., Vitali, F. (2014). Overlapproaches in documents: a definitive classification (in OWL, 2!). In Proceedings of Balisage 2014. http://dx.doi.org/10.4242/BalisageVol13.Peroni01

  11. Peroni, S. (2014). The Semantic Publishing and Referencing Ontologies. In Semantic Web Technologies and Legal Scholarly Publishing: 121–193. http://dx.doi.org/10.1007/978-3-319-04777-5_5

  12. Guarino, N., Welty, C. A. (2009). An Overview of OntoClean. In Handbook on Ontologies: 201–220. Berlin, Germany: Springer. ISBN: 978-3-540-70999-2

  13. Noy, N. F., Kunnatur, S., Klein, M. C. A., Musen, M. A. (2004). Tracking Changes During Ontology Evolution. In Proceedings of ISWC 2004: 259–273. http://dx.doi.org/10.1007/978-3-540-30475-3_19

  14. Zeginis, D., Tzitzikas, Y., Christophides, V. (2007). On the Foundations of Computing Deltas Between RDF Models. In Proceedings of ISWC/ASWC 2007: 637–651. http://dx.doi.org/10.1007/978-3-540-76298-0_46

  15. Völkel, M., Groza, T. (2006). SemVersion: RDF-based ontology versioning system. In Proceedings of the IADIS WWW/Internet 2006.

  16. Ding, L., Peng Y., da Silva, P. P., McGuinness, D. L. (2005). Tracking RDF Graph Provenance using RDF Molecules. Technical report. http://ebiquity.umbc.edu/get/a/publication/178.pdf

  17. Berners-Lee, T., Connolly, D. (2015). Delta: An Ontology for the Distribution of Differences Between RDF Graphs. https://www.w3.org/DesignIssues/Diff

We have not specify any formal domain and range for this property so as to foster its reuse in different contexts. However, in the OpenCitations Corpus, it has been used implicitly on prov:Entity individuals, each referring to a particular snapshot of a certain OCC bibliographic entity.