Overview

StringToURI is a simple link generation framework that helps linking two related data sets on the web of data. Given two predicates in two different data sets, it matches the values of the predicates on each side and replaces old string values by URIs inside the updated data set.

This module was made in order to work within Datalift, a linked data publishing platform. However, it can also be used as a stand-alone application or in other environments.

Use cases

Its classes can be used in different fashions :

Workflow

The StringToURI workflow with its input, output and three components

Inner working

To generate links, StringToURI is built around the Sesame Java API, relying mainly on : Repository, RepositoryConnection, Statement, TupleQuery, Update.

StringToURI's source code is divided into three components :

Data sets management

StringToURI manages three kinds of data source :

Those data sets are based on the Repository class and managed by RepositoryConnection. RepositoryConnection is used to handle the data set's namespaces, to send select / update queries, to retrieve tuples according to different criteria, or to manage data modification with a commit manager.

Generating links

An interlinking process can have different levels of customization :

For now, the match process only retains equal values without any measurement of similarity. If there is any match, a list of new tuples is created where the target predicate now has a link as object, this link corresponding to the value in the reference data set.

Export the output

Once the new links are generated, they can be used in two ways : by retrieving and storing them for later use, or by directly updating the data.

Those output types are designed for different use cases, it is fairly easy to create new ones to satisfy new needs.

An example

The PASSIM data set is a french public transport directory. Its data is available as a CSV file which contains 1400 transportation services and data related to these services : coverage of the service, main city, area, website, etc.

PASSIM data has been converted to raw RDFXML with Datalift using a custom-made PASSIM ontology and now we want to link it with the Geo INSEE data set, managed by the french official statistics and census office, the INSEE.

The data state

The passim:region predicate is already linked to french regions in DBPedia, but we still have to link passim:department, passim:centerTown and passim:cityThrough respectively to the geo:Departement and geo:Commune classes in the Geo INSEE ontology :

Creating DataSets

In order to create our new links, we will first set up two DataSets. The kind of DataSet to use depends on your specific needs, but i'd seem logic to use a SesameDataSet if you are using a SesameServer and to use a RDFDataSet otherwise. If the data isn't already stored inside, we can add it by calling addRDFXMLTuples with the path to our RDFXML files as parameters.

Linking them together

To create our links, we need to create a new Linkage with our two DataSets as parameter. In our specific case, we have to find values of the given predicates for specific types, thus we'll use a TypedLinkage. We could use a StandardLinkage because there are no cities named after departments or regions but it'll be faster to select less data. Once our Linkage is ready, we call generateLinks to retrieve the URIs to Geo INSEE entities.

Processing the results

We now have a collection of Statements ordered by subject. We can process those new statements using the Output class. The type of Output to choose depends directly of the kind of storage solution for the data set to be updated : Sesame or RDFXML files. Once the Output is chosen, the getOutput method will return the modifications. If the dataset is made from an RDFXML file, a RDFOutput will allow to retrieve a new file with updated values. If the data is stored inside a Sesame repository, a SPARQLOutput will give you the DELETE/INSERT queries and will allow you to directly update the data by calling updateDataSet.

Passim is now linked to Geo INSEE !

System requirements

To develop StringToURI :

To install a Sesame server :

Were used during creation :

Javadoc
http://stringtouri-javadoc.assembla.me
SVN Repository
http://subversion.assembla.com/svn/stringtouri/
Wiki
http://www.assembla.com/spaces/stringtouri/wiki/StringToURI
Related files
http://www.assembla.com/spaces/stringtouri/documents/
Sesame API Javadoc
http://www.openrdf.org/doc/sesame2/api/
Sesame user guide
http://www.openrdf.org/doc/sesame2/users/
Interesting data sets
http://telegraphis.net/data/
Datalift project
http://datalift.org/
Datalift installation tutorial
https://www.youtube.com/watch?v=l-hvHT7ZrfY
StringToURI demonstration
https://www.youtube.com/watch?v=idzSEpPswTc