Multisensor RDF Application Profile

Table of Contents

HTML Rendered version

1 Intro

The Multisensor project analyzes and extracts data from mass- and social media documents (so-called SIMMOs), including text, images and video, speech recognition and translationn, across several languages. It also handles social network data, statistical data, etc.

Early on the project made the decision that all data exchanged between project partners (between modules inside and outside the processing pipeline) will be in RDF JSONLD format. The final data is stored in a semantic repository and is used by various User Interface components for end-user interaction. This final data forms a corpus of semantic data over SIMMOs and is an important outcome of the project.

The flexibility of the semantic web model has allowed us to accommodate a huge variety of data in the same extensible model. We use a number of ontologies for representing that data: NIF and OLIA for linguistic info, ITSRDF for NER, DBpedia and Babelnet for entities and concepts, MARL for sentiment, OA for image and cross-article annotations, W3C CUBE for statistical indicators, etc. In addition to applying existing ontologies, we extended them by the Multisensor ontology, and introduced some innovations like embedding FrameNet in NIF.

The documentation of this data has been an important ongoing task. It is even more important towards the end of the project, in order to enable the efficient use of MS data by external consumers.

This document describes the different RDF patterns used by Multisensor, and how the data fits together. Thus it represents an "RDF Application Profile" for Multisensor. We use an example-based approach, rather than the more formal and labourious approach being standardized by the W3C RDF Shapes working group (still in development).

We cover the following areas:

  • Linguistic Linked Data in NLP Interchange Format (NIF), including Part of Speech (POS), dependency parsing, sentiment, Named Entity Recognition (NER), etc
  • Speech recognition, translation
  • Multimedia binding and image annotation
  • Statistical indicators and similar data
  • Social network popularity and influence, etc

A few technical aspects (eg Graph Normalization) are also covered.

The document is developed in Emacs Org-mode using a Literate Programming style. Illustrative code and ontology fragments use RDF Turtle, and are assembled ("tangled") using Org Babel support to the ./img folder.

Figures are made with the rdfpuml tool from actual Turtle. This not only saves time, but also guarantees consistency of the figures with the examples and the ontology.

1.1 SPARQL Queries

Two companion documents describe various queries used by the system:

The queries can be tried against the SPARQL endpoint http://multisensor.ontotext.com/sparql:

  • most data is in repository ms-final-prototype
  • statistical data is in repository multisensor-dbpedia

1.2 Links

1.3 References

2 Linguistic Linked Data

There's been a huge drive in recent years to represent NLP data as RDF. NLP data is usually large, so does it make sense to represent it as RDF? What's the benefit?

  • Ontologies, schemas and groups include: GRaF ITS2 FISE LAF LD4LT LEMON LIME LMF MARL NERD NIF NLP2RDF OLIA OntoLex OntoLing OntoTag Penn Stanford… my oh my!
  • There are a lot of linguistic resources available that can be used profitably: BabelNet FrameNet GOLD ISOcat LemonUBY Multitext OmegaNet UBY VerbNet Wiktionary WordNet.

The benefit is that RDF offers a lot of flexibility for combining data on many different topics in one graph.

The NLP Interchange Format (NIF) is the pivot ontology that allows binding to text, and integrates many other aspects

2.1 NIF Issues

As any new technology, NIF has various issues. A new version NIF 2.1 is in development and has raised its own issues. Issues that we have posted about NIF:

3 Prefixes

Multisensor uses the following prefixes. We define the prefixes once, and then can use them in Turtle examples without redefining them, guaranteeting consistency. When this file is loaded in GraphDB, we can also make queries without worrying about the prefixes, and the ontologies are used for auto-completion of class and property names.

./img/prefixes.ttl

A lot of the prefixes are registered in prefix.cc and can be obtained with a URL like this:

3.1 Multisensor Ontologies

UPF has defined a number of subsidiary ontologies related to Dependency Parsing (dep) and FrameNet (frame and fe).

# Multisensor ontologies
@prefix ms:             <http://data.multisensorproject.eu/ontology#> .
@prefix upf-deep:       <http://taln.upf.edu/upf-deep#> .
@prefix upf-dep-syn:    <http://taln.upf.edu/olia/penn-dep-syntax#> .
@prefix upf-dep-deu:    <http://taln.upf.edu/upf-dep-deu#> .
@prefix upf-dep-spa:    <http://taln.upf.edu/upf-dep-spa#> .
@prefix upf-pos-deu:    <http://taln.upf.edu/upf-pos-deu#> .
@prefix upf-pos-spa:    <http://taln.upf.edu/upf-pos-spa#> .
@prefix fe-upf:         <http://taln.upf.edu/frame-element#> . 
@prefix frame-upf:      <http://taln.upf.edu/frame#> .

The Multisensor ontology gathers various classes and properties. Here we define its metadata (header), while classes and properties are defined in later sections on as-needed basis.

./img/ontology.ttl

ms: a owl:Ontology;
  rdfs:label "Multisensor Ontology";
  rdfs:comment "Defines various classes and properties used by the FP7 Multisensor project";
  rdfs:seeAlso <http://multisensorproject.eu>, <http://github.com/VladimirAlexiev/multisensor/>;
  dct:creator  <http://multisensorproject.eu>, <mailto:Vladimir.Alexiev@ontotext.com>;
  dct:created  "2016-06-20"^^xsd:date;
  dct:modified "2016-10-20"^^xsd:date;
  owl:versionInfo "1.0".

3.2 Multisensor Datasets

The main Multisensor dataset is ms-content: which includes the annotated news content (SIMMOs). Additional data for social networks, annotations and locally defined concepts is also kept.

# Multisensor datasets
@prefix ms-annot:       <http://data.multisensorproject.eu/annot/>.
@prefix ms-content:     <http://data.multisensorproject.eu/content/>.
@prefix ms-concept:     <http://data.multisensorproject.eu/concept/>.
@prefix ms-soc:         <http://data.multisensorproject.eu/social/> .

3.3 External Datasets

We use two well-known LOD datasets for NER references (Babelnet bn and DBpedia dbr). Yago is used only in examples. In addition, we make up a namespace for Twitter tags and users.

# External datasets
@prefix bn:             <http://babelnet.org/rdf/> .
@prefix dbr:            <http://dbpedia.org/resource/> .
@prefix dbc:            <http://dbpedia.org/resource/Category:> .
@prefix yago:           <http://yago-knowledge.org/resource/> .
@prefix twitter:        <http://twitter.com/> .
@prefix twitter_tag:    <http://twitter.com/hashtag/> .
@prefix twitter_user:   <http://twitter.com/intent/user?user_id=> .

3.4 Common Ontologies

The following ontologies are commonly used

# Commonly used ontologies
@prefix bibo:           <http://purl.org/ontology/bibo/>.
@prefix dbo:            <http://dbpedia.org/ontology/> .
@prefix dbp:            <http://dbpedia.org/property/> .
@prefix dc:             <http://purl.org/dc/elements/1.1/> .
@prefix dct:            <http://purl.org/dc/terms/> .
@prefix dctype:         <http://purl.org/dc/dcmitype/>.
@prefix dqv:            <http://www.w3.org/ns/dqv#> .
@prefix foaf:           <http://xmlns.com/foaf/0.1/> .
@prefix ldqd:           <http://www.w3.org/2016/05/ldqd#> .
@prefix prov:           <http://www.w3.org/ns/prov#>.
@prefix schema:         <http://schema.org/> .
@prefix sioc:           <http://rdfs.org/sioc/ns#>.
@prefix skos:           <http://www.w3.org/2004/02/skos/core#>.

# System ontologies
@prefix rdf:            <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:           <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl:            <http://www.w3.org/2002/07/owl#> .
@prefix xsd:            <http://www.w3.org/2001/XMLSchema#> .

3.5 Linguistic and Annotation ontologies

These ontologies are the main "work-horse" in Multisensor, i.e. used to represent the majority of the data. NIF 2.1 RC1 (see section 2.1) defined a nif-ann: namespace that is used in MS. Later NIF 2.1 abandoned the use of this prefix (instead falling back to the original nif: namespace), but unfortunately MS data could not be updated to use nif: only.

# Linguistic and Annotation ontologies
@prefix fam:            <http://vocab.fusepool.info/fam#>.
@prefix fise:           <http://fise.iks-project.eu/ontology/>.
@prefix its:            <http://www.w3.org/2005/11/its/rdf#> .
@prefix marl:           <http://purl.org/marl/ns#> .
@prefix nerd:           <http://nerd.eurecom.fr/ontology#> .
@prefix nif:            <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix nif-ann:        <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-annotation#>.
@prefix oa:             <http://www.w3.org/ns/oa#>.
@prefix olia:           <http://purl.org/olia/olia.owl#> .
@prefix penn:           <http://purl.org/olia/penn.owl#> .
@prefix stanford:       <http://purl.org/olia/stanford.owl#>.

# FrameNet ontologies
@prefix fn:             <http://www.ontologydesignpatterns.org/ont/framenet/tbox/> .
@prefix frame:          <http://www.ontologydesignpatterns.org/ont/framenet/abox/frame/> .
@prefix fe:             <http://www.ontologydesignpatterns.org/ont/framenet/abox/fe/> .
@prefix lu:             <http://www.ontologydesignpatterns.org/ont/framenet/abox/lu/> .
@prefix st:             <http://www.ontologydesignpatterns.org/ont/framenet/abox/semType/> .

3.6 Statistical Ontologies

Multisensor uses statistical ontologies for representing Decision Support data.

# Common statistical ontologies (CUBE, SDMX)
@prefix qb:             <http://purl.org/linked-data/cube#> .
@prefix sdmx-code:      <http://purl.org/linked-data/sdmx/2009/code#> .
@prefix sdmx-attribute: <http://purl.org/linked-data/sdmx/2009/attribute#> .
@prefix sdmx-dimension: <http://purl.org/linked-data/sdmx/2009/dimension#> .
@prefix sdmx-measure:   <http://purl.org/linked-data/sdmx/2009/measure#> .

3.7 Eurostat Statistical Parameters

One of the main sources of statistical data is Eurostat. We also use some of their parameters (eg eugeo:) to represent our own stats datasets.

# Eurostat statistics
@prefix eu:             <http://eurostat.linked-statistics.org/dic/>.
@prefix eu_age:         <http://eurostat.linked-statistics.org/dic/age#>.
@prefix eu_adj:         <http://eurostat.linked-statistics.org/dic/s_adj#>.
@prefix eu_flow:        <http://eurostat.linked-statistics.org/dic/stk_flow#>.
@prefix eu_partner:     <http://eurostat.linked-statistics.org/dic/partner#>.
@prefix eudata:         <http://eurostat.linked-statistics.org/data/> .
@prefix eugeo:          <http://eurostat.linked-statistics.org/dic/geo#>.
@prefix indic:          <http://eurostat.linked-statistics.org/dic/indic#>.
@prefix indic_bp:       <http://eurostat.linked-statistics.org/dic/indic_bp#>.
@prefix indic_et:       <http://eurostat.linked-statistics.org/dic/indic_et#>.
@prefix indic_is:       <http://eurostat.linked-statistics.org/dic/indic_is#>.
@prefix indic_na:       <http://eurostat.linked-statistics.org/dic/indic_na#>.
@prefix prop:           <http://eurostat.linked-statistics.org/property#> .
@prefix unit:           <http://eurostat.linked-statistics.org/dic/unit#> .

3.8 WorldBank Statistical Parameters

WorldBank stats (as represented at Cerven Capadisli's site 270a) is also used by the project.

# WorldBank statistics
@prefix country:        <http://worldbank.270a.info/classification/country/> .
@prefix indicator:      <http://worldbank.270a.info/classification/indicator/> .
@prefix property:       <http://worldbank.270a.info/property/> .

3.9 ComTrade and Distance Statistics

@prefix comtrade:        <http://comtrade.un.org/data/>.
@prefix ms-indic:        <http://data.multisensorproject.eu/indicators/> .
@prefix ms-comtr:        <http://data.multisensorproject.eu/comtrade/prop/> .
@prefix ms-comtr-commod: <http://data.multisensorproject.eu/comtrade/commodity/> .
@prefix ms-comtr-indic:  <http://data.multisensorproject.eu/comtrade/indicator/> .
@prefix ms-comtr-dat:    <http://data.multisensorproject.eu/comtrade/data/> .
@prefix ms-distance:     <http://data.multisensorproject.eu/distance/> .
@prefix ms-distance-dat: <http://data.multisensorproject.eu/distance/data/> .

3.10 Auxiliary Prefixes

puml is used in example files to give rdfpuml instructions that improve diagram appearance. We always put such instructions after a row of ####### to distinguish them from actual RDF data.

# Auxiliary prefixes
@prefix puml:           <http://plantuml.com/ontology#>.

3.11 JSONLD Context

Multisensor uses JSONLD for communication of RDF data between the various pipeline modules. We use a JSONLD Context that reflects the prefixes to shorten the representation. We make the context using the following command:

riot --formatted=jsonld prefixes.ttl > multisensor.jsonld

./img/multisensor.jsonld

4 NIF Example

This first example shows NLP data in RDF Turtle. It covers:

  • NIF (text binding)
  • OLIA (linguistic properties)
  • Penn (POS tagging)
  • Stanford (dependency parsing)
  • ITS20 (NER or semantic annotation)
  • NERD (NER classes)
  • Stanbol/FISE (multiple NLP tools/annotations per word/phrase)
  • MARL (opinion/sentiment)

It uses NE (entities) from DBpedia, WordNet, YAGO

4.1 JSONLD vs Turtle

While JSONLD is used by the MS partners for communicating data, it is harder to read than Turtle. Compare the example in Turtle to its JSONLD equivalent:

4.2 Example Context

Assume that http://example.com/blog/1 is a blog post with the text "Germany is the work horse of the European Union". First we represent the text as a whole. isString means this Context is considered equivalent to its string value. sourceUrl points to Where the text came from, same as the @base

@base <http://example.com/blog/1/> .
<#char=0,47> a nif:Context; # the complete text
  nif:isString "Germany is the work horse of the European Union";
  nif:sourceUrl <>.

4.3 String Position URLs

The recommended NIF URLs are position-based (following RFC 5147): <#char=x,y> . The count is 0-based and spaces are counted as 1 (NIF 2.0 Core Spec, String Counting and Determination of Length). Here are the positions of each word:

Germany is   the   work  horse of    the   European Union
0,7     8,10 11,14 15,19 20,25 26,28 29,32 33,41    42,47

All string URLs must refer to the context using referenceContext. ~beginIndex/endIndex are counted within the context's isString.

We indicate the datatype of beginIndex/endIndex explicitly, as specified in NIF, and unlike examples which omit it. Please note that in Turtle a number like 7 means xsd:integer, not xsd:nonNegativeInteger (see Turtle spec)

<#char=0,7>   a nif:RFC5147String; nif:referenceContext <#char=0,47>;
  nif:beginIndex "0"^^xsd:nonNegativeInteger;  nif:endIndex "7"^^xsd:nonNegativeInteger.
<#char=8,10>  a nif:RFC5147String; nif:referenceContext <#char=0,47>;
  nif:beginIndex "8"^^xsd:nonNegativeInteger;  nif:endIndex "10"^^xsd:nonNegativeInteger.
<#char=11,14> a nif:RFC5147String; nif:referenceContext <#char=0,47>;
  nif:beginIndex "11"^^xsd:nonNegativeInteger; nif:endIndex "14"^^xsd:nonNegativeInteger.
<#char=15,19> a nif:RFC5147String; nif:referenceContext <#char=0,47>;
  nif:beginIndex "15"^^xsd:nonNegativeInteger; nif:endIndex "19"^^xsd:nonNegativeInteger.
<#char=20,25> a nif:RFC5147String; nif:referenceContext <#char=0,47>;
  nif:beginIndex "20"^^xsd:nonNegativeInteger; nif:endIndex "25"^^xsd:nonNegativeInteger.
<#char=26,28> a nif:RFC5147String; nif:referenceContext <#char=0,47>;
  nif:beginIndex "26"^^xsd:nonNegativeInteger; nif:endIndex "28"^^xsd:nonNegativeInteger.
<#char=29,32> a nif:RFC5147String; nif:referenceContext <#char=0,47>;
  nif:beginIndex "29"^^xsd:nonNegativeInteger; nif:endIndex "32"^^xsd:nonNegativeInteger.
<#char=33,41> a nif:RFC5147String; nif:referenceContext <#char=0,47>;
  nif:beginIndex "33"^^xsd:nonNegativeInteger; nif:endIndex "41"^^xsd:nonNegativeInteger.
<#char=42,47> a nif:RFC5147String; nif:referenceContext <#char=0,47>;
  nif:beginIndex "42"^^xsd:nonNegativeInteger; nif:endIndex "47"^^xsd:nonNegativeInteger.

We also introduce URLs for a couple of phrases.

<#char=15,25> a nif:RFC5147String; nif:referenceContext <#char=0,47>;
  nif:beginIndex "15"^^xsd:nonNegativeInteger; nif:endIndex "25"^^xsd:nonNegativeInteger.
<#char=33,47> a nif:RFC5147String; nif:referenceContext <#char=0,47>;
  nif:beginIndex "33"^^xsd:nonNegativeInteger; nif:endIndex "47"^^xsd:nonNegativeInteger.

4.4 Position URLs to Word URLs

In this example we introduce word-based URLs, which make the following statements more clear (especially for Stanford Dependency Parse). owl:sameAs makes two resources equivalent, so all their statements are "smushed" between each other. The URLs don't make any semantic difference, and in the actual implementation we use only position-based URLs. We also introduce URLs for the text as a whole, and a couple of phrases.

<#char=0,47>  owl:sameAs <#ROOT-0>.
<#char=0,7>   owl:sameAs <#Germany-1>.
<#char=8,10>  owl:sameAs <#is-2>.
<#char=11,14> owl:sameAs <#the-3>.
<#char=15,19> owl:sameAs <#work-4>.
<#char=20,25> owl:sameAs <#horse-5>.
<#char=26,28> owl:sameAs <#of-6>.
<#char=29,32> owl:sameAs <#the-7>.
<#char=33,41> owl:sameAs <#European-8>.
<#char=42,47> owl:sameAs <#Union-9>.
<#char=15,25> owl:sameAs <#work-horse>.
<#char=33,47> owl:sameAs <#European-Union>.

4.5 Basic Text Structure

NLP tools would usually record whether each URL is a Word, Phrase, Sentence… URLs may state the corresponding word with nif:anchorOf. This is redundant, since it can be inferred from the context's isString and indexes, but is very useful for debugging, and Multisensor records it in production.

We can also record Lemma and Stemming

<#ROOT-0> a nif:Sentence. # no nif:anchorOf because it already has the mandatory subprop nif:isString
<#Germany-1>      nif:anchorOf "Germany";        a nif:Word.
<#is-2>           nif:anchorOf "is";             a nif:Word.
<#the-3>          nif:anchorOf "the";            a nif:Word.
<#work-4>         nif:anchorOf "work";           a nif:Word.
<#horse-5>        nif:anchorOf "horse";          a nif:Word.
<#of-6>           nif:anchorOf "of";             a nif:Word.
<#the-7>          nif:anchorOf "the";            a nif:Word.
<#European-8>     nif:anchorOf "European";       a nif:Word.
<#Union-9>        nif:anchorOf "Union";          a nif:Word.
<#work-horse>     nif:anchorOf "work horse";     a nif:Phrase.
<#European-Union> nif:anchorOf "European Union"; a nif:Phrase.

############################## Stemming/Lemmatization
<#Germany-1>    nif:lemma "Germany". # same for all words, except:
<#European-8>   nif:lemma "Europe".

# For a more interesting example, let's assume there's a 10th word "favourite".
<#favourite-10> nif:stem  "favourit". # Snowball Stemmer
<#favourite-10> nif:lemma "favorite". # Stanford Core NLP

4.6 Part of Speech

Let's represent some POS info using the Penn tagset. It is part of the OLIA Ontologies, and below we refer to files from that page.

The Penn parser produces his parse:

Germany/NNP is/VBZ the/DT work/NN horse/NN of/IN the/DT European/NNP Union/NNP

We represent it below using two properties:

  • nif:oliaLink is an owl:Individual representing the individual tag
  • nif:oliaCategory is an owl:Class representing the same

Since in penn.owl the individuals have the same rdf:type as the given class, you only need one, the other is redundant.

<#Germany-1>   nif:oliaLink penn:NNP; nif:oliaCategory penn:ProperNoun.
<#is-2>        nif:oliaLink penn:VBZ; nif:oliaCategory penn:BePresentTense.
<#the-3>       nif:oliaLink penn:DT;  nif:oliaCategory penn:Determiner.
<#work-4>      nif:oliaLink penn:NN;  nif:oliaCategory penn:CommonNoun. # POS is NN, but the syntactic role is Adjective
<#horse-5>     nif:oliaLink penn:NN;  nif:oliaCategory penn:CommonNoun.
<#of-6>        nif:oliaLink penn:IN;  nif:oliaCategory penn:PrepositionOrSubordinatingConjunction.
<#the-7>       nif:oliaLink penn:DT;  nif:oliaCategory penn:Determiner.
<#European-8>  nif:oliaLink penn:NNP; nif:oliaCategory penn:ProperNoun.
<#Union-9>     nif:oliaLink penn:NNP; nif:oliaCategory penn:ProperNoun.

One could consume POS at a higher level of abstraction: OLIA abstracts the particular POS tagset. penn-link.owl defines Penn classes as subclasses of OLIA classes, so one could consume OLIA only. If you produce a reduced tagset (eg only ProperNouns), use the OLIA class directly. Please note that nif:oliaLink (the individual) doesn't apply here, only nif:oliaCategory (the class).

<#Germany-1>   nif:oliaCategory olia:ProperNoun.
<#is-2>        nif:oliaCategory [owl:unionOf (olia:FiniteVerb olia:StrictAuxiliaryVerb)],
                             [a owl:Restriction; owl:onProperty olia:hasTense; owl:allValuesFrom olia:Present].
<#the-3>       nif:oliaCategory penn:Determiner. # "Not clear whether this corresponds to OLiA/EAGLES determiners"
<#work-4>      nif:oliaCategory olia:CommonNoun.
<#horse-5>     nif:oliaCategory olia:CommonNoun.
<#of-6>        nif:oliaCategory [owl:unionOf (olia:Preposition olia:SubordinatingConjunction)].
<#the-7>       nif:oliaCategory penn:Determiner. # "Not clear whether this corresponds to OLiA/EAGLES determiners"
<#European-8>  nif:oliaCategory olia:ProperNoun.
<#Union-9>     nif:oliaCategory olia:ProperNoun.

As you see above, the OLIA abstraction doesn't work perfectly in all cases:

  • penn:Determiner doesn't have an OLIA mapping
  • penn:PrepositionOrSubordinatingConjunction maps to a unionOf (disjunction), but you can't query by such class
  • penn:BePresentTense is worse: it's also a unionOf; further a reasoner will restrict any olia:hasTense property to have type olia:Present. But neither OLIA nor Penn define any values for that property!

Rather than using the PENN-OLIA mapping, we could attach NLP features directly to words, eg

<#is-2> a olia:Verb;
  olia:hasTense ms:VerySpecialPresentTense.
ms:VerySpecialPresentTense a olia:Present, olia:Tense.

4.7 Dependency Parse

The Stanford Dependency Parser produces the following parse:

(ROOT
  (S
    (NP (NNP Germany))
    (VP (VBZ is)
      (NP
        (NP (DT the) (NN work) (NN horse))
        (PP (IN of)
          (NP (DT the) (NNP European) (NNP Union)))))))

A while ago the details of the parse were (currently it's a bit different):

individual(gov,dep) class<superclass<superclass
nsubj(horse-5,Germany-1) NominalSubject<Subject<Argument<Dependent<DependencyLabel
cop(horse-5,is-2) Copula<Auxiliary<Dependent<DependencyLabel
det(horse-5,the-3) Determiner<Modifier<Dependent<DependencyLabel
nn(horse-5,work-4) NounCompoundModifier<Modifier<Dependent<DependencyLabel
root(ROOT-0,horse-5) Root<DependencyLabel
prep(horse-5,of-6) PrepositionalModifier<Modifier<Dependent<DependencyLabel
det(Union-9,the-7) Determiner<Modifier<Dependent<DependencyLabel
amod(Union-9,European-8) AdjectivalModifier<Modifier<Dependent<DependencyLabel
pobj(of-6,Union-9) ObjectOfPreposition<Object<Complement<Argument<Dependent<DependencyLabel

individual is an owl:Individual having the class (and all superclasses) as its type

There are two ways to represent this:

  • The easy way: use a single property (nif:dependency), attach the Stanford Dependency class to target
<#horse-5> nif:dependency <#Germany-1>.  <#Germany-1>  a stanford:NominalSubject.
<#horse-5> nif:dependency <#is-2>.       <#is-2>       a stanford:Copula.
<#horse-5> nif:dependency <#the-3>.      <#the-3>      a stanford:Determiner.
<#horse-5> nif:dependency <#work-4>.     <#work-4>     a stanford:NounCompoundModifier.
<#ROOT-0>  nif:dependency <#horse-5>.    <#horse-5>    a stanford:Root.
<#horse-5> nif:dependency <#of-6>.       <#of-6>       a stanford:PrepositionalModifier.
<#Union-9> nif:dependency <#the-7>.      <#the-7>      a stanford:Determiner.
<#Union-9> nif:dependency <#European-8>. <#European-8> a stanford:AdjectivalModifier.
<#of-6>    nif:dependency <#Union-9>.    <#Union-9>    a stanford:ObjectOfPreposition.
  • The hard way: make separate DependencyLabel nodes

We use <#individual(gov,dep)> as URL: that includes numbered words, so is guaranteed to be unique.

<#nsubj(horse-5,Germany-1)> a olia:Relation, stanford:NominalSubject;        olia:hasSource <#horse-5>; olia:hasTarget <#Germany-1>.
<#cop(horse-5,is-2)>        a olia:Relation, stanford:Copula;                olia:hasSource <#horse-5>; olia:hasTarget <#is-2>.
<#det(horse-5,the-3)>       a olia:Relation, stanford:Determiner;            olia:hasSource <#horse-5>; olia:hasTarget <#the-3>.
<#nn(horse-5,work-4)>       a olia:Relation, stanford:NounCompoundModifier;  olia:hasSource <#horse-5>; olia:hasTarget <#work-4>.
<#root(ROOT-0,horse-5)>     a olia:Relation, stanford:Root;                  olia:hasSource <#ROOT-0>;  olia:hasTarget <#horse-5>.
<#prep(horse-5,of-6)>       a olia:Relation, stanford:PrepositionalModifier; olia:hasSource <#horse-5>; olia:hasTarget <#of-6>.
<#det(Union-9,the-7)>       a olia:Relation, stanford:Determiner;            olia:hasSource <#Union-9>; olia:hasTarget <#the-7>.
<#amod(Union-9,European-8)> a olia:Relation, stanford:AdjectivalModifier;    olia:hasSource <#Union-9>; olia:hasTarget <#European-8>.
<#pobj(of-6,Union-9)>       a olia:Relation, stanford:ObjectOfPreposition;   olia:hasSource <#of-6>;    olia:hasTarget <#Union-9>.

We use the following class hierarchy: stanford:DependencyLabel<olia_sys:Feature<LinguisticAnnotation The latter is kind of like Relation, which has properties hasSource, hasTarget

4.8 NER Classes

There are two mechanisms to represent Named Entity Recognition.

If you can recognize only the entity type:

<#Germany-1>      its:taClassRef nerd:Country.
<#European-Union> its:taClassRef nerd:Country. # or AdministrativeRegion or Location

This uses the NERD ontology, which includes:

  • NERD Core (top-level) classes: Thing Amount Animal Event Function Location Organization Person Product Time
  • NERD specific classes: AdministrativeRegion Aircraft Airline Airport Album Ambassador Architect Artist Astronaut Athlete Automobile Band Bird Book Bridge Broadcast Canal Celebrity City ComicsCharacter Company Continent Country Criminal Drug EducationalInstitution EmailAddress FictionalCharacter Holiday Hospital Insect Island Lake Legislature Lighthouse Magazine Mayor MilitaryConflict Mountain Movie Museum MusicalArtist Newspaper NonProfitOrganization OperatingSystem Park PhoneNumber PoliticalEvent Politician ProgrammingLanguage RadioProgram RadioStation Restaurant River Road SchoolNewspaper ShoppingMall SoccerClub SoccerPlayer Software Song Spacecraft SportEvent SportsLeague SportsTeam Stadium Station TVStation TennisPlayer URL University Valley VideoGame Weapon Website

4.9 NER Individuals

If you can recognize specific entities in LOD datasets, you can capture such annotations, eg:

  • Wordnet RDF for phrases: search Wordnet, then pick the correct sense 104608649-n
  • DBpedia for real-word entities
  • Babelnet for phrases or real-word entities: search Babelnet, then pick the correct sense 00081596n
<#work-horse>     its:taIdentRef
  <http://wordnet-rdf.princeton.edu/wn31/104608649-n>, bn:s00081596n.
<#Germany-1>      its:taIdentRef dbr:Germany.
<#European-Union> its:taIdentRef dbr:European_union.

dbr:European_union a dbo:Country, dbo:Place, dbo:PopulatedPlace, yago:G20Nations, yago:InternationalOrganizationsOfEurope. # etc
dbr:Germany        a dbo:Country, dbo:Place, dbo:PopulatedPlace, yago:FederalCountries, yago:EuropeanUnionMemberEconomies. # etc

These LOD sources include useful info, eg:

  • Wordnet's 104608649-n.ttl has:
  • DBpedia has info about population, area, etc. It also has extensive class info as shown above, so there's no need to use its:taClassRef nerd:Country. But other NERD classes may be useful, eg Phone, Email: for those you can't refer to DBpedia and must use its:taClassRef
  • Babelnet has links to Wordnet, DBpedia, Wikipedia categories, skos:broader exracted from Wordnet/DBpedia, etc. The Multisensor Entity Lookup service uses Babelnet since it's a more modern resource integrating the two above, plus more

4.10 NER Provenance

We can record the tool that created the NER annotation and its confidence. MS uses up to two tools for NER: Linguatec and Babelnet.

  • The Linguatec annotation is always emitted first, directly over the nif:Word or nif:Phrase
  • The Babelnet is emitted second. If there is no Linguatec annotation for the same nif:Phrase, it's emitted directly over the nif:Phrase. If there are two annotations, the Babelnet annotation is emitted indirectly, over a nif:AnnotationUnit.

Here we show the more complex latter case.

<#Germany-1> a nif:Word;
  its:taIdentRef dbr:Germany;
  nif-ann:provenance <http://linguatec.com>;
  nif-ann:confidence "0.9"^^xsd:double;
  nif-ann:annotationUnit <#Germany-1-annot-Babelnet>.

<#Germany-1-annot-Babelnet> a nif-ann:AnnotationUnit;
  its:taIdentRef bn:sTODO;
  nif-ann:provenance <http://babelfy.org/>;
  nif-ann:confidence "1.0"^^xsd:double.

Note: previously we used the NIF Stanbol profile (FISE) instead of nif-ann, eg:

<#Germany-1-enrichment-1>
  a fise:EntityAnnotation;
  fise:extracted-from <#Germany-1>;
  fise:entity-type nerd:Country;
  fise:entity-reference dbr:Germany;
  dct:creator <http://babelnet.org>;
  fise:confidence "1.0"^^xsd:float.

But it is a bad practice to use two completely different property sets for two similar situations. Just because in the second case there's an intermediate node for the annotation, doesn't mean the properties should be completely different. We posted that as a NIF issue: specification/issues/2

4.11 Sentiment Analysis with MARL

Assume there are some comments about our blog, which we represent using SIOC. Comments are a sort of sioc:Post, since there is no separate sioc:Comment class

<comment/1> a sioc:Post;
  sioc:reply_of <>;
  sioc:has_creator <http://example.com/users/Hans>;
  sioc:content "Yes, we Germans are the hardest-working people in the world".
<comment/2> a sioc:Post;
  sioc:reply_of <>;
  sioc:has_creator <http://example.com/users/Dimitrios>;
  sioc:content "Bullshit! We Greeks are harder-working".

Now assume a sentiment analysis algorithm detects the sentiment of the comment posts. We represent them using MARL.

<opinion/1> a marl:Opinion;
  marl:extractedFrom <comment/1>;
  marl:describesObject <>;
  marl:opinionText "Yes";
  marl:polarityValue 0.9;
  marl:minPolarityValue -1;
  marl:maxPolarityValue 1;
  marl:hasPolarity marl:Positive.
<opinion/2> a marl:Opinion;
  marl:extractedFrom <comment/2>;
  marl:describesObject <>;
  marl:opinionText "Bullshit!";
  marl:polarityValue -1;
  marl:minPolarityValue -1;
  marl:maxPolarityValue 1;
  marl:hasPolarity marl:Negative.

Note: the following properties are useful for sentiment about vendors (eg AEG) or products (eg appliances):

  • marl:describesObject (eg laptop)
  • marl:describesObjectPart (eg battery, screen)
  • marl:describesFeature (eg for battery: battery life, weight)

Often it's desirable to aggregate opinions, so one doesn't have to deal with individual opinions (marl:aggregatesOpinion is optional)

<opinions> a marl:AggregatedOpinion;
  marl:describesObject <>;
  marl:aggregatesOpinion <opinion/1>, <opinion/2>; # can skip
  marl:opinionCount 2;
  marl:positiveOpinionsCount 1; # sic, this property is spelled in plural
  marl:negativeOpinionCount 1;
  marl:polarityValue -0.05; # simple average
  marl:minPolarityValue -1;
  marl:maxPolarityValue 1;
  marl:hasPolarity marl:Neutral.

4.12 Sentiment Analysis in NIF

NIF integrates MARL using property nif:opinion from nif:String to marl:Opinion. But that's declared inverseOf marl:extractedFrom, which in the MARL example points to sioc:Post (not the nif:String content of the post). So something doesn't mesh here (specification/issues/1). We could mix SIOC and NIF properties on <comment/1>, but then nif:sourceUrl would point to itself…

<comment/1> a nif:Context;
  nif:sourceUrl <comment/1>;
  nif:isString "Yes, we Germans are the hardest-working people in the world";
  nif:opinion <opinion/1>.
<comment/2> a nif:Context;
  nif:sourceUrl <comment/2>;
  nif:isString "Bullshit! We Greeks are harder-working";
  nif:opinion <opinion/2>.

It may be more meaningful to use NIF to express which word carries the opinion (like marl:opinionText)

<comment/1#char=0,> a nif:Context;
  nif:sourceUrl <comment/1>;
  nif:isString "Yes, we Germans are the hardest-working people in the world".
<comment/1#char=0,3> a nif:String;
  nif:referenceContext <comment/1#char=0,>;
  nif:anchorOf "Yes";
  nif:opinion <opinion/1>.

<comment/2#char=0,> a nif:Context;
  nif:sourceUrl <comment/2>;
  nif:isString "Bullshit! We Greeks are harder-working".
<comment/2#char=0,9> a nif:String;
  nif:referenceContext <comment/2#char=0,>;
  nif:anchorOf "Bullhshit!";
  nif:opinion <opinion/2>.

5 RDF Validation

All generated NIF files should be validated, to avoid mistakes propagating between the pipeline modules. We first tried the NIF Validator, but quickly switched to RDFUnit validation.

5.1 NIF Validator

The basic NIF validator is part of the NIF distribution (doc, software, tests). Unfortunately there are only 11 tests, so it's not very useful

  • You can understand the tests just by reading the error messages, e.g.
    nif:anchorOf must match the substring of nif:isString calculated with begin and end index
    
  • It says "json-ld not implemented yet", so we need to convert to ttl first (I use apache-jena-2.12.1)
    rdfcat -out ttl test-out.jsonld | java -jar validate.jar -i - -o text
    

5.2 RDFUnit Validation

A much better validator is RDFUnit (home, demo, source, paper NLP data cleansing based on Linguistic Ontology constraints) This is implemented in the Multisensor RDF_Validation_Service

I tried their demo site with some examples.

  1. Data Selection> Direct Input> Turtle> Load
    Data loaded successfully! (162 statements)
    
  2. Constraints Selection> Automatic> Load
    Constraints loaded successfully: (foaf, nif, itsrdf, dcterms)
    
  3. Test Generation
    Completed! Generated 514 tests                 
    

    (That's a lot of tests!)

  4. Testing> Report Type> Status (all)> Run Tests
    Total test cases 514, Succeeded 507, Failed 7  
    

    (Those "Succeeded" also in many cases mean errors)

5.2.1 Generated Tests per Ontology

RDFunit-NIF-tests.png

URI Automatic Manual
http://xmlns.com/foaf/0.1/ 174 -
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core# 199 10
http://www.w3.org/2005/11/its/rdf# 75 -
http://purl.org/dc/terms/ 56 -
http://www.w3.org/2006/time# 183 -
http://dbpedia.org/ontology/ 9281 14

(Even though I canceled dbo generation prematurely.)

This is too much for us, we don't want the DBO tests. In particular, the Status (all) report includes a lot of "violations" that come from ontologies not from our data.

5.2.2 RDFUnit Test Results

Here are the results. "Resources" is a simple tabular format (basically URL & error), "Annotated Resources" provides more detail (about the errors pertaining to each URL)

Source File Status Annotated Resources
./img/NIF-test1.ttl ./img/NIF-test1-out.xls ./img/NIF-test1-annotated.ttl
./img/NIF-test2.ttl ./img/NIF-test2-out.xls ./img/NIF-test2-annotated.ttl

5.3 Manual Validation

In addition to RDFUnit validation, we used a lot of manual validation to check for semantic (as opposed to syntactic) errors. The RDF_Validation page describes a workflow for preparing pipeline results for validation.

Please post only Turtle files, not JSON files since they are impossible to check manually.

  • Get Jena (eg apache-jena-3.0.0.tar.gz), unzip it somewhere and add the bin directory to your path. We'll use RIOT (RDF I/O Tool).
  • Get Turtle: You can get a Turtle representation of the SIMMO in one of two ways

5.3.1 Get Turtle from Store

  • Store the SIMMO using the RDF Storing Service
  • Get the SIMMO out using a query like this (saved as "a SIMMO graph"), and then save the result as file-noprefix.ttl (Turtle).
<pre>construct {?s ?p ?o} 
where {graph <http://data.multisensor.org/content/8006dcd60b292feaaef24abc9ec09e2230aab83e> 
  {?s ?p ?o}}
  • There's also a REST call to get the SIMMO out that's easier to use from the command line

5.3.2 Get Turtle from SIMMO JSON

  • get the content of the "rdf" key out of the SIMMO JSON. Unescape quotes. Save as file.jsonld So instead of this:
    "rdf":["[{\"@id\":\"http://data.multisensor...[{\"@value\":\"Germany\"}]}]"],"category":""}</pre>
    

    You need this:

    [{"@id":"http://data.multisensor...[{"@value":"Germany"}]}]
    
  • You can do this manually, or with RIOT that can convert the stringified RDF field into more readable JSONLD format:
    riot --output=jsonld rdf_output_string.jsonld > new_readable_file.jsonld
    

    Instead of a single string, the results will be displayed as:

    "@graph" : [ {
      "@id" : "http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#Amount=10000_Euro",
      "@type" : [ "http://schema.org/QuantitativeValue", "http://nerd.eurecom.fr/ontology#Amount" ],
      "name" : "10000 Euro"
    }, {
      "@id" : "http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304#Amount=2000_Euro",
      "@type" : [ "http://schema.org/QuantitativeValue", "http://nerd.eurecom.fr/ontology#Amount" ],
      "name" : "2000 Euro"
    }, {...
    

No matter which of the two methods you used, the rest is the same

  • Validate it with RIOT: this is optional but recommended
    riot --validate file.jsonld
    
  • Convert to Turtle. Omit "WARN riot" lines which would make the Turtle invalid
    riot --output turtle file.jsonld | grep -v "WARN  riot" > file-noprefix.ttl
    

5.3.3 Prettify Turtle

Unfortunately this file doesn't use prefixes, so the URLs are long and ugly

  • Download ./img/prefixes.ttl (this file is updated about once a month)
  • Concat the two:
    cat prefixes.ttl file-noprefix.ttl > file-withprefix.ttl
    
  • Prettify the Turtle to make use of the prefixes and to group all statements of the same subject together:
    riot --formatted=turtle file-withprefix.ttl > file.ttl
    

Optional manual edits:

  • Add on top a base, using the actual SIMMO base, eg
    @base <http://data.multisensorproject.eu/content/53a0938bc4770c6ba0e7d7b9ca88a637f9e9c304>.
    
  • Replace this string with "" (I don't know why RIOT doesn't use the base, even if I specify the –base option)
  • Sort paragraphs (i.e. statement clusters)

Post in Jira that last prettified file.ttl. Thanks!

6 SIMMO

Multisensor crawls and analyzes news items and social network posts, collectively called SIMMOs. Each SIMMO has a GUID URL in the ms-content: namespace. Below we assume the @base is set to the SIMMO URL, eg

@base <http://data.multisensorproject.eu/content/04858f1e0cbc73ab672b1f6acab05afe2c18b0ae>.

Therefore <> and ms-content:<GUID> mean the same URL.

6.1 Graph Handling

The RDF_Storing_Service saves all data about a SIMMO in a named graph having the same URL as the SIMMO base URL. This makes it easy to get or overwrite all data about the SIMMO.

6.1.1 PUT vs POST

The RDF_Storing_Service supports the SPARQL Graph Store Protocol:

  • GET to get a graph
  • PUT to write or overwrite a graph (see HTTP PUT in the above specification)
  • POST to add data to he graph

In addition to NLP and NER results over the SIMMO (article), the same graph accommodates image annotations, and NLP/NER of video transcripts/ASR. Therefore one should use a sequence like this to write all of the data:

  • PUT SIMMO
  • POST ASR0
  • POST ASR1 …
  • POST image0 annotation
  • POST image1 annotation …

6.1.2 Graph Normalization

Submitting all SIMMO info in one graph makes storing it easier, but it also leads to duplication of common triples. Eg consider this:

<#char=100,107> its:taIdentRef dbr:Germany.
dbr:Germany a nerd:Location; foaf:name "Germany". # Common triples

If dbr:Germany appears 1000 times in SIMMOs, these common triples will be duplicated 1000 times in different named graphs. This leads to extreme slowness of ElasticSearch indexing: when adding the 1000th occurrence of dbr:Germany it indexes (the same) foaf:name "Germany" 1000 times, i.e. storing time grows potentially quadratically with the number of SIMMOs.

The fix we implemented is graph normalization: the storing service examines every triple <s,p,o>.

  • If s starts with one of these prefixes the triple is stored in the default graph:
    http://dbpedia.org
    http://babelnet.org
    
  • Otherwise the triple is stored in the SIMMO graph.

This still writes common triples 1000 times, but there is no duplication since a triple can exist only once in a given graph.

  • Note: some SIMMOs contain subjects that don't have the SIMMO base URL as prefix, namely embedded videos and images. It's not correct to move them to the default graph, so we work with an explicit list of common prefixes.
6.1.2.1 Query Changes

The tradeoff is that you won't be able to get all SIMMO data by simply asking for a graph. Eg query 2.3 Retrieve NEs (Select) was a bit sloppy, since it asked for certain types (and foaf:name) by graph, without looking for any relation:

SELECT DISTINCT ?ne ?type ?name {
  GRAPH <> {
    ?ne a ?type; foaf:name ?name
    FILTER (?type IN (dbo:Person, dbo:Organization, nerd:Amount, nerd:Location, nerd:Time))}}

After graph normalization is applied, we need to find the NEs by relation its:taIdentRef, and get their common triples from outside the SIMMO graph:

SELECT distinct ?ne ?type ?name {
  GRAPH <> {
    [] its:taIdentRef ?ne.
    ?ne a ?type}
    FILTER (?type IN (dbo:Person, dbo:Organization, nerd:Amount, nerd:Location, nerd:Time))}
  ?ne foaf:name ?name
}

(This query works with or without graph normalization, since the part outside GRAPH {..} looks in all graphs, both SIMMO and default).

6.1.2.2 Normalization Problems

Moving common triples outside of the SIMMO graph raises two problems:

  • If you examine the results of the query above, you'll see that some entities (eg dbr:Facebook) have several labels, eg
    "Facebook, Inc."@en
    "Facebook"^^xsd:string
    

    The reason is that different SIMMOs have different versions of the label, and different versions of the pipeline emit different literals ("en" language vs xsd:string). Both of these labels will be indexed in ElasticSearch for all occurrences of this NE. The pipeline has emitted the labels globally (as foaf:name of dbr:Facebook) rather than locally (eg as nif:anchorOf), in effect asserting that both are globally valid labels of Facebook. So that's a correct consequence of the data as stored.

  • If the last SIMMO referring to a global NE is deleted, that NE will remain as "garbage" in the common graph. But I don't think that is a significant problem, since the amount of such "garbage" won't be large, and since it is harmless.

6.2 SIMMO Context

The basic structure of a SIMMO consists of:

  • a foaf:Document describing the source document. We use dc: for literals and dct: for resources (URLs)
  • a nif:Context describing the full text of the article (the full text of any video transcriptions is separate).
graph <> {
  <> a foaf:Document;
    dc:type "article";
    dc:language "en";
    dbp:countryCode "GB";
    dc:source "Guardian";
    dct:source <https://www.theguardian.com/uk-news/2016/jun/20/zane-gbangbola-inquest-neighbour-hydrogen-cyanide>;
    dc:creator "Caroline Davies";
    dc:date    "2016-06-20T18:45:07.000+02:00"^^xsd:dateTime;
    dct:issued "2016-06-30T12:34:56.000+02:00"^^xsd:dateTime; 
    dc:title "I was told I might have 20 minutes to live, neighbour tells Zane inquest";
    dc:description "Zane’s parents Kye Gbangbola (front centre) and Nicole Lawler (right) at a protest...";
    dc:subject "Lifestyle & Leisure";
    schema:keywords "UK news, Zane Gbangbola, Hydrogen Cyanide".

<#char=0,5307> a nif:RFC5147String, nif:Context;
  nif:sourceUrl <> .
  nif:beginIndex "0"^^xsd:nonNegativeInteger;
  nif:endIndex "5307"^^xsd:nonNegativeInteger;
  nif:isString """I was told I might have 20 minutes to live, neighbour tells Zane inquest
Zane’s parents Kye Gbangbola (front centre) and Nicole Lawler (right) at a protest in 2014.
Photograph: Lauren Hurley/PA ...""".
}

Explanation:

element meaning
foaf:Document Basic SIMMO metadata
dc:type kind of SIMMO
dc:language Language of content
dbp:countryCode Code of originaing country
dc:source Literal identifying the source (newspaper or social network)
dct:source URL of source article
dc:creator Author: journalist, blogger, etc
dc:date Timestamp when crawled
dct:issued Timestamp when processed by pipeline and ingested to GraphDB
dc:title Short title
dc:description Longer description
dc:subject Article subject, roughly corresponding to IPTC Subject Codes
schema:keywords Free keywords
nif:Context "Reference Context": holds the full text, root of all NIF data.
  Each word/sentence points to it using nif:referenceContext.
  The URL #char=<beg,end> follows RFC 5147
nif:sourceUrl Points to the SIMMO
nif:beginIndex Always 0 for this node. A xsd:nonNegativeInteger
nif:endIndex Length of the text
nif:isString The full text

7 Named Entity Recognition

This section describes the representation of NER in Multisensor

7.1 NER Mapping

Multisensor recognizes a number of Named Entity types. The following table specifies potential NE properties and what they are mapped to.

Class /Property Type/enum Mapping Notes
all   nif:Word or nif:Phrase  
text string n/a nif:anchorOf omitted
onset number nif:beginIndex start
offset number nif:endIndex end
Person   dbo:Person, foaf:Person; nerd:Person  
test string foaf:name  
firstname string foaf:firstName  
lastname string foaf:lastName  
gender male, female dbo:gender dbp:Male, dbp:Female
occupation string rdau:professionOrOccupation dbo:occupation and dbo:profession are object props
Location type=other nerd:Location No need to use dbo:Location if you can't identify the type
Location type=country dbo:Country; nerd:Country  
Location type=region dbo:Region; nerd:AdministrativeRegion  
Location type=city dbo:City; nerd:City  
Location type=street schema:PostalAddress; nerd:Location Put text in schema:streetAddress
Organisation type=institution dbo:Organisation, foaf:Organization; nerd:Organization  
Organisation type=company dbo:Company, foaf:Company; nerd:Company  
Product   nerd:Product  
type string not yet don't know yet what makes sense here
Time   time:Instant; nerd:Time TODO: can you parse to XSD datetime components?
year string time:Instant; nerd:Time  
month string time:Instant OR yago:Months; nerd:Time if yago:Months then dbp:January…
day string time:Instant; nerd:Time  
time string time:Instant; nerd:Time  
weekday string yago:DaysOfTheWeek; nerd:Time dbp:Sunday,… Put text in rdfs:label
rel string nerd:Time relative expression, eg "the last three days"
other string nerd:Time any other time expression, eg "Valentine's day"
Amount type=price schema:PriceSpecification; nerd:Amount  
unit string schema:priceCurrency 3-letter ISO 4217 format
amount number schema:price "." as decimal separator
Amount type=unit schema:QuantitativeValue; nerd:Amount How about percentage??
unit string schema:unitCode Strictly speaking, UN/CEFACT Common Code (eg GRM for grams)
amount number schema:value  
Name   nerd:Thing  
type string dc:type a type if anything can be identified, otherwise empty

Notes

  • Classes are uppercase, Properties are lowercase
  • A recognized Named Entity is attached to the word using its:taIdentRef
  • NERD classes are attached to the word using its:taClassRef
  • Other classes are attached to the NE using rdf:type.
  • The Amount mapping uses schema.org classes/properties, which were borrowed from GoodRelations
  • dbo:gender is an object property, though it doesn't specify the values to use
  • dc:type is a literal. We attach it to the word directly
  • We also include Provenance and Confidence for each annotation (see section 4.10)

7.2 Named Entity URLs

We use two kinds of Named Entity URLs:

  • Global: if a NE can be identified in DBpedia or Babelnet, use its global URL, eg dbr:Angela_Merkel
  • Local: if a NE is only recognized, but not globally identified, use per-document URL consisting of the type and label (replacing punctuation with "_"), eg <#Person=Angela_Merkel>. This does not allow two different John_Smiths in one document, but the chance of this to happen is small.

Note on slash vs Hash: everyting after a # stripped by the client before it makes a HTTP request.

  • So hash is used for "sub-nodes" that will typically be served with one HTTP request
  • In contrast, slash is used with large collections

7.3 NER Examples

Examples of various kinds of Named Entities as per the above mapping.

  • I made up some word/phrase occurrences. I use nif:anchorOf to illustrate the word/phrase, and omit nif:beginIndex and nif:endIndex
  • In a couple cases I've embedded rdfs:comment and rdfs:seeAlso to illustrate a point. Of course, thse won't be present in the actual RDF.

./img/MS-NER.ttl

# Various kinds of Named Entities as per Multisensor-NER-Mapping

@base <http://data.multisensorproject.eu/content/12486u3968u39>.

<#char=1,2> nif:anchorOf "Angela Merkel";
  its:taClassRef nerd:Person;
  its:taIdentRef <#person=Angela_Merkel>.
<#person=Angela_Merkel> a dbo:Person, foaf:Person;
  foaf:name "Angela Merkel";
  foaf:firstName "Angela"; foaf:lastName "Merkel";
  dbo:gender dbp:Female;
  rdau:professionOrOccupation "Bundeskanzlerin"@de.

<#char=3,4> nif:anchorOf "Germany";
  its:taClassRef nerd:Country;
  its:taIdentRef <#location=Germany>.
<#location=Germany> a dbo:Country;
  foaf:name "Germany".

<#char=5,6> nif:anchorOf "Hesse region";
  its:taClassRef nerd:AdministrativeRegion;
  its:taIdentRef <#location=Hesse>.
<#location=Hesse> a dbo:Region;
  foaf:name "Hesse".

<#char=7,8> nif:anchorOf "Darmstadt";
  its:taClassRef nerd:City;
  its:taIdentRef <#location=Darmstadt>.
<#location=Darmstadt> a dbo:City;
  foaf:name "Darmstadt".

<#char=9,10> nif:anchorOf "135 Tsarigradsko Shosse Blvd.";
  its:taClassRef nerd:Location;
  its:taIdentRef <#location=135_Tsarigradsko_Shosse_Blvd>.
<#location=135_Tsarigradsko_Shosse_Blvd> a schema:PostalAddress;
  schema:streetAddress "135 Tsarigradsko Shosse Blvd.".

<#char=11,12> nif:anchorOf "the dark side of the Moon";
  its:taClassRef nerd:Location.

<#char=13,14> nif:anchorOf "The United Nations";
  its:taClassRef nerd:Organization;
  its:taIdentRef <#organisation=The_United_Nations>.
<#organisation=The_United_Nations> a dbo:Organisation, foaf:Organization;
  foaf:name "The United Nations".

<#char=15,16> nif:anchorOf "Ontotext Corp";
  its:taClassRef nerd:Company;
  its:taIdentRef <#organisation=Ontotext_Corp>.
<#organisation=Ontotext_Corp> a dbo:Company, foaf:Company;
  foaf:name "Ontotext Corp".

<#char=17,18> nif:anchorOf "AEG Smart-Freeze Refrigerator";
  its:taClassRef nerd:Product.

<#char=19,20> nif:anchorOf "2050";
  its:taClassRef nerd:Time;
  its:taIdentRef <#time=2050>.
<#time=2050> a time:Instant;
  time:inXSDDateTime "2050"^^xsd:gYear.

<#char=21,22> nif:anchorOf "May 2050";
  its:taClassRef nerd:Time;
  its:taIdentRef <#time=May_2050>.
<#time=May_2050> a time:Instant;
  time:inXSDDateTime "2050-05-01"^^xsd:gYearMonth;
  rdfs:comment "The correct value is 2050-05 but my JSONLD convertor throws exception";
  rdfs:seeAlso <https://github.com/jsonld-java/jsonld-java/issues/130>.

<#char=41,42> nif:anchorOf "Mei";
  its:taClassRef nerd:Time;
  its:taIdentRef dbp:May.
dbp:May a yago:Months;
  rdfs:label "May"@en, "Mei"@de.

<#char=23,24> nif:anchorOf "15 May 2050";
  its:taClassRef nerd:Time;
  its:taIdentRef <#time=15_May_2050>.
<#time=15_May_2050> a time:Instant;
  time:inXSDDateTime "2050-05-15"^^xsd:date.

<#char=25,26> nif:anchorOf "1:34pm";
  its:taClassRef nerd:Time;
  its:taIdentRef <#time=1_34pm>.
<#time=1_34pm> a time:Instant;
  rdfs:comment "Convert to xsd:time, which means complete it to minutes";
  time:inXSDDateTime "13:34:00"^^xsd:time.

<#char=39,40> nif:anchorOf "15 May 2050 1:34pm";
  its:taClassRef nerd:Time;
  its:taIdentRef <#time=15_May_2050_1_34pm>.
<#time=15_May_2050_1_34pm> a time:Instant;
  time:inXSDDateTime "2050-05-15T13:34:00"^^xsd:datetime.

<#char=27,28> nif:anchorOf "Zondag";
  its:taClassRef nerd:Time;
  its:taIdentRef dbp:Sunday.
dbp:Sunday a yago:DaysOfTheWeek;
  rdfs:label "Sunday"@en, "Zondag"@de.

<#char=29,30> nif:anchorOf "the last three days";
  its:taClassRef nerd:Time.

<#char=31,32> nif:anchorOf "Valentine's day";
  its:taClassRef nerd:Time.

<#char=33,34> nif:anchorOf "123,40 EUR";
  its:taClassRef nerd:Amount;
  its:taIdentRef <#amount=123_40_EUR>.
<#amount=123_40_EUR> a schema:PriceSpecification;
  schema:priceCurrency "EUR";
  schema:price 123.40.

<#char=35,36> nif:anchorOf "123,40 meters";
  its:taClassRef nerd:Amount;
  its:taIdentRef <#amount=123_40_meters>.
<#amount=123_40_meters> a schema:QuantitativeValue;
  schema:unitCode "MTR";
  schema:value 123.40.

<#char=37,38> nif:anchorOf "Dodo";
  its:taClassRef nerd:Thing;
  dc:type "mythical creature".

7.4 Babelnet Concepts

The MS Entity Linking service uses Babelfy to annotate SIMMOs with Babelnet concepts. Babelnet is a large-scale linguistic resource, integrating WordNet, different language versions of Wikipedia, Geonames, etc.

After annotation with Babelnet concepts, we wanted to obtain more details about them then just the label (see below). The whole Babelnet dataset is not available for download, but one can get the entities one by one. We fetched all Babelnet entities found by MS and their broader concepts, as documented here and next section. MS found 324k occurrences of 31k Babelnet entities, which grows to 46k when we get their broaders (recursively).

We recorded various info, including EN, ES, DE, BG labels (where available); DBpedia, Wordnet and Geonames links; DBpedia categories. These will be put in the default graph, not in per-SIMMO graphs, see sec 6.1.2. Note: Babelnet uses lemon:isReferenceOf and lemon:LexicalSense to express the labels, but we use a simpler representation with skos:prefLabel. E.g. for The Hague:

bn:s00000002n a skos:Concept;
  skos:prefLabel               "The Hague"@en, "Den Haag"@de, "Хага"@bg;
  bn-lemon:dbpediaCategory     dbc:Populated_coastal_places_in_the_Netherlands, dbc:1248_establishments, 
                               dbc:Provincial_capitals_of_the_Netherlands, dbc:The_Hague, dbc:Populated_places_in_South_Holland, dbc:Populated_places_established_in_the_13th_century, dbc:Cities_in_the_Netherlands, 
                               dbc:Port_cities_and_towns_of_the_North_Sea;
  bn-lemon:synsetID            "bn:00000002n";
  bn-lemon:synsetType          "NE";
  bn-lemon:wiktionaryPageLink  wiktionary:The_Hague;
  dct:license                  <http://creativecommons.org/licenses/by-nc-sa/3.0/>;
  lexinfo:partHolonym          bn:s00044423n;
  skos:broader                 bn:s00064917n, bn:s00015498n, bn:s15898622n, bn:s03335997n, bn:s10245001n, bn:s00056922n, bn:s00019319n;
  skos:exactMatch              freebase:m.07g0_, lemon-Omega:OW_eng_Synset_22362, lemon-WordNet31:108970180-n, 
                               dbr:The_Hague, yago:The_Hague, geonames:2747373 .

7.4.1 Generic vs Specific Concepts

The Concept Extraction Service makes a distinction between Generic vs Specific Babelnet concepts, which is used by the Summarization service.

  • Generic concept
  • Specific concept: specific to the Multisensor domain, which is recognized by statistical analysis over the MS SIMMO corpus

Consider the following example: "Wind turbines are complex engineering systems":

  • bn:s00081274n "wind turbine" is a specific concept (since MS includes a lot of energy-related articles)
  • bn:s00075759n "system" is a generic concept
<#char=0,45> a nif:Context;
  nif:isString "Wind turbines are complex engineering systems".

<#char=0,13> a nif:Phrase;
  nif:referenceContext <#char=0,45>;
  nif:beginIndex 0;
  nif:endIndex 13;
  nif:anchorOf "Wind turbines";
  nif:taIdentRef bn:s00081274n;
  nif:taClassRef ms:SpecificConcept.
bn:s00081274n a skos:Concept; skos:prefLabel "wind turbine"@en, "aerogenerador"@es.

<#char=38,45> a nif:Word;
  nif:referenceContext <#char=0,45>;
  nif:beginIndex 38;
  nif:endIndex 45;
  nif:anchorOf "system";
  nif:taIdentRef bn:s00075759n;
  nif:taClassRef ms:GenericConcept.
bn:s00075759n a skos:Concept; skos:prefLabel "system"@en, "sistema"@es.

The two new classes that we use are defined in the MS ontology:

ms:GenericConcept a rdfs:Class;
  rdfs:subClassOf skos:Concept;
  rdfs:label "GenericConcept";
  rdfs:comment "Generic concept that doesn't belong to a specific domain";
  rdfs:isDefinedBy ms: .

ms:SpecificConcept a rdfs:Class;
  rdfs:subClassOf skos:Concept;
  rdfs:label "SpecificConcept";
  rdfs:comment "Concept that is specific to a Multisensor domain, determined by statistical analysis over the Multisensor SIMMO corpus";
  rdfs:isDefinedBy ms: .

8 Relation Extraction

We represent Relation Extraction information using FrameNet. This more complex topic is developed in its own folder ./FrameNet/ and a paper:

9 Complex NLP Annotations

The example below shows realistic annotations for the word "East", including:

  • binding to the text (nif:referenceContext)
  • word and lemma (nif:anchorOf, nif:lemma)
  • demarkating the substring (nif:beginIndex, nif:endIndex)
  • part of speech tagging (nif:oliaLink penn:NNP)
  • surface and deep dependency parsing (nif:dependency, upf-deep:deepDependency)
  • FrameNet (nif:oliaLink <#char=0,4_fe> and linked structures)
  • Babelnet concept (the <#char=0,4-annot-BabelNet> node)
  • GenericConcept vs SpecificConcept annotation

The UPF NLP pipeline step first produces nif:literalAnnotation, and from that makes appropriate structured properties.

<#char=0,4>                a nif:Word;
  nif:anchorOf             "East";
  nif:lemma                "east";
  nif:beginIndex           "0"^^xsd:nonNegativeInteger;
  nif:endIndex             "4"^^xsd:nonNegativeInteger;
  nif:referenceContext     <#char=0,12793>;
  nif:oliaLink             upf-deep:NAME, upf-dep-syn:NAME, <#char=0,4_fe>, penn:NNP;
  nif:dependency           <#char=5,11>;
  upf-deep:deepDependency  <#char=5,11>;
  nif-ann:annotationUnit   <#char=0,4-annot-BabelNet>;
  nif:literalAnnotation    "deep=spos=NN", "surf=spos=NN",
                           "rel==dpos=NN|end_string=4|id0=1|start_string=0|number=SG|word=east|connect_check=OK|vn=east".

<#char=0,4-annot-BabelNet> a nif-ann:AnnotationUnit;
  nif-ann:confidence       "0.9125974876"^^xsd:double;
  nif-ann:provenance       <http://babelfy.org/>;
  its:taClassRef           ms:GenericConcept;
  its:taIdentRef           bn:s00029050n .

10 Multimedia Annotation

MS includes 2 multimedia services that are integrated in RDF:

  • Automatic Speech Recognition (ASR) that provides raw text extracted from the video; followed by NLP and NER
  • Concept and Event Detection that provides a list of the concepts appearing in images/videos, with a degree of confidence.

Being able to search for concepts detected in images, videos, and/or audio (speech recognition) is a useful multimedia search feature.

The basic NIF representation is like this:

  • SIMMO
    • referenceContext
      • Sentences
        • Words/Phrases
          • its:taIdentRef = list of recognized Concepts / Named Entities

We extend it for multimedia content as follows:

  • SIMMO
    • dct:hasPart dctype:StillImage = images present in the article
      • oa:Annotation = list of Concepts/Events detected per image, with confidence score
    • dct:hasPart dctype:MovingImage = videos present in the article
      • oa:Annotation = 3 to 5 most confident Concepts/Events detected in the video, with confidence score
      • ms:hasCaption = text extracted by Automatic Speech recognition
        • its:taIdentRef = recognized Concepts / Named Entities
      • dct:hasPart dctype:StillImage = some frames (images) extracted from the video
        • oa:Annotation = Concepts/Events detected per image, with confidence score

10.1 Automatic Speech Recognition

The audio track of videos embedded in articles (SIMMOs) is passed through Automatic Speech Recognition (ASR). This results in two products:

  • Plain text Transcript that is passed through text analysis (NER and other NIF annotations). The transcript is analyzed same as the main article text. So it has similar structure to the SIMMO, with the following differences
    • The transcript doesn't have sentence boundaries thus no NIF sentence structure.
    • The transcript doesn't have context properties such as author, publication date, etc
    • The transcript is subsidiary to the article, following this nesting structure:
      • Article -dct:hasPart-> Video -ms:hasCaption-> Caption <-nif:sourceUrl- Transcript
      • Note: I considered inserting Video - Audio - Caption but decided against it since we don't have any statements about the Audio
  • Structured Captions in Web Video Text Tracks (WebVTT) format (MIME type "text/vtt"). The Caption file is not stored in RDF, only a link to it is in RDF

Notes:

  • Assume that http://blog.hgtv.com/terror/2014/09/08/video is the 0th video in http://blog.hgtv.com/terror/2014/09/08/article
  • Both the article and video mention "Germany" which is recognized as a named entity. This is just for the sake of illustration and comparison, and we don't show any other NIF statements
  • The video is accessed from the source URL and not copied to an MS server We make statements against the video URL, rather than making a MS URL (same as for Images). If copied to an MS server, it's better to make statements against that URL
  • The Caption is stored on a MS server in the indicated directory.
  • The Transcript (bottom nif:Context) uses the Caption as nif:sourceUrl.
  • The Transcript's URL is subsidiary to (has as prefix) the SIMMO URL. Since we can't use two # in a URL, we use - before the transcript part and # after it. The number 0 is the sequential count (0th video)

./img/NIF-ASR.ttl

@base <http://data.multisensorproject.eu/content/fb086c>.
<> a foaf:Document ;
  dc:creator "John Smith" ;
  dc:date "2014-09-08T17:15:34.000+02:00"^^xsd:dateTime;
  dct:source <http://blog.hgtv.com/terror/2014/09/08/article>.

<#char=0,24> a nif:Context;
  nif:beginIndex "0"^^xsd:nonNegativeInteger ;
  nif:endIndex "24"^^xsd:nonNegativeInteger ;
  nif:isString "Article mentions Germany";
  nif:sourceUrl <>.

<#char=17,24> a nif:Word;
  nif:referenceContext <#char=0,24>;
  nif:beginIndex "17"^^xsd:nonNegativeInteger ;
  nif:endIndex "24"^^xsd:nonNegativeInteger ;
  nif:anchorOf "Germany";
  its:taIdentRef dbr:Germany.

<> dct:hasPart <http://blog.hgtv.com/terror/2014/09/08/video>.

<http://blog.hgtv.com/terror/2014/09/08/video> a dctype:MovingImage;
  dc:format "video/mp4";
  ms:hasCaption <-transcript0>.

<-transcript0> a dctype:Text;
  dc:format "text/vtt".

<-transcript0#char=0,27> a nif:Context;
  nif:beginIndex "0"^^xsd:nonNegativeInteger ;
  nif:endIndex "27"^^xsd:nonNegativeInteger ;
  nif:isString "Transcript mentions Germany";
  nif:sourceUrl <-transcript0>.

<-transcript0#char=20,27> a nif:Word;
  nif:referenceContext <-transcript0#char=0,27>;
  nif:beginIndex "20"^^xsd:nonNegativeInteger ;
  nif:endIndex "27"^^xsd:nonNegativeInteger ;
  nif:anchorOf "Germany";
  its:taIdentRef dbr:Germany.

dbr:Germany a nerd:Location, dbo:Country; foaf:name "Germany".

####################
nif:sourceUrl puml:arrow puml:up.
nif:referenceContext puml:arrow puml:up.
<http://blog.hgtv.com/terror/2014/09/08/article> a puml:Inline.

10.1.1 ASR Diagram

NIF-ASR.png

10.1.2 hasCaption Property

I was hoping that I can find a property to express "ASR transcript of an audio" in the ISOcat register or GOLD. There's nothing appropriate in GOLD but I found an entry in http://www.isocat.org/rest/profile/19:

This is also available as RDF at http://www.isocat.org/datcat/DC-4064.rdf (which redirects to http://www.isocat.org/rest/dc/4064.rdf), but the info is minimal:

<http://www.isocat.org/datcat/DC-4064>
  rdfs:comment  "The conversion of the spoken word to a text format in the same language."@en;
  rdfs:label    "audio transcription"@en .

The datahub entry for ISOcat https://datahub.io/dataset/isocat claims that full profiles are available as RDF at https://catalog.clarin.eu/isocat/rest/profile/19.rdf, but this link is broken. I found an (unofficial?) RDF dump of profile 5 at http://www.sfs.uni-tuebingen.de/nalida/images/isocat/profile-5-full.rdf but not of profile 19.

What is worse, there is no property name defined (eg isocat:audioTranscription), no domain and range. We'll certainly won't use something like isocat:DC-4064 to name our properties.

After this disappointment, we defined our own property:

ms:hasCaption a owl:ObjectProperty;
  rdfs:domain dctype:MovingImage;
  rdfs:range dctype:Text;
  rdfs:label "hasCaption";
  rdfs:comment "Transcript or ASR of a video (dctype:MovingImage) expressed as dctype:Text";
  rdfs:isDefinedBy ms: .

10.2 Basic Image Annotation

Before describing how an image in SIMMO is annotated, let's consider how to annotate (enrich) a single image. Since images are not text, NIF mechanisms are completely inappropriate: there are no nif:Strings to be found in images.

Look at this image:
drage-vermisste-540x304.jpg

NOTE: It's recommended to copy the images to an internal server, to ensure that they will be available in the future. If the above image disappears, statements about its URL will be useless.

The CED annotates images with heuristic tags and confidence, eg like this (many more tags are produced for this image):

Concepts3_Or_More_People # 0.731893
Amateur_Video            # 0.884379
Armed_Person             # 0.35975

We represent this in RDF using:

  • OpenAnnotation (basic representation)
  • Stanbol FISE (confidence)

Unfortunately OA has no standard way to express confidence, which is essential for this use case. I have raised this as https://github.com/restful-open-annotation/spec/issues/3.

10.2.1 Basic Representation with Open Annotation

The Web Annotation Data Model (also known as Open Annotation, OA) is widely used for all kinds of associating two or several resources: bookmarking, tagging, commenting, annotating, transcription (associating the image of eg handwritten text with the deciphered textual resources), compositing pieces of a manuscript (SharedCanvas), etc.

The OA ontology has gone through a huge number of revisions at various sites. To avoid confusion:

We represent image annotations as oa:SemanticTag:

  • The image is the target, tags are (linked to) bodies
  • The tags are expressed as oa:SemanticTag.
  • OA asks us to describe the nature of the relation as a specific oa:motivatedBy. In this case I picked oa:tagging.
  • We state the nature of the resource as rdf:type dctype:Image, and its mime type as dc:format.
  • We record basic creation (provenance) information.
  • In this example we use a custom property ms:confidence but in production we use Stanbol FISE for confidence.

./img/annot-image-oa.ttl

ms-annot:1234153426
  a oa:Annotation;
  oa:hasTarget <http://images.zeit.de/...-540x304.jpg>;
  oa:hasBody
    ms-annot:1234153426-Concepts3_Or_More_People,
    ms-annot:1234153426-Amateur_Video,
    ms-annot:1234153426-Armed_Person;
  oa:motivatedBy oa:tagging;
  oa:annotatedBy <http://data.multisensorproject.eu/agent/imageAnnotator>;
  oa:annotatedAt "2015-10-01T12:34:56"^^xsd:dateTime.

<http://images.zeit.de/...-540x304.jpg>
  a dctype:Image;
  dc:format "image/jpeg".

<http://data.multisensorproject.eu/agent/imageAnnotator>
  a prov:SoftwareAgent;
  foaf:name "CERTH Image Annotator v1.0".

ms-annot:1234153426-Concepts3_Or_More_People
  a oa:SemanticTag;
  skos:related ms-concept:Concepts3_Or_More_People;
  ms:confidence 0.731893.
ms-annot:1234153426-Amateur_Video
  a oa:SemanticTag;
  skos:related ms-concept:Amateur_Video;
  ms:confidence 0.884379.
ms-annot:1234153426-Armed_Person
  a oa:SemanticTag;
  skos:related ms-concept:Armed_Person;
  ms:confidence 0.35975.

ms-concept:Concepts3_Or_More_People
  a skos:Concept;
  skos:inScheme ms-concept: ;
  skos:prefLabel "Concepts: 3 or More People".
ms-concept:Amateur_Video
  a skos:Concept, oa:SemanticTag;
  skos:inScheme ms-concept: ;
  skos:prefLabel "Amateur Video".
ms-concept:Armed_Person
  a skos:Concept, oa:SemanticTag;
  skos:inScheme ms-concept: ;
  skos:prefLabel "Armed Person".

####################
oa:tagging a puml:Inline.
ms-annot:1234153426 puml:right <http://data.multisensorproject.eu/agent/imageAnnotator>.
ms-annot:1234153426 puml:left <http://images.zeit.de/...-540x304.jpg>.
10.2.1.1 Open Annotation Diagram

annot-image-oa.png

10.2.2 Representing Confidence with Stanbol FISE

Apache Stanbol defines an "enhancement structure" using the FISE ontology, which amongst other things defines fise:confidence. We want to use fise:TopicAnnotation that goes like this:
es_topicannotation.png

As you see, it points to fise:TextAnnotation using dc:relation; if you scroll to the top, you'll see that points further to the (textual) annotated resource (ContentItem): we don't want that since we have image not text. But there are also fise:extracted-from (dashed arrows) pointing directly to the resource. The NIF+Stanbol profile shows the same idea of using fise:extracted-from directly:
NIF-profiles.png

We bastardize the ontology a bit:

  • skip dc:relation, as we don't have fise:TextAnnotation
  • skip fise:entity-label, as it just repeats skos:prefLabel of the concept
  • skip fise:entity-type, as it just repeats rdf:type of the concept

Removing redundancy:

  • The construct of using skos:related is doubtful and will likely be removed, but for now we'll use it
  • The direct link fise:extracted-from to the image is redundant since oa:hasTarget already points there. So we can skip it

./img/annot-image-fise.ttl

ms-annot:1234153426
  a oa:Annotation;
  oa:hasTarget <http://images.zeit.de/...-540x304.jpg>;
  oa:hasBody
    ms-annot:1234153426-Concepts3_Or_More_People,
    ms-annot:1234153426-Amateur_Video,
    ms-annot:1234153426-Armed_Person;
  oa:motivatedBy oa:tagging;
  oa:annotatedBy <http://data.multisensorproject.eu/agent/imageAnnotator>;
  oa:annotatedAt "2015-10-01T12:34:56"^^xsd:dateTime.

ms-annot:1234153426-Concepts3_Or_More_People
  a oa:SemanticTag, fise:TopicAnnotation;
  skos:related ms-concept:Concepts3_Or_More_People; fise:entity-reference ms-concept:Concepts3_Or_More_People;
  fise:extracted-from <http://images.zeit.de/...-540x304.jpg>;
  fise:confidence 0.731893.
ms-annot:1234153426-Amateur_Video
  a oa:SemanticTag, fise:TopicAnnotation;
  skos:related ms-concept:Amateur_Video; fise:entity-reference ms-concept:Amateur_Video;
  fise:extracted-from <http://images.zeit.de/...-540x304.jpg>;
  fise:confidence 0.884379.
ms-annot:1234153426-Armed_Person
  a oa:SemanticTag, fise:TopicAnnotation;
  skos:related ms-concept:Armed_Person; fise:entity-reference ms-concept:Armed_Person;
  fise:extracted-from <http://images.zeit.de/...-540x304.jpg>;
  fise:confidence 0.35975.

<http://images.zeit.de/...-540x304.jpg>
  a dctype:Image;
  dc:format "image/jpeg".

<http://data.multisensorproject.eu/agent/imageAnnotator>
  a prov:SoftwareAgent;
  foaf:name "CERTH Image Annotator v1.0".

ms-concept:Concepts3_Or_More_People
  a skos:Concept;
  skos:inScheme ms-concept: ;
  skos:prefLabel "Concepts: 3 or More People".
ms-concept:Amateur_Video
  a skos:Concept, oa:SemanticTag;
  skos:inScheme ms-concept: ;
  skos:prefLabel "Amateur Video".
ms-concept:Armed_Person
  a skos:Concept, oa:SemanticTag;
  skos:inScheme ms-concept: ;
  skos:prefLabel "Armed Person".

####################
oa:tagging a puml:Inline.
skos:inScheme a puml:InlineProperty.
ms-annot:1234153426 puml:right <http://data.multisensorproject.eu/agent/imageAnnotator>.
ms-annot:1234153426 puml:left <http://images.zeit.de/...-540x304.jpg>.
10.2.2.1 Stanbol FISE Diagram

annot-image-fise.png

10.3 Annotating Images

Assume that http://blog.hgtv.com/terror/2014/09/08/image.jpg is an 0th image in http://blog.hgtv.com/terror/2014/09/08/article and:

  • The article mentions SWAT, which is coreferenced to dbr:SWAT
  • CED has recognized in the image the same concept dbr:SWAT with confidence 0.9
  • CED has recognized a local concept ms-concept:Concepts3_Or_More_People with lower confidence 0.3

We follow the approach in sec 10.2.2, but remove the redundant link fise:entity-reference

./img/SIMMO-annot-image.ttl

@base <http://data.multisensorproject.eu/content/fb086c>.

<> a foaf:Document ;
  dc:creator "John Smith" ;
  dc:date "2014-09-08T17:15:34.000+02:00"^^xsd:dateTime;
  dct:source <http://blog.hgtv.com/terror/2014/09/08/article>.

<#char=0,24> a nif:Context;
  nif:beginIndex "0"^^xsd:nonNegativeInteger ;
  nif:endIndex "24"^^xsd:nonNegativeInteger ;
  nif:sourceUrl <http://data.multisensorproject.eu/content/fb086c>.

<#char=17,21> a nif:Word;
  nif:referenceContext <#char=0,24>;
  nif:beginIndex "17"^^xsd:nonNegativeInteger ;
  nif:endIndex "21"^^xsd:nonNegativeInteger ;
  nif:anchorOf "SWAT";
  its:taIdentRef dbr:SWAT.

<> dct:hasPart <http://blog.hgtv.com/terror/2014/09/08/image.jpg>.

<http://blog.hgtv.com/terror/2014/09/08/image.jpg> a dctype:StillImage;
  dc:format "image/jpeg".

ms-annot:1234153426 a oa:Annotation;
  oa:hasTarget <http://blog.hgtv.com/terror/2014/09/08/image.jpg>;
  oa:hasBody ms-annot:1234153426-Concepts3_Or_More_People, ms-annot:1234153426-SWAT;
  oa:motivatedBy oa:tagging;
  oa:annotatedBy <http://data.multisensorproject.eu/agent/imageAnnotator>;
  oa:annotatedAt "2015-10-01T12:34:56"^^xsd:dateTime.

ms-annot:1234153426-Concepts3_Or_More_People a oa:SemanticTag, fise:TopicAnnotation;
  skos:related ms-concept:Concepts3_Or_More_People;
  fise:confidence 0.3.

ms-annot:1234153426-SWAT a oa:SemanticTag, fise:TopicAnnotation;
  skos:related dbr:SWAT;
  fise:confidence 0.9.

<http://data.multisensorproject.eu/agent/imageAnnotator>
  a prov:SoftwareAgent;
  foaf:name "CERTH Image Annotator v1.0".

ms-concept:Concepts3_Or_More_People
  a skos:Concept;
  skos:inScheme ms-concept: ;
  skos:prefLabel "Concepts: 3 or More People".

dbr:SWAT a skos:Concept; skos:prefLabel "SWAT".

####################
oa:tagging a puml:Inline.
<http://blog.hgtv.com/terror/2014/09/08/article> a puml:Inline.
skos:inScheme a puml:InlineProperty.
nif:referenceContext puml:arrow puml:up.
nif:sourceUrl puml:arrow puml:up.
oa:hasTarget puml:arrow puml:up.
ms-annot:1234153426 puml:right <http://data.multisensorproject.eu/agent/imageAnnotator>.

10.3.1 Annotated Image Diagram

SIMMO-annot-image.png

10.4 Annotating Videos

CED extracts the 3..5 most confident Concepts/Events detected in a video, with confidence score. We represent this exactly the same as in the previous sec 10.3, just using the appropriate rdf:type (dctype:MovingImage) and dc:format ("video/mp4") for the video.

./img/SIMMO-annot-video.ttl

@base <http://data.multisensorproject.eu/content/fb086c>.

<> a foaf:Document ;
  dc:creator "John Smith" ;
  dc:date "2014-09-08T17:15:34.000+02:00"^^xsd:dateTime;
  dct:source <http://blog.hgtv.com/terror/2014/09/08/article>.

<#char=0,24> a nif:Context;
  nif:beginIndex "0"^^xsd:nonNegativeInteger ;
  nif:endIndex "24"^^xsd:nonNegativeInteger ;
  nif:sourceUrl <>.

<#char=17,21> a nif:Word;
  nif:referenceContext <#char=0,24>;
  nif:beginIndex "17"^^xsd:nonNegativeInteger ;
  nif:endIndex "21"^^xsd:nonNegativeInteger ;
  nif:anchorOf "SWAT";
  its:taIdentRef dbr:SWAT.

<> dct:hasPart <http://blog.hgtv.com/terror/2014/09/08/video.mp4>.

<http://blog.hgtv.com/terror/2014/09/08/video.mp4> a dctype:MovingImage;
  dc:format "video/mp4".

ms-annot:1234153426
  a oa:Annotation;
  oa:hasTarget <http://blog.hgtv.com/terror/2014/09/08/video.mp4>;
  oa:hasBody
    ms-annot:1234153426-Concepts3_Or_More_People,
    ms-annot:1234153426-SWAT;
  oa:motivatedBy oa:tagging;
  oa:annotatedBy <http://data.multisensorproject.eu/agent/imageAnnotator>;
  oa:annotatedAt "2015-10-01T12:34:56"^^xsd:dateTime.

ms-annot:1234153426-Concepts3_Or_More_People
  a oa:SemanticTag, fise:TopicAnnotation;
  skos:related ms-concept:Concepts3_Or_More_People;
  fise:confidence 0.3.
ms-annot:1234153426-SWAT
  a oa:SemanticTag, fise:TopicAnnotation;
  skos:related dbr:SWAT;
  fise:confidence 0.9.

<http://data.multisensorproject.eu/agent/imageAnnotator>
  a prov:SoftwareAgent;
  foaf:name "CERTH Image Annotator v1.0".

ms-concept:Concepts3_Or_More_People
  a skos:Concept;
  skos:inScheme ms-concept: ;
  skos:prefLabel "Concepts: 3 or More People".

dbr:SWAT a skos:Concept; skos:prefLabel "SWAT".

####################
oa:tagging a puml:Inline.
<http://blog.hgtv.com/terror/2014/09/08/article> a puml:Inline.
skos:inScheme a puml:InlineProperty.
nif:referenceContext puml:arrow puml:up.
nif:sourceUrl puml:arrow puml:up.
oa:hasTarget puml:arrow puml:up.
ms-annot:1234153426 puml:right <http://data.multisensorproject.eu/agent/imageAnnotator>.

10.4.1 Annotated Video Diagram

SIMMO-annot-video.png

10.5 Annotating Video Frames

To annotate a video frame, we use Web Annotation's Specific Resources and a Fragment Selector that conforms to the Media Fragments specification. Assume the same video as in the previous section, and that frame(s) from second 30 to second 31 are annotated. This corresponds to a selector #t=30,31 (see Temporal Dimension for more details including NPT vs SMPTE vs real-world clock). We show only one annotated concept for simplicity.

./img/SIMMO-annot-frame.ttl

@base <http://data.multisensorproject.eu/content/fb086c>.

<> a foaf:Document ;
  dc:creator "John Smith" ;
  dc:date "2014-09-08T17:15:34.000+02:00"^^xsd:dateTime;
  dct:source <http://blog.hgtv.com/terror/2014/09/08/article>.

<#char=0,24> a nif:Context;
  nif:beginIndex "0"^^xsd:nonNegativeInteger ;
  nif:endIndex "24"^^xsd:nonNegativeInteger ;
  nif:sourceUrl <>.

<#char=17,21> a nif:Word;
  nif:referenceContext <#char=0,24>;
  nif:beginIndex "17"^^xsd:nonNegativeInteger ;
  nif:endIndex "21"^^xsd:nonNegativeInteger ;
  nif:anchorOf "SWAT";
  its:taIdentRef dbr:SWAT.

<> dct:hasPart <http://blog.hgtv.com/terror/2014/09/08/video.mp4>.

<http://blog.hgtv.com/terror/2014/09/08/video.mp4> a dctype:MovingImage;
  dc:format "video/mp4".

ms-annot:1234153426
  a oa:Annotation;
  oa:hasTarget ms-annot:1234153426-target;
  oa:hasBody ms-annot:1234153426-SWAT;
  oa:motivatedBy oa:tagging;
  oa:annotatedBy <http://data.multisensorproject.eu/agent/imageAnnotator>;
  oa:annotatedAt "2015-10-01T12:34:56"^^xsd:dateTime.

ms-annot:1234153426-target a oa:SpecificResource;
  oa:hasSource  <http://blog.hgtv.com/terror/2014/09/08/video.mp4>;
  oa:hasSelector <http://blog.hgtv.com/terror/2014/09/08/video.mp4#t=30,31>.

<http://blog.hgtv.com/terror/2014/09/08/video.mp4#t=30,31> a oa:FragmentSelector;
  rdf:value "t=30,31";
  dct:conformsTo <http://www.w3.org/TR/media-frags/>.

ms-annot:1234153426-SWAT
  a oa:SemanticTag, fise:TopicAnnotation;
  skos:related dbr:SWAT;
  fise:confidence 0.9.

<http://data.multisensorproject.eu/agent/imageAnnotator>
  a prov:SoftwareAgent;
  foaf:name "CERTH Image Annotator v1.0".

dbr:SWAT a skos:Concept; skos:prefLabel "SWAT".

<http://blog.hgtv.com/terror/2014/09/08/video.mp4> dct:hasPart <http://blog.hgtv.com/terror/2014/09/08/video.mp4#t=30,31>.

####################
oa:tagging a puml:Inline.
<http://blog.hgtv.com/terror/2014/09/08/article> a puml:Inline.
<http://www.w3.org/TR/media-frags/> a puml:Inline.
skos:inScheme a puml:InlineProperty.
nif:referenceContext puml:arrow puml:up.
nif:sourceUrl puml:arrow puml:up.
oa:hasTarget puml:arrow puml:up.
oa:hasSource puml:arrow puml:up.
<http://blog.hgtv.com/terror/2014/09/08/video.mp4> puml:dashed <http://blog.hgtv.com/terror/2014/09/08/video.mp4#t=30,31>.

10.5.1 Annotated Frame Diagram

SIMMO-annot-frame.png

The dashed arrow dct:hasPart says that the frame (fragment) is part of the video. It is optional: it allows direct access to the annotated frames, but is redundant.

11 Content Translation

Scenario: we have a SIMMO in original language ES that is machine-translated to EN.

  • All textual elements are translated: title, description, body, subject & keywords
  • However, video ASR text is not translated.
  • Both original and translations are annotated with NIF (NLP and NER).

We want to record all NIF information against original and translated separately, so there's no confusion. If the article includes multimedia, we want to attach it only to the original, to avoid data duplication.

Solution: we need separate roots (foaf:Document), so we store the original and translation(s) in separate named graphs. The translated content has link bibo:translationOf to the original

11.1 Aside: Translation Properties

Before we settled on bibo:translationOf, we considered some other options

FREME NIF Tutorial:

  • slide 16 uses its:target to point to target (translated) text of a nif:String, but we need to make further statements about the translated text
  • slide 18 shows an idea how to represent translated text as an independent document, but uses a made-up property its:translatedAs

The OntoLex vartrans module suggests 5 ways to represent translation. But all of them put us firmly in OntoLex land:

  • the senses in source and target language share a reference to a shared concept
  • class vartrans:Translation with properties vartrans:source and vartrans:target pointing the source and target sense
  • property vartrans:translation that points from source to target sense
  • property vartrans:translatableAs that points from source to target lexical entry
  • class vartrans:TranslationSet that points to a number of vartrans:member vartrans:Translation instances

We could try to use PROV, but there are no specific enough properties for "translation"

  • prov:hadPrimarySource is the only property that mentions "translation"
  • nif:wasConvertedFrom is a subprop of prov:wasDerivedFrom

11.2 Representing Translation

We represent translation as follows. The statements of the original and translated SIMMO are stored in separate named graphs, which is not shown in the Turtle below

./img/translation.ttl

@base <http://data.multisensorproject.eu/content/>.

# ES original: foaf:Document
ms-content:156e0d a foaf:Document ;
    dbp:countryCode  "ES" ;
    dc:creator       "Alberto Iglesias Fraga" ;
    dc:date          "2016-07-28T23:45:07.000+02:00"^^xsd:dateTime ;
    dc:description   "SONY ha iniciado negociaciones con Murata Manufacturing...";
    dc:language      "es" ;
    dc:source        "cloud.ticbeat.com" ;
    dc:subject       "Economía, Negocios y Finanzas" ;
    dc:title         "SONY se desprenderá de su negocio de baterías" ;
    dc:type          "article" ;
    schema:keywords  "Sony, baterías, Murata Manufacturing";
    dct:source       <http://feedproxy.google.com/...> .

# EN translation: foaf:Document 
ms-content:156e0d-en a foaf:Document ;
  bibo:translationOf ms-content:156e0d; # IMPORTANT!
  dbp:countryCode   "ES" ;
  dc:creator        "Alberto Iglesias Fraga" ;
  dc:date           "2016-07-28T23:45:07.000+02:00"^^xsd:dateTime ;
  dc:description    "SONY has begun negotiations with Murata Manufacturing..." ;
  dc:language       "en" ;
  dc:source         "cloud.ticbeat.com" ;
  dc:subject        "Economy, Business & Finance" ;
  dc:title          "SONY is clear from its battery business" ;
  dc:type           "article" ;
  schema:keywords   "Sony, batteries, Murata Manufacturing";
  dct:source        <http://feedproxy.google.com/...> .

# ES original: nif:Context
<156e0d#char=0,2131> a nif:Context ;
    ms:fluency       "1.22"^^xsd:double ;
    ms:richness      "1.86"^^xsd:double ;
    ms:technicality  "2.78"^^xsd:double ;
    nif:beginIndex   "0"^^xsd:nonNegativeInteger ;
    nif:endIndex     "2131"^^xsd:nonNegativeInteger ;
    nif:isString     "SONY se desprenderá de su negocio de baterías..." ;
    nif:sourceUrl    ms-content:156e0d .

# EN translation: nif:Context
<156e0d-en#char=0,1800> a nif:Context ;
  ms:fluency       "1.25"^^xsd:double ; # hopefully will be similar to original, but won't be identical
  ms:richness      "1.81"^^xsd:double ;
  ms:technicality  "2.70"^^xsd:double ;
  nif:beginIndex   "0"^^xsd:nonNegativeInteger ;
  nif:endIndex     "1800"^^xsd:nonNegativeInteger ; # Assuming EN comes out shorter than ES
  nif:isString     "SONY is clear from its battery business..." ;
  nif:sourceUrl    ms-content:156e0d-en .

# ES original: some NLP
<156e0d#char=1199,1224> a nif:Phrase ;
  nif:anchorOf           "batería de iones de litio" ;
  nif:beginIndex         "1199"^^xsd:nonNegativeInteger ;
  nif:endIndex           "1224"^^xsd:nonNegativeInteger ;
  nif:referenceContext   <156e0d#char=0,2131> ;
  nif-ann:taIdentConf    "1.0"^^xsd:double ;
  nif-ann:taIdentProv    <http://babelfy.org/> ;
  its:taClassRef         ms:GenericConcept ;
  its:taIdentRef         bn:s01289274n .

# EN translation: some NLP
<156e0d-en#char=1100,1119> a nif:Phrase ;
  nif:anchorOf           "lithium ion battery" ;
  nif:beginIndex         "1100"^^xsd:nonNegativeInteger ;
  nif:endIndex           "1119"^^xsd:nonNegativeInteger ;
  nif:referenceContext   <156e0d-en#char=0,1800> ;
  nif-ann:taIdentConf    "1.0"^^xsd:double ;
  nif-ann:taIdentProv    <http://babelfy.org/> ;
  its:taClassRef         ms:GenericConcept ;
  its:taIdentRef         bn:s01289274n .

# Babelnet labels: stored in the default graph
bn:s01289274n skos:prefLabel "LiIon"@de, "Li-ion cell"@en, "Batteries lithium-ion"@fr, "Литиево-йонна батерия"@bg .

# Multimedia is only present with the original-content
ms-content:156e0d dct:hasPart 
  <http://cloud.ticbeat.com/2016/07/sony-baterias-explosion.mp4>,
  <http://cloud.ticbeat.com/2016/07/sony-bateria.jpg>.

<http://cloud.ticbeat.com/2016/07/sony-baterias-explosion.mp4> 
  a dctype:MovingImage;
  dc:format "video/mp4".

<http://cloud.ticbeat.com/2016/07/sony-bateria.jpg> 
  a dctype:StillImage;
  dc:format "image/jpeg".

####################
bibo:translationOf         puml:arrow puml:left.
dct:source                 puml:arrow puml:up.
nif:sourceUrl              puml:arrow puml:up.
nif:referenceContext       puml:arrow puml:up.
<http://babelfy.org/>      a puml:Inline.
ms:GenericConcept          a puml:Inline.
ms-content:156e0d          puml:stereotype "<<(S,green)Spanish>>".
<156e0d#char=1199,1224>    puml:stereotype "<<(S,green)Spanish>>".
<156e0d#char=0,2131>       puml:stereotype "<<(S,green)Spanish>>".
ms-content:156e0d-en       puml:stereotype "<<(E,red)English>>".
<156e0d-en#char=0,1800>    puml:stereotype "<<(E,red)English>>".
<156e0d-en#char=1100,1119> puml:stereotype "<<(E,red)English>>".

11.2.1 Translation Diagram

translation.png

11.2.2 Content Alignment

The Content Alignment Pipeline (CAP) is a service that finds SIMMOs similar or contradictory to a source SIMMO. It is not part of the main pipeline that produces SIMMO annotations, rather it is called periodically, searches globally in the SIMMO DB, and exploits SIMMO annotations to compute alignment.

We store CAP statements as oa:Annotation in their own graph http://data.multisensorproject.eu/CAP (outside of any SIMMO graph). We use custom motivations (oa:motivatedBy): ms:linking-similar vs ms:linking-contradictory to express similarity vs contradiction, and a property ms:score to express the strength of the judgement:

<http://data.multisensorproject.eu/agent/CAP> a prov:SoftwareAgent;
  foaf:name "Content Alignment Pipeline v1.0".

ms:linking-similar a owl:NamedIndividual, oa:Motivation;
  skos:inScheme oa:motivationScheme;
  skos:broader oa:linking;
  skos:prefLabel "linking-similar"@en;
  rdfs:comment "Motivation that represents a symmetric link between two *similar* articles"@en;
  rdfs:isDefinedBy ms: .

ms:linking-contradictory a owl:NamedIndividual, oa:Motivation;
  skos:inScheme oa:motivationScheme;
  skos:broader oa:linking;
  skos:prefLabel "linking-contradictory"@en;
  rdfs:comment "Motivation that represents a symmetric link between two *contradictory* articles"@en;
  rdfs:isDefinedBy ms: .

ms:score a owl:DatatypeProperty;
  rdfs:domain oa:Annotation;
  rdfs:range  xsd:decimal;
  rdfs:label "score"@en;
  rdfs:comment "Strength of an Annotation, eg the link between two entities"@en;
  rdfs:isDefinedBy ms: .

11.3 Content Alignment Representation

Assume there is a pair of similar SIMMOs, and another pair of contradictory SIMMOs. We restructure it as follows, where CAP/123 and CAP/124 are sequential numbers or some other way to generate unique URLs. Please note that the representation is completely symmetric regarding the two SIMMOs involved.

<http://data.multisensorproject.eu/CAP/123> a oa:Annotation;
  oa:hasBody ms-content:e9c304, ms-content:611047;
  ms:score 1.645;
  oa:motivatedBy ms:linking-similar;
  oa:annotatedBy <http://data.multisensorproject.eu/agent/CAP>;
  oa:annotatedAt "2016-01-11T12:00:00"^^xsd:dateTime .

<http://data.multisensorproject.eu/CAP/124> a oa:Annotation;
  oa:hasBody ms-content:e9c304, ms-content:f5c719;
  ms:score 1.987;
  oa:motivatedBy ms:linking-contradictory;
  oa:annotatedBy <http://data.multisensorproject.eu/agent/CAP>;
  oa:annotatedAt "2016-01-12T12:00:00"^^xsd:dateTime .

#############
oa:hasBody puml:arrow puml:up.
#oa:annotatedBy puml:arrow puml:up.

11.4 Content Alignment Diagram

CAP.png

12 Multisensor Social Data

SMAP is a Multisensor module that does network analysis over social networks. It gets some tweets about a topic (based on keywords or hashtags), and then determines the importance of various posters on these topics. We use the SIOC ontology to represent posters and topics, and then some custom properties:

ms:has_page_rank a owl:DatatypeProperty;
  rdfs:domain sioc:Role;
  rdfs:label "Has page rank"@en;
  rdfs:comment "Centrality and importance of a poster on particular topic"@en;
  rdfs:isDefinedBy ms: .

ms:has_reachability a owl:DatatypeProperty;
  rdfs:domain sioc:Role;
  rdfs:label "Has reachability"@en;
  rdfs:comment "Level of linking of a poster on particular topic"@en;
  rdfs:isDefinedBy ms: .

ms:has_global_influence: a owl:DatatypeProperty;
  rdfs:domain sioc:Role;
  rdfs:label "Has global influence"@en;
  rdfs:comment "Global influence of a poster on particular topic. A combination of page rank and reachability"@en;
  rdfs:isDefinedBy ms: .

12.1 Topic Based on Single Keywords

Assume the following:

  • We crawled two sets of tweets based on two keywords: "cars" and "RDF"
  • The first guy (valexiev1) has posted on both topics. He knows a bit about "cars" but a lot about "RDF"
  • The second guy (johnSmith) has posted only on the topic "cars", and he knows a lot about it

We represent it as follows:

  • We use a namespace ms-soc where we put Social Network data
  • We represent topics as sioc:Forum, posters as sioc:UserAccount, and poster on a topic as sioc:Role
  • We form node URLs from keywords and poster account names. If these include spaces or other bad chars, replace with "_"
  • Keywords are strings, so we use dc:subject to express them

./img/SMAP-example.ttl

ms-soc:cars a sioc:Forum;
  sioc:has_host twitter:;
  dc:subject "cars".

ms-soc:RDF a sioc:Forum;
  sioc:has_host twitter:;
  dc:subject "RDF".

# The first guy has posted on both topics. He knows a bit about "cars" but a lot about "RDF"
twitter:valexiev1 a sioc:UserAccount;
  sioc:has_function ms-soc:cars_valexiev1, ms-soc:RDF_valexiev1.

ms-soc:cars_valexiev1 a sioc:Role;
  sioc:has_scope ms-soc:cars;
  ms:has_page_rank 0.75;
  ms:has_reachability 0.70;
  ms:has_global_influence 0.72.

ms-soc:RDF_valexiev1 a sioc:Role;
  sioc:has_scope ms-soc:RDF;
  ms:has_page_rank 7500.0;
  ms:has_reachability 7000.0;
  ms:has_global_influence 7200.0.

# The second guy has posted only on the "cars" topic. He knows a lot about "cars"
twitter:johnSmith a sioc:UserAccount;
  sioc:has_function ms-soc:cars_johnSmith.

ms-soc:cars_johnSmith a sioc:Role;
  sioc:has_scope ms-soc:cars;
  ms:has_page_rank 8.5;
  ms:has_reachability 8.0;
  ms:has_global_influence 8.2.

12.1.1 Single Keywords Diagram

SMAP-example.png

The graph allows a journalist to compare the importance of the same poster across keywords.

12.2 Topic Based on Multiple Hashtags

Assume the following:

  • We crawled one set of tweets based on multiple hashtags. We assume no special chars are present in the hashtags.
  • We don't have the user names, only user IDs. Note: the user name (eg @UNFCCC) can be extracted from the ID (eg user_id=17463923)

We use the following representation

  • We represent topics as sioc:Forum, posters as sioc:UserAccount, and poster on a topic as sioc:Role (same as before)
  • We make the topic URLs by sorting and concatenating tags, separated with "." (a bit too long but works)
  • We make poster on topic URLs by concatenating the tags and account user ID, separated with "_"
  • Hashtags are resources (separately addressable), so we use dct:subject to express them
  • We put each hashtag in a separate dct:subject. This would allow someone to analyze topic intersection
  • For now we use just one named graph, with URL ms-soc:

./img/SMAP-example2.ttl

# topic URL made from sorted concatenated tags
ms-soc:civilengineering.dishwasher.energy_crisis.energy_policy.renewable.foodmanufacturing.homeappliances a sioc:Forum;
  sioc:has_host twitter:;
  dct:subject twitter_tag:civilengineering, twitter_tag:dishwasher, 
    twitter_tag:energy_crisis, twitter_tag:policy_energy, twitter_tag:renewable,
    twitter_tag:foodmanufacturing, twitter_tag:homeappliances.

# poster on topic URL made from topic concatenated with user id
twitter_user:17463923 a sioc:UserAccount;
  sioc:has_function ms-soc:civilengineering.dishwasher.energy_crisis.energy_policy.renewable.foodmanufacturing.homeappliances_17463923.

ms-soc:civilengineering.dishwasher.energy_crisis.energy_policy.renewable.foodmanufacturing.homeappliances_17463923 a sioc:Role;
  sioc:has_scope ms-soc:civilengineering.dishwasher.energy_crisis.energy_policy.renewable.foodmanufacturing.homeappliances;
  ms:has_pagerank 0.0206942;
  ms:has_reachability 386.05;
  ms:has_overall_influence 0.143948.

twitter_user:243236419 a sioc:UserAccount;
  sioc:has_function ms-soc:civilengineering.dishwasher.energy_crisis.energy_policy.renewable.foodmanufacturing.homeappliances_243236419.

ms-soc:civilengineering.dishwasher.energy_crisis.energy_policy.renewable.foodmanufacturing.homeappliances_243236419 a sioc:Role;
  sioc:has_scope ms-soc:civilengineering.dishwasher.energy_crisis.energy_policy.renewable.foodmanufacturing.homeappliances;
  ms:has_pagerank 0.020222;
  ms:has_reachability 434.15;
  ms:has_overall_influence 0.143038.

##################
dct:subject a puml:InlineProperty.

As an enhancement, we could split into more coherent hashtag groups, and do separate analyses. Eg:

  • energy_crisis.energy_policy.renewable vs
  • dishwasher.homeappliances

12.2.1 Multiple Hashtags Diagram

SMAP-example2.png

12.3 Tweets Related to Article

Assume we can collect tweets related to a crawled article (SIMMO):

We can express the tweet as sioc:Post. We'll express just basic data:

  • sioc:content: tweet text
  • sioc:has_creator: http://twitter.com/MSR_Future (or if we don't have access to the user name, we can use the user id just like above).
  • dct:date: date posted

./img/SMAP-tweet.ttl

<http://twitter.com/MSR_Future/status/605786079153627136> a sioc:Post;
  sioc:has_creator <http://twitter.com/MSR_Future>;
  sioc:content """@UNFCCC @EnergiewendeGER
   That's great, just a shame it does not
   translate into lower CO2. #Energiewende""";
  dct:date "2015-06-02T20:20:00"^^xsd:dateTime;
  sioc:about ms-content:939468.

<http://twitter.com/MSR_Future> a sioc:UserAccount.

####################
<http://twitter.com/MSR_Future> a puml:Inline.

Possible extensions:

  • Extract the hashtags and mentions from the tweet
  • If we start sourcing Posts from other places (eg Facebook), we should link the Post and UserAccount to twitter: as a sioc:Forum or sioc:Site.
  • If we want to express more diverse relations than a general sioc:about, we can use OA (see sec *Open Annotation) and oa:motivatedBy. The SIMMO will be the target of annotation, and the tweet will be the body.

12.3.1 Tweets Diagram

SMAP-tweet.png

13 Sentiment Analysis

MS uses MARL for representing sentiments, and a few more extension properties:

ms:negativePolarityValue a owl:DatatypeProperty;
  rdfs:domain marl:Opinion;
  rdfs:label "Most negative polarity value"@en;
  rdfs:comment "A sentence may include both negative and positive polarity words: this records the most negative polarity"@en;
  rdfs:isDefinedBy ms: .

ms:positivePolarityValue a owl:DatatypeProperty;
  rdfs:domain marl:Opinion;
  rdfs:label "Most positive polarity value"@en;
  rdfs:comment "A sentence may include both negative and positive polarity words: this records the most positive polarity"@en;
  rdfs:isDefinedBy ms: .

ms:sentimentalityValue a owl:DatatypeProperty;
  rdfs:domain marl:Opinion;
  rdfs:label "Sentimentality value"@en;
  rdfs:comment """How strong is the sentiment.
Sum of the absolute values of the most positive and most negative polarity of an opinion (i.e. the difference between them)"""@en;
  rdfs:isDefinedBy ms: .

13.1 Sentiment of SIMMO and Sentence

The sentiment of the whole SIMMO is represented as follows.

./img/MS-sentiment.ttl

@base <http://data.multisensorproject.eu/content/9e9c304>.

<#char=0,2000>
  nif:opinion <#char=0,2000-sentiment>.
<#char=100,200-sentiment>
  a marl:Opinion;
  marl:minPolarityValue    -5.0; # sentiment range: minimum
  marl:maxPolarityValue     5.0; # sentiment range: maximum
  ms:negativePolarityValue -3.5; # most negative sentiment
  ms:positivePolarityValue  2.0; # most positive sentiment
  marl:polarityValue       -1.5; # sum of the two polar opinions (what's the average sentiment)
  ms:sentimentalityValue    5.5. # difference of the two polar opinions (how strong is the sentiment)

The sentiment of a sentence is structured exactly as above.

<#char=100,200> a nif:Sentence;
  nif:referenceContext <#char=0,2000>;
  nif:opinion <#char=100,200-sentiment>.

<#char=100,200-sentiment>
  a marl:Opinion;
  marl:minPolarityValue    -5.0;
  marl:maxPolarityValue     5.0;
  ms:negativePolarityValue -3.5;
  ms:positivePolarityValue  2.0;
  marl:polarityValue       -1.5;
  ms:sentimentalityValue    5.5.

A sentence without any sentiment-bearing words still carries the range (minPolarityValue and marl:maxPolarityValue), and a neutral sentimentalityValue. negativePolarityValue, ms:positivePolarityValue and marl:polarityValue are omitted:

<#char=200,300>
  a <#char=200,300-sentiment>.

<#char=200,300-sentiment>
  a marl:Opinion;
  marl:minPolarityValue    -5.0;
  marl:maxPolarityValue     5.0;
  ms:sentimentalityValue    0.0.

13.2 Sentiment About Named Entity

We can represent the sentiment about a NE as follows

./img/MS-sentiment-NE.ttl

@base <http://data.multisensorproject.eu/content/9e9c304>.

<#char=0,50> a nif:Context;
  nif:isString "Consumers are very unhappy with the new Porsche".

<#char=43,50> a nif:Word;
  nif:referenceContext <#char=0,50>;
  nif:beginOffset 43;
  nif:endOffset 50;
  nif:anchorOf "Porsche";
  nif:taIdentRef dbr:Porsche.

<#sentiment=dbr_Porsche> a marl:Opinion;
  marl:extractedFrom <#char=0,50>;
  marl:describesObject dbr:Porsche;
  marl:minPolarityValue -5;
  marl:maxPolarityValue 5;
  ms:negativePolarityValue -5;
  ms:positivePolarityValue 0;
  marl:polarityValue -5;
  ms:sentimentalityValue -5.

# If we can pin-point the sentiment-bearing words (in this case "very unhappy"):
<#char=29,38> a nif:Word;
  nif:referenceContext <#char=0,50>;
  nif:beginOffset 29;
  nif:endOffset 38;
  nif:anchorOf "very unhappy".

<#sentiment=dbr_Porsche> 
  marl:opinionText <#char=29,38>.

14 Text Characteristics

MS has statistical algorithms that can compute a few general characteristics of the text:

ms:fluency a owl:DatatypeProperty;
  rdfs:domain nif:Context;
  rdfs:label "Text fluency"@en;
  rdfs:comment """How fluent is the text.
The text should build from sentence to sentence to a coherent body of information about a topic. 
Consecutive sentences should be meaningfully connected. Paragraphs should be written in a logical sequence. 
The best way to achieve fluency is to have at least one object or subject repeatedly mentioned in consecutive units of text. """@en;
  rdfs:comment "Range: 1.0 ... 5.0"@en;
  rdfs:isDefinedBy ms: .

ms:richness a owl:DatatypeProperty;
  rdfs:domain nif:Context;
  rdfs:label "Text richness"@en;
  rdfs:comment """How rich is the text. 
Is the writing style and vocabulary used rich and interesting, as opposed to plain and straightforward?"""@en;
  rdfs:comment "Range: 1.0 ... 5.0"@en;
  rdfs:isDefinedBy ms: .

ms:technicality a owl:DatatypeProperty;
  rdfs:domain nif:Context;
  rdfs:label "Text technicality"@en;
  rdfs:comment """How technical is the text. 
Are the topic and used language technical? 
Does it require effort and previous knowledge to understand the text?"""@en;
  rdfs:comment "Range: 1.0 ... 5.0"@en;
  rdfs:isDefinedBy ms: .

The representation for a an example SIMMO is shown below This particular text is fluent, but is neither rich nor technical, so we set values 5, 1, 1 respectively.

./img/MS-text-characteristics.ttl

@base <http://data.multisensorproject.eu/content/9e9c304>.

<#char=0,2000> a nif:Context;
  nif:isString "This is the whole text of the SIMMO.\n It should continue for 2000 chars but I'll stop here"@en;
  ms:fluency      5.0;
  ms:richness     1.0;
  ms:technicality 1.0.

15 SIMMO Quality

Some of the SIMMOs have manual quality judgements. We use the W3C Data Quality Vocabulary (dqv:QualityAnnotation) and the Linked Data Quality Dimensions ldqd: by Zaveri to represent this.

First we define some MS nominal quality values:

ms:Accuracy a skos:ConceptScheme;
  rdfs:label "Accuracy values"@en.
ms:accuracy-low a skos:Concept; skos:inScheme ms:Accuracy;
  skos:prefLabel "Low accuracy"@en.
ms:accuracy-medium a skos:Concept; skos:inScheme ms:Accuracy;
  skos:prefLabel "Medium accuracy"@en.
ms:accuracy-high a skos:Concept; skos:inScheme ms:Accuracy;
  skos:prefLabel "High accuracy"@en.
ms:accuracy-curated a skos:Concept; skos:inScheme ms:Accuracy;
  skos:prefLabel "Manually curated"@en;
  skos:note "Highest accuracy"@en.

A SIMMO with quality rating is represented like this:

./img/quality.ttl

ms-content:b3f35 dqv:hasQualityAnnotation ms-content:b3f35-quality.
ms-content:b3f35-quality a dqv:QualityAnnotation;
  dqv:inDimension ldqd:semanticAccuracy;
  oa:motivatedBy dqv:qualityAssessment;
  oa:hasTarget ms-content:b3f35;
  oa:hasBody ms:accuracy-curated.

Note: initially we tried to use dqv:QualityMeasurement but it's for quantitative not qualitative (nominal) values:

ms-content:b3f36 dqv:hasQualityMeasurement ms-content:b3f36-quality.
ms-content:b3f36-quality a dqv:QualityMeasurement ;
   dqv:isMeasurementOf ms:accuracy; dqv:value ms:accuracy-curated.

####################
oa:hasTarget puml:arrow puml:up.
ms-content:b3f36-quality puml:stereotype "<<(W,red)Wrong>>".

15.1 SIMMO Quality Diagram

quality.png

16 Statistical Data for Decision Support

MS Use Case 3 is about decision support for international export. It employs various statistical data, including:

  • EuroStat and WorldBank statistics on demography and consumer behavior
  • UN ComTrade statistics on inter-country trade
  • Google info on distances between country capitals

This data is used to rank potential export countries. The full set of indicators is described in MULTISENSOR Export trading Decison Support System specification.

We use the W3C CUBE ontology to describe all of these.

16.1 EuroStat Statistics

We obtained EuroStat data from http://eurostat.linked-statistics.org/.

Eg below is the representation of a data point about indicator teibs10 "Economic Sentiment" for Cyprus for Oct 2014.

<http://eurostat.linked-statistics.org/data/teibs010#M,CY,2014M10>
  a qb:Observation ;
  qb:dataSet eudata:teibs010 ;
  sdmx-dimension:freq sdmx-code:freq-M ;
  sdmx-dimension:timePeriod "2014-10-01"^^xsd:date ;
  property:geo eugeo:CY ;
  sdmx-measure:obsValue 98.8 ;
  ms:normalized 0.45 .

ms:normalized represents a normalized value from 0 to 1, taking into account the maximum value for that indicator. It is used to make different indicators comparable when displayed on a single chart.

16.2 WorldBank Statistics

We obtained WorldBank data from http://worldbank.270a.info.

Eg below is the representation of a data point about indicator IC.BUS.EASE.XQ "Ease of doing business". It ranks economies from 1 to 189, with first place being the best. A low numerical rank means that the regulatory environment is conducive to business operation. The value below means that Spain was in 44th place for 2012.

<http://worldbank.270a.info/dataset/world-bank-indicators/IC.BUS.EASE.XQ/ES/2012>
  a qb:Observation ;
  qb:dataSet <http://worldbank.270a.info/dataset/IC.BUS.EASE.XQ> ;
  property:indicator indicator:IC.BUS.EASE.XQ .
  sdmx-dimension:refArea country:ES ;
  sdmx-dimension:refPeriod <http://reference.data.gov.uk/id/year/2012> ;
  property:decimal 0.0 ;
  sdmx-measure:obsValue 44.0 ;
  ms:normalized 0.233 .

Compared to EuroStat, WorldBank uses:

  • Different property names for geographic area and time period
  • Different URLs for geographic areas
  • URLs instead of XSD dates/years for time periods (as defined by data.gov.uk), which necessitates extracting the value from the URL

These differences are accommodated easily with appropriate queries.

16.3 ComTrade Statistics

We extracted ComTrade data as numerous CSV from http://comtrade.un.org/data/ (that API imposes restrictions on the number of records that can be fetched at one time).

We obtain commodity codes from the ComTrade data API and convert them to skos:Concepts using the following pattern:

ms-comtr-commod: a skos:ConceptScheme;
  rdfs:label "ComTrade HS Commodity Codes at level 1".
ms-comtr-commod:2 a skos:Concept; skos:inScheme ms-comtr-commod:;
  skos:prefLabel "Meat and edible meat offal".

We describe relevant trade indicators as skos:Concepts:

ms-comtr-indic: a skos:ConceptScheme;
  rdfs:label "UN ComTrade indicators: import, export, total, balance".

ms-comtr-indic:IMP a skos:Concept; skos:inScheme ms-comtr-indic:; skos:prefLabel "Import".
ms-comtr-indic:EXP a skos:Concept; skos:inScheme ms-comtr-indic:; skos:prefLabel "Export".

We model statistical data similar to EuroStat. Eg the observations below describe Import/Export volume between Austria and Belgium for 2012 for commodity code 2 "Meat and edible meat offal", in Kilograms and USD:

./img/stat-comtrade.ttl

# Import
ms-comtr-dat:2012-IMP-AU-BE-2-USD a qb:Observation;
  qb:dataSet ms-comtr-dat: ;
  sdmx-dimension:timePeriod "2012"^^xsd:gYear;
  sdmx-dimension:freq sdmx-code:freq-A;
  ms-comtr:indic ms-comtr-indic:IMP;
  ms-comtr:commod ms-comtr-commod:2;
  prop:geo eugeo:AT;
  prop:partner eugeo:BE;
  prop:unit unit:USD;
  sdmx-measure:obsValue 123.

# Export
ms-comtr-dat:2012-EXP-AU-BE-2-USD a qb:Observation;
  qb:dataSet ms-comtr-dat: ;
  sdmx-dimension:timePeriod "2012"^^xsd:gYear;
  sdmx-dimension:freq sdmx-code:freq-A;
  ms-comtr:indic ms-comtr-indic:EXP;
  ms-comtr:commod ms-comtr-commod:2;
  prop:geo eugeo:AT;
  prop:partner eugeo:BE;
  prop:unit unit:USD;
  sdmx-measure:obsValue 456.

# If there is a value for Kilograms: Import
ms-comtr-dat:2012-IMP-AU-BE-2-KG a qb:Observation;
  qb:dataSet ms-comtr-dat: ;
  sdmx-dimension:timePeriod "2012"^^xsd:gYear;
  sdmx-dimension:freq sdmx-code:freq-A;
  ms-comtr:indic ms-comtr-indic:IMP;
  ms-comtr:commod ms-comtr-commod:2;
  prop:geo eugeo:AT;
  prop:partner eugeo:BE;
  prop:unit unit:KG;
  sdmx-measure:obsValue 12.3.

#########################
ms-comtr-dat:2012-IMP-AU-BE-2-USD puml:stereotype "Import USD".
ms-comtr-dat:2012-EXP-AU-BE-2-USD puml:stereotype "Export USD".
ms-comtr-dat:2012-IMP-AU-BE-2-KG  puml:stereotype "Import KG".
qb:dataSet puml:arrow puml:up.
sdmx-code:freq-A a puml:Inline.
eugeo:AT a puml:Inline.
eugeo:BE a puml:Inline.
unit:KG a puml:Inline.
unit:USD a puml:Inline.

16.3.1 ComTrade Diagram

stat-comtrade.png

16.3.2 ComTrade Derived Indicators

We define two derived indicators:

ms-comtr-indic:TOT a skos:Concept; skos:inScheme ms-comtr-indic:; skos:prefLabel "Total";  
  skos:scopeNote "Total trade volume = Export + Import".
ms-comtr-indic:BAL a skos:Concept; skos:inScheme ms-comtr-indic:; skos:prefLabel "Balance";
  skos:scopeNote "Balance of trade   = Export - Import".

We calculate them from source ComTrade indicators using simple SPARQL Update queries like this:

insert {
  ?total a qb:Observation;
    qb:dataSet ms-comtr-dat: ;
    sdmx-dimension:timePeriod ?time;
    sdmx-dimension:freq ?freq;
    ms-comtr:indic ms-comtr-indic:TOT;
    ms-comtr:commod ?commod;
    prop:geo ?geo;
    prop:partner ?partner;
    prop:unit ?unit;
    sdmx-measure:obsValue ?tot.
  ?balance a qb:Observation;
    qb:dataSet ms-comtr-dat: ;
    sdmx-dimension:timePeriod ?time;
    sdmx-dimension:freq ?freq;
    ms-comtr:indic ms-comtr-indic:BAL;
    ms-comtr:commod ?commod;
    prop:geo ?geo;
    prop:partner ?partner;
    prop:unit ?unit;
    sdmx-measure:obsValue ?bal
} where {
  ?export a qb:Observation;
    qb:dataSet ms-comtr-dat: ;
    sdmx-dimension:timePeriod ?time;
    sdmx-dimension:freq ?freq;
    ms-comtr:indic ms-comtr-indic:EXP;
    ms-comtr:commod ?commod;
    prop:geo ?geo;
    prop:partner ?partner;
    prop:unit ?unit;
    sdmx-measure:obsValue ?exp.
  ?import a qb:Observation;
    qb:dataSet ms-comtr-dat: ;
    sdmx-dimension:timePeriod ?time;
    sdmx-dimension:freq ?freq;
    ms-comtr:indic ms-comtr-indic:IMP;
    ms-comtr:commod ?commod;
    prop:geo ?geo;
    prop:partner ?partner;
    prop:unit ?unit;
    sdmx-measure:obsValue ?imp.
  bind (replace(?export,"-EXP-","-TOT-") as ?total)
  bind (replace(?export,"-EXP-","-BAL-") as ?balance)
  bind (?exp + ?imp as ?tot).
  bind (?exp - ?imp as ?bal).
}

16.4 Distance Data

We obtained distance and travel time data from the Google Maps API, then converted it according to the model below. Eg the first record shows that the distance by road from Spain to Austria (in fact between the capitals: from Madrid to Vienna) is 2401km, and the travel time is 22:09h (79722sec):

./img/stat-distance.ttl

ms-distance-dat:ES-AT a qb:Observation;
  qb:dataSet ms-distance:google-distance;
  prop:geo eugeo:ES;
  prop:partner eugeo:AT;
  ms:distance ms-distance:ES-AT-2401185.                 

ms-distance:ES-AT-2401185 
  ms:distance 2401185;
  rdfs:label "2,401 km";
  ms:duration 79722;
  ms:durationLabel "22 hours 9 mins".

ms-distance-dat:ES-BE a qb:Observation;
  qb:dataSet ms-distance:google-distance;
  prop:geo eugeo:ES;
  prop:partner eugeo:BE;
  ms:distance ms-distance:ES-BE-1580489.         

ms-distance:ES-BE-1580489 
  ms:distance 1580489;
  rdfs:label "1,580 km";
  ms:duration 52262;
  ms:durationLabel "14 hours 31 mins".

##################
qb:dataSet puml:arrow puml:up.

16.4.1 Distance Data Diagram

stat-distance.png

Date: <2016-11-02>

Author: Vladimir Alexiev

Created: 2016-11-02 Wed 13:30

Emacs 25.0.50.1 (Org mode 8.2.10)

Validate