Vladimir Alexiev, Ontotext Corp
Multisensor meeting, 2014-10-08, Bonn, Germany
(HTML,slideshare)
Press O for overview,H for help.
Proudly made in plain text with reveal.js, org-reveal, org-mode and emacs.
There's been a flurry of activity in recent years to represent NLP data as RDF.
NLP data is usually large, why represent it in RDF?
Intro: Christian Chiarcos, John McCrae, Philipp Cimiano, and Christiane Fellbaum. Towards Open Data for Linguistics: Linguistic Linked Data. In New Trends of Research in Ontologies and Lexical Resources. Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg, 2013.
Collaborative bibliography on Linguistic LOD: representing language resources and text annotations as RDF.
Detailed example of annotating one sentence: Turtle, highlighted.
Areas covered include:
Example based on Guardian's article "Goodbye Nuclear Power" with LinguaTec NER: Turtle, highlighted.
Compare to JSONLD or JSONLD without prefixes
We describe briefly the following linguistic ontologies
OLIA includes 34 annotation models (tagsets) for 69 languages
<#Germany-1> nif:oliaLink penn:NNP; nif:oliaClass penn:ProperNoun. <#is-2> nif:oliaLink penn:VBZ; nif:oliaClass penn:BePresentTense. <#the-3> nif:oliaLink penn:DT; nif:oliaClass penn:Determiner.
X-link.owl
abstracts over X.owl
by providing OLIA subclasses/subproperties, eg
<#Germany-1> nif:oliaClass olia:ProperNoun.
OLIA abstraction doesn't work perfectly in all cases, eg
penn:Determiner
doesn't have an OLIA mapping: "Not clear whether this corresponds to
OLiA/EAGLES determiners"penn:BePresentTense
is mapped to unionOf that restricts olia:hasTense
to have type
olia:Present
<#is-2> nif:oliaClass [a owl:Class; rdfs:subClassOf [a owl:Restriction; owl:onProperty olia:hasTense; owl:allValuesFrom olia:Present], [owl:unionOf (olia:FiniteVerb olia:StrictAuxiliaryVerb)]].
Ontology | Class | ObjProp | DataProp | Description |
olia_system | 6 | 3 | 6 | Feature, LinguisticAnnotation, Relation, UnitOfAnnotation, hasTag, hasTier |
olia_top | 62 | Top categories of the OLiA model | ||
olia | 857 | 50 | Full OLiA model |
Class: penn:BePresentTense SubClassOf: olia:hasTense only olia:Present, (olia:FiniteVerb or olia:StrictAuxiliaryVerb)
<#Germany-1> nif:oliaLink penn:NNP; nif:oliaClass penn:ProperNoun. <#is-2> nif:oliaLink penn:VBZ; nif:oliaClass penn:BePresentTense. <#the-3> nif:oliaLink penn:DT; nif:oliaClass penn:Determiner. <#work-4> nif:oliaLink penn:NN; nif:oliaClass penn:CommonNoun. <#horse-5> nif:oliaLink penn:NN; nif:oliaClass penn:CommonNoun. <#of-6> nif:oliaLink penn:IN; nif:oliaClass penn:PrepositionOrSubordinatingConjunction. <#the-7> nif:oliaLink penn:DT; nif:oliaClass penn:Determiner. <#European-8> nif:oliaLink penn:NNP; nif:oliaClass penn:ProperNoun. <#Union-9> nif:oliaLink penn:NNP; nif:oliaClass penn:ProperNoun.
Represent as nif:dependency. All are subclasses of stanford:DependencyLabel
nsubj(horse-5,Germany-1): a NominalSubject<Subject<Argument<Dependent cop(horse-5,is-2): a Copula<Auxiliary<Dependent det(horse-5,the-3): a Determiner<Modifier<Dependent nn(horse-5,work-4): a NounCompoundModifier<Modifier<Dependent root(ROOT-0,horse-5): a Root prep(horse-5,of-6): a PrepositionalModifier<Modifier<Dependent det(Union-9,the-7): a Determiner<Modifier<Dependent amod(Union-9,European-8): a AdjectivalModifier<Modifier<Dependent pobj(of-6,Union-9): a ObjectOfPreposition<Object<Complement<Argument<Dependent
stanford:nsubj a stanford:NominalSubject. stanford:NominalSubject rdfs:subClassOf* stanford:DependencyLabel. stanford:DependencyLabel olia_system:Feature.
<#horse-5> nif:dependency <#Germany-1>. <#Germany-1> a stanford:NominalSubject. <#horse-5> nif:dependency <#is-2>. <#is-2> a stanford:Copula. <#horse-5> nif:dependency <#the-3>. <#the-3> a stanford:Determiner. <#horse-5> nif:dependency <#work-4>. <#work-4> a stanford:NounCompoundModifier. <#ROOT-0> nif:dependency <#horse-5>. <#horse-5> a stanford:Root. <#horse-5> nif:dependency <#of-6>. <#of-6> a stanford:PrepositionalModifier. <#Union-9> nif:dependency <#the-7>. <#the-7> a stanford:Determiner. <#Union-9> nif:dependency <#European-8>. <#European-8> a stanford:AdjectivalModifier. <#of-6> nif:dependency <#Union-9>. <#Union-9> a stanford:ObjectOfPreposition.
Internationalization Tag Set (ITS) Version 2.0 is a fairly big W3C spec
We use only the Text Analysis itsrdf: props
taAnnotatorsRef
, taConfidence
: which software and what confidencetaClassRef
: class of annotated text/entity (eg nerd:Company, nerd:PhoneNumber, nerd:Time)taIdentRef
: URL of annotated entity:
taSource
(eg "Wordnet3.0"), taIdent
(eg "301467919"): for entities that are not yet in RDF/resolvableCommon NER types across semantic annotators
IKS (FISE) put the start of Apache Stanbol, a framework for semantic content annotation and management.
Stanbol is only as good as the underlying engines
Analogs (but the properties are in diffent nodes!)
fise:extracted-from | n/a. Points to the word occurrence |
fise:start | nif:beginIndex |
fise:end | end:Index |
fise:selected-text | nif:contextOf |
fise:entity-type | itsrdf:taClassRef |
fise:entity-reference | itsrdf:taIdentReg |
fise:confidence | itsrdf:taConfidence: number |
fise:confidence-level | none. owl:Individual: suggestion, uncertain, ambiguous, certain |
fise:entity-label | eg rdfs:label on the referenced entity |
Sentiment/opinion. Aggregates many opinions (with count), about thing/part/feature
Compare to schema.org Review, Rating, AggregateRating
Representation of websites, folders, pages, forums, postings, users
Ontologies
Thesauri (lists of NLP terms):
Lexicon Model for Ontologies: for representing Wordnets, dictionaries, lexica. See Quick Guide
Extend LEMON with additional features. See Cookbook
LemonGrass (formerly lemon2gf): convertor from Lemon lexicon+ontology
W3C community group. Spec draft (wiki, github, html preview).
Modules:
Best practices:
ISO TC37 Data Category Registry (DCR)
curl -L -Haccept:application/rdf+xml http://www.isocat.org/datcat/DC-65
ontology, extends LEMON
lexinfo:abbreviationFor a owl:ObjectProperty ; dcr:datcat <http://www.isocat.org/datcat/DC-65> ; rdfs:subPropertyOf lexinfo:contractionFor .
Defines 592 entities:
verb
, thirdPerson
, vulgarRegister
Verb
, VerbPOS
, VerbPhrase
, Tense
substanceHolonym
, synonym
, translation
, tense
, voice
pronunciation
, romanization
, transliteration
languageSpecific
, example
gold.ttl
Defines
OrthographicSystem
, ReferentialVoice
, Vowel
geneticallyRelated (HumanLanguageVariety)
, literalTranslation
, writtenRealization
abbreviation
, phoneticRep
, hasExample
Old UI, new UI at DANS (supports Chrome)
In the following slides we describe large-scale Linguistic resources.
Datasets already integrated in FactForge (but old versions):
WordNet: well-known and prototypical lexical resource
http://www.image-net.org: sample images for WordNet
Crowdsourced dictionaries of >300 languages. Eg ancora#Latin at http://en.wiktionary.org:
Dataset that integrates in LEMON format:
Integrates WordNet, Open Multilingual WordNet, Wikipedia, OmegaWiki, Wikidata, Wiktionary
http://babelnet.org/2.0/data/banca_n_IT
bn:banca_n_IT a lemon:LexicalEntry ; rdfs:label "banca"@it ; lemon:canonicalForm bn:banca_n_IT/canonicalForm ; lemon:language "IT" ; lemon:sense bn:banca_IT/s03802146n, bn:banca_IT/s00008371n, bn:banca_IT/s00008364n ; lexinfo:partOfSpeech lexinfo:noun .
http://babelnet.org/2.0/data/banca_IT/s03802146n
bn:Bank_%28topography%29_EN/s03802146n lexinfo:translation bn:banca_IT/s03802146n . bn:Bank_%28sea_floor%29_EN/s03802146n lexinfo:translation bn:banca_IT/s03802146n . bn:banca_IT/s03802146n a lemon:LexicalSense ; bn-lemon:byTrans 1 ; dc:source <http://wikipedia.org/> ; dcterms:license <http://creativecommons.org/licenses/by-sa/3.0/> ; lemon:reference bn:s03802146n .
Eg ancora#lat at http://babelnet.org (3.0 just came out)
Babelfy: annotation API based on BabelNet
Another NER/annotation service; based on DBpedia labels. Too eager, low precision: