LexVars tutorial

LexVars is a Python package for generating lexical predictors for psycholinguistic studies, with a focus on morphological and contextual variables. This tutorial showcases much of the functionality of the package; please read the individual functions' docstring for comprehensive documentation.

CELEX

CELEX is a lexical database that provides rich morphological and syntactic information (Baayen & Piepenbrock, 1995). LexVars provides a convenient interface to access two CELEX databases (for English only): the lemma database, which collapses different inflected forms (e.g., running and runs), and the wordform database, which has a seperate entry for each form. Let's start by looking at the lemmas. A particular string, e.g. wind, is often mapped to multiple lemmas, which correspond to the word's possible parts of speech or meanings:

In [118]:
import lexvars.celex
c = lexvars.celex.Celex(celex_path)
wind = c.lemma_lookup('wind')
wind
Out[118]:
[<CelexLemma 51690 "wind" (noun)>,
 <CelexLemma 51691 "wind" (noun)>,
 <CelexLemma 51692 "wind" (verb)>,
 <CelexLemma 51693 "wind" (verb)>,
 <CelexLemma 51694 "wind" (verb)>]

To see a complex word's morphological decomposition:

In [119]:
lightbulb = c.lemma_lookup('lightbulb')
lightbulb[0].Parses
Out[119]:
[<CelexMorphParse ['light', 'bulb']>]

CELEX includes a wide range of annotations for each word. Let's list the first three (ordered alphabetically), and check out the description of one of them:

In [120]:
dir(lightbulb[0])[:3]
Out[120]:
['Attr_A', 'Attr_N', 'C_N']
In [121]:
lightbulb[0].help('Attr_A')
Is this lemma an adjective which in some contexts can only be used attributively? (e.g. "sheer" in "sheer nonsense"

Clearly, lightbulb is not an adjective that can only be used attributively:

In [122]:
lightbulb[0].Attr_A
Out[122]:
False

In contrast with lemmas, wordforms are annotated for their inflectional status:

In [123]:
windows = c.wordform_lookup('windows')
windows[0].FlectType
Out[123]:
['plural']

CELEX includes the frequency of each wordform in the COBUILD corpus:

In [124]:
windows[0].Cob
Out[124]:
1212

Let's list all of the wordforms that are associated with the verb lemma for build. The same string often has multiple inflectional analyses, and therefore multiple wordforms; for example, built can be a past form or a participle.

In [125]:
wfs = c.lemma_to_wordforms(c.lemma_lookup('build')[1])
wfs
Out[125]:
[<CelexWordform 10550 "build">,
 <CelexWordform 10555 "building">,
 <CelexWordform 10570 "builds">,
 <CelexWordform 10581 "built">,
 <CelexWordform 102411 "build">,
 <CelexWordform 102418 "built">,
 <CelexWordform 119690 "build">,
 <CelexWordform 119697 "built">,
 <CelexWordform 136600 "build">,
 <CelexWordform 136607 "built">,
 <CelexWordform 152608 "built">]
In [126]:
wfs[-1].FlectType
Out[126]:
['participle', 'past_tense']
In [127]:
wfs[-2].FlectType
Out[127]:
['past_tense', 'plural']

Morphological families

The derivational family of a lemma consists of all of the words that include have that lemma as a morpheme. There are multiple ways to define this family, and each are useful for a different task. Adding the flag right=True accesses the word's right derivational family, which only includes words that have the lemma as their leftmost morpheme. By default, multiword lemmas such as think up are not included in the morphological family, but can be added to it using the flag multiword=True.

In [128]:
import lexvars.lexvars
lv = lexvars.lexvars.LexVars(c)
think_family = lv.derivational_family('think')
think_family
Out[128]:
[<CelexLemma 13412 "doublethink" (noun)>,
 <CelexLemma 38606 "rethink" (verb)>,
 <CelexLemma 47061 "think" (noun)>,
 <CelexLemma 47062 "think" (verb)>,
 <CelexLemma 47063 "thinkable" (adjective)>,
 <CelexLemma 47064 "thinker" (noun)>,
 <CelexLemma 3805 "bethink" (verb)>]
In [129]:
lv.derivational_family('think', right=True)
Out[129]:
[<CelexLemma 47064 "thinker" (noun)>,
 <CelexLemma 47061 "think" (noun)>,
 <CelexLemma 47062 "think" (verb)>,
 <CelexLemma 47063 "thinkable" (adjective)>]
In [130]:
lv.derivational_family('think', right=True, include_multiword=True)
Out[130]:
[<CelexLemma 47072 "think up" (verb)>,
 <CelexLemma 47061 "think" (noun)>,
 <CelexLemma 47062 "think" (verb)>,
 <CelexLemma 47063 "thinkable" (adjective)>,
 <CelexLemma 47064 "thinker" (noun)>,
 <CelexLemma 47067 "think of" (verb)>,
 <CelexLemma 47068 "think out" (verb)>,
 <CelexLemma 47069 "think over" (verb)>,
 <CelexLemma 47070 "think-tank" (noun)>,
 <CelexLemma 47071 "think through" (verb)>]

We can now extract the lemma frequencies of each of the lemmas in the word's family, and calculate the entropy of the probability distribution defined by those frequencies (Moscoso del Prado Martín et al., 2004). The lemma think is much more frequent than all of the lemmas derived from it; its derivational entropy is therefore fairly low:

In [131]:
think_family[0].help('Cob')
Frequency in the COBUILD corpus (17.9m words)
In [132]:
[x.Cob for x in think_family]
Out[132]:
[2, 32, 0, 35874, 2, 136, 5]
In [133]:
lv.derivational_entropy('think')
Out[133]:
0.17897018795918829

Finally, we can calculate the entropy of the distribution of inflected forms of think, collapsing over all of the lemmas for think. There are multiple possible ways to group together inflected forms of the lemma for the purposes of calculating entropy; see the function's documentation for details.

In [ ]:
lv.inflectional_entropy('think')
Out[ ]:
1.5716274042735443

Verb subcategorization family

By analogy to the derivational family of a stem, we can define a verb's subcategorization family as the set of frames that a verb can occur in (Linzen et al., 2013). LexVars provides an interface to VALEX (Korhonen et al., 2006). Let's load their lexicon 5 and examine three of the frames for squash:

In [ ]:
import lexvars.valex
vlx = lexvars.valex.Valex(valex_lex5_path)
vlx.load_all_verbs(progress=False)
In [ ]:
vlx.verbs['squash'][:3]
Out[ ]:
[{'class': '24',
  'classfreq': '5281',
  'frame': 'NP',
  'freqcnt': 484,
  'relfreq': 0.397531},
 {'class': '49',
  'classfreq': '2010',
  'frame': 'NP_PP',
  'freqcnt': 181,
  'relfreq': 0.210909},
 {'class': '22',
  'classfreq': '2985',
  'frame': 'NONE',
  'freqcnt': 131,
  'relfreq': 0.103202}]

We can also calcualte the Kullback-Leibler divergence between individual verbs' subcategorization distribution and the average subcategorization distribution in the language (i.e., averaged across all verbs, weighted by the verbs' frequency; see again Linzen et al., 2013).

In [ ]:
vre = ValexRelativeEntropy(c, vlx)
vre.build_reference_distribution()
vre.calculate_relative_entropies()
vre.relative_entropies['squash']
Out[ ]:
0.5697249078143103

References

Baayen, R. H., & Piepenbrock, R. (1995). The CELEX lexical database (Release 2) [CD-ROM]. Philadelphia, PA: Linguistic Data Consortium, University of Pennsylvania [Distributor].

Korhonen, A., Krymolowski, Y., & Briscoe, T. (2006). A large subcategorization lexicon for natural language processing applications. In Proceedings of the 5th international conference on language resources and evaluation. Genova, Italy.

Linzen, T. Marantz, A., & Pylkkanen, L. Syntactic context effects in visual word recognition: An MEG study. The Mental Lexicon 8(2), 117-139.

Moscoso del Prado Martín, F. M., Kostić, A., & Baayen, R. H. (2004). Putting the bits together: An information theoretical perspective on morphological processing. Cognition, 94(1), 1-18.