Text corpus search API

This document is a part of Royal Danish Library's APIs, and in particular The documentation on how use our texts. See also Licences & Legalese and Caveats

Try out the API here

Search for
filter query
field list
result format
start record
number of records
Sort by
Query parser

Reset form!
 
      

Properties of the search index

All the texts that can be searched in using the API are in Text Encoding Initiative, TEI for short, markup.

The solfware used for indexing is described in the documentation of the project SOLR and Snippets

Note that this document does not define or describe all fields in the index. The index is far too rich for that, but I believe that it contains what it takes to use it. The thing I have left out is basically more of the same.

Finally, all fields are not available for all editions, because the heterogeneity of the data, or wishes from the projects contributing data.

ID and Relations fields

label description values
id
The ID of the record. It identifies the collection, the TEI file and is constructed as a string concatenation of that basename with the xml:id of the the content indexed and some other stuff.
string
volume_id_ssi
The ID of the volume that contain the node
part_of_ssim
Array of IDs of trunk nodes being containers of the node at hand. Typically containing
  • One (or more) work(s) as a parent(s). Works may contain works.
  • A volume as an ancestor
Some works are monographs (i.e., their are contained in a volume with only one work), and for those the part_of_ssim field become meaningless.

Filter fields

label description values
cat_ssi
Category of a text. Use when limiting searches to works or to find volumes or find author portraits (biographies), omit otherwise.
work
author
period
	    
is_editorial_ssi
The contents originator is someone else than the author. In this service it is typically forewords, prefaces, comments etc in a scientific edition.
yes
no
            
type_ssi
Node type in document. A trunk node can be a whole work, a chapter etc, whereas a leaf could a paragraph of prose, a stanza (or strophe) of poetry or a speak in a dialog in a scenic work. For historical reasons, whole texts have type_ssi:work. A type_ssi:trunk will yield a result set comprising chapters or section of some kind.
work
trunk
leaf
volume
	    
is_monograph_ssi
A monograph in text service is perhaps not what you expect (on the other hand, what you expect is a monograph in text service). A monograph is a volume with only one work.
yes
no
	    
genre_ssi
Genre of a leaf node. Note that this is not the genre of a work, but the structure of the paragraph level markup. If there is a song in a scenic work, the speak in question might be classified as containing mostlty poetry. Available for all editions except GV.
prose
poetry
play
	    
subcollection_ssi
Filter with respect to collection. public-index.kb.dk contains all these editions.
              adl
              gv
              jura
              letters
              lh
              sks
              tfs
	    

Sort fields

position_isi
	  
The position of the current node along the sibling xpath axis in the document. Sorting with respect to this field will guarantee that the result is presented in document order. (We cannot use page number, which might be a roman numeral or an arabic one. Also, we need to take into account leaf nodes within pages.)
integer
	    

Search fields

label description values
work_title_tesim
Misc. metadata fields. There are more of them, but they should be self explanatory.
just plain text
volume_title_tesim
work_title_tesim
author_name_tesim
The author(s) of a document. For messages it is assumed that author is a synonym of sender.
text_tesim
The text
just plain text
prose_extract_tesim
verse_extract_tesim
performance_extract_tesim        
The text, as text_tesim, split up into fields according to its form. The to fields get their content from <p> ... </p>, <lg> ... </lg> and <sp> ... </sp> respectively.
just plain text
contains_ssi
We measures the length of the texts in prose_extract_tesim verse_extract_tesim performance_extract_tesim, whichever is the longest is used to assign the value of this field.
prose
poetry
play
speaker_tesim
The name of a character uttering something in a dialogue
just plain text
page_ssi
The page number where a leaf node (paragraph, speak or strophe) starts.
string (either integer
or roman numerals)
person_name_ssim
person_name_tesim
Name of persons mentioned in works, or, in case of letters, name of the recipient. The field can be accessed both as text (tesim) and string (ssim). The names in these fields are normalized to last name first (LNF) format. Also, the normalized form usually hits variants, such as Shakespeare, William hits William Shakespeare, and Jesus hits Kristus (Danish for Christ) as well. But only in these fields, there is no query expansion for the full text.
other_location_ssim other_location_tesim sender_location_tesim Names of places mentioned in works, or, in case of letters, the residence of the sender. The field can be accessed both as text (tesim) and string (ssim). The place names are usually normalized. For instance, a search in these field for Danmark hits Dannemark as well. The reverse is not true, a search for Dannemark hits only the word Dannemark in the full text (see text_tesim above). sender_location_tesim applies to letters only.
bible_ref_ssim
bible_ref_tesim
References to the bible mentioned in works. The field can be accessed both as text (tesim) and string (ssim). The references is using standard Danish abbreviations, like 1 Mos; 1 Kor 13,12; 1 Mos 2,7; Matt 16,18; Sl; Åb; ApG; Joh 1,14; Jak; Job. In many cases use bible_ref_ssim and then search for the exact string "1 Kor 13,12". The references are standardized annotations but in the full texts (of Grundtvig and Kierkegaard) may just allude to a place in the Bible.
year_itsi Year of release, publication or, in case of a message, the year it was sent. long int

Examples

Find all works try it! (clicking on "try it" to fill in the form to the left. You may then submit the search or customize it for your purposes. You might need to reset the form before a new search.)
type_ssi:work AND is_editorial_ssi:no
	  
Find all works by Gustaf Munch-Petersen try it!
author_name_tesim:munch
AND
type_ssi:work
	  
Find all speak in dialogs (TEI <sp> elements) in Archive for Danish Literature (ADL), written by someone called Jeppe try it!
genre_ssi:play
AND
subcollection_ssi:adl
AND
author_name_tesim:jeppe
	  
Find all speak in dialogs (<sp> elements) in ADL, spoken by a character named Jeppe try it!
genre_ssi:play
AND
subcollection_ssi:adl
AND
speaker_tesim:jeppe
	  
Find all strophes of poetry by N.F.S. Grundtvig containing the words hjerte and smerte (the two words rhyme, which heart and agony do not) in subcollection ADL. The query only makes sense in leafs; both words will most likely appear in any 19th century text of significant length. try it!
type_ssi:leaf
AND
genre_ssi:poetry
AND
subcollection_ssi:adl
AND
author_name_tesim:grundtvig
AND
text_tesim:hjerte
AND  
text_tesim:smerte
	  
Find all dialogue (all TEI speak <sp> ... </sp>) in the plays by Holberg where someone is talking about Mester Erich try it!
genre_ssi:play
AND
subcollection_ssi:adl
AND
text_tesim:mester erich
AND
author_name_tesim:holberg
	  
Find all letters sent from Berlin by Georg Brandes
Filter by letters, search by author and sender location try it!
subcollection_ssi:letters
AND
author_name_tesim:georg brandes
AND
sender_location_tesim:berlin
          
Find all letters sent from Paris before 1850
Filter by letters, search by year_itsi and sender location try it!
subcollection_ssi:letters
AND
sender_location_tesim:paris
AND
year_itsi:[1000 TO 1850]
          

Filter, join and sort examples

Find all works by Holberg containing poetry try it! . Steps in the search:
Search for author
author_name_tesim:holberg
	
Filter by genre_ssi:poetry, but return the record corresponding to the containing work rather than to the leaf node corresponding to a piece of poetry. Requires a database join:
{!join to=id from=part_of_ssim}genre_ssi:poetry
	
Find all letters sent from Berlin by Georg Brandes as above, but sort descending by date (year)
I.e., filter by letters, search by author and sender location try it!
Add sort by clause
year_itsi desc          
Find all years when Grundtvig mentions hell (in Danish helvede). try it! You can limit the retrieval to document id and year only by entering year_itsi into the field list field in the form) and get all records by setting the number of records to (say) 500.
query
subcollection_ssi:gv
AND
verse_extract_tesim:helvede
AND
type_ssi:work
          
field list
id year_itsi            
          
sort by ascending
year_itsi asc            
          
Note the difference between *_extract_tesim and genre_ssi. The former is to limit the search to text in the specified form of text in document. The genre_ssi looks specifies the form. genre_ssi is only applicable to paragraph level records.
subcollection_ssi:gv
AND
text_tesim:helvede
AND
type_ssi:work
AND
genre_ssi:poetry
        
will give zero hits whereas
subcollection_ssi:gv
AND
text_tesim:helvede
AND
type_ssi:leaf
AND
genre_ssi:poetry
        
will give a lot of hits, one for each strophe.
          
An interesting exercise we leave to the reader is to repeat the search for paradise (the same in Danish) or heaven. Does Grundtvig mentions of hell and paradise (or heaven) in anyway correlate temporally?
Poetry often consists of strophes containing lines (which may or may not contain rhymes and rythm). In TEI, strophes are lines in a line group element (<lg>). Find all strophes containing "regn" (i.e., rain) in poetry in volume 1 of Gustaf Munch Petersen's collected works.
Sort the result set in inverse document order Try it!
The actual search
volume_id_ssi:adl-texts-munp1-root
AND
text_tesim:regn
AND
genre_ssi:poetry
	  
The sort
position_isi desc
	  
A poem is, technically in TEI, a sequence of line groups (see above). Find all poems (i.e., works) containing strophes with "regn" (i.e., rain) in volume 1 of Gustaf Munch Petersen's collected works.
Sort the result set in the actual document order Try it!
The actual search
volume_id_ssi:adl-texts-munp1-root
AND
text_tesim:regn
	  
The join
{!join to=id from=part_of_ssim}genre_ssi:poetry
	
The sort
position_isi asc
	  
Find paragraphs or strophes where there are references to 1 Corinthians 13:12 (1 Kor 13,12: For now we see only a reflection as in a mirror; then we shall see face to face.) in the works of N.F.S. Grundtvig. try it!
The query
bible_ref_ssim:"1 Kor 13,12"
AND
subcollection_ssi:gv
AND
is_editorial_ssi:no
Sort chronologically
 year_itsi asc
 
Join with volume parent to return works. For paragraphs of prose.
{!join to=volume_id_ssi from=part_of_ssim}genre_ssi:prose
Join with volume parent to return works. Same thing as the join above but for strophes of poetry. Try it again for poetry!
{!join to=volume_id_ssi from=part_of_ssim}genre_ssi:poetry
I believe 1 Corinthians 13:12 is the part of the scripture most quoted by Grundtvig, but he do that more in prose than in poetry. On the other hand, he wrote more prose in spite of the fact that he is one of the most prolific hymn authors in not only Denmark but the whole of Scandinavia.

Choose index instance

You cannot use the index-test instance outside our network. Forget this if you are not developer at kb.dk

Colophon

This document was authored by

Sigfrid Lundberg
The Royal Danish Library
Denmark

who also wrote the indexer. However, a large number of people has contributed to this by coding services on top the index. That process has required clarifications of this document and modification of the index. This is the fruit of a teamwork.