3. The Document instance

The Document Class encapsulates the original source document, plus the various metadata that can and should be extracted: short name, dated URI, editors, document type, etc. These data are extracted from the file, usually trying to interpret the content of the file as well as the referenced CSS files. The metadata also includes information on whether there is scripting, whether it contains svg or MathML: these should be added to the book’s package file (per the specification of EPUB).

The class instance collects the various external references that must be, eventually, added to the final book (images, CSS files, etc.).

Finally, the HTML content (ie, the DOM tree) is also modified on the fly: HTML namespace is added, some metadata is changed a bit to fit the HTML5 requirements, the HTML is output in XHTML, etc.

The class is invoked (and “controlled”) by a :py:class:.DocWrapper` instance.

3.1. Module content

class rp2epub.document.Document(driver)[source]

Encapsulation of the top level document.

Parameters:driver (DocWrapper) – the caller instance
_collect_downloads()[source]

Process a document looking for (and possibly copying) external references and making some minor modifications on the fly. (Element, attribute) pairs are added on the fly to the internal array of downloads (see download_targets).

Returns:a cssurls.CSSList instance, with all the CSS references
_get_CSS_TR_version()[source]

Set the CSS TR version based on the document. Note: at the moment this is very ugly: the path of the CSS URL is checked for a date. Hopefully, there will be some more ‘standard’ way of doing this, eventually.

_get_document_metadata()[source]

Extract metadata (date, title, editors, etc.)

_get_metadata_from_respec(dict_config)[source]

Extract metadata (date, title, editors, etc.) making use of the stored ReSpec configuration structure (this structure includes the data set by the user plus some data added by the ReSpec process itself).

Returns:True or False, depending on whether the right keys are available or not
_get_metadata_from_source()[source]

Extract metadata (date, title, editors, etc.) ‘scraping’ the source, i.e., by extracting the data based on class names, URI patterns, etc.

Raises R2EError:
 if the content is not recognized as one of the W3C document types (WD, ED, CR, PR, PER, REC, Note, or ED)
add_additional_resource(local_name, media_type)[source]

Add a pair of local name and media type to the list of additional resources. Appends to the additional_resources list. :param local_name: name of the resource within the final book :param media_type: media type (used when the resource is added to the package file)

additional_resources

List of additional resources that must be added to the book eventually. A list of tuples, containing the internal reference to the resource and the media type. Built up during processing, it is used in when creating the manifest file of the book.

authors

List of authors (name + affiliation per element)

css_change_patterns

List if (from, to) pairs that must be used to replace strings in the CSS files on the fly. Typically used to adjust the values used in url statements.

css_references

Set of (local_name, absolute_url) pairs for resources gathered recursively from CSS files. These are CSS files themselves, or other media like logos, background images, etc, referred to via a url statement in CSS.

css_tr_version

Version (as an integer number denoting the year) of the CSS TR version. The value is 2015 or higher

date

Date of publication

dated_uri

‘Dated URI’, in the W3C jargon. As a fallback, this may be set to the top URI of the document if the dated uri has not been set

doc_type

Document type, eg, one of REC, NOTE, PR, PER, CR, WD, or ED, or the values set in ReSpec

doc_type_info

Structure reflecting the various aspects of documents by doc type. This is just a shorthand for config.DOCTYPE_INFO[self.doc_type]

download_targets

Array of resources to be downloaded and added to the final book. Entries of the array are (xml.etree.ElementTree.Element, attribute) pairs, referring to the element and the attribute that identifies the URL of the resources to be downloaded.

driver

The caller: a doc2epub.DocToEpub instance.

editors

List of editors (name + affiliation per element)

extract_external_references()[source]

Handle the external references (images, etc) in the core file, and copy them to the book. If the content referred to is

  • has a URL is a relative one, begins with the same base, or refers to the www.w3.org domain (the latter is for official CSS files and logos)
  • is one of the ‘accepted’ media types for epub

then the file is copied and stored in the book, the reference is changed in the document, and the resource is marked to be added to the manifest file. HTML files are copied as XHTML files, with a .xhtml suffix.

html

The parsed version of the top level HTML element; an xml.etree.ElementTree.Element instance

nav_toc

Table of content extracted from a <nav> element (if any), that is copied almost verbatim into the EPUB3 navigation document. It may be empty, though, because the source does not contain the required TOC structure, in which case the simple TOC structure is (see toc).

properties

The properties of the document, to be added to the manifest entry

respec_config

The full respec configuration as a Python mapping type. This is available for newer releases of ReSpec, but not in older. And, of course, not available for Bikeshed sources. The value is None if was not made available.

Note that the rest of the code retrieves some of the common properties (e.g., short_name), i.e., the rest of the code does not make use of this property. But it may be used in the future.

short_name

‘Short Name’, in W3C jargon

subtitle

“W3C Note/Recommendation/Draft/ etc.”: the text to be reused as a subtitle on the cover page.

title

The title element content.

toc

Table of content, an array of utils.TOC_Item instances. It is only the top level TOC structures; used for the old-school TOC file as well as for the EPUB3 navigation document in case the original document does not have the appropriate structures in its TOC.