3. The Document instance¶
The Document
Class encapsulates the original source document, plus the various metadata that can and should be
extracted: short name, dated URI, editors, document type, etc. These data are extracted from the file,
usually trying to interpret the content of the file as well as the referenced CSS files. The metadata also includes information on whether there
is scripting, whether it contains svg or MathML: these should be added to the book’s package file (per the specification of EPUB).
The class instance collects the various external references that must be, eventually, added to the final book (images, CSS files, etc.).
Finally, the HTML content (ie, the DOM tree) is also modified on the fly: HTML namespace is added, some metadata is changed a bit to fit the HTML5 requirements, the HTML is output in XHTML, etc.
The class is invoked (and “controlled”) by a :py:class:.DocWrapper` instance.
3.1. Module content¶
-
class
rp2epub.document.
Document
(driver)[source]¶ Encapsulation of the top level document.
Parameters: driver ( DocWrapper
) – the caller instance-
_collect_downloads
()[source]¶ Process a document looking for (and possibly copying) external references and making some minor modifications on the fly.
(Element, attribute)
pairs are added on the fly to the internal array of downloads (seedownload_targets
).Returns: a cssurls.CSSList
instance, with all the CSS references
-
_get_CSS_TR_version
()[source]¶ Set the CSS TR version based on the document. Note: at the moment this is very ugly: the path of the CSS URL is checked for a date. Hopefully, there will be some more ‘standard’ way of doing this, eventually.
-
_get_metadata_from_respec
(dict_config)[source]¶ Extract metadata (date, title, editors, etc.) making use of the stored ReSpec configuration structure (this structure includes the data set by the user plus some data added by the ReSpec process itself).
Returns: True or False, depending on whether the right keys are available or not
-
_get_metadata_from_source
()[source]¶ Extract metadata (date, title, editors, etc.) ‘scraping’ the source, i.e., by extracting the data based on class names, URI patterns, etc.
Raises R2EError: if the content is not recognized as one of the W3C document types (WD, ED, CR, PR, PER, REC, Note, or ED)
-
add_additional_resource
(local_name, media_type)[source]¶ Add a pair of local name and media type to the list of additional resources. Appends to the
additional_resources
list. :param local_name: name of the resource within the final book :param media_type: media type (used when the resource is added to the package file)
-
additional_resources
¶ List of additional resources that must be added to the book eventually. A list of tuples, containing the internal reference to the resource and the media type. Built up during processing, it is used in when creating the manifest file of the book.
List of authors (name + affiliation per element)
-
css_change_patterns
¶ List if (from, to) pairs that must be used to replace strings in the CSS files on the fly. Typically used to adjust the values used in url statements.
-
css_references
¶ Set of (local_name, absolute_url) pairs for resources gathered recursively from CSS files. These are CSS files themselves, or other media like logos, background images, etc, referred to via a url statement in CSS.
-
css_tr_version
¶ Version (as an integer number denoting the year) of the CSS TR version. The value is 2015 or higher
-
date
¶ Date of publication
-
dated_uri
¶ ‘Dated URI’, in the W3C jargon. As a fallback, this may be set to the top URI of the document if the dated uri has not been set
-
doc_type
¶ Document type, eg, one of
REC
,NOTE
,PR
,PER
,CR
,WD
, orED
, or the values set in ReSpec
-
doc_type_info
¶ Structure reflecting the various aspects of documents by doc type. This is just a shorthand for
config.DOCTYPE_INFO[self.doc_type]
-
download_targets
¶ Array of resources to be downloaded and added to the final book. Entries of the array are (
xml.etree.ElementTree.Element
, attribute) pairs, referring to the element and the attribute that identifies the URL of the resources to be downloaded.
-
driver
¶ The caller: a
doc2epub.DocToEpub
instance.
-
editors
¶ List of editors (name + affiliation per element)
-
extract_external_references
()[source]¶ Handle the external references (images, etc) in the core file, and copy them to the book. If the content referred to is
- has a URL is a relative one, begins with the same base, or refers to the www.w3.org domain (the latter is for official CSS files and logos)
- is one of the ‘accepted’ media types for epub
then the file is copied and stored in the book, the reference is changed in the document, and the resource is marked to be added to the manifest file. HTML files are copied as XHTML files, with a
.xhtml
suffix.
-
html
¶ The parsed version of the top level HTML element; an
xml.etree.ElementTree.Element
instance
Table of content extracted from a
<nav>
element (if any), that is copied almost verbatim into the EPUB3 navigation document. It may be empty, though, because the source does not contain the required TOC structure, in which case the simple TOC structure is (seetoc
).
-
properties
¶ The properties of the document, to be added to the manifest entry
-
respec_config
¶ The full respec configuration as a Python mapping type. This is available for newer releases of ReSpec, but not in older. And, of course, not available for Bikeshed sources. The value is None if was not made available.
Note that the rest of the code retrieves some of the common properties (e.g., short_name), i.e., the rest of the code does not make use of this property. But it may be used in the future.
-
short_name
¶ ‘Short Name’, in W3C jargon
-
subtitle
¶ “W3C Note/Recommendation/Draft/ etc.”: the text to be reused as a subtitle on the cover page.
-
title
¶ The
title
element content.
-
toc
¶ Table of content, an array of
utils.TOC_Item
instances. It is only the top level TOC structures; used for the old-school TOC file as well as for the EPUB3 navigation document in case the original document does not have the appropriate structures in its TOC.
-