6. Common Utilities¶

Various utility classes and methods.

6.1. Module Content¶

rp2epub.utils.logger¶: A python logger instance (see the Python logging library for details). May be overwritten by the DocWrapper instance). Defaults to None.

rp2epub.utils.TOC_PAIRS¶: Array of tuples to help selecting the (top level) TOC entries; these are strings to be used in an XPath find. Because there has been several versions over the past, including Bikeshed and ReSpec versions, the array contains quite a number of variants. The tuple may contain a third string, denoting a specific class name on the target element that can be used to narrow the filter.

class rp2epub.utils.Book(book_name, folder_name, package=True, folder=False)[source]¶

Abstraction for a book; it encapsulates a zip file as well as saving the content into a directory.

Parameters:	book_name – file name of the book folder_name – name of the directory package – whether a real zip file should be created or not folder – whether the directory structure should be created separately or not

_path(path)[source]¶

Expand the path with the name of the package, check whether the resulting path (filename) includes intermediate directories and create those on the fly if necessary.

Parameters:	path – path to be checked
Returns:	expanded, full path

close()[source]¶: Close the book (i.e., the archive).

folder¶: Flag whether a folder should be generated or not

name¶: Prefix that should be added to all names when storing a folder (set to the short name of the document)

package¶: Flag whether an EPUB package should be generated or not

write_HTTP(target, url)[source]¶

Retrieve the content of a URI and store it in the book. (This is a wrapper around the write_session method.)

Parameters:	target (str) – path for the target file, this is always a relative URI url – URL that has to be retrieved to be written into the book
Boolean return:	whether the HTTP session was successful or not

write_element(target, element)[source]¶

An ElementTree object is added to the book.

Parameters:	target (str) – path for the target file element (`xml.etree.ElementTree`) – the XML tree to be stored

write_session(target, session, css_change_patterns=None)[source]¶

The returned content of an HttpSession is added to the book. If the content is an HTML file, it will be converted into XHTML on the fly.

Parameters:	target (str) – path for the target file session – a `HttpSession` instance whose data must retrieved to be written into the book css_change_patterns – a list of `(from,to)` replace patterns to be applied on CSS files before storage
Return boolean:	the value of session.success

writestr(target, content, compress=8)[source]¶

Write the content of a string.

Parameters:	target – path for the target file content – string/bytes to be written on the file compress – either `zipfile.ZIP_DEFLATED` or `zipfile.ZIP_STORED`, whether the content should be compressed, resp. not compressed

zip¶: The package (book) file itself

class rp2epub.utils.HttpSession(url, check_media_type=False, raise_exception=False, is_respec=False)[source]¶

Wrapper around an HTTP session; the returned media type is compared against accepted media types.

Parameters:

Parameters:	url (str) – the URL to be retrieved check_media_type (boolean) – whether the media type should be checked against the media type of the resource to see if it is acceptable raise_exception (boolean) – whether an exception should be raised if the document cannot be retrieved (either because the HTTP return is not 200, or not of an acceptable media type) is_respec (boolean) – if True, the URL is a callout to the spec generator service; if so, and there is a problem, the corresponding error message is different
Raises R2EError:
	in case the file is not an of an acceptable media type, or the HTTP return is not 200

url (str) – the URL to be retrieved
check_media_type (boolean) – whether the media type should be checked against the media type of the resource to see if it is acceptable
raise_exception (boolean) – whether an exception should be raised if the document cannot be retrieved (either because the HTTP return is not 200, or not of an acceptable media type)
is_respec (boolean) – if True, the URL is a callout to the spec generator service; if so, and there is a problem, the corresponding error message is different

Raises R2EError:

in case the file is not an of an acceptable media type, or the HTTP return is not 200

data¶: The returned resource, as a file-like object

media_type¶: Media type of the resource

success¶: True if the HTTP retrieval was successful, False otherwise

url¶: The request URL for this session

class rp2epub.utils.Logger[source]¶: Wrapper around the logger calls, simply checking whether the logger in the configuration file has been set to a real value or whether it is None (in the latter case nothing happens). Saves a repeated set of checks elsewhere in the code.

class rp2epub.utils.TOC_Item(href, label, short_label)[source]¶

A single Table of Content (TOC) item.

Parameters:	href (str) – reference in the TOC label (str) – long label, ie, including the chapter numbering short_label (str) – shotr label, ie, without the chapter numbering

class rp2epub.utils.Utils[source]¶

Generic utility functions to extract information from a W3C TR document.

static change_DOM(html)[source]¶

Changes on the DOM to ensure a proper interoperability of the display among EPUB readers. At the moment, the following actions are done:

1. Due to the rigidity of the iBook reader, the DOM tree has to change: all children of the <body> should be encapsulated into a top level block element (we use <div role="main">). This is because iBook imposes a zero padding on the body element, and that cannot be controlled by the user; the introduction of the top level block element allows for suitable CSS adjustments.

The CSS adjustment is done as follows: the templates.BOOK_CSS is completed with the exact padding values; these are retrieved (depending on the TR version and the document) from the See the config.PADDING_NEW_STYLE and, if applicable, the config.PADDING_OLD_STYLE dictionaries. The expansion of templates.BOOK_CSS itself happens in the doc2epub.DocWrapper.process() method.

Note that using simply a “main” element as a top level encapsulation is not a good approach, because some files (e.g., generated by Bikeshed) already use that element, and there can be only one of those…

2. If a <pre> element has the class name highlight, the Readium extension to Chrome goes wild. However, that class name is used only for an internal processing of ReSpec though it is unused in the various, default CSS content. As a an emergency measure this class name is simply removed from the code, although, clearly, this is not the optimal way:-( But hopefully this bug will disappear from Readium and this hack can be removed, eventually.

Note: this is an acknowledged bug in Readium. When a newer release of Readium is deployed, this hack should be removed from the code.

3. Some readers require to have a type="text/css" on the the link element for a CSS; otherwise the CSS is ignored. It is added (though not needed in HTML5, it doesn’t do any harm either…)

4. Add to the class of the body element the toc-inline value, to ensure that the TOC stays inline and is not floated on the left hand side. In reality, this is needed only for the post-2016 versions of the TR documents, but it does not harm for earlier versions. I.e., this step is not made more complicated by a check of the document’s TR version.

5. Also like 4., remove the reference to the fixup.js script (which sets some initial values to the sidebar handling which is to be removed altogether anyway...)

Parameters:	html (`xml.etree.ElementTree.ElementTree`) – the object for the whole document

static create_shortname(name)[source]¶

Create the short name, in W3C jargon, based on the dated name. Returns a tuple with the category of the publication (REC, NOTE, PR, WD, CR, ED, “RSCND”, or PER), and the short name itself.

Parameters:	name (str) – dated name
Returns:	tuple of with the category of the publication (`REC`, `NOTE`, `PR`, `WD`, `CR`, `ED`, “RSCND”, or `PER`), and the short name itself.
Return type:	tuple

static editors_to_string(names, editor=True)[source]¶

Return a string of names generated from a list of names, with correct punctuation, and a suffix denoting whether these are editors or authors

Parameters:	names – list of strings, each entry a name to be used in the final output editor – if True, the string ‘(editor)’ or ‘(editors)’ is appended to the list (depending on cardinality), ‘(author)’, resp. ‘(authors)’ otherwise
Returns:	a string that can be used as a final display for the names of editors/authors.

static extract_editors(html)[source]¶

Extract the editors’ names from a document, following the respec conventions (@class=p-author for <dd> including <a> or <span> with @class=p-name)

Note that this is used only for older documents. Current respec reproduces the configuration in the target HTML file that can be used to extract the data directly.

Parameters:	html (`xml.etree.ElementTree.ElementTree`) – the object for the whole document
Returns:	list of editors

static extract_toc(html, short_name)[source]¶

Extract the table of content from the document. html is the Element object for the full document. toc_tuples is an array of TOC_Item objects where the items should be put, short_name is the short name for the document as a whole (used in possible warnings).

Parameters:	html (`xml.etree.ElementTree.ElementTree`) – the object for the whole document short_name (str) – short name of the document as a whole (used in possible warning)
Returns:	array of `TOC_Item` instances

static get_document_properties(html)[source]¶

Find the extra manifest properties that must be added to the HTML resource in the opf file.

See the IDPF documentation for details

Parameters:	html (`xml.etree.ElementTree.ElementTree`) – the object for the whole document
Returns:	set collecting all possible property values
Return type:	set

static html_to_xhtml(html)[source]¶

Make the minimum changes necessary in the DOM tree so that the XHTML5 output is valid and accepted by epub readers. These are:

1. The http://www.w3.org/1999/xhtml namespace is required in EPUB, but not generated by the XML serialization of Python’s ElementTree (or the HTML5Lib implementation thereof?). It is therefore added explicitly.

2. XHTML5 does not work with <script src="..."/>, ie, with a self-closing element. Such elements are modified by adding a space to the content of the element.

Parameters:	html (`xml.etree.ElementTree.ElementTree`) – the object for the whole document
Returns:	the input object

static retrieve_date(duri)[source]¶

Retrieve the (publication) date from the dated URI.

Raises R2EError:
Parameters:	duri (str) – dated URI
Returns:	date
Return type:	`datetype.date`
	the dated URI is not of an expected format

static set_html_meta(html, head)[source]¶

Change the meta elements so that:

any @http-equiv=content-type is removed

there should be an extra meta setting the character set

Parameters:	html (`xml.etree.ElementTree.ElementTree`) – the object for the whole document head (`xml.etree.ElementTree.Element`) – the object for the <head> element