6. Common Utilities

Various utility classes and methods.

6.1. Module Content

rp2epub.utils.logger

A python logger instance (see the Python logging library for details). May be overwritten by the DocWrapper instance). Defaults to None.

rp2epub.utils.TOC_PAIRS

Array of tuples to help selecting the (top level) TOC entries; these are strings to be used in an XPath find. Because there has been several versions over the past, including Bikeshed and ReSpec versions, the array contains quite a number of variants. The tuple may contain a third string, denoting a specific class name on the target element that can be used to narrow the filter.

class rp2epub.utils.Book(book_name, folder_name, package=True, folder=False)[source]

Abstraction for a book; it encapsulates a zip file as well as saving the content into a directory.

Parameters:
  • book_name – file name of the book
  • folder_name – name of the directory
  • package – whether a real zip file should be created or not
  • folder – whether the directory structure should be created separately or not
_path(path)[source]

Expand the path with the name of the package, check whether the resulting path (filename) includes intermediate directories and create those on the fly if necessary.

Parameters:path – path to be checked
Returns:expanded, full path
close()[source]

Close the book (i.e., the archive).

folder

Flag whether a folder should be generated or not

name

Prefix that should be added to all names when storing a folder (set to the short name of the document)

package

Flag whether an EPUB package should be generated or not

write_HTTP(target, url)[source]

Retrieve the content of a URI and store it in the book. (This is a wrapper around the write_session method.)

Parameters:
  • target (str) – path for the target file, this is always a relative URI
  • url – URL that has to be retrieved to be written into the book
Boolean return:

whether the HTTP session was successful or not

write_element(target, element)[source]

An ElementTree object is added to the book.

Parameters:
  • target (str) – path for the target file
  • element (xml.etree.ElementTree) – the XML tree to be stored
write_session(target, session, css_change_patterns=None)[source]

The returned content of an HttpSession is added to the book. If the content is an HTML file, it will be converted into XHTML on the fly.

Parameters:
  • target (str) – path for the target file
  • session – a HttpSession instance whose data must retrieved to be written into the book
  • css_change_patterns – a list of (from,to) replace patterns to be applied on CSS files before storage
Return boolean:

the value of session.success

writestr(target, content, compress=8)[source]

Write the content of a string.

Parameters:
  • target – path for the target file
  • content – string/bytes to be written on the file
  • compress – either zipfile.ZIP_DEFLATED or zipfile.ZIP_STORED, whether the content should be compressed, resp. not compressed
zip

The package (book) file itself

class rp2epub.utils.HttpSession(url, check_media_type=False, raise_exception=False, is_respec=False)[source]

Wrapper around an HTTP session; the returned media type is compared against accepted media types.

Parameters:
  • url (str) – the URL to be retrieved
  • check_media_type (boolean) – whether the media type should be checked against the media type of the resource to see if it is acceptable
  • raise_exception (boolean) – whether an exception should be raised if the document cannot be retrieved (either because the HTTP return is not 200, or not of an acceptable media type)
  • is_respec (boolean) – if True, the URL is a callout to the spec generator service; if so, and there is a problem, the corresponding error message is different
Raises R2EError:
 

in case the file is not an of an acceptable media type, or the HTTP return is not 200

data

The returned resource, as a file-like object

media_type

Media type of the resource

success

True if the HTTP retrieval was successful, False otherwise

url

The request URL for this session

class rp2epub.utils.Logger[source]

Wrapper around the logger calls, simply checking whether the logger in the configuration file has been set to a real value or whether it is None (in the latter case nothing happens). Saves a repeated set of checks elsewhere in the code.

class rp2epub.utils.TOC_Item(href, label, short_label)[source]

A single Table of Content (TOC) item.

Parameters:
  • href (str) – reference in the TOC
  • label (str) – long label, ie, including the chapter numbering
  • short_label (str) – shotr label, ie, without the chapter numbering
class rp2epub.utils.Utils[source]

Generic utility functions to extract information from a W3C TR document.

static change_DOM(html)[source]

Changes on the DOM to ensure a proper interoperability of the display among EPUB readers. At the moment, the following actions are done:

1. Due to the rigidity of the iBook reader, the DOM tree has to change: all children of the <body> should be encapsulated into a top level block element (we use <div role="main">). This is because iBook imposes a zero padding on the body element, and that cannot be controlled by the user; the introduction of the top level block element allows for suitable CSS adjustments.

The CSS adjustment is done as follows: the templates.BOOK_CSS is completed with the exact padding values; these are retrieved (depending on the TR version and the document) from the See the config.PADDING_NEW_STYLE and, if applicable, the config.PADDING_OLD_STYLE dictionaries. The expansion of templates.BOOK_CSS itself happens in the doc2epub.DocWrapper.process() method.

Note that using simply a “main” element as a top level encapsulation is not a good approach, because some files (e.g., generated by Bikeshed) already use that element, and there can be only one of those…

2. If a <pre> element has the class name highlight, the Readium extension to Chrome goes wild. However, that class name is used only for an internal processing of ReSpec though it is unused in the various, default CSS content. As a an emergency measure this class name is simply removed from the code, although, clearly, this is not the optimal way:-( But hopefully this bug will disappear from Readium and this hack can be removed, eventually.

Note: this is an acknowledged bug in Readium. When a newer release of Readium is deployed, this hack should be removed from the code.

3. Some readers require to have a type="text/css" on the the link element for a CSS; otherwise the CSS is ignored. It is added (though not needed in HTML5, it doesn’t do any harm either…)

4. Add to the class of the body element the toc-inline value, to ensure that the TOC stays inline and is not floated on the left hand side. In reality, this is needed only for the post-2016 versions of the TR documents, but it does not harm for earlier versions. I.e., this step is not made more complicated by a check of the document’s TR version.

5. Also like 4., remove the reference to the fixup.js script (which sets some initial values to the sidebar handling which is to be removed altogether anyway...)

Parameters:html (xml.etree.ElementTree.ElementTree) – the object for the whole document
static create_shortname(name)[source]

Create the short name, in W3C jargon, based on the dated name. Returns a tuple with the category of the publication (REC, NOTE, PR, WD, CR, ED, “RSCND”, or PER), and the short name itself.

Parameters:name (str) – dated name
Returns:tuple of with the category of the publication (REC, NOTE, PR, WD, CR, ED, “RSCND”, or PER), and the short name itself.
Return type:tuple
static editors_to_string(names, editor=True)[source]

Return a string of names generated from a list of names, with correct punctuation, and a suffix denoting whether these are editors or authors

Parameters:
  • names – list of strings, each entry a name to be used in the final output
  • editor – if True, the string ‘(editor)’ or ‘(editors)’ is appended to the list (depending on cardinality), ‘(author)’, resp. ‘(authors)’ otherwise
Returns:

a string that can be used as a final display for the names of editors/authors.

static extract_editors(html)[source]

Extract the editors’ names from a document, following the respec conventions (@class=p-author for <dd> including <a> or <span> with @class=p-name)

Note that this is used only for older documents. Current respec reproduces the configuration in the target HTML file that can be used to extract the data directly.

Parameters:html (xml.etree.ElementTree.ElementTree) – the object for the whole document
Returns:list of editors
static extract_toc(html, short_name)[source]

Extract the table of content from the document. html is the Element object for the full document. toc_tuples is an array of TOC_Item objects where the items should be put, short_name is the short name for the document as a whole (used in possible warnings).

Parameters:
  • html (xml.etree.ElementTree.ElementTree) – the object for the whole document
  • short_name (str) – short name of the document as a whole (used in possible warning)
Returns:

array of TOC_Item instances

static get_document_properties(html)[source]

Find the extra manifest properties that must be added to the HTML resource in the opf file.

See the IDPF documentation for details

Parameters:html (xml.etree.ElementTree.ElementTree) – the object for the whole document
Returns:set collecting all possible property values
Return type:set
static html_to_xhtml(html)[source]

Make the minimum changes necessary in the DOM tree so that the XHTML5 output is valid and accepted by epub readers. These are:

1. The http://www.w3.org/1999/xhtml namespace is required in EPUB, but not generated by the XML serialization of Python’s ElementTree (or the HTML5Lib implementation thereof?). It is therefore added explicitly.

2. XHTML5 does not work with <script src="..."/>, ie, with a self-closing element. Such elements are modified by adding a space to the content of the element.

Parameters:html (xml.etree.ElementTree.ElementTree) – the object for the whole document
Returns:the input object
static retrieve_date(duri)[source]

Retrieve the (publication) date from the dated URI.

Parameters:duri (str) – dated URI
Returns:date
Return type:datetype.date
Raises R2EError:
 the dated URI is not of an expected format
static set_html_meta(html, head)[source]

Change the meta elements so that:

  • any @http-equiv=content-type is removed
  • there should be an extra meta setting the character set
Parameters:
  • html (xml.etree.ElementTree.ElementTree) – the object for the whole document
  • head (xml.etree.ElementTree.Element) – the object for the <head> element