6. Common Utilities¶
Various utility classes and methods.
6.1. Module Content¶
-
rp2epub.utils.
logger
¶ A python logger instance (see the Python logging library for details). May be overwritten by the
DocWrapper
instance). Defaults toNone
.
-
rp2epub.utils.
TOC_PAIRS
¶ Array of tuples to help selecting the (top level) TOC entries; these are strings to be used in an XPath find. Because there has been several versions over the past, including Bikeshed and ReSpec versions, the array contains quite a number of variants. The tuple may contain a third string, denoting a specific class name on the target element that can be used to narrow the filter.
-
class
rp2epub.utils.
Book
(book_name, folder_name, package=True, folder=False)[source]¶ Abstraction for a book; it encapsulates a zip file as well as saving the content into a directory.
Parameters: - book_name – file name of the book
- folder_name – name of the directory
- package – whether a real zip file should be created or not
- folder – whether the directory structure should be created separately or not
-
_path
(path)[source]¶ Expand the path with the name of the package, check whether the resulting path (filename) includes intermediate directories and create those on the fly if necessary.
Parameters: path – path to be checked Returns: expanded, full path
-
folder
¶ Flag whether a folder should be generated or not
-
name
¶ Prefix that should be added to all names when storing a folder (set to the short name of the document)
-
package
¶ Flag whether an EPUB package should be generated or not
-
write_HTTP
(target, url)[source]¶ Retrieve the content of a URI and store it in the book. (This is a wrapper around the write_session method.)
Parameters: - target (str) – path for the target file, this is always a relative URI
- url – URL that has to be retrieved to be written into the book
Boolean return: whether the HTTP session was successful or not
-
write_element
(target, element)[source]¶ An ElementTree object is added to the book.
Parameters: - target (str) – path for the target file
- element (
xml.etree.ElementTree
) – the XML tree to be stored
-
write_session
(target, session, css_change_patterns=None)[source]¶ The returned content of an
HttpSession
is added to the book. If the content is an HTML file, it will be converted into XHTML on the fly.Parameters: - target (str) – path for the target file
- session – a
HttpSession
instance whose data must retrieved to be written into the book - css_change_patterns – a list of
(from,to)
replace patterns to be applied on CSS files before storage
Return boolean: the value of session.success
-
writestr
(target, content, compress=8)[source]¶ Write the content of a string.
Parameters: - target – path for the target file
- content – string/bytes to be written on the file
- compress – either
zipfile.ZIP_DEFLATED
orzipfile.ZIP_STORED
, whether the content should be compressed, resp. not compressed
-
zip
¶ The package (book) file itself
-
class
rp2epub.utils.
HttpSession
(url, check_media_type=False, raise_exception=False, is_respec=False)[source]¶ Wrapper around an HTTP session; the returned media type is compared against accepted media types.
Parameters: - url (str) – the URL to be retrieved
- check_media_type (boolean) – whether the media type should be checked against the media type of the resource to see if it is acceptable
- raise_exception (boolean) – whether an exception should be raised if the document cannot be retrieved (either because the HTTP return is not 200, or not of an acceptable media type)
- is_respec (boolean) – if True, the URL is a callout to the spec generator service; if so, and there is a problem, the corresponding error message is different
Raises R2EError: in case the file is not an of an acceptable media type, or the HTTP return is not 200
-
data
¶ The returned resource, as a file-like object
-
media_type
¶ Media type of the resource
-
success
¶ True if the HTTP retrieval was successful, False otherwise
-
url
¶ The request URL for this session
-
class
rp2epub.utils.
Logger
[source]¶ Wrapper around the logger calls, simply checking whether the logger in the configuration file has been set to a real value or whether it is None (in the latter case nothing happens). Saves a repeated set of checks elsewhere in the code.
-
class
rp2epub.utils.
TOC_Item
(href, label, short_label)[source]¶ A single Table of Content (TOC) item.
Parameters: - href (str) – reference in the TOC
- label (str) – long label, ie, including the chapter numbering
- short_label (str) – shotr label, ie, without the chapter numbering
-
class
rp2epub.utils.
Utils
[source]¶ Generic utility functions to extract information from a W3C TR document.
-
static
change_DOM
(html)[source]¶ Changes on the DOM to ensure a proper interoperability of the display among EPUB readers. At the moment, the following actions are done:
1. Due to the rigidity of the iBook reader, the DOM tree has to change: all children of the
<body>
should be encapsulated into a top level block element (we use<div role="main">
). This is because iBook imposes a zero padding on the body element, and that cannot be controlled by the user; the introduction of the top level block element allows for suitable CSS adjustments.The CSS adjustment is done as follows: the
templates.BOOK_CSS
is completed with the exact padding values; these are retrieved (depending on the TR version and the document) from the See theconfig.PADDING_NEW_STYLE
and, if applicable, theconfig.PADDING_OLD_STYLE
dictionaries. The expansion oftemplates.BOOK_CSS
itself happens in thedoc2epub.DocWrapper.process()
method.Note that using simply a “main” element as a top level encapsulation is not a good approach, because some files (e.g., generated by Bikeshed) already use that element, and there can be only one of those…
2. If a
<pre>
element has the class namehighlight
, the Readium extension to Chrome goes wild. However, that class name is used only for an internal processing of ReSpec though it is unused in the various, default CSS content. As a an emergency measure this class name is simply removed from the code, although, clearly, this is not the optimal way:-( But hopefully this bug will disappear from Readium and this hack can be removed, eventually.Note: this is an acknowledged bug in Readium. When a newer release of Readium is deployed, this hack should be removed from the code.
3. Some readers require to have a
type="text/css"
on the the link element for a CSS; otherwise the CSS is ignored. It is added (though not needed in HTML5, it doesn’t do any harm either…)4. Add to the class of the
body
element thetoc-inline
value, to ensure that the TOC stays inline and is not floated on the left hand side. In reality, this is needed only for the post-2016 versions of the TR documents, but it does not harm for earlier versions. I.e., this step is not made more complicated by a check of the document’s TR version.5. Also like 4., remove the reference to the fixup.js script (which sets some initial values to the sidebar handling which is to be removed altogether anyway...)
Parameters: html ( xml.etree.ElementTree.ElementTree
) – the object for the whole document
-
static
create_shortname
(name)[source]¶ Create the short name, in W3C jargon, based on the dated name. Returns a tuple with the category of the publication (
REC
,NOTE
,PR
,WD
,CR
,ED
, “RSCND”, orPER
), and the short name itself.Parameters: name (str) – dated name Returns: tuple of with the category of the publication ( REC
,NOTE
,PR
,WD
,CR
,ED
, “RSCND”, orPER
), and the short name itself.Return type: tuple
-
static
editors_to_string
(names, editor=True)[source]¶ Return a string of names generated from a list of names, with correct punctuation, and a suffix denoting whether these are editors or authors
Parameters: - names – list of strings, each entry a name to be used in the final output
- editor – if True, the string ‘(editor)’ or ‘(editors)’ is appended to the list (depending on cardinality), ‘(author)’, resp. ‘(authors)’ otherwise
Returns: a string that can be used as a final display for the names of editors/authors.
-
static
extract_editors
(html)[source]¶ Extract the editors’ names from a document, following the respec conventions (
@class=p-author
for<dd>
including<a>
or<span>
with@class=p-name
)Note that this is used only for older documents. Current respec reproduces the configuration in the target HTML file that can be used to extract the data directly.
Parameters: html ( xml.etree.ElementTree.ElementTree
) – the object for the whole documentReturns: list of editors
-
static
extract_toc
(html, short_name)[source]¶ Extract the table of content from the document.
html
is the Element object for the full document.toc_tuples
is an array ofTOC_Item
objects where the items should be put,short_name
is the short name for the document as a whole (used in possible warnings).Parameters: - html (
xml.etree.ElementTree.ElementTree
) – the object for the whole document - short_name (str) – short name of the document as a whole (used in possible warning)
Returns: array of
TOC_Item
instances- html (
-
static
get_document_properties
(html)[source]¶ Find the extra manifest properties that must be added to the HTML resource in the opf file.
See the IDPF documentation for details
Parameters: html ( xml.etree.ElementTree.ElementTree
) – the object for the whole documentReturns: set collecting all possible property values Return type: set
-
static
html_to_xhtml
(html)[source]¶ Make the minimum changes necessary in the DOM tree so that the XHTML5 output is valid and accepted by epub readers. These are:
1. The
http://www.w3.org/1999/xhtml
namespace is required in EPUB, but not generated by the XML serialization of Python’s ElementTree (or theHTML5Lib
implementation thereof?). It is therefore added explicitly.2. XHTML5 does not work with
<script src="..."/>
, ie, with a self-closing element. Such elements are modified by adding a space to the content of the element.Parameters: html ( xml.etree.ElementTree.ElementTree
) – the object for the whole documentReturns: the input object
-
static
retrieve_date
(duri)[source]¶ Retrieve the (publication) date from the dated URI.
Parameters: duri (str) – dated URI Returns: date Return type: datetype.date
Raises R2EError: the dated URI is not of an expected format
-
static
set_html_meta
(html, head)[source]¶ Change the meta elements so that:
- any
@http-equiv=content-type
is removed - there should be an extra meta setting the character set
Parameters: - html (
xml.etree.ElementTree.ElementTree
) – the object for the whole document - head (
xml.etree.ElementTree.Element
) – the object for the <head> element
- any
-
static