This specification defines the innerText
IDL attribute for HTML
elements. This was originally an extension to the DOM model introduced by
Microsoft in Internet Explorer sometime in the mists of history, which was
eventually copied by other browser rendering engines, albeit somewhat
inconsistently. This document is (or rather will be) the result of
reverse-engineering the behavior of major browsers.
The innerText attribute essentially returns a plaintext version of the element's contents, so this specification just lays out an algorithm to convert HTML to plaintext. The same algorithm will be reused for Selection stringification. Although Gecko (Firefox) doesn't implement innerText, it does implement Selection stringification, so Gecko's behavior was considered in writing the algorithm.
Where the reasoning behind the specification is of interest, such as when major preexisting rendering engines are known not to match it, the reasoning is included in HTML comments so as not to distract the reader.
An ignored node is a Node
that is one of the following:
Comment
ProcessingInstruction
Element
whose "display" property
computes to "none"
The innerText
IDL attribute is a DOMString on
the HTMLElement
interface. On setting, it
must behave identically to textContent
. On getting, the user agent must
append the plaintext of the context
node to the empty string and return the string portion of the output
(discarding the boolean portion).
To append the plaintext of a Node
node to a string s with boolean flag trailing space, a user agent
must run the following algorithm. It returns two outputs: a string and a
boolean. If the algorithm is invoked with only two arguments, trailing space defaults to false. To run the algorithm, the user agent must execute these steps:
Either generated content needs to be included in the plaintext (which makes sense to me but no browser does it), or <br> has to be special-cased.
The "returning two outputs" thing is awkward. Can we do it more nicely? The problem is that since the algorithm is defined recursively, we can't easily tell whether trailing spaces were produced as collapsed whitespace (and can be elided) or are part of something with "white-space: pre(-wrap)" (and cannot be elided). We have a similar problem with leading/trailing whitespace for inlines, where we cheat by defining a separate algorithm, which isn't ideal. Maybe we need to respec this non-recursively and track more state within the algorithm? Probably that's how implementers will implement it anyway.
Text
node
data
.
At the time of this writing, there is no precise definition of how text-transform is supposed to work in CSS. User agents should apply the same transformation here as they do in CSS.
Element
that is not an
ignored node
The leading whitespace for an Element node consists of the following, depending on the computed value of node's "display" property:
The trailing whitespace for an Element node consists of the following, depending on the computed value of node's "display" property:
innerText
is empty, the empty
string. Otherwise, if the "margin-bottom" property of node
has computed value at least half that of its "font-size" property, the string
consisting of two newlines (U+000A). Otherwise, the string consisting of one
newline (U+000A).
All references are normative unless marked "Non-normative".