innerText

Work in Progress — Last Update 3 February 2011

Editor: Aryeh Gregor <ayg+spec@aryeh.name>
Version history: http://aryeh.name/gitweb.cgi?p=innertext

Introduction

This specification defines the innerText IDL attribute for HTML elements. This was originally an extension to the DOM model introduced by Microsoft in Internet Explorer sometime in the mists of history, which was eventually copied by other browser rendering engines, albeit somewhat inconsistently. This document is (or rather will be) the result of reverse-engineering the behavior of major browsers.

The innerText attribute essentially returns a plaintext version of the element's contents, so this specification just lays out an algorithm to convert HTML to plaintext. The same algorithm will be reused for Selection stringification. Although Gecko (Firefox) doesn't implement innerText, it does implement Selection stringification, so Gecko's behavior was considered in writing the algorithm.

Where the reasoning behind the specification is of interest, such as when major preexisting rendering engines are known not to match it, the reasoning is included in HTML comments so as not to distract the reader.

Definitions

An ignored node is a Node that is one of the following:

a Comment
a ProcessingInstruction
an Element whose "display" property computes to "none"

The innerText attribute

The innerText IDL attribute is a DOMString on the HTMLElement interface. On setting, it must behave identically to textContent. On getting, the user agent must append the plaintext of the context node to the empty string and return the string portion of the output (discarding the boolean portion).

Plaintext conversion algorithm

To append the plaintext of a Node node to a string s with boolean flag trailing space, a user agent must run the following algorithm. It returns two outputs: a string and a boolean. If the algorithm is invoked with only two arguments, trailing space defaults to false. To run the algorithm, the user agent must execute these steps:

Either generated content needs to be included in the plaintext (which makes sense to me but no browser does it), or <br> has to be special-cased.

The "returning two outputs" thing is awkward. Can we do it more nicely? The problem is that since the algorithm is defined recursively, we can't easily tell whether trailing spaces were produced as collapsed whitespace (and can be elided) or are part of something with "white-space: pre(-wrap)" (and cannot be elided). We have a similar problem with leading/trailing whitespace for inlines, where we cheat by defining a separate algorithm, which isn't ideal. Maybe we need to respec this non-recursively and track more state within the algorithm? Probably that's how implementers will implement it anyway.

For each child child of node, in order, if child is . . .
an ignored node
Do nothing.
a Text node
1. If node's "visibility" property computes to "hidden", do nothing and abort these substeps (proceeding to the next child).
2. Let data be child's data.
3. Let whitespace be the computed value of the "white-space" property of node.
4. If whitespace is "normal", "nowrap", or "pre-line":
  1. If whitespace is "normal" or "nowrap", let set be the set of space characters. If it's "pre-line", let set be the set of space characters other than line feed (U+000A).
  2. Let position be a pointer into data, initially pointing at the start of the string.
  3. Let newdata be the empty string.
  4. While position doesn't point past the end of data:
    
    If the character at position is from set, append a single space (U+0020) to newdata and advance position until the character at position is not from set.
    Otherwise, if the character at position is a line feed (U+000A), delete the last character of newdata if it's a space (U+0020), then append a line feed (U+000A) to newdata, then advance position until the character at position is not from set.
    Otherwise, append the character at position to newdata and increment position.
  5. Set data to newdata.
5. If trailing space is true and data does not begin with a space character, append a space to the end of s.
6. If s is empty or ends with a space character, and data begins with a space (U+0020), and whitespace is "normal", "no-wrap", or "pre-line", delete the space from the beginning of data.
7. If whitespace is "normal", "no-wrap", or "pre-line", and the last character of data is a space (U+0020), delete the last character of data and set trailing space to true. Otherwise, set trailing space to false.
8. If the computed value of node's "text-transform" property is not "normal", apply the appropriate transformation to data.
  At the time of this writing, there is no precise definition of how text-transform is supposed to work in CSS. User agents should apply the same transformation here as they do in CSS.
9. Append data to s.
an Element that is not an ignored node
1. If the last character of s is not a newline (U+000A), and the leading whitespace for child is not the empty string, append the leading whitespace for child to s and set trailing space to false.
2. Append the plaintext of child to s with flag trailing space, and assign the result to (s, trailing space).
3. If node has another child after child that is not an ignored node, and the trailing whitespace of child is not the empty string, append the trailing whitespace of child to s and set trailing space to false.
Return (s, trailing space).

The leading whitespace for an Element node consists of the following, depending on the computed value of node's "display" property:

inline: If node has a child that is not an ignored node, and the first child of node that is not an ignored node is an Element, the leading whitespace for that child. Otherwise, the empty string.
inline-block
inline-table
none
table-cell
table-column
table-column-group: The empty string.
Any other value: The string consisting of a single newline.

The trailing whitespace for an Element node consists of the following, depending on the computed value of node's "display" property:

inline: If node has a child that is not an ignored node, and the last child of node that is not an ignored node is an Element, the trailing whitespace for that child. Otherwise, the empty string.
inline-block
inline-table
none
table-column
table-column-group: The empty string.
table-cell: The string consisting of one tab (U+0009).
Any other value: If node's innerText is empty, the empty string. Otherwise, if the "margin-bottom" property of node has computed value at least half that of its "font-size" property, the string consisting of two newlines (U+000A). Otherwise, the string consisting of one newline (U+000A).

References

All references are normative unless marked "Non-normative".