Using type inference to make web templates robust against XSS

Scripting vulnerabilities plague web applications today. To streamline the output generation from application code, numerous web templating frameworks have recently emerged and are gaining widespread adoption. However, existing web frameworks fall short in providing mechanisms to automatically and context-sensitively sanitize untrusted data.

but this template is vulnerable to Cross-site scripting (XSS) vulnerabilities. An attacker who controls the value of name could pass in <script>document.location = 'http://phishing.com/';</script> to redirect users to a malicious site, steal the users credentials or personal data, or initiate a download of malware.

making sure that the user sees exactly the value of name as per spec, and defeating this particular attack. A better web templating system might automatically insert the |escape*** directives, relieving the template author of the burden.

This paper argues that correct sanitization is too important, that manual sanitization is an unreasonable burden to place on template authors (and especially maintainers), defines goals that any automatic approach should satisfy, and introduces an automatic approach that is particularly suitable for bolting onto existing web templating languages.

In particular, we introduce the new notion of "context" type qualifiers to represent the contexts in which untrusted data can be embedded. We propose a new type system that refines the base type system of a web templating language with the context type qualifier. Based on the new type system, we design and develop a context-sensitive auto-sanitization (CSAS) engine which runs during the compilation stage of a web templating framework to add proper sanitization and runtime checks to ensure the correct sanitization. We implement our system in Google Closure Templates, a commercially used open-source templating framework that is used in GMail, Google Docs and other applications. We evaluate our type system on 1035 real-world Closure templates. We demonstrate that our approach achieves both better security and performance than previous approaches.

This system is in the progress of being bolted onto JQuery templates but that work has not yet been evaluated on production code.

Glossary

Solution Sketch: A static approach with RTTI to avoid over-escaping.

The template below specifies a form whose action depends on two values $name and $tgt which may come from untrusted sources. The {if …}…{else}…{/if} branches to define a dynamic URL.

First, we parse the template to find trusted static content, dynamic "data holes" that may be filled by untrusted data, and flow control constructs: if, for, etc. The solid black portions are the data holes, and the green portions are trusted static content.

Next we do a flow-sensitive analysis, propagating types to determine the context in which each data hole appears.

Based on those contexts, we determine the type of content that is expected for each hole.

That is the gist of the solution, though the above example glosses over issues with re-entrant templates, templates that are invoked in multiple start contexts, and joining branches that end in different contexts; and the exact sanitization functions chosen are different than shown in this simplified example.

The example only shows HTML and URL encoding, but our solution deals with data holes that occur inside embedded JavaScript and CSS as any solution for AJAX applications must.

Problem Definition

In this section we present several metrics on which any competing sanitization scheme should be judged, and a definition of a safe template that can be used to prove or disprove the soundness of a sanitization scheme that we think is relevant to security properties that web applications commonly want to enforce.

Performance

Run-time analysis overhead (proportional to overall template runtime) often differs substantially by platform. High quality parser-generators exist for C and Java, so the overhead may be much lower there than in browser, since iterating char by char over a string is slow in JavaScript.

Our proposal has a modest compile-/load-time cost taking slightly less than 1 second to do static inference for 1035 templates comprising 782kB of source code or about 1ms per template. The runtime analysis for our proposal is zero. The runtime sanitization overhead on a benchmark is between 3% and 10% of the total template execution time, and is indistinguishable from the overhead when non-contextual auto-sanitization is used (all data holes sanitized using HTML entity escaping).

Development overhead is hard to measure but the 1035 templates were migrated by an application group in a matter of weeks without stopping application development with little coordination, so the one-time overhead — the overhead to learn the system — is lower than that to learn and adopt a new templating language. Since the system works by inserting function calls, we provided debugging tools that diffed templates before and after the inference was run to show developers what the system was doing and aid in debugging. Due to the need to debug templates written using any approach, the continual development overhead can never be zero, but tool support, like diffing can make the system transparent and ease debugging.

Finally, once a bug has been identified, we try to make sure there are simple bugfixing recipes.

Ease of Adoption/Migration

What kind of changes, if any, do developers have to make to take an existing codebase of templates and have them properly sanitized? For example, adding sanitization functions manually is time-consuming and error-prone. Making sure that all static content is valid XHTML requires repetitive, time-consuming changes, but would not be as error-prone.

Our proposal allows contextual auto-sanitization to be turned on for some templates and not for others; most templating languages allow templates to be composed, i.e. templates can call other templates, and standard practice seems to be to have a few large templates that call out to many smaller templates. Since this can be done per template, a codebase can be migrated piecemeal, starting with complicated templates that have known problems.

Our proposal does not impose an element structure on template boundaries. Many top level templates look like:

Approaches that require template code to be well-formed XML, such as XSLT, cannot support this idiom. Our proposal works for templating languages that allow this idiom because we propagate types as they flow template calls rather than inferring types of content based on a DOM derived from a template.

Ease of Abandonment

If a development team adopts a sanitization scheme, and finds that it does not meet their needs, how easily can they switch it off, and how much of the effort they invested in deploying it can they recover?

Since our solution works by inserting calls to sanitization function into templates, a development team having second thoughts can simply run the type inference engine to insert the calls, and print out the resulting templates to generate a patch to their codebase and then remove whatever directives turned on auto-sanitization. We argued above that cost of adoption is low, and most of the work put into verifying that the sanitization functions chosen were reasonable is recoverable.

Security under Maintenance

Security measures tend to be removed from code under maintenance. Imagine a template that is not auto-sanitized:

that is passed a plain text name. While merging two applications, developers add a call to this code, passing in a rich HTML signature that has been proven safe by a tag whitelister, e.g. "Alan <font color=green>Green</font>span". Eventually, Mr. Greenspan notices that his name is misrendered and files a bug. A developer might check that the rich text signature is sanitized properly before being passed in, but not notice the other caller that doesn't do any sanitization. They resolve the bug by removing the call to escapeHTML which fixes the bug but opens a vulnerability.

Over-encoding is more likely to be noticed by end-users than XSS vulnerabilities, so a project under maintenance is more likely to lose manual sanitization directives than to gain them.

Our proposal addresses this by introducing sanitized content types as a principled solution to over-encoding problems.

Structure Preservation Property

We define a safe template as one that has several properties: the structure preservation property described here, and the code effect and least surprise properties defined in later sections.

Intuitively, this property holds that when a template author writes an HTML tag in a safe templating language, the browser will interpret the corresponding portion of the output as a tag regardless of the values of untrusted data, and similarly for other structures such as attribute boundaries and JS and CSS string boundaries.

This property can be violated in a number of ways. E.g. in the following JavaScript, the author is composing a string that they expect will contain a single top-level bold element surrounded by text.

and if greeting is "Hello" and planet is "World" then this holds as the output written is "Hello, <b>world</b>!"; but if greeting is "<script>alert('pwned');//" and planet is "</script>" then this does not hold since the structure has changed: the <b> should have started a bold element but the browser interprets it as part of a JavaScript comment in "<script>alert('pwned');//, <b></script></b>!".

Lower level encoding attacks, such as UTF-7 attacks, may also violate this property.

we can derive an innocuous template by replacing every untrusted variable with an innocuous string, a string that is not empty, is not a keyword in any programming language and does not contain special characters in any of the languages we're dealing with. We choose our innocuous string so that it is not a substring of the concatenation of literal string parts. Using the innocuous string "zzz", an innocuous template derived from the above is:

Parsing this, we can derive a tree structure where each inner node has a type and children, and each leaf has a type and a string value.

A template has the structure preservation property when for all possible branch decisions through a template, and for all possible data table inputs, a template either produces no output (fails with an exception) or produces an output that can be parsed to a tree that is structurally the same as that produced by the innocuous template derived from it for the same set of branch decisions.

∀ branch-decisions ∀ data, areEquivalent(
parse(innocuousTemplate(T)(branch-decisions, data))
parse(T(branch-decisions, data)))

where parse parses using a combined HTML/JavaScript/CSS grammar to the tree structure described above, branch-decisions is a path through flow control constructs (the conditions in for loops and if conditions) and where areEquivalent is defined thus:

This definition is not computationally tractable, but can be used as a basis for correctness proofs, and in practice branch decisions that go through loops more than twice or recurse more than twice can be ignored so by using fuzzers to generate bad data inputs, we can gain confidence in an implementation.

This property is essential to capturing developer intent. When the developer writes a tag, the browser should interpret that as a tag, and when the developer writes paired start and end tags, the browser should interpret those as a matched pair. It is also important to applications that want to embed sanitized data while preserving a trusted path since the structure preservation property is a prerequisite for visual containment.

Code Effect Property

Web clients may specify data values in code (strings, booleans, numbers, JSON) but only code specified by the template author should run as a result of injecting the template output into a page and all code specified by the template author should run as a result of the same. There are a dizzyingly large number of ways this property can fail to hold for a template. A non-exhaustive sample of ways to cause extra code to run:

There are also many ways to cause security-critical code to not run. In general, it is not wise to rely on JavaScript running in a browser, but many developers, not unreasonably, rely on some code having run if other code is running at a later time. A non-exhaustive sample of ways to stop code running via XSS:

Our proposal enforces this property by filtering URLs to prevent any data hole from specifying an exotic protocol, by filtering CSS keywords, and by only allowing data holes in JavaScript contexts to specify simple boolean, numeric, and string values, or complex JSON values which cannot have free variables. We assume that the JavaScript interpreter will work on arbitrarily large inputs. "Defining Code-Injection Attacks" by Ray & Ligatti defines a similar property: a CIAO (code injection attack on outputs) occurs when an interpolation causes the parse tree to include an expression that is not in its normal form, one consequence of which is that has no free variables.

Finally, our escapers are designed to produce output that avoids grammatical such as semicolon insertion, non-ASCII newline characters, regular-expression/division-operator/line-comment confusion.

Identifying all places in which a URL might appear in HTML (incl. MathML and SVG) is relatively easy compared to CSS. In CSS, it is difficult. For example, in <div style="background: {$bg}">, $bg might specify a URL, a color name, a color value like #000, a function-like color rgb(0,0,0), a keyword value like transparent, or a combination of the above. Given how hard it is to reliably black-list URLs, when you know the content is a URL, we took the rather drastic approach of forbidding anything that might specify a colon in CSS data holes. This seems to affect very little in practice, and we could relax this constraint to allow colons preceded by a safe word like the name of an element, pseudo-element, or innocuous property. Even if we did, it is possible that existing code uses colons in data holes to specify list separators a la semantic HTML, and we would break that use case:

This property is a prerequisite for many application privacy goals. If a third-party can cause script to run with the privileges of the origin, it can steal user data and phone home. Even if credentials are unavailable to JavaScript (HTTPOnly cookies), scripts with same-origin privileges can screen scrape (using DOM APIs) user names and identifiers and associated page content and phone home.

This property is also a prerequisite for many informed consent goals. If a third-party script can install onsubmit handlers, it can rewrite form data before it is submitted with the XSRF tokens that are meant to ensure that the data submitted was specified by the user.

Least Surprise Property

The last of the security properties that any auto-sanitization scheme should preserve is the property of least surprise. The authors do not know how to formalize this property.

Developer intuition is important. A developer (or code reviewer) familiar with HTML, CSS, and JavaScript; who knows that auto-sanitization is happening should be able to look at a template and correctly infer what happens to dynamic values without having to read a complex specification document. Simple rules-of-thumb should be sufficient to understand the system. E.g. if a mythical average developer sees <script>var msg = '{$msg}';</script> and their intuition is that $world should be escaped using JavaScript style \ sequences, and that is sufficient to preserve the other security properties, then that is what the system should do. Templates should be both easy to write and to code review.

Exceptions to the system should be easily audited. SQL prepared statements are great, but there's no way to have exceptions to the rule without giving up the whole safety net, so sometimes developers work around them by concatenating strings. It's hard to grep (or craft presubmit triggers) for all the places where concatenated strings are passed to SQL APIs, so it's hard for a more senior developer to find these after the fact and explain how naive developers can achieve their goal working within the system, notice a trend that points to a systemic problem with schemas, or agree that the exception to the rule is warranted and document it for future security auditors.

Our proposal was designed with this goal in mind, but we have not managed to quantify our success. We can note that 1035 templates were converted within a matter of weeks without a flood of questions to the mailing lists we monitor, so we infer that most of the parts of the system that were heavily exercised were non-controversial. Different communities of developers may have different expectations. We worked with a group of developers most of whom knew Java, C++, or both before starting web application development, and among whom a high proportion have at least a bachelor's degrees in CS or a related field. They may differ, intuition-wise, from developers who came to web development from a Ruby, Perl, or PHP background.

Alternate Approaches

In this section we introduce a number of alternative proposals, explain why they perform worse on the metrics above. We cite real systems as examples of some of these alternatives. Many of these systems are well-thought out, reasonable solutions to particular problems their authors faced. We merely argue that they do not extend well to the criteria we outlined above and explicitly label these sections "strawmen" to clarify the difference between our design criteria and the contexts in which these systems arose. We do claim though that any comprehensive solution to XSS, at a tools level, should meet the criteria above.

Strawman 0: Manual sanitization

Manual sanitization is the state-of-the-art currently. Developers use a suite of functions, such as OWASP's open source OSAPI encoders and every developer must learn when and how to apply them correctly. They must apply sanitizers either before data reaches a template or within the template by inserting function calls into code.

This places a significant burden on developers and does not guarantee any of the security properties listed above. One lapse can undo all the work put into hardening a website because of the all-or-nothing nature of the same-origin policy.

There is a tradeoff between correctness and simplicity of API that works in the attackers favor. Manual sanitization is particularly error-prone because developers learn the good parts of the languages they work in, but attackers have available to them the bad parts as well. The syntax of HTML, CSS, and JavaScript are much gnarlier than most developers imagine, and it is an unreasonable burden to expect them to learn and remember obscure syntactic corner cases. These corner cases mean that the typical suite of 4-6 escaping functions is the most that many developers can reliably choose from, but they are insufficient to handle corner cases or nested contexts.

Changes in language syntax or vendor-specific extensions (e.g. XML4J and embedded SVG) may invalidate developers previously valid assumptions. Code that was safe before may no longer be safe. With an automated system, a security patch and recompile may suffice, but a patch to code that took a team of developers years to write will take a team of developers to fix.

XSS Scanners (e.g. lemon) can mitigate some of manual sanitization's cons (though they work with any of the other solutions here as well to provide defense-in-depth), but there are no good scanners for AJAX applications, and, with manual sanitization, scanners impose a continual burden on developers to respond to the reported errors.

Strawman I: Non-contextual auto-sanitization

Non-contextual auto-sanitization is a great improvement over manual sanitization. Django templates and others use it.

It works by assuming that every data hole should be sanitized the same way, usually by HTML entity encoding. As such, it is prone to over-escaping and mis-escaping. To understand mis-escaping, consider what happens when the following template is called with ', alert('XSS'), ' :

The template produces <button onclick="setName('', alert('XSS '), '')"> which is exactly the same, to the browser, as <button onclick="setName('', alert('XSS '), '')"> because the browser HTML entity decodes the attribute value before invoking the JavaScript parser on it.

Non-contextual auto-sanitization cannot preserve the structure preservation property for JavaScript, CSS, or URLs because it is unaware of those languages. It also fails to preserve the code effect property.

Bolting filters on non-contextual auto-sanitization will not help it to preserve the code effect property. It is possible to write bizarre JavaScript that does not even need alphanumerics. Since JavaScript has no regular lexical grammar, regular expressions that are less than draconian are insufficient to filter out attacks.

Non-contextual auto-sanitization, with auditable exceptions like Django's, does preserve the least surprise property in a sense. With very little training, a developer can predict exactly what it will do, and empirically, 74% of the time it does what they want (our system chose some kind of HTML entity encoding for 992 out of 1348 data holes).

Strawman II: Strict structural containment

Examples of strict structural containment languages are XSLT, GXP, Yesod, and possibly XHP. For all of these, the input is (or is coercible via fancy tricks) to a tree structure like XML. So for every data hole, it is obvious to the system which element and attribute context the hole appears in†. A similar structural constraint could be applied in principle to embedded JS, CSS, and URIs.

Strict structural containment is a sound, principled approach to building safe templates that is a great approach for anyone planning a new template language, but it cannot be bolted onto existing languages though because it requires that every element and attribute start and end in the same template. This assumption is violated by several very common idioms, such as the header-footer idiom, above, in ways that often require drastic changes to repair.

Since it cannot be bolted onto existing languages, limiting ourselves to it would doom to insecurity most of the template code existing today. Most project managers who know their teams have trouble writing XSS-free code, know this because they have existing code written in a language that does not have this property.

† - modulo mechanisms like <xsl:element name="..."> which can, in principle, be repaired using equivalence classes of elements and attributes. I.e. one could define an equivalence class of elements all of whose attributes have the same meaning and which have the same content type: (TBODY, THEAD, TFOOT), (OL, UL), (TD, TH), (SPAN, I, B, U), (H1, H2, H3, …) and allow a dynamic element mechanism to switch between element types within the same equivalence class. Similar approaches can allow selecting among equivalent dynamic attribute types : all event handlers are equivalent (modulo perhaps those that imply user interaction for some applications).

Strawman III: A runtime typing approach

A runtime contextual auto-sanitizer plugs into a template runtime at a low level. Instead of writing content to an output buffer, the template runtime passes trusted and untrusted chunks to the auto-sanitizer. The template:

might produce the output on the left, and by propagating context at runtime, infer the context in the middle and choose to apply the escaping directives on the right before writing to the output buffer.

Content	Trusted	Context	Sanitization function
`<ul>`	Yes	PCDATA	none
`<li onclick="alert('>`	Yes	PCDATA	none
`foo`	No	JS string	escapeJSString
`')">`	Yes	JS string	none
`foo`	No	PCDATA	escapeHTML
`<li onclick="alert('>`	Yes	PCDATA	none
`<script>doEvil()</script>`	No	JS string	escapeJSString
`')">`	Yes	JS string	none
`<script>doEvil()</script>`	No	PCDATA	escapeHTML
`</ul>`	Yes	PCDATA	none

This works, and with a hand-tuned C parser has been deployed successfully on CTemplates and ClearSilver.

Writing a highly tuned parser in JavaScript though is difficult so implementing this scheme requires making a hard trade-off between flexibility and correctness and download-size/speed.

Our proposal is a factor of 4 faster than a runtime scheme implemented in JavaScript and has no download size cost above and beyond the code for the sanitization functions and the calls to them.

Even in languages for which there are efficient parser generators, runtime approaches might suffer performance-wise. The overhead for the static approach is independent of the number of times a loop is re-entered, so templates that take large array inputs might perform worse with even a highly efficient runtime scheme.

Runtime sanitization does do more elegantly in at least one area though. Dynamic tag and attribute names pose no problems to a runtime sanitizer. Whereas our scheme has to filter attribute names so that $aname cannot be "onclick" in <button {$aname}=…>, because a static approach must decide that the beginning of the attribute value is either a JavaScript context or some other context, a runtime approach can take into account the actual value of $aname. This is not a common problem, and our approach does handle many dynamic attribute situations including: <button on{$handlerType}=…>.

Strawman IV: A purely static approach

We know of no purely static approaches, though they are possible. A purely static approach is one that, like our proposal, infers contexts at compile or load time, but does not take into account the runtime type of the values that fill the data holes.

This approach has problems with over-escaping. Existing systems often use a mix of sanitization in-template and sanitization outside the template in the front-end code that calls the template.

Our solution takes into account the runtime type of the values that fill a hole. If the runtime type marks the value as a known-safe string of HTML, then a sanitization function can choose not to re-escape, and instead normalize or do nothing.

See caveats for other problems that are as equally applicable to pure static systems as to our proposal.

Definitions and Algorithms

This section is only relevant to implementors, testers, and others who want to understand the implementation. Everyone else, including web application developers, can ignore it.

At a high level, the type system defines four things which are expanded upon below:

By contrast, the runtime auto-sanitization scheme described in strawman III has the same initial context, the same context propagation operator, no context join operator and uses a slightly differently shaped sanitization function chooser : context → (α → (string * context)).

Contexts

A context captures the state of the parser in a combined HTML/CSS/JS lexical grammar. It is composed of a number of fields which pack into 2 bytes with room to spare:

The join operator produces the context at the end of a condition, loop, switch, or other flow control construct. This sometimes introduces an ambiguity. In the template:

One branch ends in the query portion of a URI, and one ends outside it. If there were a data hole at the ↑, then we would not be able to determine an appropriate sanitization function for it†. So context joining often introduces just enough ambiguity, by using do-not-know values for fields, and in the common case, we later reach a point where we discard that info. In the URI case, if there were a # character at the ↑ we can reliably transition into a URI fragment context, and in any case, the end of the attribute moots the question.

The ε-commit operator is used when we see a data hole. In some cases, we introduce parser states to delay decision making. In the template fragment, <a href=, we could see a quote character next, or space, or the start of an unquoted value, or the end of the tag (implying empty href), or a data hole specifying the start of an unquoted attribute value. If the next construct is a data hole we need to commit to it being an unquoted attribute. The ε-commit operator in this case goes from an HTML_BEFORE_ATTRIBUTE_VALUE state with an attribute end delimiter of NONE to a state appropriate to the value type (e.g. JS for an onclick attribute) with an attribute end delimiter of SPACE_OR_TAG_END.

The precise details of both these operators were determined empirically to come up with the simplest semantics that handles cases found in real code that web developers do not consider to be badly written or confusing.

† — This could be fixed by migrating the problematic data hole and the code leading up to it into each branch, but this is tricky to do across template boundaries and has not proven to be necessary for the codebase we migrated.

Grammar

The context propagation algorithm uses a combined HTML/CSS and JS lexical grammar described below. Click on non-terminal productions for more detail.

HTML

Attributes

JS

CSS

URI

DynamicTagName

Allows through parts of non-CDATA, non-RCDATA tag names. So the Soy <h{$headerLevel}> can be used to generate <h1>, <h2>, …

To avoid problems where a tag name might be combined with a static part to form script, style, or another CDATA or RCDATA tag, we impose the following restrictions:

must contain only ASCII letters, digits, dashes and colons; and
must
- contain a colon (a namespace), or
- contain a digit, or
- be the full name (case-insensitive) of a non-RCDATA, non-CDATA HTML element.

DynamicJsString

Escapes plain text so it can be incorporated into part of a JS string literal by escaping special characters, e.g. newline → \n.

`John "The Anonymous" Doe` → `John \"The Anonymous\" Doe`

We escape dynamic JS strings using the following table:

Codepoint Glyph Escape

000A₁₆ \n

000D₁₆ \r

0022₁₆ " \u0022

0027₁₆ ' \u0027

002F₁₆ / \/

003C₁₆ < \u003C

003E₁₆ > \u003E

005C₁₆ \ \\

2028₁₆ \u2028

2029₁₆ \u2029

Codepoint	Glyph	Escape
000A₁₆		`\n`
000D₁₆		`\r`
0022₁₆	`"`	`\u0022`
0027₁₆	`'`	`\u0027`
002F₁₆	`/`	`\/`
003C₁₆	`<`	`\u003C`
003E₁₆	`>`	`\u003E`
005C₁₆	`\`	`\\`
2028₁₆		`\u2028`
2029₁₆		`\u2029`

These escapes prevent premature string closing, since all JS quote characters are encoded to a sequence that does not contain a quote character and no other codepoint is encoded to a sequence containing a quote character. This prevents additional JS syntax errors by properly encoding all JS newline codepoints. It preserves structure by encoding any sequences that would end a CDATA tag, CDATA section, escaping text span, or quoted HTML attribute value. The output can be embedded in an HTML attribute value by additionally escaping & to \u0026. In the case of unquoted HTML attribute values, just escaping ampersands is not sufficient ; the output needs to be HTML entity escaped per DynamicAttrValue.

DynamicJsValue

Quotes strings and encodes them like DynamicJsString, puts spaces around boolean, null, and numeric values.

`John "The Anonymous" Doe + 1` → `"John \"The Anonymous\" Doe + 1"`
`42` → ` 42 `
`false` → ` false `

Putting spaces around non-string values makes sure that they will be separate tokens but will not introduce a function call in the case of the Soy template

      var f = function () {}  // Missing semicolon.
      {$myBoolean} && sideEffect();

where due to semicolon insertion, adding parentheses would cause the template to produce the equivalent of

      var f = ((function () {})(false)) && sideEffect();

given { myBoolean: false }.

DynamicCssString

Escapes plain text so it can be incorporated into part of a CSS string literal by escaping special characters, e.g. newline → \10 .

`John "The Anonymous" Doe` → `John \22 The Anonymous\22 Doe`

We encode all CSS special characters using CSS hex escaping. CSS hex escaping allows an escape to be followed optionally by a space or tab character so that an escape may be followed by an unescaped hex digit. We always emit a following space.

We aggressively encode all CSS special characters to prevent unspecified CSS error recovery from restarting parsing inside quoted strings.

9.2.1. Error conditions

In general, this document does not specify error handling behavior for user agents (e.g., how they behave when they cannot find a resource designated by a URI).
However, user agents must observe the rules for handling parsing errors.
Since user agents may vary in how they handle error conditions, authors and users must not rely on specific error recovery behavior.

We also escape both angle brackets (< and >) (which is already a CSS special) so that HTML escaping text spans, CDATA sections, CDATA end tags, etc. cannot be introduced into the middle of CSS strings.

Context Propagation

The context propagation algorithm uniquely determines the context at every data hole so that a later pass may chose a sanitization function for each hole. The algorithm operates at two level, one on the graph of templates, and another individually within templates. The first deals with identifying the minimal set of templates that need to be processed, and might clone templates to deal with templates that are called in multiple different contexts.

The template context propagation algorithm uses an inference object which is implemented as a set of nested maps and a pointer to a parent inference object. This allows us to speculatively type a template sub-graph, and when we have a consistent view of types, we can collapse our conclusions into the parent by simply copying maps from children to parent. The maps include maps from holes to start contexts, from templates to end contexts used to type calls.

That algorithm delegates all the hard work to another algorithm below that examines the template graph reachable from one particular top-level template.

Thus far, we have done nothing that is particular to the syntax templating language itself. Different languages have different semantics around parameter passing, and provide different flow control constructs. The algorithm below is an example for one that deals with a simple template language that provides calls, conditions, chunks of static template text, and expression interpolations which fill data holes. On a call, it may recurse to the compute end context algorithm above, which is how we lazily explore the portion of the template call graph needed.

† — We make the simplifying assumption that the start context for all public templates is HTML_PCDATA. Some templating languages may be used in different contexts, and so this assumption might not prove valid. We could choose the starting context for public templates based on some kind of annotation or naming convention particular to the templating language.

Sanitization Functions

We define a suite of sanitization functions. The table below describes them briefly and the context in which they are used. There are significantly more than most manual escaping schemes. As noted above, most developers who don't work on parsers for HTML/CSS/JS have a simplified mental model of the grammar which makes it difficult to choose between this many options. We have many sanitization functions because we want to minimize template output size to minimize network latency; having more sanitization functions lets us avoid escaping common characters like spaces when safe. The naming convention for sanitization function reflects the escaper, filter, and normalizer definitions from the glossary. By convention, sanitization functions are split into broad groups: escaping functions transform an input language (usually plain text) to the output language, filters transform any input language to the same string or to an innocuous string, and normalizers transform a string in the input language to an output in the same language but that is easier to embed in other languages.

Sanitized Content Types

Sanitized content allows template users to pre-sanitize some content, and allow approved structured content.

new SanitizedContent('<b>Hello, World!</b>') specifies a chunk of HTML that the creator asserts is safe to embed in HTML PCDATA.

It is possible for misuse of this feature to violate all the safety properties contextual auto-sanitization provides. We assert that allowing this makes it easier to migrate code that has no XSS safety net to a better place, and satisfies some compelling use cases including HTML translated into foreign languages by trusted translators, and HTML from tag whitelisters, wiki-text-to-html converters, rich text editors. But it needs to be used carefully. Developers should:

Caveats

As noted above, (in the runtime contextual auto-sanitization strawman) static approaches (including ours) cannot handle all possible uses of dynamic attribute and element name. These seem rare in real code, and relatively easy to fix, but if necessary, a hybrid runtime/static approach could address this problem.

Static approaches get into corner cases around zero-length untrusted values. For example, to preserve the code effect property, we need to make sure that no untrusted value specifies a javascript: or similar URL protocol. In template code like <img src="{$x}{$y}"> we might naively decide that it is sufficient to filter $x to make sure that it specifies no protocol or an approved one. But if $x is the empty string, then $y might still specify a dangerous protocol. Alternatively $x might specify "javascript" and $y start with a colon. This hole can be closed a number of ways, e.g. java%73cript:alert(1337) is not a dangerous URL. Similar problems arise with JavaScript regular expressions: var myPattern = /{$x}/ where an empty $x could turn the regular expression literal into a line comment and there are similar special case fixes (/(?:)/ is not a comment). But a general solution to empty strings would be a source of considerable complexity. Simply making sanitizer functions variadic ({$x}{$y} → {filterNormalizeUri($x, $y)}) will not suffice because the two interpolations might cross template boundaries.

Our JavaScript parser is unsound. JavaScript is a language that does not have a regular lexical grammar (even ignoring conditional compilation) because of the way it specifies whether a / starts a regular expression or a division operator. We use a scheme based on a draft JavaScript 1.9 grammar devised by Waldemar Horwat that makes that decision based on the last non-comment token. This works well for all the code we've seen that people actually write, and makes our approach feasible, but there is a known case where it fails: x++ /a/i vs x = ++/a/i. The second code snippet, while nonsensical, is valid JavaScript that our scheme fails to handle correctly.

Our parser does not currently recognize HTML5 escaping text spans, the regions inside <script> and <style> bodies delimited by  that suppress end-tag processing. This can be fixed if a codebase seems to use them. Our santization function choices are designed to not produce content containing escaping text span boundaries.

Our parser does not descend into HTML, CSS, or JS in data: URLs. We could but have not encountered the need in existing code.

Case Study

We studied 1035 templates that were migrated from an existing codebase to use contextually sanitized templates. Most of the templates were relatively small but totaled 21098 LOC and 783kB. The compilation load time cost for these 1035 templates was 998339279 ns on a platform with 2 GB of RAM, an Intel 2.6 MHz dual-core processor running Linux 2.6.31.

Most of the sanitization functions chosen were plain text→HTML. the non-contextual auto-sanitization is correct 63% of the time assuming the auto-sanitizer is sufficient in Html, HtmlAttribute, and HtmlRcdata contexts.

If values were aggressively filtered to prevent dangerous URLs from appearing in the template input, then non-contextual auto-sanitization would be sufficient in 77% of cases. The rates might be hire for a codebase written for non-contextual sanitization by developers aware of its limitations.

LOC	# templates
1- 18	######################################## (685)
19- 36	############ (210)
37- 55	#### (78)
56- 73	# (33)
74- 91	(10)
92- 110	(7)
111- 128	(4)
129- 147	(3)
148- 220	(1) x 4
221- 294	(0) x 4
295- 312	(1)

268 out of 1348 interpolation sites require runtime filtering (19.9)%, mostly filterNormalizeUri.

The benchmark runs over a large template with dummy data that is meant to be representative of the application using it. The benchmarks range from 15.2-16.8 ms and the standard-deviation is roughly 0.6 ms, which puts the runtime-cost of the sanitization functions in the noise.

In JavaScript, a state-machine based runtime contextual auto-sanitization approach shows a 3-4 time slowdown over string concatenation.

We ran the same benchmark against a runtime contextual auto-sanitizer we wrote for javascript. The "noEscape" case simply appends all the strings to a buffer. It does no context inference. The "parseOnly" case appends to a buffer and does context inference, but does no escaping. The "dynEscape" does context propagation and chooses one of three escaping methods by looking at the context from the parser. The cost of applying the escaping directive is about the same as a string copy, and the cost of parsing and propagating context at runtime is about 6 times that cost. This benchmark is a good comparison for templates where the logic that computes values to fill data holes is simple so the cost of executing the template should approach string concatenation.

↑PCDATA	↑URL start	↑URL path	↑URL query	↑URL path	↑PCDATA

`escapeHTML`	HTML entity escapes plain text, and allows pre-saniized HTML content through unchanged
`normalizeHTML`	Normalizes HTML. Same as HTML, but does not encode ampersands.
`{escape,normalize}HTMLRcdata`	Like `escapeHTML` but does not exempt pre-sanitized content since RCDATA (`<title>` and `<textarea>`) can't contain tags.
`{escape,normalize}HTMLAttribute`	Like `escapeHTML` but strips tags from pre-sanitized content.
`filterHtmlElementName`	Rejects any invalid element name or non PCDATA element.
`filterHtmlAttribName`	Rejects any invalid attribute name or attribute name that has JS, CSS, or URI content.
`{escape,normalize}URI`	Percent encodes (assuming UTF-8) URI,HTML,JS,&CSS special characters to allow safe embedding. This means encoding parentheses and single quotes which should not be normalized according to RFC 3986, and is not valid for all non-hierarchical URI schemes, but the only productions using single quotes or parentheses are obsolete marker productions, and normalizing these characters is essential to safely embedding URIs in unquoted CSS `url(…)` and to make sure that CSS error recovery mode doesn't jump into the middle of a quoted string.
`filterNormalizeUri`	Like `normalizeUri` but rejects any input that has a protocol other than `http`, `https`, or `mailto`.
`{escape,normalize}JSStringChars`	Uses `\uABCD` style escapes for code-units special in HTML, JS, or conditional compilation.
`{escape,normalize}JSRegexChars`	Like `{escape,normalize}JSStringChars` but also escapes regular expression specials such as `'$'`.
`{escape,normalize}JSValue`	Encodes booleans & numbers wrapped in spaces, else quotes and escapes.
`escapeCSSStringChars`	Uses `\ABCD` style escapes to escape HTML and CSS special characters.
`filterCssIdentOrValue`	Allows `class`es, `id`s, property name parts for bidi, CSS keyword values, colors, & quantities.
`noAutoescape`	Passes its input through unchanged. This is an auditable exception to auto-sanitization.

`\|escapeHtml`	602	`\|escapeUri`	15
`\|escapeHtmlAttribute`	380	`\|escapeHtmlRcdata`	10
`\|filterNormalizeUri, \|escapeHtmlAttribute`	231	`\|escapeHtmlAttributeNospace`	7
`\|escapeJsValue`	39	`\|filterHtmlIdent`	3
`\|filterCssValue`	33	`\|filterNormalizeUri`	1
`\|escapeJsString`	27

# rows	string +=	Array.join	open(Template(…))	DOM	render time
1000	54 ms	68 ms	204 ms	508 ms	586 ms
5000	267 ms	332 ms	1159 ms	2528 ms	1458 ms

For 1000 runs	noEscape			parseOnly			dynEscape
	491316000 ns	(1.0)		2979672000 ns	(6.1)		3531971000 ns	(7.2)

Using type inference to make web templates robust against XSS

Contents

Motivation

Glossary

Solution Sketch: A static approach with RTTI to avoid over-escaping.

Problem Definition

Performance

Ease of Adoption/Migration

Ease of Abandonment

Security under Maintenance

Structure Preservation Property

Code Effect Property

Least Surprise Property

Alternate Approaches

Strawman 0: Manual sanitization

Strawman I: Non-contextual auto-sanitization

Strawman II: Strict structural containment

Strawman III: A runtime typing approach

Strawman IV: A purely static approach

Definitions and Algorithms

Contexts

Grammar

HTML

Attributes

JS

CSS

URI

Context Propagation

Sanitization Functions

Sanitized Content Types

Caveats

Case Study

References

Using type inference to make web templates robust against XSS

Contents

Motivation

Glossary

Solution Sketch: A static approach with RTTI to avoid over-escaping.

Problem Definition

Performance

Ease of Adoption/Migration

Ease of Abandonment

Security under Maintenance

Structure Preservation Property

Code Effect Property

Least Surprise Property

Alternate Approaches

Strawman 0: Manual sanitization

Strawman I: Non-contextual auto-sanitization

Strawman II: Strict structural containment

Strawman III: A runtime typing approach

Strawman IV: A purely static approach

Definitions and Algorithms

Contexts

Grammar

HTML

Attributes

JS

CSS

URI

DynamicText

DynamicRcdata

DynamicTagName

DynamicAttrName

DynamicAttrValue

DynamicJsString

DynamicRegExp

DynamicJsValue

DynamicCssString

9.2.1. Error conditions

DynamicCssQuantityOrKeywordOrName

DynamicSchemeFilteredUriPart

DynamicQueryPart

2.2 Reserved Characters

D.2 Modifications

DynamicUriPart

Context Propagation

Sanitization Functions

Sanitized Content Types

Caveats

Case Study

References