Mike Samuel <msamuel@google.com>, Prateek Saxena <prateeks@eecs.berkeley.edu>
Scripting vulnerabilities plague web applications today. To streamline the output generation from application code, numerous web templating frameworks have recently emerged and are gaining widespread adoption. However, existing web frameworks fall short in providing mechanisms to automatically and context-sensitively sanitize untrusted data.
For example, a naive web template might look like
<div>{$name}</div>
but this template is vulnerable to Cross-site scripting (XSS)
vulnerabilities. An attacker who controls the value of
name
could pass in
<script>document.location = 'http://phishing.com/';</script>
to redirect users to a malicious site, steal the users credentials or
personal data, or initiate a download of malware.
The template author might manually encode name:
<div>{$name |escapeHTML}</div>
making sure that the user sees exactly the value of name
as per spec, and defeating this particular attack.
A better web templating system might automatically insert the
|escape***
directives, relieving the template author
of the burden.
This paper argues that correct sanitization is too important, that manual sanitization is an unreasonable burden to place on template authors (and especially maintainers), defines goals that any automatic approach should satisfy, and introduces an automatic approach that is particularly suitable for bolting onto existing web templating languages.
In particular, we introduce the new notion of "context" type qualifiers to represent the contexts in which untrusted data can be embedded. We propose a new type system that refines the base type system of a web templating language with the context type qualifier. Based on the new type system, we design and develop a context-sensitive auto-sanitization (CSAS) engine which runs during the compilation stage of a web templating framework to add proper sanitization and runtime checks to ensure the correct sanitization. We implement our system in Google Closure Templates, a commercially used open-source templating framework that is used in GMail, Google Docs and other applications. We evaluate our type system on 1035 real-world Closure templates. We demonstrate that our approach achieves both better security and performance than previous approaches.
This system is in the progress of being bolted onto JQuery templates but that work has not yet been evaluated on production code.
expression:
and comment parsing and error recovery
quirks so that our sanitization function definitions survive a
worst-case analysis. This paper assumes a basic familiarity with
CSS.
text/plain
) and produces content in an output
language. E.g. the function escapeHTML
is an escaper
that takes plain text, 'I <3 Ponies'
, and
transforms that to semantically equivalent HTML by turning HTML
special characters into entities: 'I <3
Ponies'
. (Escapers may, in the process, break
hearts.) See also
OWASP's definition.
javascript:
. A filter can ensure that an untrusted
value at the beginning of a URL either contains no protocol or
contains one in a whitelist (http
,
https
, or mailto
) and if it finds an
untrusted value that violates this rule, might return an innocuous
value such as '#'
which defangs the URL.
normalizeURI
might make sure that quotes are encoded so that a URI path can be
embedded in an HTML attribute unchanged :
'mailto:<Mohammed%20"The%20Greatest"%20Ali>%20ali@gmail.com'
→
'mailto:%3cMohammed%20%22The%20Greatest%22%20Ali%3e%20ali@gmail.com'
and a function that strips tags from valid HTML allows the tagless
HTML to be included in an HTML attribute context.
text/plain
) before
concatenating it with content in another language such as
text/html
in the case of XSS. Other examples of quoting
confusion include SQL Injection, Shell Injection, and HTTP header splitting.
typeof
in C++, C#, JavaScript;
instanceof
in Java and JavaScript;
Object.instanceof_of?()
in Ruby, type
in python, Object.getClass()
in Java; and
Object.GetType()
in C#.
The template below specifies a form whose action
depends
on two values $name
and $tgt
which may
come from untrusted sources. The
{if …}…{else}…{/if}
branches to
define a dynamic URL.
<form action="{if $tgt}/{$name}/handle?tgt={$tgt}{else}/{$name}/default{/if}">Hello {$name}…
First, we parse the template to find trusted static content, dynamic
"data holes" that may be filled by untrusted data, and flow control
constructs: if
, for
, etc. The solid black
portions are the data holes, and the green portions are trusted static
content.
<form action="{if $tgt}/███████/handle?tgt=██████{else}/███████/default{/if}">Hello ███████…
Next we do a flow-sensitive analysis, propagating types to determine the context in which each data hole appears.
<form action="{if $tgt}/███████/handle?tgt=██████{else}/███████/default{/if}">Hello ███████…
↑PCDATA | ↑URL start | ↑URL path | ↑URL query | ↑URL path | ↑PCDATA |
Based on those contexts, we determine the type of content that is expected for each hole.
<form action="{if $tgt}/ URL /handle?tgt=Query {else}/ URL /default{/if}">Hello HTML …
Finally we insert calls to sanitizer functions into the template.
<form action="{if $tgt} /{escapeHTML($name)}/handle?tgt={encodeURIComponent($tgt)} {else} /{escapeHTML($name)}/default {/if} ">Hello {escapeHTML($name)}…
That is the gist of the solution, though the above example glosses over issues with re-entrant templates, templates that are invoked in multiple start contexts, and joining branches that end in different contexts; and the exact sanitization functions chosen are different than shown in this simplified example.
The example only shows HTML and URL encoding, but our solution deals with data holes that occur inside embedded JavaScript and CSS as any solution for AJAX applications must.
In this section we present several metrics on which any competing sanitization scheme should be judged, and a definition of a safe template that can be used to prove or disprove the soundness of a sanitization scheme that we think is relevant to security properties that web applications commonly want to enforce.
A sanitization scheme should be judged on several performance metrics:
Run-time analysis overhead (proportional to overall template runtime) often differs substantially by platform. High quality parser-generators exist for C and Java, so the overhead may be much lower there than in browser, since iterating char by char over a string is slow in JavaScript.
Our proposal has a modest compile-/load-time cost taking slightly less than 1 second to do static inference for 1035 templates comprising 782kB of source code or about 1ms per template. The runtime analysis for our proposal is zero. The runtime sanitization overhead on a benchmark is between 3% and 10% of the total template execution time, and is indistinguishable from the overhead when non-contextual auto-sanitization is used (all data holes sanitized using HTML entity escaping).
Development overhead is hard to measure but the 1035 templates were migrated by an application group in a matter of weeks without stopping application development with little coordination, so the one-time overhead — the overhead to learn the system — is lower than that to learn and adopt a new templating language. Since the system works by inserting function calls, we provided debugging tools that diffed templates before and after the inference was run to show developers what the system was doing and aid in debugging. Due to the need to debug templates written using any approach, the continual development overhead can never be zero, but tool support, like diffing can make the system transparent and ease debugging.
Finally, once a bug has been identified, we try to make sure there are simple bugfixing recipes.
SanitizedContent
wrapper object of the appropriate
type as close to where it is sanitized as possible. Ideally, an
HTML tag whitelisting sanitizer would return a value of type
SanitizedContent
.What kind of changes, if any, do developers have to make to take an existing codebase of templates and have them properly sanitized? For example, adding sanitization functions manually is time-consuming and error-prone. Making sure that all static content is valid XHTML requires repetitive, time-consuming changes, but would not be as error-prone.
Our proposal allows contextual auto-sanitization to be turned on for some templates and not for others; most templating languages allow templates to be composed, i.e. templates can call other templates, and standard practice seems to be to have a few large templates that call out to many smaller templates. Since this can be done per template, a codebase can be migrated piecemeal, starting with complicated templates that have known problems.
Our proposal does not impose an element structure on template boundaries. Many top level templates look like:
{include "common-header.foo"} <!-- Body content --> {include "common-footer.foo"}
where the common header opens elements that are closed in the common footer:
<html><head> <!-- Common style and script definitions --> ... </head><body> <!-- Common menus -->
Approaches that require template code to be well-formed XML, such as XSLT, cannot support this idiom. Our proposal works for templating languages that allow this idiom because we propagate types as they flow template calls rather than inferring types of content based on a DOM derived from a template.
If a development team adopts a sanitization scheme, and finds that it does not meet their needs, how easily can they switch it off, and how much of the effort they invested in deploying it can they recover?
Since our solution works by inserting calls to sanitization function into templates, a development team having second thoughts can simply run the type inference engine to insert the calls, and print out the resulting templates to generate a patch to their codebase and then remove whatever directives turned on auto-sanitization. We argued above that cost of adoption is low, and most of the work put into verifying that the sanitization functions chosen were reasonable is recoverable.
Security measures tend to be removed from code under maintenance. Imagine a template that is not auto-sanitized:
<div>Your friend, {escapeHTML($name)}, thinks you'll like this.</div>
that is passed a plain text name. While merging two applications,
developers add a call to this code, passing in a rich HTML signature
that has been proven safe by a tag whitelister, e.g.
"Alan <font color=green>Green</font>span"
.
Eventually, Mr. Greenspan notices that his name is misrendered and
files a bug. A developer might check that the rich text signature is
sanitized properly before being passed in, but not notice the other
caller that doesn't do any sanitization. They resolve the bug by
removing the call to escapeHTML
which fixes the bug but
opens a vulnerability.
Over-encoding is more likely to be noticed by end-users than XSS vulnerabilities, so a project under maintenance is more likely to lose manual sanitization directives than to gain them.
Our proposal addresses this by introducing sanitized content types as a principled solution to over-encoding problems.
We define a safe template as one that has several properties: the structure preservation property described here, and the code effect and least surprise properties defined in later sections.
Intuitively, this property holds that when a template author writes an HTML tag in a safe templating language, the browser will interpret the corresponding portion of the output as a tag regardless of the values of untrusted data, and similarly for other structures such as attribute boundaries and JS and CSS string boundaries.
This property can be violated in a number of ways. E.g. in the following JavaScript, the author is composing a string that they expect will contain a single top-level bold element surrounded by text.
document.write(greeting + ', <b>' + planet + '</b>!');
and if greeting
is "Hello"
and
planet
is "World"
then this holds as the
output written is
"Hello, <b>world</b>!
";
but if greeting
is
"<script>alert('pwned');//"
and planet
is "</script>"
then this does not hold since the
structure has changed: the <b>
should have started
a bold element but the browser interprets it as part of a JavaScript
comment in
"<script>alert('pwned');//, <b></script></b>!
".
Lower level encoding attacks, such as UTF-7 attacks, may also violate this property.
More formally, given any template, e.g.
<div id="{$id}" onclick="alert('{$message}')">{$message}</div>
we can derive an innocuous template by replacing every
untrusted variable with an innocuous string, a string that is not
empty, is not a keyword in any programming language and does not
contain special characters in any of the languages we're dealing with.
We choose our innocuous string so that it is not a substring of the
concatenation of literal string parts. Using the innocuous string
"zzz"
, an innocuous template derived from the above is:
<div id="zzz" onclick="alert('zzz')">zzz</div>
Parsing this, we can derive a tree structure where each inner node has a type and children, and each leaf has a type and a string value.
Element ╠Name : "div" ╠Attribute ║ ╠Name : "id" ║ ╚Text : "zzz" ╠Attribute ║ ╠Name : "onclick" ║ ╚JsProgram ║ ╚FunctionCall ║ ╠Identifier : "alert" ║ ╚String : "zzz" ╚Text : "zzz"
A template has the structure preservation property when for all possible branch decisions through a template, and for all possible data table inputs, a template either produces no output (fails with an exception) or produces an output that can be parsed to a tree that is structurally the same as that produced by the innocuous template derived from it for the same set of branch decisions.
∀ branch-decisions ∀ data, areEquivalent(
parse(innocuousTemplate(T)(branch-decisions, data))
parse(T(branch-decisions, data)))
where parse parses using a combined HTML/JavaScript/CSS grammar to the
tree structure described above, branch-decisions is a path through
flow control constructs (the conditions in for
loops and
if
conditions) and where areEquivalent is defined thus:
def areEquivalent(innocuous_tree, actual_tree): if innocuous_tree.is_leaf: # innocuous_string was 'zzz' in the example above. if innocuous_string in innocuous_tree.leaf_value: # Ignore the contents of actual since it was generated by # a hole. We only care that it does not interfere with # the structure in which it was embedded. return True # Leaves structurally the same. # Assumes same node type implies actual is leafy. return (innocuous_tree.node_type is actual_tree.node_type and innocuous_tree.leaf_value == actual_tree.leaf_value) # Require type equivalence for inner nodes. if node_type(innocuous_tree) is not node_type(actual_tree): return False # Zip below will silently drop extras. if len(innocuous_tree.children) != len(actual_tree.children): return False # Recurse to children. for innocuous_child, actual_child in zip( innocuous_tree.children, actual_tree.children): if not areEquivalent(innocuous_child, actual_child): return False return True # All grounds on which they could be inequivalent disproven.
This definition is not computationally tractable, but can be used as a basis for correctness proofs, and in practice branch decisions that go through loops more than twice or recurse more than twice can be ignored so by using fuzzers to generate bad data inputs, we can gain confidence in an implementation.
This property is essential to capturing developer intent. When the developer writes a tag, the browser should interpret that as a tag, and when the developer writes paired start and end tags, the browser should interpret those as a matched pair. It is also important to applications that want to embed sanitized data while preserving a trusted path since the structure preservation property is a prerequisite for visual containment.
Web clients may specify data values in code (strings, booleans, numbers, JSON) but only code specified by the template author should run as a result of injecting the template output into a page and all code specified by the template author should run as a result of the same. There are a dizzyingly large number of ways this property can fail to hold for a template. A non-exhaustive sample of ways to cause extra code to run:
<script>
element.
onclick
.
src
or href
could specify
javascript
, livescript
, etc. as the protocol
in myriad ways.
expression
or -moz-binding
.
<object>
might load flash cross origin
and AllowScriptAccess
.
eval
, setTimeout
, etc.
There are also many ways to cause security-critical code to not run. In general, it is not wise to rely on JavaScript running in a browser, but many developers, not unreasonably, rely on some code having run if other code is running at a later time. A non-exhaustive sample of ways to stop code running via XSS:
<base>
element disabling src=
'ed
<script>
s with relative URLs.
<script>
element that causes it to fail to parse, e.g.
codepoints U+2028 or U+2029 in a string body.
(A violation of the Structure Preservation Property).
<script>
tag
to not be interpreted as such.
(A violation of the Structure Preservation Property).
Object.prototype.toString = function() {throw new Error}
<noscript>
element around the <head>
.
(A violation of the Structure Preservation Property).
Our proposal enforces this property by filtering URLs to prevent any data hole from specifying an exotic protocol, by filtering CSS keywords, and by only allowing data holes in JavaScript contexts to specify simple boolean, numeric, and string values, or complex JSON values which cannot have free variables. We assume that the JavaScript interpreter will work on arbitrarily large inputs. "Defining Code-Injection Attacks" by Ray & Ligatti defines a similar property: a CIAO (code injection attack on outputs) occurs when an interpolation causes the parse tree to include an expression that is not in its normal form, one consequence of which is that has no free variables.
Finally, our escapers are designed to produce output that avoids grammatical such as semicolon insertion, non-ASCII newline characters, regular-expression/division-operator/line-comment confusion.
Identifying all places in which a URL might appear in HTML
(incl. MathML and SVG) is relatively easy compared to CSS.
In CSS, it is difficult. For example, in
<div style="background: {$bg}">
,
$bg
might specify a URL, a color name,
a color value like #000
, a function-like color
rgb(0,0,0)
, a keyword value like
transparent
, or a combination of the above. Given how
hard it is to reliably black-list URLs, when you know the content is a
URL, we took the rather drastic approach of forbidding anything that
might specify a colon in CSS data holes.
This seems to affect very little in practice, and we could relax this
constraint to allow colons preceded by a safe word like the name of an
element, pseudo-element, or innocuous property. Even if we did, it is
possible that existing code uses colons in data holes to specify list
separators a la semantic HTML, and we would break that use case:
ul.inline li { list-style: none; display: inline } ul.inline li:before { content: ': ' } /* ', ' here would give a normal looking list. */ ul.inline li:first-child:before { content: '' }
This property is a prerequisite for many application privacy goals. If a third-party can cause script to run with the privileges of the origin, it can steal user data and phone home. Even if credentials are unavailable to JavaScript (HTTPOnly cookies), scripts with same-origin privileges can screen scrape (using DOM APIs) user names and identifiers and associated page content and phone home.
This property is also a prerequisite for many informed consent goals.
If a third-party script can install onsubmit
handlers, it can
rewrite form data before it is submitted with the XSRF tokens that are
meant to ensure that the data submitted was specified by the user.
The last of the security properties that any auto-sanitization scheme should preserve is the property of least surprise. The authors do not know how to formalize this property.
Developer intuition is important.
A developer (or code reviewer) familiar with HTML, CSS, and
JavaScript; who knows that auto-sanitization is happening should be
able to look at a template and correctly infer what happens to dynamic
values without having to read a complex specification document.
Simple rules-of-thumb should be sufficient to understand the system.
E.g. if a mythical average developer sees
<script>var msg = '{$msg}';</script>
and their intuition is that $world
should be
escaped using JavaScript style \
sequences, and that is sufficient to preserve the other
security properties, then that is what the system should do.
Templates should be both easy to write and to code review.
Exceptions to the system should be easily audited.
SQL prepared statements are great, but there's no way to have
exceptions to the rule without giving up the whole safety net, so
sometimes developers work around them by concatenating strings.
It's hard to grep
(or craft presubmit triggers) for all
the places where concatenated strings are passed to SQL APIs, so it's
hard for a more senior developer to find these after the fact and
explain how naive developers can achieve their goal working within the system,
notice a trend that points to a systemic problem with schemas, or
agree that the exception to the rule is warranted and document it for
future security auditors.
Our proposal was designed with this goal in mind, but we have not managed to quantify our success. We can note that 1035 templates were converted within a matter of weeks without a flood of questions to the mailing lists we monitor, so we infer that most of the parts of the system that were heavily exercised were non-controversial. Different communities of developers may have different expectations. We worked with a group of developers most of whom knew Java, C++, or both before starting web application development, and among whom a high proportion have at least a bachelor's degrees in CS or a related field. They may differ, intuition-wise, from developers who came to web development from a Ruby, Perl, or PHP background.
In this section we introduce a number of alternative proposals, explain why they perform worse on the metrics above. We cite real systems as examples of some of these alternatives. Many of these systems are well-thought out, reasonable solutions to particular problems their authors faced. We merely argue that they do not extend well to the criteria we outlined above and explicitly label these sections "strawmen" to clarify the difference between our design criteria and the contexts in which these systems arose. We do claim though that any comprehensive solution to XSS, at a tools level, should meet the criteria above.
Manual sanitization is the state-of-the-art currently. Developers use a suite of functions, such as OWASP's open source OSAPI encoders and every developer must learn when and how to apply them correctly. They must apply sanitizers either before data reaches a template or within the template by inserting function calls into code.
This places a significant burden on developers and does not guarantee any of the security properties listed above. One lapse can undo all the work put into hardening a website because of the all-or-nothing nature of the same-origin policy.
There is a tradeoff between correctness and simplicity of API that works in the attackers favor. Manual sanitization is particularly error-prone because developers learn the good parts of the languages they work in, but attackers have available to them the bad parts as well. The syntax of HTML, CSS, and JavaScript are much gnarlier than most developers imagine, and it is an unreasonable burden to expect them to learn and remember obscure syntactic corner cases. These corner cases mean that the typical suite of 4-6 escaping functions is the most that many developers can reliably choose from, but they are insufficient to handle corner cases or nested contexts.
Changes in language syntax or vendor-specific extensions (e.g. XML4J and embedded SVG) may invalidate developers previously valid assumptions. Code that was safe before may no longer be safe. With an automated system, a security patch and recompile may suffice, but a patch to code that took a team of developers years to write will take a team of developers to fix.
XSS Scanners (e.g. lemon) can mitigate some of manual sanitization's cons (though they work with any of the other solutions here as well to provide defense-in-depth), but there are no good scanners for AJAX applications, and, with manual sanitization, scanners impose a continual burden on developers to respond to the reported errors.
Non-contextual auto-sanitization is a great improvement over manual sanitization. Django templates and others use it.
It works by assuming that every data hole should be sanitized the same way,
usually by HTML entity encoding. As such, it is prone to over-escaping and
mis-escaping.
To understand mis-escaping, consider what happens when the
following template is called with ', alert('XSS'), '
:
<button onclick="setName('{$name}')">
The template produces
<button onclick="setName('', alert('XSS '), '')">
which is exactly the same, to the browser, as
<button onclick="setName('', alert('XSS '), '')">
because the browser HTML entity
decodes the attribute value before invoking the JavaScript parser on it.
Non-contextual auto-sanitization cannot preserve the structure preservation property for JavaScript, CSS, or URLs because it is unaware of those languages. It also fails to preserve the code effect property.
Bolting filters on non-contextual auto-sanitization will not help it to preserve the code effect property. It is possible to write bizarre JavaScript that does not even need alphanumerics. Since JavaScript has no regular lexical grammar, regular expressions that are less than draconian are insufficient to filter out attacks.
Non-contextual auto-sanitization, with auditable exceptions like Django's, does preserve the least surprise property in a sense. With very little training, a developer can predict exactly what it will do, and empirically, 74% of the time it does what they want (our system chose some kind of HTML entity encoding for 992 out of 1348 data holes).
Examples of strict structural containment languages are XSLT, GXP, Yesod, and possibly XHP. For all of these, the input is (or is coercible via fancy tricks) to a tree structure like XML. So for every data hole, it is obvious to the system which element and attribute context the hole appears in†. A similar structural constraint could be applied in principle to embedded JS, CSS, and URIs.
Strict structural containment is a sound, principled approach to building safe templates that is a great approach for anyone planning a new template language, but it cannot be bolted onto existing languages though because it requires that every element and attribute start and end in the same template. This assumption is violated by several very common idioms, such as the header-footer idiom, above, in ways that often require drastic changes to repair.
Since it cannot be bolted onto existing languages, limiting ourselves to it would doom to insecurity most of the template code existing today. Most project managers who know their teams have trouble writing XSS-free code, know this because they have existing code written in a language that does not have this property.
† - modulo mechanisms like
<xsl:element name="...">
which can, in principle, be repaired using equivalence classes of
elements and attributes. I.e. one could define an equivalence class
of elements all of whose attributes have the same meaning and which
have the same content type: (TBODY, THEAD, TFOOT), (OL, UL), (TD, TH),
(SPAN, I, B, U), (H1, H2, H3, …) and allow a dynamic element
mechanism to switch between element types within the same equivalence
class. Similar approaches can allow selecting among equivalent
dynamic attribute types : all event handlers are equivalent (modulo
perhaps those that imply user interaction for some applications).
Prior to this work, the best auto-sanitization scheme was a runtime scheme.
A runtime contextual auto-sanitizer plugs into a template runtime at a low level. Instead of writing content to an output buffer, the template runtime passes trusted and untrusted chunks to the auto-sanitizer. The template:
<ul>{for $item in $items}<li onclick="alert('{$item}')">{$item}{/for}</ul>
might produce the output on the left, and by propagating context at runtime, infer the context in the middle and choose to apply the escaping directives on the right before writing to the output buffer.
Content | Trusted | Context | Sanitization function |
---|---|---|---|
<ul> | Yes | PCDATA | none |
<li onclick="alert('> | Yes | PCDATA | none |
foo | No | JS string | escapeJSString |
')"> | Yes | JS string | none |
foo | No | PCDATA | escapeHTML |
<li onclick="alert('> | Yes | PCDATA | none |
<script>doEvil()</script> | No | JS string | escapeJSString |
')"> | Yes | JS string | none |
<script>doEvil()</script> | No | PCDATA | escapeHTML |
</ul> | Yes | PCDATA | none |
This works, and with a hand-tuned C parser has been deployed successfully on CTemplates and ClearSilver.
Writing a highly tuned parser in JavaScript though is difficult so implementing this scheme requires making a hard trade-off between flexibility and correctness and download-size/speed.
Our proposal is a factor of 4 faster than a runtime scheme implemented in JavaScript and has no download size cost above and beyond the code for the sanitization functions and the calls to them.
Even in languages for which there are efficient parser generators, runtime approaches might suffer performance-wise. The overhead for the static approach is independent of the number of times a loop is re-entered, so templates that take large array inputs might perform worse with even a highly efficient runtime scheme.
Runtime sanitization does do more elegantly in at least one area though.
Dynamic tag and attribute names pose no problems to a runtime sanitizer.
Whereas our scheme has to filter attribute names so that
$aname
cannot be "onclick"
in
<button {$aname}=…>
, because a
static approach must decide that the beginning of the attribute value
is either a JavaScript context or some other context, a runtime approach
can take into account the actual value of $aname
.
This is not a common problem, and our approach does handle many dynamic
attribute situations including:
<button on{$handlerType}=…>
.
We know of no purely static approaches, though they are possible. A purely static approach is one that, like our proposal, infers contexts at compile or load time, but does not take into account the runtime type of the values that fill the data holes.
This approach has problems with over-escaping. Existing systems often use a mix of sanitization in-template and sanitization outside the template in the front-end code that calls the template.
Our solution takes into account the runtime type of the values that fill a hole. If the runtime type marks the value as a known-safe string of HTML, then a sanitization function can choose not to re-escape, and instead normalize or do nothing.
See caveats for other problems that are as equally applicable to pure static systems as to our proposal.
This section is only relevant to implementors, testers, and others who want to understand the implementation. Everyone else, including web application developers, can ignore it.
At a high level, the type system defines four things which are expanded upon below:
HTML_PCDATA
.
(context * string) → context
.
context → ((α → string) * context)
.
If data holes have statically available type info, then the type
could be taken into account : (context * type) →
((α → string) * context)
.
{if}
by joining the context at the end of the
then-branch with the context at the end of the else-branch. It is
also used with loops, where (unless proven otherwise) we have to
join the context at the start (loop never entered) with a context
once through, with a steady state context for many repetitions.
context list → context
By contrast, the runtime auto-sanitization scheme described in
strawman III has the same initial context, the same context propagation
operator, no context join operator and uses a slightly differently
shaped sanitization function chooser :
context → (α → (string * context))
.
A context captures the state of the parser in a combined HTML/CSS/JS lexical grammar. It is composed of a number of fields which pack into 2 bytes with room to spare:
<
and >
), keeps track of whether
the tag body is PCDATA, RCDATA, or CDATA; and once in an RCDATA or
CDATA tag body, used to keep track of the expected end tag,
e.g. inside a <script>
body we have to find a
</script>
tag, but should ignore any apparent
</style>
tags.onclick
,
etc.), style
attributes, URL attributes
(href
, etc.), and others./
that does not start a
comment: enter a regular expression literal, or a division
operator, or fail with an error message due to ambiguity from
context joining.Contexts support two operators: join and ε-commit.
The join operator produces the context at the end of a condition, loop, switch, or other flow control construct. This sometimes introduces an ambiguity. In the template:
<form action="{if $tgt}/{$name}/handle?tgt={$tgt}{else}/{$name}/default{/if}↑">Hello {$name}…
One branch ends in the query portion of a URI, and one ends outside
it. If there were a data hole at the ↑, then we would
not be able to determine an appropriate sanitization function for
it†. So context joining often introduces just enough
ambiguity, by using do-not-know values for fields, and in the common
case, we later reach a point where we discard that info. In the URI
case, if there were a #
character at the
↑ we can reliably transition into a URI fragment
context, and in any case, the end of the attribute moots the
question.
The ε-commit operator is used when we see a data hole. In
some cases, we introduce parser states to delay decision making.
In the template fragment, <a href=
,
we could see a quote character next, or space, or the start of an
unquoted value, or the end of the tag (implying empty href), or a data
hole specifying the start of an unquoted attribute value.
If the next construct is a data hole we need to commit to it being
an unquoted attribute. The ε-commit operator in this case
goes from an HTML_BEFORE_ATTRIBUTE_VALUE state with an attribute
end delimiter of NONE to a state appropriate to the value type
(e.g. JS for an onclick
attribute) with an
attribute end delimiter of SPACE_OR_TAG_END.
The precise details of both these operators were determined empirically to come up with the simplest semantics that handles cases found in real code that web developers do not consider to be badly written or confusing.
† — This could be fixed by migrating the problematic data hole and the code leading up to it into each branch, but this is tricky to do across template boundaries and has not proven to be necessary for the codebase we migrated.
The context propagation algorithm uses a combined HTML/CSS and JS lexical grammar described below. Click on non-terminal productions for more detail.
new SanitizedHtml('<b>Hello, World</b>')
→ `<b>Hello, World!</b>`
The first case is handled by encoding all PCDATA special characters (<, >, and &) as HTML entities (<, >, and &). Other code-points may be escaped, but need not be.
In the second case, the safe HTML is emitted as is. It must be a mixed group of complete tags and text nodes such that there exists a safe template that could have produced it starting from an HTML PCDATA context and ending in the same context, or there exists a safe HTML sanitizer that could have produced it.
new SanitizedHtml('<b>Hello, World</b>')
→ `<b>Hello, World!</b>`
The first case is handled by encoding all RCDATA special characters (<, >, and &) as HTML entities (<, >, and &). Other code-points may be escaped, but need not be.
In the second case, the safe HTML is normalized. All the HTML
special characters are escaped except for ampersands
(&), which are left as-is. Since all RCDATA end tags
contain `<`, and `<` is escaped to a
string that does not contain it, and no other code units are
escaped to a string that contains it, no safe HTML chunk can
cause premature ending of an RCDATA tag. This means that the
safety of the odd but valid Soy template
<textarea>{$foo}<script>alert('Keystone kop');</script></textarea>
will not violate the structure security goal or unauthored code
security goal even when a chunk of safe HTML contains an RCDATA
end tag like </textarea>
.
<h{$headerLevel}>
can be used to
generate <h1>
, <h2>
,
…
To avoid problems where a tag name might be combined with a
static part to form script
, style
, or
another CDATA
or RCDATA
tag, we impose
the following restrictions:
TODO: scheme to avoid concatenation from producing
on
*, style
, href
, etc.
If the result is known safe HTML, strips tags so that the Soy
<abbr title="{$longDesc}">{$shortDesc}</abbr>
works
even when both $longDesc
and
$shortDesc
are snippets of sanitized HTML.
new SanitizedHtml('<b>Hello, World</b>')
→ `Hello, World!`
The first case is handled by encoding all HTML special characters including quotes (<, >, &, ", ', and =) as HTML entities (<, >, &, ", and ", =).
The second case is handled by stripping HTML tags and comments from the safe HTML, and then normalizing it by applying the same escaping scheme as for the first case, but without encoding ampersands (&).
For both cases, when the HTML attribute is not quoted, we additionally have to quote all codepoints that would signal the end of an HTML attribute, including a number of space and control characters. This set was derived empirically, and includes the backtick (`) which can be used as a quoting character on some versions of IE.
\n
.
We escape dynamic JS strings using the following table:
Codepoint | Glyph | Escape |
---|---|---|
000A16 | \n | |
000D16 | \r | |
002216 | " | \u0022 |
002716 | ' | \u0027 |
002F16 | / | \/ |
003C16 | < | \u003C |
003E16 | > | \u003E |
005C16 | \ | \\ |
202816 | \u2028 | |
202916 | \u2029 |
These escapes prevent premature string closing, since all JS quote characters are encoded to a sequence that does not contain a quote character and no other codepoint is encoded to a sequence containing a quote character. This prevents additional JS syntax errors by properly encoding all JS newline codepoints. It preserves structure by encoding any sequences that would end a CDATA tag, CDATA section, escaping text span, or quoted HTML attribute value. The output can be embedded in an HTML attribute value by additionally escaping & to \u0026. In the case of unquoted HTML attribute values, just escaping ampersands is not sufficient ; the output needs to be HTML entity escaped per DynamicAttrValue.
Like DynamicJsString, but additionally escapes characters special in regexp like ? and *.
Putting spaces around non-string values makes sure that they will be separate tokens but will not introduce a function call in the case of the Soy template
var f = function () {} // Missing semicolon.
{$myBoolean} && sideEffect();
where due to semicolon insertion, adding parentheses would cause
the template to produce the equivalent of
var f = ((function () {})(false)) && sideEffect();
given { myBoolean: false }
.
\10
.
We encode all CSS special characters using CSS hex escaping. CSS hex escaping allows an escape to be followed optionally by a space or tab character so that an escape may be followed by an unescaped hex digit. We always emit a following space.
We aggressively encode all CSS special characters to prevent unspecified CSS error recovery from restarting parsing inside quoted strings.
9.2.1. Error conditions
In general, this document does not specify error handling behavior for user agents (e.g., how they behave when they cannot find a resource designated by a URI).
However, user agents must observe the rules for handling parsing errors.
Since user agents may vary in how they handle error conditions, authors and users must not rely on specific error recovery behavior.
We also escape both angle brackets (< and >) (which is already a CSS special) so that HTML escaping text spans, CDATA sections, CDATA end tags, etc. cannot be introduced into the middle of CSS strings.
color: #{$hashColor}
color: {$colorName}
border-{$rtlLeft}: … /* left for English, right for Arabic */
div.{$className} { … }
width: ${width}{$widthUnits}
TODO: explain the allowed set and its derivation.
Whitelists a protocol if present to prevent code execution via
javascript:…
, and normalizes the URI
(encoding all unencoded HTML special characters, quotes, spaces,
and parentheses) so it can be embedded. E.g. `"`
→ %22.
URI normalization percent escapes all codepoints escaped by DynamicQueryPart except for the percent character (%).
TODO: Explain the filter details and their derivation.
Encodes all characters that are special or disallowed in a URI.
We encode all codepoints encoded by
encodeURIComponent
making the same assumption that
the URL is UTF-8 encoded.
Over encodeURIComponent
, we additionally encode
single quotes (') and parentheses(( and
)) so that the result can be safely embedded in single
quoted HTML attributes and in single quoted and unquoted CSS
url(…)
constructs. Note that applying an
extra level of CSS escaping using \27
style escapes is
not an option since IE (for interoperability with DOS file paths?)
does not interpret \
as the beginning of an escape when
it appears inside a url(…)
.
Each of these characters is significant in a URI as specified in RFC 3986:
so escaping them is technically not semantics preserving, but encoding them is safe for all schemes that commonly appear in HTML because those codepoints only appear in the obsolete mark productions.2.2 Reserved Characters
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
D.2 Modifications
The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of [RFC2234]. This change required all rule names that formerly included underscore characters to be renamed with a dash instead. In addition, a number of syntax rules have been eliminated or simplified to make the overall grammar more comprehensible. Specifications that refer to the obsolete grammar rules may be understood by replacing those rules according to the following table: …
mark "-" / "_" / "." / "!" / "~" / "*" / "'" / "(" / ")"
Normalizes the URI like URI normalization so that an already encoded path or fragment can be emitted inline but does not filter since a protocol part cannot appear here.
The context propagation algorithm uniquely determines the context at every data hole so that a later pass may chose a sanitization function for each hole. The algorithm operates at two level, one on the graph of templates, and another individually within templates. The first deals with identifying the minimal set of templates that need to be processed, and might clone templates to deal with templates that are called in multiple different contexts.
The template context propagation algorithm uses an inference object which is implemented as a set of nested maps and a pointer to a parent inference object. This allows us to speculatively type a template sub-graph, and when we have a consistent view of types, we can collapse our conclusions into the parent by simply copying maps from children to parent. The maps include maps from holes to start contexts, from templates to end contexts used to type calls.
def autosanitize(templates): inferences = Inferences() for template in templates: if inferences.getEndContext(template) is not None: continue # already done if template.is_public() or template.is_contextually_autosanitized(): # By exploring the call graph from only public templates, ones # that can be invoked by front-end code, we do not trigger error checks for # parts of the code-base that don't yet use contextual # auto-sanitization, easing migration. compute_end_context(template, inferences, start_context=HTML_PCDATA†) return inferences
That algorithm delegates all the hard work to another algorithm below that examines the template graph reachable from one particular top-level template.
def compute_end_context(template, inferences, start_context): # We need to choose an end context before typing the body to avoid # infinite regression for recursive templates. # Start with the optimistic assumption that the template ends in the # same context in which it starts. # Empirically, less than 0.2% of templates in our sample violate # this assumption. # The ones that do tend to be some of the gnarliest code that # template authors would rather not refactor. optimistic_assumption_1 = Inferences(parent=inferences) optimistic_assumption_1.template_end_contexts[template] = start_context end_context = propagate_context( template.children, start_context, optimistic_assumption_1) if start_context == end_context: # Our optimistic assumption was warranted. optimistic_assumption_1.commit_into_parent() return end_context # Otherwise, assume that the end_context above is the end_context # and check that we have reached a fixed point. optimistic_assumption_2 = Inferences(parent=inferences) optimistic_assumption_2.template_end_contexts[template] = end_context end_context_fixed_point = propagate_context( template.body, start_context, optimistic_assumption_2) if end_context_fixed_point == end_context: # We have a fixed point. Phew! optimistic_assumption_2.commit_into_parent() return end_context_fixed_point # We could try other strategies to generate optimistic assumptions, but # we have not seen a need in real template code. raise Error(...)
Thus far, we have done nothing that is particular to the syntax templating language itself. Different languages have different semantics around parameter passing, and provide different flow control constructs. The algorithm below is an example for one that deals with a simple template language that provides calls, conditions, chunks of static template text, and expression interpolations which fill data holes. On a call, it may recurse to the compute end context algorithm above, which is how we lazily explore the portion of the template call graph needed.
def propagate_context(parse_tree_nodes, context, inferences): for parse_tree_node in parse_tree_nodes: if is_safe_text_node(parse_tree_node): context = apply_html_grammar(parse_tree_node.safe_text, context) elif is_data_hole(parse_tree_node): context = &epsilon_commit(context) # see definition above inferences.context_for_data_hole[node] = context context = … # compute context after hole. elif is_conditional(parse_tree_node): if_context = propagate_context(parse_tree_node.if_branch, context, inferences) else_context = propagate_context(parse_tree_node.else_branch, context, inferences) context = context_join(if_branch, else_branch) elif is_call_node(parse_tree_node): output_context = None # possible_callees comes up with the templates this might be calling, # and may clone templates if they are called in multiple different contexts. # Most template languages have static call graphs, so in practice, there is # exactly one possible callee. for possible_callee in possible_callees_of(parse_tree_node, context): if possible_callee not in inferences.template_end_contexts: context_after_call = compute_end_context(possible_callee, inferences, context) else: context_after_call = inferences.template_end_contexts[possible_callee] if output_context is None: output_context = context_after_call else: # Since 99% of templates end in their start context, in practice, # this join does little. output_context = context_join(output_context, context_after_call) context = output_context return context
† — We make the simplifying assumption that the start context for all public templates is HTML_PCDATA. Some templating languages may be used in different contexts, and so this assumption might not prove valid. We could choose the starting context for public templates based on some kind of annotation or naming convention particular to the templating language.
We define a suite of sanitization functions. The table below describes them briefly and the context in which they are used. There are significantly more than most manual escaping schemes. As noted above, most developers who don't work on parsers for HTML/CSS/JS have a simplified mental model of the grammar which makes it difficult to choose between this many options. We have many sanitization functions because we want to minimize template output size to minimize network latency; having more sanitization functions lets us avoid escaping common characters like spaces when safe. The naming convention for sanitization function reflects the escaper, filter, and normalizer definitions from the glossary. By convention, sanitization functions are split into broad groups: escaping functions transform an input language (usually plain text) to the output language, filters transform any input language to the same string or to an innocuous string, and normalizers transform a string in the input language to an output in the same language but that is easier to embed in other languages.
escapeHTML |
HTML entity escapes plain text, and allows pre-saniized HTML content through unchanged |
normalizeHTML |
Normalizes HTML. Same as HTML, but does not encode ampersands. |
{escape,normalize}HTMLRcdata |
Like escapeHTML but does not exempt pre-sanitized
content since RCDATA
(<title> and
<textarea> ) can't contain tags. |
{escape,normalize}HTMLAttribute |
Like escapeHTML but strips tags from
pre-sanitized content. |
filterHtmlElementName |
Rejects any invalid element name or non PCDATA element. |
filterHtmlAttribName |
Rejects any invalid attribute name or attribute name that has JS, CSS, or URI content. |
{escape,normalize}URI |
Percent encodes
(assuming UTF-8)
URI,HTML,JS,&CSS
special characters to allow safe embedding.
This
means encoding parentheses and single quotes which should not be
normalized according to RFC 3986, and is not valid for all
non-hierarchical URI schemes, but the only productions using
single quotes or parentheses are obsolete marker productions,
and normalizing these characters is essential to safely
embedding URIs in unquoted CSS url(…)
and to make sure that CSS error recovery mode doesn't jump
into the middle of a quoted string. |
filterNormalizeUri |
Like normalizeUri but rejects any input
that has a protocol other than http ,
https , or mailto . |
{escape,normalize}JSStringChars |
Uses \uABCD style escapes for
code-units special in HTML, JS, or conditional compilation. |
{escape,normalize}JSRegexChars |
Like {escape,normalize}JSStringChars but also
escapes regular expression specials
such as '$' . |
{escape,normalize}JSValue |
Encodes booleans & numbers wrapped in spaces, else quotes and escapes. |
escapeCSSStringChars |
Uses \ABCD style escapes to escape HTML
and CSS special characters. |
filterCssIdentOrValue |
Allows classes, ids, property name parts for bidi, CSS keyword values, colors, & quantities. |
noAutoescape |
Passes its input through unchanged. This is an auditable exception to auto-sanitization. |
Sanitized content allows template users to pre-sanitize some content, and allow approved structured content.
new SanitizedContent('<b>Hello, World!</b>')
specifies a chunk of HTML that the creator asserts is safe to embed
in HTML PCDATA.
It is possible for misuse of this feature to violate all the safety properties contextual auto-sanitization provides. We assert that allowing this makes it easier to migrate code that has no XSS safety net to a better place, and satisfies some compelling use cases including HTML translated into foreign languages by trusted translators, and HTML from tag whitelisters, wiki-text-to-html converters, rich text editors. But it needs to be used carefully. Developers should:
As noted above, (in the runtime contextual auto-sanitization strawman) static approaches (including ours) cannot handle all possible uses of dynamic attribute and element name. These seem rare in real code, and relatively easy to fix, but if necessary, a hybrid runtime/static approach could address this problem.
Static approaches get into corner cases around zero-length untrusted
values. For example, to preserve the
code effect property, we need to
make sure that no untrusted value specifies a javascript:
or similar URL protocol. In template code like
<img src="{$x}{$y}">
we might
naively decide that it is sufficient to filter $x
to make
sure that it specifies no protocol or an approved one. But if
$x
is the empty string, then $y
might still
specify a dangerous protocol. Alternatively $x
might
specify "javascript"
and $y
start with a
colon. This hole can be closed a number of ways, e.g.
java%73cript:alert(1337)
is not a dangerous URL.
Similar problems arise with JavaScript regular expressions:
var myPattern = /{$x}/
where an empty
$x
could turn the regular expression literal into a
line comment and there are similar special case fixes
(/(?:)/
is not a comment).
But a general solution to empty strings would be a source
of considerable complexity. Simply making sanitizer functions variadic
({$x}{$y}
→ {filterNormalizeUri($x, $y)}
)
will not suffice because the two interpolations might cross
template boundaries.
Our JavaScript parser is unsound. JavaScript is a language that does
not have a regular lexical grammar (even ignoring conditional compilation)
because of the way it specifies whether a /
starts a regular
expression or a division operator.
We use a scheme based on a draft JavaScript 1.9 grammar devised by
Waldemar Horwat that makes that decision based on the last non-comment
token. This works well for all the code we've seen that people actually
write, and makes our approach feasible, but there is a known case where
it fails: x++ /a/i
vs x = ++/a/i
.
The second code snippet, while nonsensical, is valid JavaScript that our
scheme fails to handle correctly.
Our parser does not currently recognize HTML5
escaping text spans,
the regions inside <script>
and
<style>
bodies delimited by <!--
and -->
that suppress end-tag processing. This can be
fixed if a codebase seems to use them. Our santization function
choices are designed to not produce content containing escaping text
span boundaries.
Our parser does not descend into HTML, CSS, or JS in data: URLs. We could but have not encountered the need in existing code.
We studied 1035 templates that were migrated from an existing codebase to use contextually sanitized templates. Most of the templates were relatively small but totaled 21098 LOC and 783kB. The compilation load time cost for these 1035 templates was 998339279 ns on a platform with 2 GB of RAM, an Intel 2.6 MHz dual-core processor running Linux 2.6.31.
LOC | # templates |
---|---|
1- 18 | ######################################## (685) |
19- 36 | ############ (210) |
37- 55 | #### (78) |
56- 73 | # (33) |
74- 91 | (10) |
92- 110 | (7) |
111- 128 | (4) |
129- 147 | (3) |
148- 220 | (1) x 4 |
221- 294 | (0) x 4 |
295- 312 | (1) |
Most of the sanitization functions chosen were plain text→HTML. the non-contextual auto-sanitization is correct 63% of the time assuming the auto-sanitizer is sufficient in Html, HtmlAttribute, and HtmlRcdata contexts.
If values were aggressively filtered to prevent dangerous URLs from appearing in the template input, then non-contextual auto-sanitization would be sufficient in 77% of cases. The rates might be hire for a codebase written for non-contextual sanitization by developers aware of its limitations.
|escapeHtml | 602 | |escapeUri | 15 | ||
|escapeHtmlAttribute | 380 | |escapeHtmlRcdata | 10 | ||
|filterNormalizeUri, |escapeHtmlAttribute | 231 | |escapeHtmlAttributeNospace | 7 | ||
|escapeJsValue | 39 | |filterHtmlIdent | 3 | ||
|filterCssValue | 33 | |filterNormalizeUri | 1 | ||
|escapeJsString | 27 |
268 out of 1348 interpolation sites require runtime filtering (19.9)%,
mostly filterNormalizeUri
.
The benchmark runs over a large template with dummy data that is meant to be representative of the application using it. The benchmarks range from 15.2-16.8 ms and the standard-deviation is roughly 0.6 ms, which puts the runtime-cost of the sanitization functions in the noise.
No sanitization : 50% Scenario 16709334.99 ns; σ= 615548.54 ns @ 10 trials Non-contextual auto-sanitization : 50% Scenario 16835324.39 ns; σ=6030836.03 ns @ 10 trials Full contextual auto-sanitization : 50% Scenario 15227861.39 ns; σ= 616193.00 ns @ 10 trials
In JavaScript, a state-machine based runtime contextual auto-sanitization approach shows a 3-4 time slowdown over string concatenation.
# rows | string += | Array.join | open(Template(…)) | DOM | render time |
---|---|---|---|---|---|
1000 | 54 ms | 68 ms | 204 ms | 508 ms | 586 ms |
5000 | 267 ms | 332 ms | 1159 ms | 2528 ms | 1458 ms |
We ran the same benchmark against a runtime contextual auto-sanitizer we wrote for javascript. The "noEscape" case simply appends all the strings to a buffer. It does no context inference. The "parseOnly" case appends to a buffer and does context inference, but does no escaping. The "dynEscape" does context propagation and chooses one of three escaping methods by looking at the context from the parser. The cost of applying the escaping directive is about the same as a string copy, and the cost of parsing and propagating context at runtime is about 6 times that cost. This benchmark is a good comparison for templates where the logic that computes values to fill data holes is simple so the cost of executing the template should approach string concatenation.
For 1000 runs | noEscape | parseOnly | dynEscape | |||||
---|---|---|---|---|---|---|---|---|
491316000 ns | (1.0) | 2979672000 ns | (6.1) | 3531971000 ns | (7.2) |