URL

Living Standard — Last Updated 28 December 2012

This Version:
http://url.spec.whatwg.org/
Participate:
Send feedback to whatwg@whatwg.org (archives) or file a bug (open bugs)
IRC: #whatwg on Freenode
Version History:
https://github.com/whatwg/url/commits
@urlstandard
Editor:
Anne van Kesteren <>

Table of Contents

  1. Goals
  2. 1 Conformance
  3. 2 Terminology
  4. 3 Percent-encoded bytes
  5. 4 Hosts and IP addresses
    1. 4.1 Writing
    2. 4.2 Parsing
    3. 4.3 Serializing
  6. 5 URLs
    1. 5.1 Writing
    2. 5.2 Parsing
    3. 5.3 Serializing
  7. 6 application/x-www-form-urlencoded
  8. 7 API
    1. 7.1 Constructors
    2. 7.2 Interface URLUtils
    3. 7.3 Interface URLQuery
  9. References
  10. Acknowledgments

Goals

The URL standard sets out to make URLs fully predictable and interoperable. This is the plan:

As the editor learns more about the subject matter the goals might increase in scope somewhat.

1 Conformance

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this specification are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]

2 Terminology

Some terms used in this specification are defined in the Encoding Standard. [ENCODING]

The EOF code point signifies the end of a string or code point stream.

The ASCII digits are code points in the range U+0030 to U+0039.

The ASCII hex digits are ASCII digits or are code points in the range U+0041 to U+0046 or in the range U+0061 to U+0066.

The ASCII alpha are code points in the range U+0041 to U+005A or in the range U+0061 to U+007A.

The ASCII alphanumeric are ASCII digits or ASCII alpha.

The domain label separators are the code points U+002E, U+3002, U+FF0E, and U+FF61.

3 Percent-encoded bytes

A percent-encoded byte is "%", followed by two ASCII hex digits. Sequences of percent-encoded bytes, after conversion to bytes, should not cause utf-8 decode to emit any decoder errors.

To percent encode a byte into a percent-encoded byte, return a string consisting of "%", followed by a double-digit, uppercase, hexadecimal representation of byte.

To percent decode a string into a byte sequence, run these steps:

  1. Let pointer be a pointer into string, initially zero (pointing to the first code point), and let c be the code point it points to.

  2. Let remaining be the substring after pointer in string.

  3. Let bytes be an empty byte sequence.

  4. While c is not the EOF code point, run these substeps:

    1. While c is not "%" or the EOF code point, append to bytes a byte whose value is c's code point and increase pointer by one.

    2. If c is "%" and remaining does not start with two ASCII hex digits, append to bytes a byte whose value is c's code point, increase pointer by one, and run these substeps again.

    3. While c is "%" and remaining starts with two ASCII hex digits, append to bytes a byte whose value is remaining's two leading code points, interpreted as hexadecimal number, and increase pointer by three.

  5. Return bytes.

The simple encode set are all code points less than U+0020 (i.e. excluding U+0020) and all code points greater than U+007E.

The default encode set is the simple encode set and code points U+0020, '"', "#", "<", ">", "?", and "`".

The password encode set is the default encode set and code points "/", "@", and "\".

The username encode set is the password encode set and code point ":".

To utf-8 percent encode a code point, using an encode set, run these steps:

  1. If code point is not in encode set, return code point.

  2. Let bytes be the result of running utf-8 encode on code point.

  3. Percent encode each byte in bytes, and then return them concatenated, in the same order.

4 Hosts and IP addresses

A host is a string that represents a network address, either in the form of a domain or an IPv6 address.

This is a slightly more generic definition of host than its traditional meaning for the sake of convenience.

A domain is an ordered list of one or more domain labels.

An IPv6 address is a 128-bit identifier and for the purposes of this specification represented as an ordered list of eight 16-bit pieces. [IPV6]

4.1 Writing

A host must be either a domain or "[", followed by an IPv6 address, followed by "]".

A domain is one or more domain labels separated from each other by a domain label separator, optionally followed by a domain label separator.

A trailing domain label separator signifies an empty domain label.

A domain label is ...

An IPv6 address is defined in the "Text Representation of Addresses" chapter of IP Version 6 Addressing Architecture. [IPV6]

4.2 Parsing

The host parser takes a string input and then runs these steps:

  1. If input starts with "[", run these substeps:

    1. If input does not end with "]", return failure.

    2. Return the result of parsing input with its leading "[" and trailing "]" removed.

  2. Let host be the result of running utf-8's decoder on the percent decoding of input.

  3. IDNA hell

The IPv6 parser takes a string input and then runs these steps:

  1. Let address be a new IPv6 address with its 16-bit pieces initialized to 0.

  2. Let piece pointer be a pointer into address's 16-bit pieces, initially zero (pointing to the first 16-bit piece), and let piece be the 16-bit piece it points to.

  3. Let compress pointer be another pointer into pieces, initially null and pointing to nothing.

  4. Let pointer be a pointer into input, initially zero (pointing to the first code point), and let c be the code point it points to.

  5. Let remaining be the substring after pointer in input.

  6. If c is ":", run these substeps:

    1. If remaining does not start with ":", return failure.

    2. Increase pointer by two.

    3. Increase piece pointer by one and then set compress pointer to piece pointer.

  7. Main: While c is not the EOF code point, run these substeps:

    1. If piece pointer is eight, return failure.

    2. If c is ":", run these inner substeps:

      1. If compress pointer is not null, return failure.
      2. Increase piece pointer by one, set compress pointer to piece pointer, and then jump to Main.
    3. Let value and length be 0.

    4. While length is less than 4 and c is an ASCII hex digit, set value to value × 0x10 + c interpreted as hexadecimal number, and increase pointer and length by one.

    5. Based on c:

      "."

      If length is 0, return failure.

      Decrease pointer by length.

      Jump to IPv4.

      ":"

      Increase pointer by one.

      If c is the EOF code point, return failure.

      Anything but the EOF code point

      Return failure.

    6. Set piece to value.

    7. Increase piece pointer by one.

  8. If c is the EOF code point, jump to Finale.

  9. IPv4: If piece pointer is greater than six, return failure.

  10. Let dots seen be 0.

  11. While c is not the EOF code point, run these substeps:

    1. Let value be 0.

    2. While c is an ASCII digit, set value to value × 10 + c interpreted as decimal number and increase pointer by one.

    3. If value is greater than 255, return failure.

    4. If dots seen is less than 3 and c is not a ".", return failure.

    5. If dots seen is 3 and c is not the EOF code point, return failure.

    6. Set piece to piece × 0x10 + value.

    7. If dots seen is 0 or 2, increase piece pointer by one.

    8. Increase dots seen by one.

  12. Finale: If compress pointer is not null, run these substeps:

    1. Let swaps be piece pointercompress pointer.

    2. Set piece pointer to seven.

    3. While neither piece pointer nor swaps is zero, replace piece with the piece at pointer compress pointer + swaps and then decrease piece pointer and swaps by one.

  13. Otherwise, if compress pointer is null and piece pointer is not eight, return failure.

  14. Return address.

4.3 Serializing

The host serializer takes a host host and then runs these steps:

  1. If host is null, return the empty string.

  2. If host is an IPv6 address, return "[", followed by the result of running the IPv6 serializer on host, followed by "]".

  3. If host is a domain ...

The IPv6 serializer takes an IPv6 address address and then runs these steps:

  1. Let output be the empty string.

  2. Let compress pointer be a pointer to the first 16-bit piece in the first longest sequences of address's 16-bit pieces that are 0.

    In 0:f:0:0:f:f:0:0 it would point to the second 0.

  3. If there is no sequence of address's 16-bit pieces that are 0 longer than one, set compress pointer to null.

  4. For each piece in address's pieces, run these substeps:

    1. If compress pointer points to piece, append "::" to output and then run these substeps again with all subsequent pieces in address's pieces that are 0 skipped or go the next step in the overall set of steps if that leaves no pieces.

    2. Append piece, represented as the shortest possible lowercase hexadecimal number, to output.

    3. If piece is not the last of address's pieces, append ":" to output.

  5. Return output.

This algorithm requires the recommendation from A Recommendation for IPv6 Address Text Representation. [IPV6TEXT]

5 URLs

A URL is a string that represents an identifier.

A URL is either a relative URL or an absolute URL. Either form can be followed by a fragment.

A relative URL is a URL that is relative to a parsed URL. Such a parsed URL is a base URL. If the base URL has no relative scheme, parsing the relative URL results in failure.

An absolute URL stands on its own and is therefore a potential base URL.

Parsing (provided it does not return failure) and serializing a URL will turn it into an absolute URL. The intermediate form is named a parsed URL. The components a URL can consist of and parsed URL consists of are scheme, scheme data (not used if scheme is a relative scheme), username, password, host, port, path, query, and fragment.

A relative scheme is a scheme listed in the first column of the following table. A default port is a relative scheme's optional corresponding port and is listed in the second column on the same row.

scheme port
"ftp""21"
"file"
"gopher""70"
"http""80"
"https""443"
"ws""80"
"wss""443"

5.1 Writing

A URL must be either a relative URL or an absolute URL, optionally followed by "#" and a fragment.

An absolute URL is a scheme, followed by ":", followed by scheme data, optionally followed by "?" and a query.

A scheme is one ASCII alpha, followed by zero or more of ASCII alphanumeric, "+", "-", and ".". A scheme must be registered ....

The syntax of scheme data depends on the scheme and is typically defined alongside it. For a relative scheme, scheme data is a scheme-relative URL. For other schemes, specifications or standards must define scheme data within the constraints of zero or more URL units.

A relative URL is either a scheme-relative URL, an absolute-path-relative URL, or a path-relative URL that does not start with a scheme and ":", optionally followed by a "?" and a query.

A relative URL must be relative to a base URL with a relative scheme.

A scheme-relative URL is "//", optionally followed by userinfo and "@", followed by a host, optionally followed by ":" and a port, optionally followed by an absolute-path-relative URL.

Userinfo is a username, optionally followed by a ":" and a password.

A username is zero or more URL units, excluding "/", ":, "?", and "@".

A password is zero or more URL units, excluding "/", "?", and "@".

A port is zero or more ASCII digits.

An absolute-path-relative URL is "/", followed by a path-relative URL that does not start with "/".

A path-relative URL is zero or more path segments separated from each other by a "/".

A path segment is zero or more URL units, excluding "/" and "?".

A query is zero or more URL units.

A fragment is zero or more URL units.

The URL code points are ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFEF, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.

Code points higher than U+009F will be converted to percent-encoded bytes by the URL parser.

The URL units are URL code points and percent-encoded bytes.

5.2 Parsing

Aside from the components mentioned earlier, a parsed URL also has an associated relative flag.

To clear a parsed URL, set its scheme, scheme data, username, and port to the empty string, password, host, query, and fragment to null, and its path to the empty list.

Add the ability to halt on the first conformance error.

The URL parser takes a string input, optionally with a base URL base, optionally with an encoding encoding override, optionally with an parsed URL url, and if url is given, optionally with a state override state override, and then runs these steps:

The url and state override arguments can be used for API manipulation of a parsed URL.

  1. If url is not given:

    1. Set url to a new parsed URL.

    2. Clear url.

    3. Remove any leading and trailing ASCII whitespace from input.

  2. Let state be state override if given, or scheme start state otherwise.

  3. If base is not given, set it to null.

  4. If encoding override is not given, set it to utf-8.

  5. Let buffer be the empty string.

  6. Let the @ flag and the [] flag be unset.

  7. Let pointer be a pointer to first code point in input.

  8. Keep running the following state machine by switching on state, increasing pointer by one after each time it is run, as long as pointer does not point past the end of input.

    Let c be the code point to which pointer points.

    Let remaining be the substring starting after pointer in input.

    If input is "mailto:example@example" and pointer points to "@", remaining is "example".

    scheme start state
    1. If c is an ASCII alpha, append c, lowercased, to buffer, and set state to scheme state.

    2. Otherwise, if state override is not given, set state to no scheme state, and decrease pointer by one.

    3. Otherwise, parse error, terminate this algorithm.

    scheme state
    1. If c is an ASCII alphanumeric, "+", "-", or ".", append c, lowercased, to buffer.

    2. Otherwise, if c is ":", set url's scheme to buffer, buffer to the empty string, and then run these substeps:

      1. If state override is given, terminate this algorithm.

      2. If url's scheme is a relative scheme, set url's relative flag.

      3. If url's scheme is "file", set state to relative state.

      4. Otherwise, if url's relative flag is set, base is not null and base's scheme is equal to url's scheme, set state to relative state.

      5. Otherwise, if url's relative flag is set, set state to authority start state.

      6. Otherwise, set state to scheme data state.

    3. Otherwise, if state override is not given, set buffer to the empty string, state to no scheme state, and start over (from the first code point in input).

    4. Otherwise, if c is the EOF code point, terminate this algorithm.

    5. Otherwise, parse error, terminate this algorithm.

    scheme data state
    1. If c is "?", set url's query to the empty string and state to query state.

    2. Otherwise, if c is "#", set url's fragment to the empty string and state to fragment state.

    3. Otherwise, run these substeps:

      1. If c not the EOF code point, not a URL code point, and c is not ":" while remaining starts with two ASCII hex digits, parse error.

      2. If c is none of EOF code point, U+0009, U+000A, and U+000D, utf-8 percent encode c using the simple encode set, and append the result to url's scheme data.

    no scheme state

    If base is null, or base's scheme is not a relative scheme, return failure.

    You do not want to check base's relative flag here, as the scheme itself can have been changed to something non-sensical through the protocol attribute.

    Otherwise, set state to relative state, and decrease pointer by one.

    relative state

    Set url's relative flag, set url's scheme to base's scheme, and then, based on c:

    EOF code point

    Set url's host to base's host, url's port to base's port, url's path to base's path, and url's query to base's query.

    "/"
    "\"

    If remaining starts with either "/" or "\", increase pointer by one, and run these steps:

    1. If url's scheme is "file", set state to file host state.

    2. Otherwise set state to authority start state.

    Otherwise, set url's host to base's host, url's port to base's port, state to relative path start state, and decrease pointer by one.

    "?"

    Set url's host to base's host, url's port to base's port, url's path to base's path, url's query to the empty string, and state to query state.

    "#"

    Set url's host to base's host, url's port to base's port, url's path to base's path, url's query to base's query, url's fragment to the empty string, and state to fragment state.

    Otherwise

    Set url's host to base's host, url's port to base's port, url's path to base's path, then remove url's path's last string, set state to relative path start state, and decrease pointer by one.

    authority start state

    If c is neither "/" nor "\", set state to authority state, and decrease pointer by one.

    authority state
    1. If c is "@", run these substeps:

      1. If the @ flag is set, prepend "%40" to buffer.

      2. Set the @ flag.

      3. For each code point in buffer, run these substeps:

        1. If code point is ":" and url's password is null, set url's password to the empty string and continue.

        2. utf-8 percent encode code point using the default encode set and append the result to url's password if url's password is non-null, and to url's username otherwise.

      4. Set buffer to the empty string.

    2. If c is one of EOF code point, "/", "\", "?", and "#", decrease pointer by the number of code points in buffer, set buffer to the empty string, and state to host state.

    3. Otherwise, if c is none of U+0009, U+000A, and U+000D, append c to buffer.

    file host state
    1. If c is one of EOF code point, "/", "\", "?", and "#", decrease pointer by one, and run these substeps:

      1. If buffer consists of two code points, of which the first is an ASCII alpha and the second is either ":" or "|", set state to relative path state.

        This is a quirk for parsing Windows drive letters and therefore buffer is not reset here.

      2. Otherwise, run these steps:

        1. Let host be the result of host parsing buffer.

        2. If host is failure, return failure.

        3. Set url's host to host, buffer to the empty string, and state to relative path start state.

    2. Otherwise, if c is none of U+0009, U+000A, and U+000D, append c to buffer.

    host state
    hostname state
    1. If c is ":" and the [] flag is unset, run these substeps:

      1. Let host be the result of host parsing buffer.

      2. If host is failure, return failure.

      3. Set url's host to host, buffer to the empty string, and state to port state.

      4. If state override is hostname state, terminate this algorithm.

    2. Otherwise, if c is one of EOF code point, "/", "\", "?", and "#", decrease pointer by one, and run these substeps:

      1. Let host be the result of host parsing buffer.

      2. If host is failure, return failure.

      3. Set url's host to host, buffer to the empty string, and state to relative path start state.

      4. If state override is given, terminate this algorithm.

    3. Otherwise, if c is none of U+0009, U+000A, and U+000D, run these substeps:

      1. If c is "[", set the [] flag.

      2. If c is "]", unset the [] flag.

      3. Append c to buffer.

    port state
    1. If c is an ASCII digit, append c to buffer.

    2. Otherwise, if c is one of EOF code point, "/", "\", "?", and "#", or state override is given, run these substeps:

      1. Remove leading U+0030 code points from buffer until either the leading code point is not U+0030 or buffer is one code point.

        InputOutput
        "42""42"
        "031""31"
        "080""80"
        "0000""0"
      2. If buffer is equal to url's scheme's default port, set buffer to the empty string.

      3. Set url's port to buffer.

      4. If state override is given, terminate this algorithm.

      5. Set buffer to the empty string, state to relative path start state, and decrease pointer by one.

    3. Otherwise, return failure.

    relative path start state

    Set state to relative path state and if c is neither "/" nor "\", decrease pointer by one.

    relative path state
    1. If either c is one of EOF code point, "/", and "\", or state override is not given and c is one of "?" and "#", run these substeps:

      1. If buffer is ".." and c is one of EOF code point, "/", and "\", set the last string in url's path to the empty string.

      2. Otherwise, if buffer is "..", remove the last string from url's path.

      3. Otherwise, if buffer is "." and c is one of EOF code point, "/", and "\", append an empty string to url's path.

      4. Otherwise, if buffer is not ".", run these subsubsteps:

        1. If url's scheme is "file", url's path is the empty list, buffer consists of two code points, of which the first is an ASCII alpha, and the second is either ":" or "|", replace the second code point in buffer with ":".

          Windows drive letters are beautiful, no?

        2. Append buffer to url's path.

      5. Set buffer to the empty string.

      6. If c is "?", set url's query to the empty string, and state to query state.

      7. If c is "#", set url's fragment to the empty string, and state to fragment state.

    2. Otherwise, if c is "%" and remaining starts with either "2E" or "2e", increase pointer by two, and append "." to buffer.

    3. Otherwise, if c is none of U+0009, U+000A, and U+000D, utf-8 percent encode c using the default encode set, and append the result to buffer.

    query state
    1. If c is the EOF code point or state override is not given and c is "#", run these substeps:

      1. If url's relative flag is set, set encoding override to utf-8.

      2. Set buffer to the result of running encoding override's encoder on buffer. Whenever the encoder algorithm emits an encoder error, emit a 0x3F byte instead and do not terminate the algorithm.

      3. For each byte in buffer run these subsubsteps:

        1. If byte is less than 0x21, greater than 0x7E, or is one of 0x22, 0x23, 0x3C, 0x3E, and 0x60, append byte, percent encoded, to url's query.

        2. Otherwise, append a code point whose value is byte to url's query.

      4. Set buffer to the empty string.

      5. If c is "#", set url's fragment to the empty string, and state to fragment state.

    2. Otherwise, if c is none of U+0009, U+000A, and U+000D, append c to buffer.

    fragment state

    If c is none of EOF code point, U+0009, U+000A, and U+000D, utf-8 percent encode c using the simple encode set, and append the result to url's fragment.

  9. Return url.

5.3 Serializing

The URL serializer takes a parsed URL url, optionally an exclude fragment flag, and then runs these steps:

  1. Let output be url's scheme and ":" concatenated.

  2. If url's relative flag is set:

    1. Append "//" to output.

    2. If url's username is not the empty string or url's password is non-null, run these substeps:

      1. Append url's username to output.

      2. If url's password is non-null, append ":" concatenated with url's password to output.

      3. Append "@" to output.

    3. Append url's host, serialized, to output.

    4. If url's port is not the empty string, append ":" concatenated with url's port to output.

    5. Append "/" concatenated with the strings in url's path (including empty strings), separated from each other by "/" to output.

    6. If url's query is non-null, append "?" concatenated with url's query to output.

  3. Otherwise, if url's relative flag is unset, append url's scheme data to output.

  4. If the exclude fragment flag is unset and url's fragment is non-null, append "#" concatenated with url's fragment to output.

  5. Return output.

6 application/x-www-form-urlencoded

The application/x-www-form-urlencoded parser takes a string input, optionally with an encoding encoding override, optionally with a use "_charset_" flag, and optionally with an isindex flag, and then runs these steps:

  1. Let strings be the result of splitting input on "&".

  2. If the isindex flag is set and the first string in strings does not contain a "=", prepend "=" to the first string in strings.

  3. If encoding override is not given, set it to utf-8.

  4. Let pairs be an empty list of name-value pairs.

  5. For each string string in strings, run these substeps:

    1. If string is the empty string, run these substeps again for the next string.

    2. If string contains a "=", then let name be the substring of string from the start of string up to but excluding its first "=", and let value be the substring from the first code point, if any, after the first "=" up to the end of string. If "=" is the first code point, then name will be the empty string. If it is the last, then value will be the empty string.

    3. Otherwise, let name have the value of string and let value be the empty string.

    4. Replace any "+" in name and value with U+0020.

    5. If the use "_charset_" flag is set, name is "_charset_", and get an encoding for value does not return value, unset the use "_charset_" flag and set encoding override to the result of getting an encoding for value.

    6. Add a pair consisting of name and value to pairs.

  6. Replace each name-value pair in pairs with the result of running encoding override's decoder on the percent decoding of the name-value pair.

  7. Return pairs.

The application/x-www-form-urlencoded byte serializer takes a byte sequence input, an encoding override, and then runs these steps:

  1. Let output be the empty string.

  2. For each byte in input, depending on byte:

    0x20

    Append U+002B to output.

    0x2A
    0x2D
    0x2E
    0x30 to 0x39
    0x41 to 0x5A
    0x5F
    0x61 to 0x7A

    Append a code point whose value is byte to output.

    Otherwise

    Append byte, percent encoded, to output.

  3. Return output.

The application/x-www-form-urlencoded serializer takes a list of name-value pairs pairs, optionally with an encoding encoding override, and then runs these steps:

  1. If encoding override is not given, set it to utf-8.

  2. Let output be the empty string.

  3. For each pair in pairs, run these substeps:

    1. Replace pair's name and value with the result of running encoding override's encoder on them, respectively. Whenever the encoder algorithm emits an encoder error, emit the result of running utf-8 encode on U+0026, U+0023, followed by one or more ASCII digits representing the code point that caused the encoder error in base ten, followed by U+003B.

    2. Replace pair's name and value with their serialization.

    3. If this is not the first pair, append "&" to output.

    4. Append pair's name, followed by "=", followed by pair's value to output.

  4. Return output.

7 API

[Constructor(DOMString url, optional (URL or DOMString) base)]
interface URL {
};
URL implements URLUtils;

[NoInterfaceObject]
interface URLUtils {
  stringifier attribute DOMString href;
  readonly attribute DOMString origin;

           attribute DOMString protocol;
           attribute DOMString username;
           attribute DOMString password;
           attribute DOMString host;
           attribute DOMString hostname;
           attribute DOMString port;
           attribute DOMString pathname;
           attribute DOMString search;
           attribute URLQuery? query;
           attribute DOMString hash;
};

Any object implementing URLUtils has an associated base URL base, input, query encoding, query object, and a parsed URL url. Unless stated otherwise, query encoding is utf-8, and query object and url are null. The others must be set on creation by the specification using URLUtils.

The associated query encoding is a legacy concept only relevant for HTML. [HTML]

When an object implementing URLUtils is created with a non-null url whose relative flag is set, query object must be set to a new URLQuery object using url's query.

Specifications defining objects implementing URLUtils may define update steps to make it possible for an underlying object (such as an attribute value) to be updated.

The update steps are always invoked after each potential modification. Specifications need to keep track themselves if an actual modification is made, if they wish to make that distinction.

7.1 Constructors

The URL(url, base) constructor must run these steps:

  1. If base is not given, set it to "about:blank".

  2. If base is a string, parse base and set base to the result of that algorithm.

  3. If base is failure, throw an "SyntaxError" exception.

  4. Let parsed URL be the result of parsing url with base URL base.

  5. If parsed URL is failure, throw an "SyntaxError" exception.

  6. Create a new URL object, set its url to parsed URL, base to base, and input to url, and then return the new object.

7.2 Interface URLUtils

The URLUtils interface is not exposed on the global object. It augments other interfaces, such as URL.

The href attribute must run these steps:

  1. If url is null, return input.

  2. Return the serialization of url.

Setting the href attribute must run these steps:

  1. Set url and query object to null.

  2. Set input to the given value.

  3. Let parsed URL be the result of parsing input with base URL base and query encoding as encoding override.

  4. If parsed URL is not failure, set url to parsed URL.

  5. If url is non-null and its relative flag is set, set query object to a new URLQuery object using url's query.

  6. Run the update steps.

The origin attribute must run these steps:

  1. If url is null, return the empty string.

  2. Return the Unicode serialization of url's origin. [ORIGIN]

It returns the Unicode rather than the ASCII serialization for compatibility with HTML's MessageEvent feature. [HTML]

The protocol attribute must run these steps:

  1. If url is null, return ":".

  2. Return scheme and ":" concatenated.

Setting the protocol attribute must run these steps:

  1. If url is null, terminate these steps.

  2. Parse the given value and ":" concatenated with url as url and scheme start state as state override.

  3. Run the update steps.

The username attribute must run these steps:

  1. If url is null, return the empty string.

  2. Return username.

Setting the username attribute must run these steps:

  1. If url is null, or its relative flag is unset, terminate these steps.

  2. Set username to the empty string.

  3. For each code point in the given value, utf-8 percent encode it using the username encode set, and append the result to username.

  4. Run the update steps.

The password attribute must run these steps:

  1. If url is null or its password is null, return the empty string.

  2. Return password.

Setting the password attribute must run these steps:

  1. If url is null, or its relative flag is unset, terminate these steps.

  2. If the given value is the empty string, set password to null, run the update steps, and terminate these steps.

  3. Set password to the empty string.

  4. For each code point in the given value, utf-8 percent encode it using the password encode set, and append the result to password.

  5. Run the update steps.

The host attribute must run these steps:

  1. If url is null, return the empty string.

  2. If port is the empty string, return host, serialized.

  3. Return host, serialized, ":", and port concatenated.

Setting the host attribute must run these steps:

  1. If url is null, or its relative flag is unset, terminate these steps.

  2. Parse the given value with url as url, and host state as state override.

  3. Run the update steps.

The hostname attribute must run these steps:

  1. If url is null, return the empty string.

  2. Return host, serialized.

Setting the hostname attribute must run these steps:

  1. If url is null, or its relative flag is unset, terminate these steps.

  2. Parse the given value with url as url, and hostname state as state override.

  3. Run the update steps.

The port attribute must run these steps:

  1. If url is null, return the empty string.

  2. Return port.

Setting the port attribute must run these steps:

  1. If url is null, its relative flag is unset, or its scheme is "file", terminate these steps.

  2. Parse the given value with url as url, and port state as state override.

  3. Run the update steps.

The pathname attribute must run these steps:

  1. If url is null, return the empty string.

  2. If the relative flag is unset, return scheme data.

  3. Return "/" concatenated with the strings in path (including empty strings), separated from each other by "/".

Setting the pathname attribute must run these steps:

  1. If url is null, or its relative flag is unset, terminate these steps.

  2. Set path to the empty list.

  3. Parse the given value with url as url, and relative path start state as state override.

  4. Run the update steps.

The search attribute must run these steps:

  1. If url is null, or its query is either null or the empty string, return the empty string.

  2. Return "?" concatenated with query.

Setting the search attribute must run these steps:

  1. If url is null, or its relative flag is unset, terminate these steps.

  2. If the given value is the empty string, set query to null, set query object's associated list of name-value pairs to the empty list, run the update steps, and terminate these steps.

  3. Let input be the given value with a single leading "?" removed, if any.

  4. Set query to the empty string.

  5. Parse input with url as url, query state as state override, and the associated query encoding as encoding override.

  6. Set query object's associated list of name-value pairs to the result of parsing input.

  7. Run the update steps.

The query attribute must return the query object.

Setting the query attribute must run these steps:

  1. Let object be the given value.

  2. If query object or object is null, terminate these steps.

  3. If object's url object is not null, set object to a new URLObject object using object.

  4. Set query object to object.

  5. Run object's update steps.

The hash attribute must run these steps:

  1. If url is null, or its fragment is either null or the empty string, return the empty string.

  2. Return "#" concatenated with fragment.

Setting the hash attribute must run these steps:

  1. If url is null, or its scheme is "javascript", terminate these steps.

  2. If the given value is the empty string, set fragment to null, run the update steps, and terminate these steps.

  3. Let input be the given value with a single leading "#" removed, if any.

  4. Set fragment to the empty string.

  5. Parse input with url as url, and fragment state as state override.

  6. Run the update steps.

7.3 Interface URLQuery

JavaScript does not have MultiMap but the idea is to implement this in terms of an underlying MultiMap.

[Constructor(optional (DOMString or URLQuery or ...) init)]
interface URLQuery {
  DOMString? get(DOMString name);
  sequence<DOMString> getAll(DOMString name);
  void set(DOMString name, DOMString value);
  void append(DOMString name, DOMString value);
  boolean has(DOMString name);
  void delete(DOMString name);
  readonly attribute unsigned long size;
};

A URLQuery object has an associated list of name-value pairs, which is initially empty.

A URLQuery object has an associated url object which is an object implementing URLUtils whose query object is the URLQuery object, and null if there is no such object.

URLQuery objects always use utf-8 as encoding, despite the existence of concepts such as query encoding. This is to encourage developers to migrate towards utf-8, which they really ought to have done a long time ago now.

To create a new URLQuery object, optionally using init, run these steps:

  1. Let query be a new URLQuery object.

  2. If init is not given or is null, return query.

  3. If init is a string and is not the empty string, set query's associated list of name-value pairs to the result of parsing input.

  4. If init is a URLQuery object, set query's associated list of name-value pairs to a copy of init associated list of name-value pairs.

  5. If init is a dictionary...

  6. Return query.

A URLQuery object's update steps are:

  1. If url object is null, terminate these steps.

  2. Set url object's url's query to the serialization of the URLQuery object's associated list of name-value pairs.

  3. Run url object's update steps.

The URLQuery(init) constructor must return a new URLQuery object using init if given.

The get(name) method must return the value of the first name-value pair whose name is name, and null if there is no such pair.

The getAll(name) method must return the values of all name-value pairs whose name is name, in list order, and the empty sequence otherwise.

The set(name, value) method must run these steps:

  1. If there is a name-value pair whose name is name, set the value of the first such name-value pair to value.

  2. Otherwise, append a new name-value pair whose name is name and value is value, to the list of name-value pairs.

  3. Run the update steps.

The append(name, value) method must run these steps:

  1. Append a new name-value pair whose name is name and value is value, to the list of name-value pairs.

  2. Run the update steps.

The has(name) method must return true if there is a name-value pair whose name is name, and false otherwise.

The delete(name) method must run these steps:

  1. Remove all name-value pairs whose name is name.

  2. Run the update steps.

The size attribute must return the number of name-value pairs.

References

[ENCODING]
Encoding Standard, Anne van Kesteren. WHATWG.
[HTML]
(Non-normative) HTML, Ian Hickson. WHATWG.
[IPV6]
IP Version 6 Addressing Architecture, R. Hinden and Steve Deering. IETF.
[IPV6TEXT]
(Non-normative) A Recommendation for IPv6 Address Text Representation, S. Kawamura and M. Kawashima. IETF.
[IRI]
(Non-normative) Internationalized Resource Identifiers (IRIs), Martin Dürst and Michel Suignard. IETF.
[ORIGIN]
The Web Origin Concept, Adam Barth. IETF.
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels, Scott Bradner. IETF.
[URI]
(Non-normative) Uniform Resource Identifier (URI): Generic Syntax, Tim Berners-Lee, Roy Fielding and Larry Masinter. IETF.

Acknowledgments

Thanks to Adam Barth, Alexandre Morgaut, Boris Zbarsky, David Sheets, Erik Arvidsson, Gavin Carothers, Glenn Maynard, Henri Sivonen, Ian Hickson, James Graham, James Manger, James Ross, Martin Dürst, Mathias Bynens, Michael™ Smith, Rodney Rehm, Simon Pieters, Tab Atkins, and Tantek Çelik for being awesome!

While this standard has been written from scratch, special thanks should be extended to the editors of the various specifications that previously defined what we now call URLs: Larry Masinter, Martin Dürst, Michel Suignard, Roy Fielding, and Tim Berners-Lee.