This specification is provided to promote interoperability among implementations and users of in-band text tracks sourced for [[HTML5]]/[[HTML]] from media resource containers. The specification provides guidelines for the creation of video, audio and text tracks and their attribute values as mapped from in-band tracks from media resource types typically supported by User Agents. It also explains how the UA should map in-band text track content into text track cues.

Mappings are defined for [[MPEGDASH]], [[ISOBMFF]], [[MPEG2TS]], [[OGGSKELETON]] and [[WebM]].

This is the first draft. Please send feedback to: public-inbandtracks@w3.org.

Introduction

The specification maintains mappings from in-band audio, video and other data tracks of media resources to HTML VideoTrack, AudioTrack, and TextTrack objects and their attribute values.

This specification defines the mapping of tracks from media resources depending on the MIME type of that resource. If an implementation claims to support that MIME type and exposes a track from a resource of that type, the exposed track must conform to this specification.

Which actual tracks are exposed by a user agent from a supported media resource is implementation dependent. A user agent may expose tracks, for which it supports parsing, decoding and rendering, for playback selection by the web application or user. A user agent may also decide to expose tracks coded in formats it is not able to decode, but which it can identify, and describe through metadata such as the HTML kind attribute and others as defined in this specification. For text tracks, the track content may be exposed to the Web application via TextTrackCue or DataCue objects.

A generic rule to follow is that a track as exposed in HTML only ever represents a single semantic concept. When mapping from a media resource, sometimes an in-band track does not relate 1-to-1 to a HTML text, audio or video track.

For example, a HTML TextTrack object is either a subtitle track or a caption track, never both. However, in-band text tracks may encapsulate caption and subtitle cues of the same language as a single in-band track. Since a caption track is essentially a subtitle track with additional cues of transcripts of audio-only information, such an encapsulation in a single in-band track can save space. In HTML, these tracks should be exposed as two TextTrack objects, since they represent different semantic concepts. The cues appear in their relevant tracks - subtitle cues would be present in both. This allows users to choose between the two tracks and activate the desired one in the same manner that they do when the two tracks are provided through two track elements.

A similar logic applies to in-band text tracks that have subtitle cues of different languages mixed together in one track. They, too, should be exposed in a track of their own language each.

A further example is when a UA decides to implement rendering for a caption track but without exposing the caption track through the TextTrack API. To the Web developer and the Web page user, such a video appears as though it has burnt-in captions. Therefore, the UA could expose two video tracks on the HTMLMediaElement - one with captions and a kind attribute set to captions and one without captions with a kind attribute set to main. In this way, the user and the Web developer still get the choice of whether to see the video with or without captions.

Another generic rule to follow for in-band data tracks is that in order to map them to TextTrack objects, the contents of the track need to be mapped to media-time aligned cues that relate to a non-zero interval of time.

For every MIME-type/subtype of an existing media container format, this specification defines the following information:

  1. Track order.

    Tracks sourced according to this specification are referenced by HTML TrackList objects (audioTracks, videoTracks or textTracks). The [[HTML5]]/[[HTML]] specification mandates that the tracks in those objects be consistently ordered. This requirement insures that the order of tracks is not changed when a track is added or removed, e.g. that videoTracks[3] points to the same object if the tracks with indices 0, 1, 2 and 3 were not removed. This also insures a deterministic result when calls to getTrackById are made with media resources, possibly invalid, that declares two tracks with the same id. This specification defines a consistent ordering of tracks between the media resource and TrackList objects when the media resource is consumed by the user agent.

    Note that in some media workflows, the order of tracks in a media resource may be subject to changes (e.g. tracks may be added or removed) between authoring and publication. Applications associated with a media resource should not rely on an order of tracks being the same between when the media resource was authored and when it is consumed by the user agent.

    All media resource formats used in this specification support identifying tracks using a unique identifier. This specification defines how those unique identifiers are mapped onto the id attribute of HTML Track objects. Application authors are encouraged to use the id attribute to identify tracks, rather than the index in a TrackList object.

  2. How to identify the type of tracks - one of audio, video or text.
  3. Setting the attributes id, kind, language and label for sourced TextTrack objects.
  4. Setting the attributes id, kind, language and label for sourced AudioTrack and VideoTrack objects.
  5. Mapping Text Track content into text track cues.

MPEG-DASH

MIME type/subtype: application/dash+xml

[[MPEGDASH]] defines formats for a media manifest, called MPD (Media Presentation Description), which references media containers, called media segments. [[MPEGDASH]] also defines some media segments formats based on [[MPEG2TS]] or [[ISOBMFF]]. Processing of media manifests and segments to expose tracks to Web applications can be done by the user agent. Alternatively, a web application can process the manifests and segments to expose tracks. When the user agent processes MPD and media segments directly, it exposes tracks for AdaptationSet and ContentComponent elements, as defined in this document. When the Web application processes the MPD and media segments, it passes media segments to the user agent according to the MediaSource Extension [[MSE]] specification. In this case, the tracks are exposed by the user agent according to [[MSE]]. The Web application may set default track attributes from MPD data, using the trackDefaults object, that will be used by the user agent to set attributes not set from initialization segment data.

  1. Track Order

    If an AdaptationSet contains ContentComponents, a track is created for each ContentComponent. Otherwise, a track is created for the AdaptationSet itself. The order of tracks specified in the MPD (Media Presentation Description) format [[MPEGDASH]] is maintained when sourcing multiple MPEG DASH tracks into HTML.

  2. Determining the type of track

    A user agent recognises and supports data from a MPEG DASH media resource as being equivalent to a HTML track using the content type given by the MPD. The content type of the track is the first present value out of: The ContentComponents's "contentType" attribute, the AdaptationSet's "contentType" attribute, or the main type in the AdaptationSet's "mimeType" attribute (i.e. for "video/mp2t", the main type is "video").

    • text track:
    • video track: the content type is "video"
    • audio track: the content type is "audio"
  3. Track Attributes for sourced Text Tracks

    Data for sourcing text track attributes may exist in the media content or in the MPD. Text track attribute values are first sourced from track data in the media container, as described for text track attributes in MPEG-2 Transport Streams and text track attributes in MPEG-4 ISOBMFF. If a track attribute's value cannot be determined from the media container, then the track attribute value is sourced from data in the track's ContentComponent. If the needed attribute or element does not exist on the ContentComponent (or if the AdaptationSet doesn't contain any ContentComponents), then that attribute or element is sourced from the AdaptationSet:

    Attribute How to source its value
    id The track is:
    • An ISOBMFF CEA 608 caption service: the string "cc" concatenated with the value of the 'channel-number' field in the Accessibility descriptor in the ContentComponent or AdaptationSet.
    • An ISOBMFF CEA 708 caption service: the string "sn" concatenated with the value of the 'service-number' field in the Accessibility descriptor in the ContentComponent or AdaptationSet.
    • Otherwise, the content of the 'id' attribute in the ContentComponent, or AdaptationSet.
    kind The track:
    • Represents a ContentComponent or AdaptationSet containing a Role descriptor with schemeIdURI attribute = "urn:mpeg:dash:role:2011":
      • "captions": if the Role descriptor's value is "caption"
      • "subtitles": if the Role descriptor's value is "subtitle"
      • "metadata": otherwise
    • Is an ISOBMFF CEA 608 or 708 caption service: "captions".
    label The empty string.
    language The track is:
    • An ISOBMFF CEA 608 708 caption service: the value of the 'language' field in the Accessibility descriptor, in the ContentComponent or AdaptationSet, where the corresponding 'channel-number' or 'service-number' is the same as this track's 'id' attribute. The empty string if there is no such corresponding 'channel-number' or 'service-number'.
    • Otherwise: the content of the 'lang' attribute in the ContentComponent or AdaptationSet element.
    inBandMetadataTrackDispatchType If kind is "metadata", an XML document containing the AdaptationSet element and all child Role descriptors and ContentComponents, and their child Role descriptors. The empty string otherwise.
    mode "disabled"
  4. Track Attributes for sourced Audio and Video Tracks

    Data for sourcing audio and video track attributes may exist in the media content or in the MPD. Audio and video track attribute values are first sourced from track data in the media container, as described for audio and video track attributes in MPEG-2 Transport Streams and audio and video track attributes in MPEG-4 ISOBMFF. If a track attribute's value cannot be determined from the media container, then the track attribute value is sourced from data in the track's ContentComponent. If the needed attribute or element does not exist on the ContentComponent (or if the AdaptationSet doesn't contain any ContentComponents), then that attribute or element is sourced from the AdaptationSet:

    Attribute How to source its value
    id Content of the id attribute in the ContentComponent or AdaptationSet element. Empty string if the id attribute is not present on either element.
    kind

    Given a Role scheme of "urn:mpeg:dash:role:2011", determine the kind attribute from the value of the Role descriptors in the ContentComponent and AdaptationSet elements.

    • "alternative": if the role is "alternate" but not also "main" or "commentary", or "dub"
    • "captions": if the role is "caption" and also "main"
    • "descriptions": if the role is "description" and also "supplementary"
    • "main": if the role is "main" but not also "caption", "subtitle", or "dub"
    • "main-desc": if the role is "main" and also "description"
    • "sign": not used
    • "subtitles": if the role is "subtitle" and also "main"
    • "translation": if the role is "dub" and also "main"
    • "commentary": if the role is "commentary" but not also "main"
    • "": otherwise
    label The empty string.
    language Content of the lang attribute in the ContentComponent or AdaptationSet element.
  5. Mapping Text Track content into text track cues

    TextTrackCue objects may be sourced from DASH media content in the WebVTT, TTML, MPEG-2 TS or ISOBMFF format.

    Media content with the MIME type "text/vtt" is in the WebVTT format and should be exposed as a VTTCue object as defined in [[WEBVTT]].

    Media content with the MIME type "application/ttml+xml" is in the TTML format and should be exposed as an as yet to be defined TTMLCue object. Alternatively, browsers can also map the TTML features to VTTCue objects [[WEBVTT]]. Finally, browsers that cannot render TTML [[ttaf1-dfxp]] format data should expose them as DataCue objects [[HTML51]]. In this case, the TTML file must be parsed in its entirety and then converted into a sequence of TTML Intermediate Synchronic Documents (ISDs). Each ISD creates a DataCue object with attributes sourced as follows:

    Attribute How to source its value
    id Decimal representation of the id attribute of the head element in the XML document. Null if there is no id attribute.
    startTime Value of the beginning media time of the active temporal interval of the ISD.
    endTime Value of the ending media time of the active temporal interval of the ISD.
    pauseOnExit "false"
    data The (UTF-16 encoded) ArrayBuffer composing the ISD resource.

    Media content with the MIME type "application/mp4" or "video/mp4" is in the [[ISOBMFF]] format and should be exposed following the same rules as for ISOBMFF text track.

    Media content with the MIME type "video/mp2t" is in the MPEG-2 TS format and should be exposed following the same rules as for MPEG-2 TS text track.

MPEG-2 Transport Streams

MIME type/subtype: audio/mp2t, video/mp2t
  1. Track Order

    Tracks are called "elementary streams" in a MPEG-2 Transport Stream (TS) [[MPEG2TS]]. The order in which elementary streams are listed in the "Program Map Table" (PMT) of a MPEG-2 TS is maintained when sourcing multiple MPEG-2 tracks into HTML. Additions or deletions of elementary streams in the PMT should invoke addtrack or removetrack events in the user agent.

    The order of elementary streams in the PMT may change between when the media resource was created and when it is received by the user agent. Scripts should not infer any information from the ordering, or rely on any particular ordering being present.

  2. Determining the type of track

    A user agent recognizes and supports data in an MPEG-2 TS elementary stream identified by the elementary_PID field in the Program Map Table as being equivalent to an HTML track based on the value of the stream_type field associated with that elementary_PID:

    • text track:
      • The elementary stream with PID 0x02 or the stream_type value is "0x02", "0x05" or between "0x80" and "0xFF".
      • The CEA 708 caption service [[CEA708]], as identified by:
        • A caption_service_descriptor [[ATSC65]] in the 'Elementary Stream Descriptors' in the PMT entry for a video stream with stream type 0x02 or 0x1B.
        • For stream_type 0x02, the presence of caption data in the user_data() field [[ATSC52]].
        • For stream_type 0x1B, the presence of caption data in the ATSC1_data() field [[SCTE128-1]].
      • a DVB subtitle component [[DVB-SUB]] as identified by a subtitling_descriptor [[DVB-SI]] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a stream_type of "0x06"
      • an ITU-R System B Teletext component [[DVB-TXT]] as identified by an teletext_descriptor [[DVB-SI]] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a stream_type of "0x06"
      • a VBI data component [[DVB-VBI]] as identified by a VBI_data_descriptor [[DVB-SI]] or a VBI_teletext_descriptor [[DVB-SI]] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a stream_type of "0x06"
    • video track: the stream_type value is "0x01", "0x02", "0x10", "0x1B", between "0x1E" and "0x24" or "0xEA".
    • audio track:
      • the stream_type value is "0x03", "0x04", "0x0F", "0x11", "0x1C", "0x81" or "0x87".
      • an AC-3 audio component as identified by an AC-3_descriptor [[DVB-SI]] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a stream_type of "0x06"
      • an Enhanced AC-3 audio component as identified by an enhanced_ac-3_descriptor [[DVB-SI]]in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a stream_type of "0x06"
      • a DTS® audio component as identified by a DTS_audio_stream_descriptor [[DVB-SI]] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a stream_type of "0x06"
      • a DTS-HD® audio component as identified by a DTS-HD_audio_stream_descriptor [[DVB-SI]] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a stream_type of "0x06"
  3. Track Attributes for sourced Text Tracks

    Attribute How to source its value
    id Decimal representation of the elementary stream's identifier (elementary_PID field) in the PMT.

    For CEA 608 closed captions, the string "cc" concatenated with the decimal representation of the channel number.

    For CEA 708 closed captions, the string "sn" concatenated with the decimal representation of the service_number field in the 'Caption Channel Service Block'.

    If program 0 (zero) is present in the transport stream, a string of the format "OOOO.TTTT.SSSS.CC" consisting of the following, lower-case hexadecimal encoded fields:

    • OOOO is the four character representation of the 16-bit original_network_id [[DVB-SI]].
    • TTTT is the four character representation of the 16-bit transport_stream_id [[DVB-SI]].
    • SSSS is the four character representation of the 16-bit service_id [[DVB-SI]].
    • CC is:
      • If a stream_identifier_descriptor [[DVB-SI]] is present in the PMT, a two character representation of the 8-bit component_tag value.
      • Otherwise, a four character representation of the elementary stream's identifier (13-bit elementary_PID field) in the PMT.

    kind
    • "captions":
      • For a CEA708 caption service.
      • for a DVB subtitle component [[DVB-SUB]] as identified by a subtitling_descriptor [[DVB-SI]] in the PMT with a subtitling_type in the range "0x20" to "0x25".
      • an ITU-R System B Teletext component [[DVB-TXT]] as identified by an teletext_descriptor [[DVB-SI]] with a teletext_type value of "0x05" in the PMT
      • a VBI data component [[DVB-VBI]] as identified by a VBI_teletext_descriptor [[DVB-SI]] with a teletext_type value of "0x05" in the PMT.
    • "subtitles":
      • If the stream type value is "0x82".
      • for a DVB subtitle component [[DVB-SUB]] as identified by a subtitling_descriptor [[DVB-SI]] in the PMT with a subtitling_type in the range "0x10" to "0x15".
      • an ITU-R System B Teletext component [[DVB-TXT]] as identified by an teletext_descriptor [[DVB-SI]] with a teletext_type value of "0x02" in the PMT
      • a VBI data component [[DVB-VBI]] as identified by a VBI_teletext_descriptor [[DVB-SI]] with a teletext_type value of "0x02" in the PMT.
    • "metadata": otherwise
    label
    • If a component_name_descriptor [[ATSC65]] is found immediately after the ES_info_length field in the Program Map Table [[MPEG2TS]], the DOMString representation of the component_name_string in that component_name_descriptor.
    • If a component_descriptor [[DVB-SI]] for the component is present in the SDT or EIT, the DOMString representation of the content of the text field in that component_descriptor
    • The empty string otherwise.
    language kind is
    • "captions":
      • For a CEA708 caption service.
        • Content of the language field for the caption service in the caption_service_descriptor, if present.
        • Otherwise, for the first caption service, as identified by the service_number field in the service_block [[CEA708]] with a value of 1, the value of language of the audio track where kind has the value "main".
        • The empty string for all other caption services, as identified by values greater than 1 in the service_number field.
      • For a DVB subtitle component [[DVB-SUB]], the value of the ISO_639_language_code field in the subtitling_descriptor [[DVB-SI]] in the PMT
      • For an ITU-R System B Teletext component [[DVB-TXT]], the value of the ISO_639_language_code field in the teletext_descriptor [[DVB-SI]] in the PMT
      • For a VBI data component [[DVB-VBI]], the value of the ISO_639_language_code field in the VBI_teletext_descriptor [[DVB-SI]] in the PMT
    • "subtitles":
      • If stream_type value is "0x82", the content of the ISO_639_language_code field in the ISO_639_language_descriptor in the elementary stream descriptor array in the PMT.
      • for a DVB subtitle component [[DVB-SUB]], the value of the ISO_639_language_code field in the subtitling_descriptor [[DVB-SI]] in the PMT
      • for an ITU-R System B Teletext component [[DVB-TXT]], the value of the ISO_639_language_code field in the teletext_descriptor [[DVB-SI]] in the PMT
      • for a VBI data component [[DVB-VBI]], the value of the ISO_639_language_code field in the VBI_teletext_descriptor [[DVB-SI]] in the PMT
    • "metadata": The empty string.
    inBandMetadataTrackDispatchType If kind is "metadata", then the concatenation of the stream_type byte field in the program map table and ES_info_length bytes following the ES_info_length field expressed in hexadecimal using uppercase ASCII hex digits. The empty string otherwise.
    mode "disabled"
  4. Track Attributes for sourced Audio and Video Tracks

    Attribute How to source its value
    id
    • Decimal representation of the elementary stream's identifier (elementary_PID field) in the PMT.
    • If a program 0 (zero) is present in the transport stream, a string of the format "OOOO.TTTT.SSSS.CC" or "OOOO.TTTT.SSSS.CC&CC", consisting of the following, lower-case hexadecimal encoded fields:
      • OOOO is the four character representation of the 16-bit original_network_id [[DVB-SI]].
      • TTTT is the four character representation of the 16-bit transport_stream_id [[DVB-SI]].
      • SSSS is the four character representation of the 16-bit service_id [[DVB-SI]].
      • CC is:
        • If a stream_identifier_descriptor [[DVB-SI]] is present in the PMT, a two character representation of the 8-bit component_tag value.
        • Otherwise, a four character representation of the elementary stream's identifier (13-bit elementary_PID field) in the PMT.

      Where a track is derived from two components, the second form ("CC&CC") identifies the independent and dependent streams, where the first 'CC' identifies the independent stream, and the second 'CC' identifies the dependent stream. Otherwise the first form is used.

    kind
    • If a supplementary_audio_descriptor [[DVB-SI]] is present in the PMT for an audio component, the value is derived according to the audio purpose defined in table J.3 of [[DVB-SI]] using the following rules:
      • "main" if PSI signalling of audio purpose indicates "Main audio" for the audio track that the user agent would select by default, otherwise to "translation"

        Need to define how UA would select track by default.

      • components with an audio purpose of "Audio description (broadcast-mix)" map to "main-desc"
      • components with an audio purpose of "Audio description (receiver-mix)":
        • The user agent exposes an audio track of kind "main-desc" for each permitted combination of this track with another audio track as defined in annex J.2 of [[DVB-SI]]. Enabling this track results in the combination being presented.
        • If the user agent can present the stream in isolation, it also exposes an audio track of kind "descriptions" for this audio component.
      • components with an audio purpose of "Clean audio (broadcast-mix)", "Parametric data dependent stream", or "Unspecific audio for the general audience" map to "alternative"
      • components with other audio purposes map to the empty string
    • Otherwise:
      • "descriptions":
        • For AC-3 audio [[ATSC52]] if the bsmod field is 2 and the full_svc field is 0 in the AC-3_audio_stream_descriptor() in the PMT
        • For E-AC-3 audio [[ATSC52]] if the audio_service_type field is 2 and the full_service_flag is 0 in the E-AC-3_audio_descriptor() in the PMT
        • For AAC audio [[SCTE193-2]] if the AAC_service_type field is 2 and the receiver_mix_rqd is 1 in the MPEG_AAC_descriptor() in the PMT
      • "main" if the first audio (video) elementary stream in the PMT and the audio_type field in the ISO_639_language_descriptor, if present, is "0x00" or "0x01"
      • "main-desc":
        • For AC-3 audio [[ATSC52]] if the bsmod field is 2 and the full_svc field is 1 in the AC-3_audio_stream_descriptor()
        • For E-AC-3 audio [[ATSC52]] if the audio_service_type field is 2 and the full_service_flag is 1 in the E-AC-3_audio_descriptor()
        • For AAC audio [[SCTE193-2]] if the AAC_service_type field is 2 and the receiver_mix_rqd is 0 in the MPEG_AAC_descriptor()
      • "sign" video components with a component_descriptor [[DVB-SI]] in the SDT or EIT, where the stream_content is "0x3" and the component_type is "0x30" or "0x31"
      • "translation": not first audio elementary stream in the PMT and the audio_type field in the ISO_639_language_descriptor is "0x00" or "0x01" and bsmod=0
      • "": otherwise
    label
    • If a component_descriptor [[DVB-SI]] is present in the SDT or EIT, the DOMString representation of the content of the text field in that component_descriptor
    • If a component_name_descriptor [[ATSC65]] is present for this elementary in the Program Map Table [[MPEG2TS]], the DOMString representation of the component_name_string field in that descriptor .
    • The empty string otherwise.
    language kind is:
    • "descriptions" or "main-desc": Content of the language field in the AC-3_audio_stream_descriptor or AC-3_audio_stream_descriptor [[ATSC52]] if present.
    • otherwise: Content of the ISO_639_language_code field in the ISO_639_language_descriptor.
  5. Mapping Text Track content into text track cues for MPEG-2 TS

    MPEG-2 transport streams may contain data that should be exposed as cues on "captions", "subtitles" or "metadata" text tracks. No data is defined that equates to "descriptions" or "chapters" text track cues.

    1. Metadata cues

      Cues on an MPEG-2 metadata text track are created as DataCue objects [[HTML51]]. Each section in an elementary stream identified as a text track creates a DataCue object with its TextTrackCue attributes sourced as follows:

      Attribute How to source its value
      id The empty string.
      startTime 0
      endTime The time, in the media resource timeline, that corresponds to the presentation time of the video frame received immediately prior to the section in the media resource.
      pauseOnExit "false"
      data The entire MPEG-TS section, starting with table_id and ending section_length bytes after the section_length field.
    2. Captions cues

      • CEA 708

        MPEG-2 TS captions in the CEA 708 format [[CEA708]] are carried in the video stream in Picture User Data [[ATSC53-4]] for stream_type 0x02 and in Supplemental Enhancement Information [[ATSC72-1]] for stream_type 0x1B. Browsers that can render the CEA 708 format should expose the caption data to the web application by mapping the CEA 708 features to VTTCue objects [[VTT708]].

      • DVB

        MPEG-2 TS captions in the DVB subtitle format [[DVB-SUB]], ITU-R System B Teletext [[DVB-TXT]] and VBI [[DVB-VBI]] formats are not exposed in a TextTrackCue.

    3. Subtitles cues

      • SCTE 27

        MPEG-2 TS subtitles in the SCTE 27 format [[SCTE27]] should should be exposed in an as yet to be specified SCTE27Cue objects. Alternatively, browsers can also map the SCTE 27 features to VTTCue object via an as yet to be specified mapping process. Finally, browsers that cannot render SCTE 27 subtitles, should expose them as DataCue objects [[HTML51]]. In this case, each section in an elementary stream identified as a subtitles text track creates a DataCue object with TextTrackCue attributes sourced as follows:

        Attribute How to source its value
        id The empty string.
        startTime The time, in the HTML media resource timeline, that corresponds to the display_in_PTS field in the section data.
        endTime The sum of the startTime and the display_duration field in the section data expressed in seconds.
        pauseOnExit "false"
        data The entire MPEG-TS section, starting with table_id and ending section_length bytes after the section_length field.
      • DVB

        MPEG-2 TS subtitles in the DVB subtitle format [[DVB-SUB]], ITU-R System B Teletext [[DVB-TXT]] and VBI [[DVB-VBI]] formats are not exposed in a TextTrackCue.

MPEG-4 ISOBMFF

MIME type/subtype: audio/mp4, video/mp4, application/mp4
  1. Track Order

    The order of tracks specified by TrackBox (trak) boxes in the MovieBox (moov) container [[ISOBMFF]] is maintained when sourcing multiple MPEG-4 tracks into HTML.

  2. Determining the type of track

    A user agent recognises and supports data from a TrackBox as being equivalent to a HTML track based on the value of the handler_type field in the HandlerBox (hdlr) of the MediaBox (mdia) of the TrackBox:

    • text track:
      • the handler_type value is "meta", "subt" or "text"
      • the handler_type value is "vide" and an ISOBMFF CEA 608 or 708 caption service is encapsulated in the video track as an SEI message as defined in [[DASHIFIOP]].
    • video track: the handler_type value is "vide"
    • audio track: the handler_type value is "soun"
  3. Track Attributes for sourced Text Tracks

    Attribute How to source its value
    id

    For ISOBMFF CEA 608 closed captions, the string "cc" concatenated with the decimal representation of the channel_number.

    For ISOBMFF CEA 708 closed captions, the string "sn" concatenated with the decimal representation of the service_number field in the 'Caption Channel Service Block'.

    Otherwise, the decimal representation of the track_ID of a TrackHeaderBox (tkhd) in a TrackBox (trak).

    kind
    • "captions":
      • WebVTT caption: handler_type is "text" and SampleEntry format is WVTTSampleEntry [[ISO14496-30]] and the VTT metadata header Kind is "captions"
      • SMPTE-TT caption: handler_type is "subt" and SampleEntry format is XMLSubtitleSampleEntry [[ISO14496-30]] and the namespace is set to "http://www.smpte-ra.org/schemas/2052-1/2013/smpte-tt#cea708" [[SMPTE2052-11]].
      • An ISOBMFF CEA 608 or 708 caption service.
      • 3GPP caption: handler_type is "text" and the SampleEntry code (format field) is "tx3g".

        Are all sample entries of this type "captions"?

    • "subtitles":
      • WebVTT subtitle: handler_type is "text" and SampleEntry format is WVTTSampleEntry [[ISO14496-30]] and the VTT metadata header Kind is "subtitles"
      • SMPTE-TT subtitle: handler_type is "subt" and SampleEntry format is XMLSubtitleSampleEntry [[ISO14496-30]] and the namespace is set to a TTML namespace that does not indicate a SMPTE-TT caption.
    • "metadata": otherwise
    label Content of the name field in the HandlerBox.
    language If the track is an ISOBMFF CEA 608 or 708 caption service then the empty string ("").

    Otherwise, the content of the language field in the MediaHeaderBox.

    No signaling is currently defined for specifying the langaugae of CEA 608 or 708 captions in ISOBMFF. MPEG DASH MPDs may specify caption track metadata, including language [[DASHIFIOP]]. The user agent should set the language attribute of CEA 608 or 708 caption text tracks to the empty string so that script may use the media source extensions [[MSE]] TrackDefault object to provide a default for the language attribute.

    inBandMetadataTrackDispatchType
    • kind is "metadata":
      • if a XMLMetaDataSampleEntry box is present the concatenation of the string "metx", a U+0020 SPACE character, and the value of the namespace field
      • if a TextMetaDataSampleEntry box is present the concatenation of the string "mett", a U+0020 SPACE character, and the value of the mime_format field
      • otherwise the empty string
    • otherwise the empty string
    mode "disabled"
  4. Track Attributes for sourced Audio and Video Tracks

    Attribute How to source its value
    id Decimal representation of the track_ID of a TrackHeaderBox (tkhd) in a TrackBox (trak).
    kind
    • "alternative": not used
    • "captions": not used
    • "descriptions"
      • For E-AC-3 audio [[ETSI102366]] if the bsmod field is 2 and the asvc is 1 in the EC3SpecificBox
    • "main": first audio (video) track
    • "main-desc
      • For AC-3 audio [[ETSI102366]] if the bsmod field is 2 in the AC3SpecificBox
      • For E-AC-3 audio [[ETSI102366]] if the bsmod field is 2 and the asvc is 0 in the EC3SpecificBox
    • "sign": not used
    • "subtitles": not used
    • "translation": not first audio (video) track
    • "commentary": not used
    • "": otherwise
    label Content of the name field in the HandlerBox.
    language Content of the language field in the MediaHeaderBox.
  5. Mapping Text Track content into text track cues for MPEG-4 ISOBMFF

    [[ISOBMFF]] text tracks may be in the WebVTT or TTML format [[ISO14496-30]], 3GPP Timed Text format [[3GPP-TT]], or other format.

    [[ISOBMFF]] text tracks carry WebVTT data if the media handler type is "text" and a WVTTSampleEntry format is used, as described in [[ISO14496-30]]. Browsers that can render text tracks in the WebVTT format should expose a VTTCue [[WEBVTT]] as follows:

    Attribute How to source its value
    id The cue_id field in the CueIDBox.
    startTime The sample presentation time.
    endTime The sum of the startTime and the sample duration.
    pauseOnExit "false"
    cue setting attributes The settings field in the CueSettingsBox.
    text The cue_text field in the CuePayloadBox.

    [[ISOBMFF]] captions in the CEA 708 format [[CEA708]] are carried in the video stream in SEI messages [[DASHIFIOP]]. Browsers that can render the CEA 708 format should expose the caption data to the web application by mapping the CEA 708 features to VTTCue objects [[VTT708]].

    ISOBMFF text tracks carry TTML data if the media handler type is "subt" and an XMLSubtileSampleEntry format is used with a TTML-based name_space field, as described in [[ISO14496-30]]. Browsers that can render text tracks in the TTML format should expose an as yet to be defined TTMLCue. Alternatively, browsers can also map the TTML features to VTTCue objects. Finally, browsers that cannot render TTML [[ttaf1-dfxp]] format data should expose them as DataCue objects [[HTML51]]. Each TTML subtitle sample consists of an XML document and creates a DataCue object with attributes sourced as follows:

    Attribute How to source its value
    id Decimal representation of the id attribute of the head element in the XML document. Null if there is no id attribute.
    startTime Value of the beginning media time of the top-level temporal interval of the XML document.
    endTime Value of the ending media time of the top-level temporal interval of the XML document.
    pauseOnExit "false"
    data The (UTF-16 encoded) ArrayBuffer composing the XML document.

    TTML data may contain tunneled CEA708 captions [[SMPTE2052-11]]. Browsers that can render CEA708 data should expose it as defined for MPEG-2 TS CEA708 cues.

    3GPP timed text data is carried in [[ISOBMFF]] as described in [[3GPP-TT]]. Browsers that can render text tracks in the 3GPP Timed Text format should expose an as yet to be defined 3GPPCue. Alternatively, browsers can also map the 3GPP features to VTTCue objects.

WebM

MIME type/subtype: audio/webm, video/webm
  1. Track Order

    The order of tracks specified in the EBML initialisation segment [[WebM]] is maintained when sourcing multiple WebM tracks into HTML.

  2. Determining the type of track

    A user agent recognises and supports data from a WebM resource as being equivalent to a HTML track based on the value of the TrackType field of the track in the Segment info:

    • text track: TrackType field is "0x11" or "0x21"
    • video track: TrackType field is "0x01"
    • audio track: TrackType field is "0x02"
  3. Track Attributes for sourced Text Tracks

    WebM has defined how to store WebVTT [[WEBVTT]] files in WebM [[WebM]][[WEBVTT-WEBM]]. Sourcing text tracks from WebM is different for chapter tracks from tracks of other kinds and is explained below the table.

    Attribute How to source its value
    id Decimal representation of the TrackNumber field of the track in the Track section of the WebM file Segment.
    kind

    Map the content of the TrackType and CodecID fields of the track as follows:

    • "captions": TrackType is "0x11" and CodecId is "D_WEBVTT/captions"
    • "subtitles": TrackType is "0x11" and CodecId is "D_WEBVTT/subtitles"
    • "descriptions": TrackType is "0x11" and CodecId is "D_WEBVTT/descriptions"
    • "metadata": otherwise
    label Content of the name field of the track.
    language Content of the language field of the track.
    inBandMetadataTrackDispatchType If kind is "metadata", then the value of the CodecID element. The empty string otherwise.
    mode "disabled"

    Tracks of kind "chapters" are found in the "Chapters" section of the WebM file Segment, which are all at the beginning of the WebM file, such that chapters can be used for navigation. The details of this mapping have not been specified yet and simply point to the more powerful Matroska chapter specification [[Matroska]]. Presumably, the id attribute could be found in EditionUID, label is empty, and language can come from the first ChapterAtom's ChapLanguage value.

    The Matroska container format, which is the basis for WebM, has specifications for other text tracks, in particular SRT, SSA/ASS, and VOBSUB. The described attribute mappings can be applied to these, too, except that the kind field will always be "subtitles". The information of their CodecPrivate field is exposed in the inBandMetadataTrackDispatchType attribute.

  4. Track Attributes for sourced Audio and Video Tracks

    Attribute How to source its value
    id Decimal representation of the TrackNumber field of the track in the Segment info.
    kind
    • "alternative": not used
    • "captions": not used
    • "descriptions": not used
    • "main": the FlagDefault element is set on the track
    • "main-desc": not used
    • "sign": not used
    • "subtitles": not used
    • "translation": not first audio (video) track
    • "commentary": not used
    • "": otherwise
    label Content of the name field of the track in the Segment info.
    language Content of the language field of the track in the Segment info.
  5. Mapping Text Track content into text track cues

    The only types of text tracks that WebM is defined for are in the WebVTT format [[WEBVTT-WEBM]]. Therefore, cues on a text track are created as VTTCue objects [[WEBVTT]]. Each Block in the BlockGroup of the WebM track that has the actual data of the text track creates a VTTCue object with its TextTrackCue attributes sourced as follows:

    Attribute How to source its value
    id First line of the Block's data.
    startTime Calculated from the BlockTimecode field in the Block's header and the Timecode field in the Cluster relative to which BlockTimecode is specified.
    endTime Calculated from the BlockDuration filed in the Block's header and the startTime.
    pauseOnExit "false"
    cue setting attributes Parsed from the second line of the Block's data.
    text The third and all following lines of the Block's data.

    Other Matroska container format's text tracks can also be mapped to TextTrackCue objects. These will be created as DataCue objects [[HTML51]] with id, startTime, endTime, and pauseOnExit attributes filled identically to the VTTCue objects, and the data attribute containing the Block's data.

Ogg

MIME type/subtype: audio/ogg, video/ogg
  1. Track Order

    The order of tracks specified in the Skeleton fisbone headers [[OGGSKELETON]] is maintained when sourcing multiple Ogg tracks into HTML. If no Skeleton track is available, the order of the "beginning of stream" (BOS) pages which determines track order [[OGG]].

  2. Determining the type of track

    A user agent recognises and supports data from a Ogg resource as being equivalent to a HTML track based on the value of the Role field of the fisbone header in Ogg Skeleton:

    • text track: Role starts with "text"
    • video track: Role starts with "video"
    • audio track: Role starts with "audio"

    If no Skeleton track is available, determine the type based on the codec used in the BOS pages, e.g. Vorbis is an "audio" track and "theora" is a video track.

  3. Track Attributes for sourced Text Tracks

    Attribute How to source its value
    id Content of the name message header field of the fisbone header in Ogg Skeleton. If no Skeleton header is available, use a decimal representation of the stream's serialnumber as given in the BOS.
    kind

    Map the content of the Role message header fields of Ogg Skeleton as follows:

    • "captions": Role is "text/captions"
    • "subtitles": Role is "text/subtitle" or "text/karaoke"
    • "descriptions": Role is "text/textaudiodesc"
    • "chapters": Role is "text/chapters"
    • "metadata": otherwise
    label Content of the title message header field of the fisbone header. If no Skeleton header is available, the empty string.
    language Content of the language message header field of the fisbone header. If no Skeleton header is available, the empty string.
    inBandMetadataTrackDispatchType If kind is "metadata", then the value of the Role header field. The empty string otherwise.
    mode "disabled"
  4. Track Attributes for sourced Audio and Video Tracks

    Attribute How to source its value
    id Content of the name message header field of the fisbone header in Ogg Skeleton. If no Skeleton header is available, use a decimal representation of the stream's serialnumber as given in the BOS.
    kind

    Map the content of the Role message header fields of Ogg Skeleton as follows:

    • "alternative": Role is "audio/alternate" or "video/alternate"
    • "captions": Role is "video/captioned"
    • "descriptions": Role is "audio/audiodesc"
    • "main": Role is "audio/main" or "video/main"
    • "main-desc": Role is "audio/described"
    • "sign": Role is "video/sign"
    • "subtitles": Role is "video/subtitled"
    • "translation": Role is "audio/dub"
    • "commentary": Role is "audio/commentary"
    • "": otherwise
    label Content of the title message header field of the fisbone header. If no Skeleton header is available, the empty string.
    language Content of the language message header field of the fisbone header. If no Skeleton header is available, the empty string.
  5. Mapping Text Track content into text track cues

    TBD

Acknowledgements

Thanks to all In-band Track Community Group members in helping to create this specification.

Thanks also to the WHATWG and W3C HTML WG where a part of this specification originated.