Copyright © 2014 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document presents the accessibility requirements users with disabilities have with respect to audio and video on the web.
It first provides an introduction to the needs of users with disabilities in relation to audio and video.
Then it explains what alternative content technologies have been developed to help such users gain access to the content of audio and video.
A third section explains how these content technologies fit in the larger picture of accessibility, both technically within a web user agent and from a production process point of view.
This document is most explicitly not a collection of baseline user agent or authoring tool requirements. It is important to recognize that not all user agents (nor all authoring tools) will support all the features discussed in this document. Rather, this document attempts to supply a comprehensive collection of user requirements needed to support media accessibility in the context of HTML5.
Please also note this document is not an inventory of technology currently provided by, or missing from HTML5 specification drafts. Technology is listed here because it's important for accommodating the alternative access needs of users with disabilities to web-based media. This document is an inventory of Media Accessibility User Requirements.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is a Working Draft by the Protocols & Formats Working Group (PFWG) of the Web Accessibility Initiative. This document is reasonably stable, and represents a consensus within the Working Group. This draft addresses comments received since the publication of the previous working draft. A diff file identifying the resulting changes is available along with a commit history. The Working Group is looking for feedback prior to publication as a Note. In particular, the Working Group seeks input about substantive changes to practices and technologies for media accessibility since the last publication of this document.
To comment, send email to public-pfwg-comments@w3.org (comment archive). Comments are requested by 19 September 2014. In-progress updates to the document may be viewed in the publicly visible editors' draft.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 August 2014 W3C Process Document.
The following User Requirements have also been distilled into a Media Accessibility Checklist. Developers and implementers may want to refer to this checklist when implementing audio and video content and features.
Editorial note: This section is a rough draft. It will be edited to align with How People with Disabilities Use the Web once that document is complete. This draft is included now to provide general background for sections 2 and 3 of this document.
Comprehension of media may be affected by loss of visual function, loss of audio function, cognitive issues, or a combination of all three. Cognitive disabilities may affect access to and/or comprehension of media. Physical disabilities such as dexterity impairment, loss of limbs, or loss of use of limbs may affect access to media. Once richer forms of media, such as virtual reality, become more commonplace, tactile issues may come into play. Control of the media player can be an important issue, e.g., for people with physical disabilities, however this is typically not addressed by the media formats themselves, but is a requirement of the technology used to build the player.
People who are blind cannot access information if it is presented only in the visual mode. They require information in an alternative representation, which typically means the audio mode, although information can also be presented as text. It is important to remember that not only the main video is inaccessible, but any other visible ancillary information such as stock tickers, status indicators, or other on-screen graphics, as well as any visual controls needed to operate the content. Since people who are blind use a screen reader and/or refreshable braille display, these assistive technologies (ATs) need to work hand-in-hand with the access mechanism provided for the media content.
People with low vision can use some visual information. Depending on their visual ability they might have specific issues such as difficulty discriminating foreground information from background information, or discriminating colors. Glare caused by excessive scattering in the eye can be a significant challenge, especially for very bright content or surroundings. They may be unable to react quickly to transient information, and may have a narrow angle of view and so may not detect key information presented temporarily where they are not looking, or in text that is moving or scrolling. A person will likely use screen magnification software. This means that they will only be viewing a portion of the screen, and so must manage tracking media content via their AT. They may have difficulty reading when text is too small, has poor background contrast (too high or too low), or when outlined or other fancy font types or effects are used. If the font is an image, it is likely to appear grainy when magnified. They may be using an AT that adjusts all the colors of the screen, such as inverting the colors, so the media content must be viewable through the AT. Users with low vision will often benefit from the same text streams and instructions that are sometimes hidden or displayed off screen for users of screen readers or refreshable Braille.
A significant percentage of the population has atypical color perception, and may not be able to discriminate between different colors, or may miss key information when coded with color only. They might have difficulty discriminating foreground information from background information, or discriminating colors. Such issues can be minimized when the user has the ability to customize the color and contrast of text content.
People who are deaf generally cannot use audio. Thus, an alternative representation is required, typically through synchronized captions and/or sign translation.
People who are hard of hearing may be able to use some audio material, but might not be able to discriminate certain types of sound, and may miss any information presented as audio only if it contains frequencies they can't hear, or is masked by background noise or distortion. They may miss audio which is too quiet, or of poor quality. Speech may be challenging if it is too fast and cannot be played back more slowly. Information presented using multichannel audio (e.g., stereo) may not be perceived by people who are deaf in one ear.
Individuals who are deaf-blind have a combination of conditions that may result in one of the following: blindness and deafness; blindness and difficulty in hearing; low vision and deafness; or low vision and difficulty in hearing. Depending on their combination of conditions, individuals who are deaf-blind may need captions that can be enlarged, changed to high-contrast colors, or otherwise styled; or they may need captions and/or described video that can be presented with AT (e.g., a refreshable braille display). They may need synchronized captions and/or described video, or they may need a non-time-based transcript which they can read at their own pace.
People with physical disabilities such as poor dexterity, loss of limbs, or loss of use of limbs may use the keyboard alone rather than the combination of a pointing device plus keyboard to interact with content and controls, or may use a switch with an on-screen keyboard, or other assistive technology. The player itself must be usable via the keyboard and pointing devices. The user must have full access to all player controls, including methods for selecting alternative content.
Cognitive and neurological disabilities include a wide range of conditions that may include intellectual disabilities (called learning disabilities in some regions), autism-spectrum disorders, memory impairments, mental-health disabilities, attention-deficit disorders, audio- and/or visual-perceptive disorders, dyslexia and dyscalculia (called learning disabilities in some regions), or seizure disorders. Necessary accessibility supports vary widely for these different conditions. Individuals with some conditions may process information aurally better than by reading text; therefore, information that is presented as text embedded in a video should also be available as audio descriptions. Individuals with other conditions may need to reduce distractions or flashing in presentations of video. Some conditions such as autism-spectrum disorders may have multi-system effects and individuals may need a combination of different accommodation. Overall, the media experience for people on the autism spectrum should be customizable and well designed so as to not be overwhelming. Care must be taken to present a media experience that focuses on the purpose of the content and provides alternative content in a clear, concise manner.
A number of alternative content types have been developed to help users with sensory disabilities gain access to audio-visual content. This section lists them, explains generally what they are, and provides a number of requirements on each that need to be satisfied with technology developed in HTML5 around the media elements.
Described video contains descriptive narration of key visual elements designed to make visual media accessible to people who are blind or visually impaired. The descriptions include actions, costumes, gestures, scene changeset or any other important visual information that someone who cannot see the screen might ordinarily miss. Descriptions are traditionally audio recordings timed and recorded to fit into natural pauses in the program, although they may also briefly obscure the main audio track. (See the section on extended descriptions for an alternative approach.) The descriptions are usually read by a narrator with a voice that cannot be easily confused with other voices in the primary audio track. They are authored to convey objective information (e.g., a yellow flower) rather than subjective judgments (e.g., a beautiful flower).
As with captions, descriptions can be open or closed.
Described video provides benefits that reach beyond blind or visually impaired viewers; e.g., students grappling with difficult materials or concepts. Descriptions can be used to give supplemental information about what is on screen—the structure of lengthy mathematical equations or the intricacies of a painting, for example.
Described video is available on some television programs and in many movie theaters in the U.S. and other countries. Regulations in the U.S. and Europe are increasingly focusing on description, especially for television, reflecting its priority with citizens who have visual impairments. The technology needed to deliver and render basic video descriptions is in fact relatively straightforward, being an extension of common audio-processing solutions. Playback products must support multi-audio channels required for description, and any product dealing with broadcast TV content must provide adequate support for descriptions. Descriptions can also provide text that can be indexed and searched.
Systems supporting described video that are not open descriptions must:
Described video that uses text for the description source rather than a recorded voice creates specific requirements.
Text video descriptions (TVDs) are delivered to the client as text and rendered locally by assistive technology such as a screen reader or a braille device. This can have advantages for screen-reader users who want full control of the preferred voice and speaking rate, or other options to control the speech synthesis.
Text video descriptions are provided as text files containing start times for each description cue. Since the duration that a screen reader takes to read out a description cannot be determined during authoring of the cues, it is difficult to ensure they don't obscure the main audio or other description cues. This is likely to be caused by at least three reasons:
People with low-vision may also benefit from having access to text video descriptions.
Systems supporting text video descriptions must:
Video descriptions are usually provided as recorded speech, timed to play in the natural pauses in dialog or narration. In some types of material, however, there is not enough time to present sufficient descriptions. To meet such cases, the concept of extended description was developed. Extended descriptions work by pausing the video and program audio at key moments, playing a longer description than would normally be permitted, and then resuming playback when the description is finished playing. This will naturally extend the timeline of the entire presentation. This procedure has not been possible in broadcast television; however, hard-disk recording and on-demand Internet systems can make this a practical possibility.
Extended video description (EVD) has been reported to have benefits for cognitive disabilities; for example, it might benefit people with Asperger Syndrome and other Autistic Spectrum Disorders, in that it can make connections between cause and effect, point out what is important to look at, or explain moods that might otherwise be missed.
Systems supporting extended audio descriptions must:
Because the user is the ultimate arbiter of the rate at which TTS playback occurs, it is not feasible for an author to guarantee that any texted audio description can be played within the natural pauses in dialog or narration of the primary audio resource. Therefore, all texted descriptions must be treated as extended text descriptions potentially requiring the pausing and resumption of primary resource playback.
A relatively recent development in television accessibility is the concept of clean audio, which takes advantage of the increased adoption of multichannel audio. This is primarily aimed at audiences who are hard of hearing, and consists of isolating the audio channel containing the spoken dialog and important non-speech information that can then be amplified or otherwise modified, while other channels containing music or ambient sounds are attenuated.
Using the isolated audio track may make it possible to apply more sophisticated audio processing such as pre-emphasis filters, pitch-shifting, and so on to tailor the audio to the user's needs, since hearing loss is typically frequency-dependent, and the user may have usable hearing in some bands yet none at all in others.
Systems supporting clean audio and multiple audio tracks must:
Sign language shares the same concept as captioning: it presents both speech and non-speech information in an alternative format. Note that due to the wide regional variation in signing systems (e.g., American Sign Language vs British Sign Language), sign translation may not be appropriate for content with a global audience unless localized variants can be made available.
Signing can be open, mixed with the video and offered as an entirely alternative stream or closed (using some form of picture-in-picture or alpha-blending technology). It is possible to use quite low bit rates for much of the signing track, but it is important that facial, arm, hand and other body gestures be delivered at sufficient resolution to support legibility. Animated avatars may not currently be sufficient as a substitute for human signers, although research continues in this area and it may become practical at some point in the future.
Acknowledging that not all devices will be capable of handling multiple video streams, this is a SHOULD requirement for browsers where hardware is capable of support. Strong authoring guidance for content creators will mitigate situations where user-agents are unable to support multiple video streams (WCAG) - for example, on mobile devices that cannot support multiple streams, authors should be encouraged to offer two versions of the media stream, including one with signed captions burned into the media.
Selecting from multiple tracks for different sign languages should be achieved in the same fashion that multiple caption/subtitle files are handled.
Systems supporting sign language must:
While synchronized captions are generally preferable for people with hearing impairments, for some users they are not viable – those who are deaf-blind, for example, or those with cognitive or reading impairments that make it impossible to follow synchronized captions. And even with ordinary captions, it is possible to miss some information as the captions and the video require two separate loci of attention. The full transcript supports different user needs and is not a replacement for captioning. A transcript can either be presented simultaneously with the media material, which can assist slower readers or those who need more time to reference context, but it should also be made available independently of the media.
A full text transcript should include information that would be in both the caption and video description, so that it is a complete representation of the material, as well as containing any interactive options.
Systems supporting transcripts must:
While all devices may not support the capability, a standard control API must support the ability to speed up or slow down content presentation without altering audio pitch.
While perhaps unfamiliar to some, this feature has been present on many devices, especially audiobook players, for some 20 years now.
The user can adjust the playback rate of prerecorded time-based media content, such that all of the following are true (UAAG 2.0 2.11.4):
One of the biggest challenges to date has been the lack of a universal system for media access. In response to user requirements various countries and groups have defined systems to provide accessibility, especially captioning for television. However these systems are typically not compatible. In some cases the formats can be inter-converted, but some formats — for example DVD sub-pictures — are image based and are difficult to convert to text.
Caption formats are often geared towards delivery of the media, for example as part of a television broadcast. They are not well suited to the production phases of media creation. Media creators have developed their own internal formats which are more amenable to the editing phase, but to date there has been no common format that allows interchange of this data.
Any media based solution should attempt to reduce as far as possible layers of translation between production and delivery.
In general captioners use a proprietary workstation to prepare caption files; these can often export to various standard broadcast ingest formats, but in general files are not inter-convertible. Most video editing suites are not set up to preserve captioning, and so this has typically to be added after the final edit is decided on; furthermore since this work is often outsourced, the copyright holder may not hold the final editable version of the captions. Thus when programming is later re-purposed, e.g. a shorter edit is made, or a ‘directors cut’ produced, the captioning may have to be redone in its entirety. Similarly, and particularly for news footage, parts of the media may go to web before the final TV edit is made, and thus the captions that are produced for the final TV edit are not available for the web version.
It is important when purchasing or commissioning media, that captioning and described video is taken into account and made equal priority in terms of ownership, rights of use, etc., as the video and audio itself.
This is primarily an authoring requirement. It is understood that a common time-stamp format must be declared in HTML5, so that authoring tools can conform to a required output.
Systems supporting accessibility needs for media must:
As described above, individuals need a variety of media (alternative content) in order to perceive and understand the content. The author or some web mechanism provides the alternative content. This alternative content may be part of the original content, embedded within the media container as 'fallback content', or linked from the original content. The user is faced with discovering the availability of alternative content.
Alternative content must be both discoverable by the user, and accessible in device agnostic ways. The development of APIs and user-agent controls should adhere to the following UAAG guidance:
The user agent can facilitate the discovery of alternative content by following these criteria:
This feature can be user configurable to allow maximum flexibility in trading off the anticipated future need for the description against the amount of extra data storage required. A flexible solution giving maximum control to the user would be to provide a global setting with the following options:
Often forgotten in media systems, especially with the newer forms of packaging such as DVD menus and on-screen program guides, is the fact that the user needs to actually get to the content, control its playback, and turn on any required accessibility options. For user agents supporting accessibility APIs implemented for a platform, any media controls need to be connected to that API.
On self-contained products that do not support assistive technology, any menus in the content need to provide information in alternative formats (e.g., talking menus). Products with a separate remote control, or that are self-contained boxes, should ensure the physical design does not block access, and should make accessibility controls, such as the closed-caption toggle, as prominent as the volume or channel controls.
The video viewport plays a particularly important role with respect to alternative-content technologies. Mostly it provides a bounding box for many of the visually represented alternative-content technologies (e.g., captions, hierarchical navigation points, sign language), although some alternative content does not rely on a viewport (e.g., full transcripts, descriptive video).
One key principle to remember when designing player ‘skins’ is that the lower-third of the video may be needed for caption text. Caption consumers rely on being able to make fast eye movements between the captions and the video content. If the captions are in a non-standard place, this may cause viewers to miss information. The use of this area for things such as transport controls, while appealing aesthetically, may lead to accessibility conflicts.
If alternative content has a different height or width than the media content, then the user agent will reflow the (HTML) viewport. (UAAG 2.0 1.8.7).
This may create a need to provide an author hint to the web page when embedding alternative content in order to instruct the web page how to render the content: to scale with the media resource, scale independently, or provide a position hint in relation to the media. On small devices where the video takes up the full viewport, only limited rendering choices may be possible, such that the UA may need to override author preferences.
This should be achievable through UA configuration or even through something like a greasemonkey script or user CSS which can override styles dynamically in the browser.
This can be achieved by simply zooming into the web page, which will automatically rescale the layout and reflow the content.
This is a user-agent device requirement and should already be addressed in the UAAG. In live content, it may even be possible to adjust camera settings to achieve this requirement. It is also a "SHOULD" level requirement, since it does not account for limitations of various devices.
If there are several types of overlapping overlays, the controls should stay on the bottom edge of the viewport and the others should be moved above this area, all stacked above each other.
Multiple secondary user devices must be directly addressable. This functionality is increasingly also known by the new term, "Second Screen," even though there may be more than two screens in any given viewing environment, and even though not all secondary devices are video displays. It must be assumed that many users will have at least one additional display device (such as a tablet), and/or at least one additional audio output device (such as a Bluetooth headset) attached to a primary video display device, an individual computer, or locally addressable on a LAN. It must be possible to configure certain types of media for presentation on specific devices, and these configuration settings must be readily overwritable on a case-by-case basis by users.
Systems supporting secondary devices must:
The following people contributed to the development of this document.
Kazuyuki Ashimura (W3C), Simon Bates, Chris Blouch (AOL), Ben Caldwell (Trace), Charles Chen (Google, Inc.), Christian Cohrs, Dimitar Denev (Frauenhofer Gesellschaft), Donald Evans (AOL), Geoff Freed (Invited Expert, NCAM), Kentarou Fukuda (IBM Corporation), Becky Gibson (IBM), Alfred S. Gilman, Andres Gonzalez (Adobe Systems Inc.), Georgios Grigoriadis (SAP AG), Jeff Grimes (Oracle), Barbara Hartel, John Hrvatin (Microsoft Corporation), Masahiko Kaneko (Microsoft Corporation), Earl Johnson (Sun), Jael Kurz, Diego La Monica (International Webmasters Association / HTML Writers Guild (IWA-HWG)), Gez Lemon (International Webmasters Association / HTML Writers Guild (IWA-HWG)), Aaron Leventhal (IBM Corporation), Alex Li (SAP), Thomas Logan (HiSoftware Inc.), William Loughborough (Invited Expert), Linda Mao (Microsoft), Anders Markussen (Opera Software), Matthew May (Adobe Systems Inc.), Joshue O Connor (Invited Expert), Artur Ortega (Yahoo!, Inc.), Lisa Pappas (Society for Technical Communication (STC)), Dave Pawson (RNIB), David Poehlman, Simon Pieters (Opera Software), Sarah Pulis (Media Access Australia), T.V. Raman (Google, Inc.), Jan Richards (IDRC), Gregory Rosmaita (Invited Expert), Tony Ross (Microsoft Corporation), Martin Schaus (SAP AG), Marc Silbey (Microsoft Corporation), Henri Sivonen (Mozilla), Andi Snow-Weaver (IBM Corporation), Henny Swan (Opera Software), Vitaly Sourikov, Mike Squillace (IBM), Gregg Vanderheiden (Invited Expert, Trace), Ryan Williams (Oracle), Tom Wlodkowski.
This publication has been funded in part with Federal funds from the U.S. Department of Education, National Institute on Disability, Independent Living, and Rehabilitation Research (NIDILRR) under contract number ED-OSE-10-C-0067. The content of this publication does not necessarily reflect the views or policies of the U.S. Department of Education, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.