This document presents the accessibility requirements users with disabilities have with respect to audio and video on the web.
It first provides an introduction to the needs of users with disabilities in relation to audio and video.
Then it explains what alternative content technologies have been developed to help such users gain access to the content of audio and video.
A third section explains how these content technologies fit in the larger picture of accessibility, both technically within a web user agent and from a production process point of view.
This document is most explicitly not a collection of baseline user agent or authoring tool requirements. It is important to recognize that not all user agents (nor all authoring tools) will support all the features discussed in this document. Rather, this document attempts to supply a comprehensive collection of user requirements needed to support media accessibility in the context of HTML5.
Please also note this document is not an inventory of technology currently provided by, or missing from HTML5 specification drafts. Technology is listed here because it's important for accommodating the alternative access needs of users with disabilities to web-based media. This document is an inventory of Media Accessibility User Requirements.
This document is reasonably stable, and represents a consensus within the Working Group. The Working Group is looking for feedback prior to publication as a Note.
This section provides examples of the accessibility requirements of people with auditory, cognitive, neurological, physical, speech, or visual disabilities to access and comprehend media, including requirements for media formats and media player technologies. For a broader exploration of how people with different disabilities interact with web content and tools, see How People with Disabilities Use the Web.
People who are blind cannot access visual information in videos, player controls, status indicators, etc. They need the information in an alternative representation of audio or text. People who are blind use a screen reader and/or refreshable Braille display, and media content needs to be operable with these assistive technologies (ATs).
People with low vision can use some visual information. Depending on their visual ability they might have specific issues such as difficulty discriminating foreground information from background information, or discriminating colors. Glare caused by excessive scattering in the eye can be a significant challenge, especially for very bright content or surroundings. They may be unable to react quickly to transient information, and may have a narrow angle of view and so may not detect key information presented temporarily where they are not looking, or in text that is moving or scrolling. A person will likely use screen magnification software. This means that they will only be viewing a portion of the screen, and so must manage tracking media content via their AT. They may have difficulty reading when text is too small, has poor background contrast (too high or too low), or when outlined or other fancy font types or effects are used. If the font is an image, it is likely to appear grainy when magnified. They may be using an AT that adjusts all the colors of the screen, such as inverting the colors, so the media content must be viewable through the AT. Users with low vision will often benefit from the same text streams and instructions that are sometimes hidden or displayed off screen for users of screen readers or refreshable Braille.
People with atypical color perception (often called "color blindness") may not be able to discriminate between different colors, or may miss key information when coded with color only, such as colors in media controls and text overlays.
People who are deaf generally cannot use audio. Thus, an alternative representation is required, typically through synchronized captions and/or sign translation.
People who are hard of hearing may be able to use some audio material, but might not be able to discriminate certain types of sound, and may miss any information presented as audio only if it contains frequencies they can't hear, or is masked by background noise or distortion. They may miss audio which is too quiet, or of poor quality. Speech may be challenging if it is too fast and cannot be played back more slowly. Information presented using multichannel audio (e.g., stereo) may not be perceived by people who are deaf in one ear. People with cochlear implants may not have issues with audio volume levels, but comprehension may be challenging if the media experience is overwhelming.
Individuals who are deaf-blind have a combination of conditions that may result in one of the following: blindness and deafness; blindness and difficulty in hearing; low vision and deafness; or low vision and difficulty in hearing. Depending on their combination of conditions, individuals who are deaf-blind may need captions that can be enlarged, changed to high-contrast colors, or otherwise styled; or they may need captions and/or described video that can be presented with AT (e.g., a refreshable Braille display). They may need synchronized captions and/or described video, or they may need a non-time-based transcript which they can read at their own pace.
Some people with physical disabilities such as limited muscle control (including tremors, lack of coordination, and paralysis), pain that impedes movement, or missing limbs cannot use a keyboard or mouse to interact with content and controls. Some use a keyboard but not a pointing device, some use a switch with an on-screen keyboard, and some use other assistive technology. The media player must be usable with only a keyboard, including access to all player controls and methods for selecting alternative content.
Cognitive disabilities include a wide range of conditions that may include intellectual disabilities (called learning disabilities in some regions), autism-spectrum disorders, memory impairments, mental-health disabilities, attention-deficit disorders, audio- and/or visual-perceptive disorders, dyslexia and dyscalculia (called learning disabilities in some regions), or seizure disorders. The accessibility supports for these different conditions vary widely. Individuals with some conditions may process information aurally better than by reading text; therefore, information that is presented as text embedded in a video should also be available as audio descriptions. Individuals with other conditions may need to reduce distractions or flashing in presentations of video. Some conditions, such as autism-spectrum disorders, may have multisystem effects. Individuals may need a combination of different accommodations. Overall, the media experience for people on the autism spectrum should be customizable and well designed so as to not be overwhelming. Care must be taken to present a media experience that focuses on the purpose of the content and provides alternative content in a clear, concise manner.
A number of alternative content types have been developed to help users with sensory disabilities gain access to audio-visual content. This section lists them, explains generally what they are, and provides a number of requirements on each that need to be satisfied with technology developed in HTML5 around the media elements.
Described video contains descriptive narration of key visual elements designed to make visual media accessible to people who are blind or visually impaired. The descriptions include actions, costumes, gestures, scene changes or any other important visual information that someone who cannot see the screen might ordinarily miss. Descriptions are traditionally audio recordings timed and recorded to fit into natural pauses in the program, although they may also briefly obscure the main audio track (see the section on extended descriptions for an alternative approach).
As with captions, descriptions can be open or closed.
Described video provides benefits that reach beyond blind or visually impaired viewers; e.g., students grappling with difficult materials or concepts. Descriptions can be used to give supplemental information about what is on screen—the structure of lengthy mathematical equations or the intricacies of a painting, for example.
Described video is available on some television programs and in many movie theaters in the U.S. and other countries. Regulations in the U.S. and Europe are increasingly focusing on description, especially for television, reflecting its priority with citizens who have visual impairments. The technology needed to deliver and render basic video descriptions is in fact relatively straightforward, being an extension of common audio-processing solutions. Playback products must support multi-audio channels required for description, and any product dealing with broadcast TV content must provide adequate support for descriptions. Descriptions can also provide text that can be indexed and searched.
Systems supporting described video where the descriptions are available as independent file or channel resources must:
Described video that uses text for the description source rather than a recorded voice creates specific requirements.
Text video descriptions (TVDs) are delivered to the client as text and rendered locally by assistive technology such as a screen reader or a Braille device. This can have advantages for screen-reader users who want full control of the preferred voice and speaking rate, or other options to control the speech synthesis.
Text video descriptions are provided as text files containing start times for each description cue. Since the duration that a screen reader takes to read out a description cannot be determined during authoring of the cues, it is difficult to ensure they don't obscure the main audio or other description cues. This is likely to be caused by at least three reasons:
People with low-vision may also benefit from having access to text video descriptions.
Systems supporting text video descriptions must:
Video descriptions are usually provided as recorded speech, timed to play in the natural pauses in dialog or narration. In some types of material, however, there is not enough time to present sufficient descriptions. To meet such cases, the concept of extended description was developed. Extended descriptions work by pausing the video and program audio at key moments, playing a longer description than would normally be permitted, and then resuming playback when the description is finished playing. This will naturally extend the timeline of the entire presentation. This procedure has not been possible in broadcast television; however, hard-disk recording and on-demand Internet systems can make this a practical possibility.
Extended video description (EVD) has been reported to have benefits for cognitive disabilities; for example, it might benefit people with Asperger's Syndrome and other Autistic Spectrum Disorders, in that it can make connections between cause and effect, point out what is important to look at, or explain moods that might otherwise be missed.
Systems supporting extended audio descriptions must:
Because the user is the ultimate arbiter of the rate at which TTS playback occurs, it is not feasible for an author to guarantee that any texted audio description can be played within the natural pauses in dialog or narration of the primary audio resource. Therefore, all texted descriptions must be treated as extended text descriptions potentially requiring the pausing and resumption of primary resource playback.
A relatively recent development in television accessibility is the concept of clean audio, which takes advantage of the increased adoption of multichannel audio. This is primarily aimed at audiences who are hard of hearing, and consists of isolating the audio channel containing the spoken dialog and important non-speech information that can then be amplified or otherwise modified, while other channels containing music or ambient sounds are attenuated.
Using the isolated audio track may make it possible to apply more sophisticated audio processing such as pre-emphasis filters, pitch-shifting, and so on to tailor the audio to the user's needs, since hearing loss is typically frequency-dependent, and the user may have usable hearing in some bands yet none at all in others.
Systems supporting clean audio and multiple audio tracks must:
For people who are deaf or hard-of-hearing, captioning is a prime alternative representation of audio. Captions are in the same language as the main audio track and, in contrast to foreign-language subtitles, render a transcription of dialog or narration as well as important non-speech information, such as sound effects, music, and laughter. Historically, captions have been either closed or open. Closed captions have been transmitted as data along with the video but were not visible until the user elected to turn them on, usually by invoking an on-screen control or menu selection. Open captions have always been visible; they had been merged with the video track and could not be turned off.
Ideally, captions should be a verbatim representation of the audio; however, captions are sometimes edited for various reasons— for example, for reading speed or for language level. In general, consumers of captions have expressed that the text should represent exactly what is in the audio track. If edited captions are provided, then they should be clearly marked as such, and the full verbatim version should also be available as an option.
The timing of caption text can coincide with the mouth movement of the speaker (where visible), but this is not strictly necessary. For timing purposes, captions may sometimes precede or extend slightly after the audio they represent. Captioning should also use adequate means to distinguish between speakers as turn-taking occurs during conversation; this has in the past been done by positioning the text near the speaker, by associating different colors to different speakers, or by putting the name and a colon in front of the text line of a speaker.
Captions are useful to a wide array of users in addition to their originally intended audiences. Gyms, bars, and restaurants regularly employ captions as a way for patrons to watch television while in those establishments. People learning to read or learning the language of the country where they live as a second language also benefit from captions: research has shown that captions help reinforce vocabulary and language. Captions can also provide a powerful search capability, allowing users and search engines to search the caption text to locate a specific video or an exact point in a video.
Formats for captions, subtitles or foreign-language subtitles must:
Most of the time, the main audio track would be the best candidate for the timebase. Where a video without audio, but with a text track, is available, the video track becomes the timebase master. Also, there may be situations where an explicit timing track is available.
This should be possible both within media resources and caption formats.
This means that caption cues should be able to either let the start time of the subsequent cue be determined by the duration of the cue or have the end time be implied by the start of the next cue. For overlapping captions, explicit start and end times are then required.
This means that determined character encodings should be supported - which could be either by making the character encoding explicit or by enforcing a single default one such as UTF-8.
The minimum requirement is a bounding box (with an optional background) into which text is flowed, and that probably needs to be pixel aligned. The absolute position of text within the bounding box is less critical, although it is important to be able to avoid bad word-breaks and have adequate white space around letters and so on. There is more on this in a separate requirement.
The caption format could provide a min-width/min-height for its bounding box, which typically is calculated from the bottom of the video viewport, but can be placed elsewhere by the web page, with the web page being able to make that box larger and scale the text relatively, too. The positions inside the box should probably be into regions, such as top, right, bottom, left, center.
This typically relates to multiple text cues that are defined on overlapping times. If the cues' rendering target are made out to different spatial regions, they can be displayed simultaneously.
Internationalization is important not just for subtitles, as captions can be used in all languages.
The legibility of the rendered text depends upon the size of the text as perceived by the viewer. This is in turn dependent upon the display size and the distance between the display and the viewer. Users must be able to select an appropriate format for their environment. See also CC-11 below.
A default palette of colors suitable for users with atypical color perception should be available to distinguish editorial concepts such as speakers. There are likely to be conflicting requirements between different users with differing cognitive conditions to maximize the accessibility of content, so full color customization should be available. For example users with cognitive conditions such as dyslexia (itself an umbrella label for a variety of conditions), ADHD, and Asperger's may find that viewing content that is given a particular color cast, akin to viewing through blue eyeglasses, helps them to read presented text.
While users must have the ability to customize their experience, it is preferable that developers do their best to ensure the legibility of the content they are presenting. For example, the use of Media Queries and alternate style sheets based upon screen size is a common technique for tuning the size and style of fonts used depending on the output device (e.g., a large monitor vs. a small smart-phone screen). Ideally, a combination of techniques such as this along with sensible system-provided defaults will reduce the need for end-users to customize beyond general system settings.
The use of drop shadows is not a suitable general alternative to displaying text on a non-transparent background. For example, white text with drop shadows on a transparent background is not legible over white content (e.g., footage of snow).
The use of drop shadows can increase the sense of 'busyness', and can have negative impacts upon viewers with some cognitive conditions. In general developers will improve text legibility of they avoid the use of drop shadows.
It may be technically possible to have cues without text.
Similarly, in karaoke, individual characters are often "painted on".
The comprehension and appreciation of captions and subtitles depends on how well matched they are 'editorially' to the related video content. In particular the pacing of the content should be reflected in the caption text; for example a fast paced drama is likely to benefit from relatively short captions that change more often in comparison to a slow paced one. In the most extreme case, very fast changing short subtitles can cause readability problems because they can prevent viewers from having enough attention to consider the video; such extremes should be avoided.
Caption/subtitle files that are alternatives in different languages are probably better provided in different caption resources and should be user selectable. Realistically, no more than 2 languages should be present at the same time on the screen.
Italics markup may be sufficient for a human user, but it is important to be able to mark up languages so that the text can be rendered correctly, since the same Unicode can be shared between languages and rendered differently in different contexts. This is mainly an localization issue. It is also important for audio rendering, to get correct pronunciation.
Further, systems that support captions must:
It is desirable to expose the same API to both.
This requires a menu of some sort that displays the available tracks for activation/deactivation.
Edited and verbatim captions may be provided in two separate caption resources. How these differ should be explained to the user.
These different-language "tracks" can be provided in different resources.
Enhanced captions are timed text cues that have been enriched with further information - examples are glossary definitions for acronyms and other initialisms, foreign terms (for example, Latin), jargon or descriptions for other difficult language. They may be age-graded, so that multiple caption tracks are supplied, or the glossary function may be added dynamically through machine lookup.
Glossary information can be added in the normal time allotted for the cue (e.g., as a callout or other overlay), or it might take the form of a hyperlink that, when activated, pauses the main content and allows access to more complete explanatory material.
Such extensions can provide important additional information to the content that will enable or improve the understanding of the main content to users of assistive technology. Enhanced text cues will be particularly useful for those with restricted reading skills, to subtitle users, and to caption users. Users may often come across keywords in text cues that lend themselves to further in-depth information or hyperlinks, such as an e-mail contact or phone number for a person, a unfamiliar term that needs a link to a definition, or an idiom that needs comments to explain it to a foreign-language speaker.
Systems that support enhanced captions must:
Such "metadata" markup can be realized through a title attribute on a <span> of the text, or a hyperlink to another location where a term is explained, an <abbr> element, an <acronym> element, a <dfn> element, or through RDFa or microdata.
This could be realized by including hyperlinks or buttons into timed text cues, where additional overlays could be created or a different page loaded. One needs to deal here with the need to pause the media timeline for reading of the additional information.
This feature is analogous to extended video descriptions - where timing for a text cue is longer than the available time for the cue. It may be necessary to halt the media to allow the user more time to read the text and its additional material. In such cases the pause is dependent on the user's reading speed, so this implies user control or automated timeouts.
This can be a setting in the UA, which will define user-interface behavior.
Sign language shares the same concept as captioning: it presents both speech and non-speech information in an alternative format. Note that due to the wide regional variation in signing systems (e.g., American Sign Language vs British Sign Language), sign translation may not be appropriate for content with a global audience unless localized variants can be made available.
Signing can be open, mixed with the video and offered as an entirely alternative stream or closed (using some form of picture-in-picture or alpha-blending technology). It is possible to use quite low bit rates for much of the signing track, but it is important that facial, arm, hand and other body gestures be delivered at sufficient resolution to support legibility. Animated avatars may not currently be sufficient as a substitute for human signers, although research continues in this area and it may become practical at some point in the future.
Acknowledging that not all devices will be capable of handling multiple video streams, this is a SHOULD requirement for browsers where hardware is capable of support. Strong authoring guidance for content creators will mitigate situations where user-agents are unable to support multiple video streams (WCAG) - for example, on mobile devices that cannot support multiple streams, authors should be encouraged to offer two versions of the media stream, including one with signed captions burned into the media.
Selecting from multiple tracks for different sign languages should be achieved in the same fashion that multiple caption/subtitle files are handled.
Systems supporting sign language must:
While synchronized captions are generally preferable for people with hearing impairments, for some users they are not viable – those who are deaf-blind, for example, or those with cognitive or reading impairments that make it impossible to follow synchronized captions. And even with ordinary captions, it is possible to miss some information as the captions and the video require two separate loci of attention. The full transcript supports different user needs and is not a replacement for captioning. A transcript can either be presented simultaneously with the media material, which can assist slower readers or those who need more time to reference context, but it should also be made available independently of the media.
A full text transcript should include information that would be in both the caption and video description, so that it is a complete representation of the material, as well as containing any interactive options.
Systems supporting transcripts must:
Media elements offer a rich set of interaction possibilities to users. These interaction possibilities must be available to all users, including those that cannot use a pointer device for interaction. Further, these interaction possibilities must be available to all users for all means in which the controls are exposed - no matter whether they are exposed by the user agent, or are scripted. Further, the interaction possibilities need to be rich enough to allow all users fine grained control over media playback.
It is imperative that controls be device independent, so that control may be achieved by keyboard, pointing device, speech, etc.
Systems supporting accessibility for interactive controls must:
This means that all interaction possibilities with media elements need to be keyboard accessible; e.g., through being able to tab onto the play, pause, mute buttons, and to move the playback position from the keyboard.
This means that the controls content attribute needs to provide an extended set of control functionality including functionality for accessibility users.
This means that new IDL attributes need to be added to the media elements for the extra controls that are accessibility related.
This could be enabled through a context menu, which is keyboard accessible and its keyboard access cannot be turned off.
This is below the level of HTML and means that the accessibility platform needs to be extended to allow access to these controls.
This could be enabled through encouraging publishers to use @autoplay, encouraging UAs to implement accessibility settings that allow turning off all autoplay, and encouraging AT to implement a shortcut key to stop all autoplay on a web page.
As explained in "Content navigation" above, a real-time control mechanism must be provided for adjusting the granularity of the specific structural navigation point next and previous. Users must be able to set the range/scope of next and previous in real time.
While all devices may not support the capability, a standard control API must support the ability to speed up or slow down content presentation without altering audio pitch.
While perhaps unfamiliar to some, this feature has been present on many devices, especially audiobook players, for some 20 years now.
The user can adjust the playback rate of prerecorded time-based media content, such that all of the following are true
One of the biggest challenges to date has been the lack of a universal system for media access. In response to user requirements various countries and groups have defined systems to provide accessibility, especially captioning for television. However these systems are typically not compatible. In some cases the formats can be inter-converted, but some formats — for example DVD sub-pictures — are image based and are difficult to convert to text.
Caption formats are often geared towards delivery of the media, for example as part of a television broadcast. They are not well suited to the production phases of media creation. Media creators have developed their own internal formats which are more amenable to the editing phase, but to date there has been no common format that allows interchange of this data.
Any media based solution should attempt to reduce as far as possible layers of translation between production and delivery.
In general captioners use a proprietary workstation to prepare caption files; these can often export to various standard broadcast ingest formats, but in general files are not inter-convertible. Most video editing suites are not set up to preserve captioning, and so this has typically to be added after the final edit is decided on; furthermore since this work is often outsourced, the copyright holder may not hold the final editable version of the captions. Thus when programming is later re-purposed, e.g. a shorter edit is made, or a ‘directors cut’ produced, the captioning may have to be redone in its entirety. Similarly, and particularly for news footage, parts of the media may go to web before the final TV edit is made, and thus the captions that are produced for the final TV edit are not available for the web version.
It is important when purchasing or commissioning media, that captioning and described video is taken into account and made equal priority in terms of ownership, rights of use, etc., as the video and audio itself.
This is primarily an authoring requirement. It is understood that a common time-stamp format must be declared in HTML5, so that authoring tools can conform to a required output.
Systems supporting accessibility needs for media must:
As described above, individuals need a variety of media (alternative content) in order to perceive and understand the content. The author or some web mechanism provides the alternative content. This alternative content may be part of the original content, embedded within the media container as 'fallback content', or linked from the original content. The user is faced with discovering the availability of alternative content.
Alternative content must be both discoverable by the user, and accessible in device agnostic ways. The development of APIs and user-agent controls should adhere to the following UAAG guidance:
The user agent can facilitate the discovery of alternative content by following these criteria:
This feature can be user configurable to allow maximum flexibility in trading off the anticipated future need for the description against the amount of extra data storage required. A flexible solution giving maximum control to the user would be to provide a global setting with the following options:
Often forgotten in media systems, especially with the newer forms of packaging such as DVD menus and on-screen program guides, is the fact that the user needs to actually get to the content, control its playback, and turn on any required accessibility options. For user agents supporting accessibility APIs implemented for a platform, any media controls need to be connected to that API.
On self-contained products that do not support assistive technology, any menus in the content need to provide information in alternative formats (e.g., talking menus). Products with a separate remote control, or that are self-contained boxes, should ensure the physical design does not block access, and should make accessibility controls, such as the closed-caption toggle, as prominent as the volume or channel controls.
The video viewport plays a particularly important role with respect to alternative-content technologies. Mostly it provides a bounding box for many of the visually represented alternative-content technologies (e.g., captions, hierarchical navigation points, sign language), although some alternative content does not rely on a viewport (e.g., full transcripts, descriptive video).
One key principle to remember when designing player ‘skins’ is that the lower-third of the video may be needed for caption text. Caption consumers rely on being able to make fast eye movements between the captions and the video content. If the captions are in a non-standard place, this may cause viewers to miss information. The use of this area for things such as transport controls, while appealing aesthetically, may lead to accessibility conflicts.
If alternative content has a different height or width than the media content, then the user agent will reflow the (HTML) viewport.
This may create a need to provide an author hint to the web page when embedding alternative content in order to instruct the web page how to render the content: to scale with the media resource, scale independently, or provide a position hint in relation to the media. On small devices where the video takes up the full viewport, only limited rendering choices may be possible, such that the UA may need to override author preferences.
This should be achievable through UA configuration or user defined javascript or CSS which can override styles dynamically in the browser.
This can be achieved by simply zooming into the web page, which will automatically rescale the layout and reflow the content.
This is a user-agent device requirement and should already be addressed in the UAAG. In live content, it may even be possible to adjust camera settings to achieve this requirement. It is also a "SHOULD" level requirement, since it does not account for limitations of various devices.
If there are several types of overlapping overlays (including captions and subtitles), implementations should attempt to ensure that none of them overlaps with editorially important content. In particular, user agents should avoid obscuring video components such as mouths, "burned in text" (embedded captions or other annotations in the main video stream), etc. When in constrained environments where it is impossible to avoid obscuring all of these components, user agents should make every effort to avoid the most important of them. Users typically expect controls to appear at the bottom of the viewport. Controls should not be prevented from becoming usable due to repositioning.
Multiple secondary user devices must be directly addressable. This functionality is increasingly also known by the new term, "Second Screen," even though there may be more than two screens in any given viewing environment, and even though not all secondary devices are video displays. It must be assumed that many users will have at least one additional display device (such as a tablet), and/or at least one additional audio output device (such as a Bluetooth headset) attached to a primary video display device, an individual computer, or locally addressable on a LAN. It must be possible to configure certain types of media for presentation on specific devices, and these configuration settings must be readily overwritable on a case-by-case basis by users.
Systems supporting secondary devices must:
A table-ized version of the requirements in this document is available in the Wiki at http://www.w3.org/WAI/PF/HTML/wiki/Media_Accessibility_Checklist.
The following people contributed to the development of this document.
Participants in the PFWG and HTML-WG's HTML Accessibility Task Force:
Other contributors and commenters:
Additionally, the following W3C groups contributed to, and commented on this document:
This publication has been funded in part with Federal funds from the U.S. Department of Education, National Institute on Disability and Rehabilitation Research (NIDRR) under contract number ED-OSE-10-C-0067. The content of this publication does not necessarily reflect the views or policies of the U.S. Department of Education, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.