Getting to know Voice

By Jonny Axelsson

July 27th 2011: Please note that Voice only works in Opera on Windows 2000/XP, and we no longer officially support it.

Article update, 4th June 2010: Updates made to Opera support information, and details about Aural CSS specifications.

From a different world than the traditional browsing world comes a range of techniques that allows a developer to code for speech behaviours much more easily than previously possible. Opera has had support for this since Opera 7.6.

Speech Recognition

Understanding natural speech is hard for a machine, sometimes almost impossible. One approach to make this easier is to train the machine to recognize the user's voice and speech patterns. The user will be told to speak some predetermined texts to make the machine accustomed to the user.

This is not practical for web pages, so another approach is used instead, namely explicit grammars of expected utterances. The grammars have two advantages. The machine has a dramatically higher chance of guessing what the user meant this way. Also by restricting the accepted values it is easier to make a sensible response to the now known possible values.

Text To Speech

The machine talks back. It is possible get fairly natural machine generated voices, but this is a trade-off with file size, resource use, and flexibility. The machine voices typically used for web pages would fool nobody, but they are still easy to understand. Text To Speech is fairly common by now, but Opera is the only browser offering aural styling using CSS.


The language or profile Opera uses to combine XHTML and Voice is called XHTML+Voice, or X+V for short. More than for normal XHTML pages, X+V design is primarily about interaction design. You create the storyboard for what the user can do and when. You can make simple web page enhancement, or you can craft elaborate Byzantine labyrinths for the user to get lost in, it is all up to you.

An X+V web page is a normal XHTML page with additional Voice forms in the head. These voice forms cover both speech recognition and text to speech markup. The interaction is event based, you use XML Events to describe what should be the consequences or handlers of different events. An event can for instance be when the page has loaded, the user clicks on a button, or says something the speech recognizer doesn't understand. The consequence can be a voice monolog or dialog with the user. This is turn can for instance trigger a script, reformat the page, or throw another event.

Getting started with X+V

X+V in Style
After all, it isn't as much what you say as how you say it. Add styled speech to X+V.
X+V in Action
Do as I say, not as I do. Use X+V to voice-enable JavaScript
How to Add Voice Interactivity to Your Site
Practical experience of adding voice to a web site
Multimodal FAQ (PDF)
What is this multimodal anyway? An introduction to X+V in questions and answers form.

Getting to know it

Even though X+V still is cutting edge, there are places you can go for help and more information.

Opera accessibility and voice browsing
Welcome to our forum on everything voice-related.
IBM multimodal site
This IBM site has a large collection of documents on voice and multimodal interaction.
W3C Multimodal Interaction working group
The documents here are quite technical, but this is the place where the future standards are defined. It has a public mailing list. You might also want to look at the HTML and Voice working group pages. Styling speech is done by the CSS working group.

Learning to speak


XHTML+Voice Programmer's Guide (PDF)
The book of X+V, including a reference for every element added to XHTML.
X+V Speech Considerations
The alternatives and trade-offs for text to speech. Why the most natural-sounding voice may not always be the best choice.
Multimodal Application Design Issues (PDF)
Many good tips for how to design good X+V pages.
Developing Multimodal Applications using XHTML+Voice (PDF)
Much the same topics as the above article, but more code oriented.
XML Events tutorial
How to get what you want when something happens. This tutorial explains how XML Events works. Alternatively you can go directly for the specification

Speech recognition

Speech recognition is assisted by grammars.

Speech Recognition Grammar Specification
The language to describe what user utterances are accepted.
Semantic Interpretation for Speech Recognition
This let you specify what part of the recognized speech to use, for instance to voice-enable a JavaScript application.

VoiceXML, the dialog language

Like HTML is the language for web documents, VoiceXML is the language for voice interaction.

VoiceXML Forum
Industry organization for the promotion of VoiceXML, with some useful information.
VoiceXML 2.0
The specification itself.

Styling speech

CSS3 Speech Module
This specification is incomplete and currently available as a working draft.
CSS 2 Aural CSS
In 1998 CSS2 Aural CSS (ACSS) was specified.CSS2 Aural CSS is intended to be superceded by CSS3 Speech, but since that module is still incomplete (last working draft: 1994), CSS2 Aural CSS is actually more powerful than CSS3 Speech, and still the spec to use for aural CSS.

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.


The forum archive of this article is still available on My Opera.

No new comments accepted.