Getting to know Voice

By Jonny Axelsson

July 27th 2011: Please note that Voice only works in Opera on Windows 2000/XP, and we no longer officially support it.

Article update, 4th June 2010: Updates made to Opera support information, and details about Aural CSS specifications.

From a different world than the traditional browsing world comes a range of techniques that allows a developer to code for speech behaviours much more easily than previously possible. Opera has had support for this since Opera 7.6.

Speech Recognition

Understanding natural speech is hard for a machine, sometimes almost impossible. One approach to make this easier is to train the machine to recognize the user's voice and speech patterns. The user will be told to speak some predetermined texts to make the machine accustomed to the user.

This is not practical for web pages, so another approach is used instead, namely explicit grammars of expected utterances. The grammars have two advantages. The machine has a dramatically higher chance of guessing what the user meant this way. Also by restricting the accepted values it is easier to make a sensible response to the now known possible values.

Text To Speech

The machine talks back. It is possible get fairly natural machine generated voices, but this is a trade-off with file size, resource use, and flexibility. The machine voices typically used for web pages would fool nobody, but they are still easy to understand. Text To Speech is fairly common by now, but Opera is the only browser offering aural styling using CSS.

XHTML+Voice

The language or profile Opera uses to combine XHTML and Voice is called XHTML+Voice, or X+V for short. More than for normal XHTML pages, X+V design is primarily about interaction design. You create the storyboard for what the user can do and when. You can make simple web page enhancement, or you can craft elaborate Byzantine labyrinths for the user to get lost in, it is all up to you.

An X+V web page is a normal XHTML page with additional Voice forms in the head. These voice forms cover both speech recognition and text to speech markup. The interaction is event based, you use XML Events to describe what should be the consequences or handlers of different events. An event can for instance be when the page has loaded, the user clicks on a button, or says something the speech recognizer doesn't understand. The consequence can be a voice monolog or dialog with the user. This is turn can for instance trigger a script, reformat the page, or throw another event.

Getting started with X+V

X+V in Style: After all, it isn't as much what you say as how you say it. Add styled speech to X+V.
X+V in Action: Do as I say, not as I do. Use X+V to voice-enable JavaScript
How to Add Voice Interactivity to Your Site: Practical experience of adding voice to a web site

Multimodal FAQ (PDF): What is this multimodal anyway? An introduction to X+V in questions and answers form.

Getting to know it

Even though X+V still is cutting edge, there are places you can go for help and more information.

Opera accessibility and voice browsing: Welcome to our forum on everything voice-related.
IBM multimodal site: This IBM site has a large collection of documents on voice and multimodal interaction.
W3C Multimodal Interaction working group: The documents here are quite technical, but this is the place where the future standards are defined. It has a public mailing list. You might also want to look at the HTML and Voice working group pages. Styling speech is done by the CSS working group.

Learning to speak

XHTML+Voice

XHTML+Voice Programmer's Guide (PDF): The book of X+V, including a reference for every element added to XHTML.
X+V Speech Considerations: The alternatives and trade-offs for text to speech. Why the most natural-sounding voice may not always be the best choice.
Multimodal Application Design Issues (PDF): Many good tips for how to design good X+V pages.
Developing Multimodal Applications using XHTML+Voice (PDF): Much the same topics as the above article, but more code oriented.
XML Events tutorial: How to get what you want when something happens. This tutorial explains how XML Events works. Alternatively you can go directly for the specification

Speech recognition

Speech recognition is assisted by grammars.

Speech Recognition Grammar Specification: The language to describe what user utterances are accepted.
Semantic Interpretation for Speech Recognition: This let you specify what part of the recognized speech to use, for instance to voice-enable a JavaScript application.

VoiceXML, the dialog language

Like HTML is the language for web documents, VoiceXML is the language for voice interaction.

VoiceXML Forum: Industry organization for the promotion of VoiceXML, with some useful information.
VoiceXML 2.0: The specification itself.

Styling speech

CSS3 Speech Module: This specification is incomplete and currently available as a working draft.
CSS 2 Aural CSS: In 1998 CSS2 Aural CSS (ACSS) was specified.CSS2 Aural CSS is intended to be superceded by CSS3 Speech, but since that module is still incomplete (last working draft: 1994), CSS2 Aural CSS is actually more powerful than CSS3 Speech, and still the spec to use for aural CSS.

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Getting to know Voice

Speech Recognition

Text To Speech

XHTML+Voice

Getting started with X+V

Getting to know it

Learning to speak

XHTML+Voice

Speech recognition

VoiceXML, the dialog language

Styling speech

Comments