XHTML+Voice By Example

By Jonny Axelsson

July 27th 2011: Please note that Voice only works in Opera on Windows 2000/XP, and we no longer officially support it.

This article assumes that you have already Voice installed and working on your computer. For more on what XHTML+Voice (X+V) is all about, read our Getting to Know X+V article.

Hello World!

You can make an X+V browser say "Hello World" with the 'block' element, like this:

<block>Hello World!</block>

The full web page will look like this:

<!DOCTYPE html PUBLIC "-//VoiceXML Forum//DTD XHTML+Voice 1.2//EN"
"http://www.voicexml.org/specs/multimodal/x+v/12/dtd/xhtml+voice12.dtd"> [1]
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:ev="http://www.w3.org/2001/xml-events"> [2]

  <title>Example 1: "Hello, World"</title>
  <form xmlns="http://www.w3.org/2001/vxml" id="sayHello"> [3]
    <block>Hello World!</block> [4]

<body ev:event="load" ev:handler="#sayHello"> [5]
  <h1>"Hello World!" example</h1>
  <p>If your browser is voice enabled, you will hear it say "Hello, world!".</p>
  • [1] The DOCTYPE describes the type of document this is. It isn't necessary for voice processing, but is necessary for the document to be valid. Also see DOCTYPE sniffing.
  • [2] XHTML is the default namespace, and XML Events has prefix "ev".
  • [3] This is the voice form, the element that contains voice code. It sets the default namespace to VoiceXML, so everything contained in the voice form is VoiceXML by default. It has an id ("sayHello") so that other elements can refer to it.
  • [4] This is the code that actually says "Hello World".
  • [5] This is what triggers the voice form. When the body element receives the "load" event (i.e. when the page has loaded) the handler with id sayHello is activated.

Try it in action.

Make it listen

So now you can have the browser talk to you. It is far more exiting that you can talk back to it. This can also be more challenging. It isn't too hard to make it listen, it is more difficult to make it understand. The application will not behave more intelligently than what you code for. You need to specify what to listen for from the user chatter, and give good hints about what is expected. The examples in this article will be very literal-minded, and not give much leeway for creative user responses, but it isn't too hard to make it more flexible.

Using your options

When you in HTML want to restrict the choices, you can use a select element. To give the user the choice between "one" and "two", and nothing else, you can code:

<select name="example">

You can use the mouse or the Tab key to activate the select box and the arrow keys to choose one of the options.

In a voice form the field element fulfil a similar role. To present the same choice in voice, you code:

<field name="example">

The following example will ask the user the name of the best browser (Opera of course), handle mismatches, and give an inspired lecture at the end.

<field name="browser"> 
  <prompt>What is the name of the best browser?</prompt> [1]
  <option>Opera</option> [2]
  <nomatch>Try again.</nomatch> [3]
  <filled>Yes, that's a fact. Opera is the best browser, full of wonderful 
    features.</filled> [4]
  • [1] The spoken texts are called prompts. They are similar to the p elements in XHTML.
  • [2] The options gives the choices the user has. In this case the only value for the best browser is "Opera".
  • [3] If the user tries to say anything that doesn't match "Opera", he will be asked to try again (and again and again). If no nomatch is set, the standard phrase (normally "Sorry. I did not understand") will be used instead.
  • [4] The filled element will be executed when a match (i.e. "Opera") has occurred.

Try it in action.

Going for grammar

The collection of accepted responses is called a grammar (the collection of options above is also a grammar). At its most simple, it is only a collection of alternatives, e.g. <fruit> = apple | orange | slime mold. The voice recognition version on Hello World, coffee-tea-milk, will politely ask you if you want coffee, tea, or milk, and (in this version) refuse to give it to you afterwards.

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:ev="http://www.w3.org/2001/xml-events">
  <title>Example 3: Drink dispenser</title>
  <vxml xmlns="http://www.w3.org/2001/vxml" id="drinkform"> [1]
    <field name="drink">
      <prompt>Would you like coffee, tea, or milk?</prompt> 
      <grammar><![CDATA[    [2]
        #JSGF V1.0;
        grammar drinks;
        public <drinks> = coffee | tea | milk [3]
       ]]> [2]
      <filled> [4]
        <block>Sorry, I'm out of <value expr="drink"/>.</block> [5]
<body ev:event="load" ev:handler="#drinkform"> [1]
<h1>Example 3: Drink dispenser</h1>
<p>Our drink dispenser can offer you a wide choice of refreshing drinks.</p>
  • [1] This voice form will be triggered when the document has loaded.
  • [2] Grammars can have characters like "<" that are normally treated as XML, in this case start of tag, but content inside <![CDATA[ and ]]> is taken literally in XML, and not processed as markup. If you write <![CDATA[<b>bold</b>]]> in the source code, it should be displayed as plain text "<b>bold</b>", not as "bold".
  • [3] This is the grammar, you can choose between coffee, tea, and milk. The "|" vertical bar means "or".
  • [4] The 'filled' element is a conditional element, it is entered when the field has gotten a value.
  • [5] This is where it refuses to serve you anything.

Try it in action.

The grammar could in this case just as well be expressed using option, like this:


I don't understand what you are saying, but I can pretend

The advantage of grammar over option is that you can handle more natural language this way. Here is a more advanced grammar example:

<form xmlns="http://www.w3.org/2001/vxml" id="command">
  <field name="commandInterpreter">
      #JSGF V1.0;
      grammar command;
      public <command> = [I want to] <action> {the_action = $action} [1]
                         <object> {the_object = $object}  
                         [with <instrument>{the_instrument=$instrument}];
      <action>         = watch | shut down | surprise | control | buy | hide | ignore; [2]
      <object>         = [the|a] tv | [the|a] phone | [the|my] neighbor | Opera | my boss;
      <instrument>     = [the] remote control | [my famous] wit | [a] stick | 
                         [a] camera | [an] [expired] credit card | a wet blanket;
     <prompt>Give me a command</prompt>
     <nomatch>I refuse to do that. </nomatch>
    <filled> <!-- Give feedback --> [3]
      Why do you want to <value expr="the_action"/> the poor <value expr="the_object"/>?
  • [1] Phrases in [brackets], like "I want to" are, optional. <action> refers to the action rule further down, the variable is set to , which is a special variable containing whatever the <action> rule has returned.
  • [2] The <action> rule lets you pick one out of a set of verbs, much like option would.
  • [3] The value element refers to the values set in [1]. This values can also be used when scripting X+V

This grammar would accept utterances like:

  • watch tv
  • I want to surprise my neighbour with my famous wit
  • buy Opera with an expired credit card
  • I want to hide the phone
  • shut down tv with remote control
  • ignore my boss

Try it in action

The grammar can easily be modified. Try use different verbs and nouns for the <action>, <object>, and <instrument> rules.

This example may be more advanced than you would need, but it does show some of the things that grammars allow you to do. To learn more about this see the grammar links in Getting to Know X+V.


This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.


The forum archive of this article is still available on My Opera.

No new comments accepted.