VoiceXML Gives a Voice to Web Applications

"Stock quotes."

"OK stock quotes. Say a company name."

"Apple"

"Apple Computer. Down .35 to 21.30."

"Sony"

"Boeing. Down .01 to 39.89

"So-Neeee"

"Sony ADR. Up .98 to 48.9"

....

"Traffic"

"OK, traffic. Say a city and state..."

"San Francisco, California."

"The San Francisco metro area...say the name of a major road."

"Highway 101"

"Ok, Highway 101. Currently there are four incidents reported. Highway 101 at 85, there's a disabled vehicle blocking the center lane, affecting the northbound direction....."

I had this telephone conversation recently. A woman's voice would give me a prompt, and in response to my reply, would pass on some useful information. She wasn't the most creative conversationalist I'd ever spoken with, but she mostly understood what I said, and in any case, was unfailingly polite. "She," of course, was a computer, a demonstration of voice portal technology developed by Tellme Networks using the VoiceXML 2.0 markup language.

The idea behind VoiceXML is to provide a standard way of building voice interfaces, primarily to Web applications. While these applications are not as smart as a live operator, they are easier to navigate and less aggravating than the traditional "phone tree," where the interface is the 12-key touchpad. "Voice recognition systems can provide high quality service for the vast majority of callers," says Jeff Kunins, Tellme's director of corporate and product marketing. "It's not a 100 percent solution, but it is not supposed to be. And there are advantages over live operators: you're never put on hold and you can obtain information in the middle of the night."

Of course, the Web itself never puts you on hold and is also open all hours, but that's the whole point of VoiceXML: to add a voice interface to existing, text-based Web applications. "VoiceXML enables the back-end integration of the business rules, databases and other elements that companies have invested in the Web," says Kunins. "It's a better alternative to maintaining separate telecom and web groups---who both hate each other and require redundant staff and infrastructure."

VoiceXML is significant not because voice is the best or most convenient interface to the Web, but because the phone remains the universal communication medium. In mobile applications, wireless text remains a rarity among mobile users, while cellular telephone and voice are approaching the must-carry status of a wallet and keys.

In Japan, voice access has even more potential because the language takes longer to key in than English. The Japanese version of the initiative is overseen by the VoiceXML WG [working group] under the XML Consortium Japan. Yoshihisa Shinozaki, a sales manager with Motorola Japan's Commercial Government Industry Solution Sector, notes that voice application development has paralleled the U.S., with flight reservation and help desk applications both in use. He says that Japanese drivers are already familiar with talking computers through the use of car navigation systems, some of which also provide voice recognition.

Ironically, one of the most popular cell phone applications in Japan, e-mail, may not catch on as a voice application because of train and bus authorities prohibiting cell phone use. "Because of this situation, some believe that voice-driven cellular phone applications will not become popular in Japan," he writes. "On the other hand, in areas where people use the automobile, hands-free voice applications may indeed catch on." And beyond the car? Shinozaki thinks the kitchen is another possibility. Like driving, cooking requires a hands-free interface.

Roots at AT&T Bell Labs

VoiceXML dates back to a 1995 project called Phone Web that began at AT&T's Bell Laboratories. Researchers Dave Ladd, Chris Ramming, Ken Rehor, and Curt Tuckey were looking at how telephone applications and the Internet could be linked. The following year, AT&T spun off Lucent Technologies as a hardware company, which took most of Bell Labs with it. The breakup broke up the team, with Ramming remaining at AT&T, Ladd going to Lucent, and Tuckey and Rehor to Motorola. The project divided up, as well, and as a result, three different voice markup languages were produced, including Lucent's TelePortel and Motorola's VoxML. A fourth language, SpeechML, was developed by IBM.

If there were ever a situation ripe for a common standard, this was it---hence the VoiceXML Forum. In August 1999, the group produced VoiceXML 0.9, a markup language that incorporated in the best features of the previous languages. After community feedback, VoiceXML 1.0 was submitted to the WorldWideWeb Consortium (W3C) for consideration in March 2000. The W3C Voice Browser Working Group produced VoiceXML 2.0, which has been on the web since last October at www.voicexml.org/spec. AT&T, Lucent, Motorola and IBM remain the sponsoring members, and they have been joined by some 600 other companies.

Notably missing from this group is Microsoft, which, along with Intel, Cisco, Phillips and others, are backing a somewhat competing initiative, called the Speech Application Language Tags (SALT). SALT is less mature, however, and VoiceXML is generally believed to have the running start.

Some of SALT's backers are supporting both. "We're agnostic to the two standards: both run on the Intel architecture," says Tim Moynihan, director of product marketing in Intel's Telecommunications and Embedded Group. The group was formed after Intel acquired the New Jersey-based Dialogic Corp in 1999. SALT claims it will better support "multimodal" applications, that is, access by devices other than the telephones. "Speech is a more natural way of interacting with a PDA [personal digital assistant]," Moynihan says, "even though the output may still be text. Consider a real estate agent driving around with a wireless PDS. In response to a spoken address, the PDA screen would show details about a property for sale."

So far, however, neither initiative has the clear edge in multimodal support. Supporters of VoiceXML are only beginning to talk about multimodal extensions, while SALT is a much younger spec with no working applications. At this writing, its supporters have yet to submit it to a standards body like the W3C---an important step if SALT is to be considered more than just a Microsoft technology masquerading as a standard.

Developing VoiceXML applications

A VoiceXML application is composed of "dialogs," each of which represents a given point in the conversation where the system expects voice input from the user. A form dialog assigns a value to one or more variables, while a menu dialog provides the user with a choice of selections, which determines what dialog is routed to next. In a "machine-directed" application, the vocabulary is narrowly limited to a single dialogue. In a "mixed initiative" application, some dialogs may be active throughout, enabling a user to shortcut the process.

Motorola engineer Jim Ferrans says that application development is a four-step process. The first step is to work out the call flow---what questions you want to ask, how you want the call to proceed, what information you want to give out. "At the high end, a human factors specialist will come up with an application design that resembles a flow chart," Ferrans says. "That chart represents a state machine for the conversation. Every conversational state asks certain questions and gets certain answers back, with the goal being to fill in one or more fields with information. Once the state is completed, you transition to the next state."

The second step is to write the VoiceXML code, based on the flow chart. Around the same time, speech grammars are also developed---telling the speech recognizer what voice input to expect. "That job is often performed by a specialist, who listens to real users saying real phrases. You might expect a user to say 'vanilla,' but in fact, people might say 'I'd like vanilla, please.' Or people might use slang words or a shortened form. You have to capture these variants." The final step is tuning, checking to make certain the speech grammars are correct, and that the questions elicit the response you are expecting.

Tellme's Kunins argues that companies should invest the resources do these applications right---they are, after all, a reflection of the company and its brand. Applications should expect the unexpected, with graceful error recovery that takes into account "not just what you want people to say, but what they'll say anyway." If callers phoning into a travel site keep saying "hotel," the site either needs to add that service, or at least acknowledge the request: "I'm sorry, we don't provide hotel listings."

Audio quality is also important. "People hate inauthentic speech. It's not only annoying, but hard to understand." Text-to-speech (TTS) systems have jokingly been called "drunken robots" because of the tipsy yet mechanical sound to the synthesized voice. A newer form, called wave form concatenation speech synthesis, assembles words from a library of prerecorded wave forms. The Tellme voice sounds pretty realistic, though I'm not sure you'd want to ask her out on a date.

Voice recognition---the ability to decipher what calls are saying---has also come far in the last five years, though it is not perfect. In my dialog with the Tellme demo, the application mistook "Boeing" for "Sony," although it got it right the second time. Kunins notes, however, that human beings also slip up, especially when conversing with a caller on a cell phone. In any case, the number of words a system can understand has grown large. In the U.S., Tellme powers the toll-free 800 number lookup, which gets 200 million "hits" a year. You say the company name, it tells you the number, sometimes presenting sub-options in the process. This ability to recognize company names throughout North America is a remarkable feat, especially considering that callers have different pronunciations. By contrast, PC voice recognition programs must be trained to understand a single voice, and that input is generally done with a microphone connected directly to the computer's sound board.

People are likely to give voice applications that provide information a lot of leeway. After all, if I'm driving in my car, I may be grateful for any guidance on highway conditions, or movie showtimes, or telephone numbers---whether it comes from a human being or a computer. More complex transactions, including many purchases, will take more programming talent. Here in the U.S., for example, it is hard to imagine a computer replacing the knowledgeable phone representatives of the outdoor clothes manufacture LL Bean. Just try asking a computer whether the color of that "heather gray" sweater is closer to taupe or charcoal. But even in less nuanced conversations, people will expect more from a voice application. When clicking on a keyboard, I'm constantly reminded that I'm dealing with a machine interface. But when I'm in a conversation, I'm apt to forget that the pleasant sounding voice has a silicon heart.

Sidebar: An interview with Motorola's Jim Ferrans

In 1998, Jim Ferrans was working at a not-so-successful startup when he read an article on VoxML, Motorola's first voice markup language. As it happened, the development work was proceeding nearby, and so Ferrans paid a visit to Motorola's facility in Downers Grove, Illinois near Chicago, where he "begged and pleaded" to get into the VoxML group. His timing was good---work was just proceeding to merge VoxML and three other voice markup languages (from AT&T, Lucent, IBM and Motorola) into a single standard. Ferrans became one of the handful of people to work on the VoiceXML forum standardization before it was submitted to W3C, and has been involved ever since.

Today, Ferrans' business card says simply that he's a Motorola Distinguished Member of the Technical Staff. At his day job, he works on implementing Motorola's Voice XML platform, with special emphasis on multimodal dialogues.

Voice recognition has been thought of primarily as a PC application. How is it different over the phone?

Phones typically have more background noise, especially with a cellular connection, and of course the line quality is a factor. And while PC applications must be "trained" to recognize each individual, phone applications must work for any caller. So rather than allowing for unconstrained voice input, you have to tell the speech recognizer that, at a given point, it's only listening for 'chocolate,' 'vanilla' or 'strawberry.' These limited vocabulary lists are called "speech recognition grammars," and include the kind of things the speech recognizer should expect to hear. The speech recognizer also includes a dictionary that includes the different pronunciations of a given word.

But speaker-independent speech recognition over the phone can be surprisingly broad. Fidelity Investments has a system that asks what mutual fund you are interested in. It recognizes any of hundreds of funds. I did a funds transfer a while ago and the system was so accurate, it was scary. Even as a researcher in this field, I found it a bit unnerving that there was no human at the other end of the line.

Will grammars always be limited?

Right now, the assumption is that you must cue the recognizer into what is being said. As speech recognition gets better, we'll be able to make a semantic analysis of what the user said, to come up with information from open-ended inputs. That's on the horizon.

Are mobile phones the prime area for application development?

Mobile phones are certainly fertile ground---you often want to have hand's free. But many of the IVR [interactive voice response] systems typically accessed by landlines are also good candidates. Legacy IVR systems typically use proprietary hardware and programming languages. With VoiceXML, the applications are separated from the voice services; they're put on separate boxes. That means the voice services can be done by a specialist company, while the applications can live on ordinary Web servers, anywhere. So you get a huge productivity boost in terms of application development, because the applications are served off the Web, built using traditional Web programming models and with web development tools. And VoiceXML applications can be accessed from any voice server that conforms to the standard.

Another area of development is in information services, like those on the Tellme voice portal. You could imagine, for example, a world weather service where you don't have to press the country code; you just say the name. By eliminating the cumbersome menus, VoiceXML could make such applications much easier to use.

Will you be able to do anything as intricate as book a hotel?

Certainly. At Motorola, for example, we've done voice interfaces to flight reservation applications, allowing callers to find flight times and purchase tickets over the phone.

An admirable part of the VoiceXML spec is at least the attempt to keep it interruptible. It lets the caller break in.

VoiceXML conversations are interruptible in a couple of senses. While the machine is talking to you, you can barge in with the right answer---because you've been through this dialog before and know what's coming. The other way users can be put in control is what we call "mixed initiative," a step toward a more open conversation. The application can ask the user for his airline and flight number, and the user can respond instead: "I'd like to make a car reservation." And the system will jump to that dialogue, instead. It's not easy to design a mixed initiative application, but if you put some careful thought into it, you can make it very useable.

What else will help the spec succeed?

I think the key issue is how easy it makes voice applications to write. A computer hobbyist can put together a voice application, then tell a voice service where the application resides, and have the service execute his application over the phone. I heard of a guy who, in the midst of a family health crisis, put up a voice site that he updated periodically. It wasn't a sophisticated application, but it was written almost spur of the moment.

Programming has gotten easier because you no longer have to worry about the details of speech recognition and interacting with the phone. Where voice applications once required special purpose C and Java programs, you now just write a little markup page. The language gets more complex if you need to do complex things, but the basic stuff---reading an announcement or simple question and response---is really easy. Ease-of-use is the main benefit of VoiceXML.

What kind of an industry does VoiceXML create?

VoiceXML breaks the connection between the telephony platform, the speech recognition software and the application software. That means I can write an application that doesn't depend on a particular voice server. I can shop around --- for the best speech recognizer, text-to-speech engine, or simply, the best technology for the price. Switching vendors, in theory at least, is as simple as forwarding your toll-free number to a different vendor's voice server, as long as both vendors have faithfully implemented the VoiceXML specification.

Do you expect most applications will be built using an external voice portal or will companies grow their own?

Both. At first glance, voice portals would seem the obvious way to go. But just as with other web applications, some companies will want to control their own voice servers, for security, or to present a common "look and feel", or to guarantee a certain responsiveness.

So I assume VoiceXML development tools will play a role here.

We're seeing some already. One of the companies Motorola does business with is called Voxeo, which has a flow-chart editor for describing how you want the user to proceed. You write the flow chart, add information, press a button and get VoiceXML code. I haven't used it personally, but it has gotten some good reviews.

How does VoiceXML compare with SALT?

SALT is Microsoft's attempt to get into the space and control the standard---at least that's how some might interpret it. Talking Web pages means that Microsoft can sell more software and Intel can sell more processing power. Philips and SpeechWorks develop speech recognition technology, hoping for increased sales---though it may open them up to increased competition from Microsoft.

SALT is not as mature, but aren't you at a disadvantage without Microsoft behind you?

We do have over 600 companies in the VoiceXML Forum, with a lot of VoiceXML interest out there. The standard has been out for two years, and has gotten strong traction in telecommunications, the interactive voice response industry, voice portals, and other places. So for pure voice, VoiceXML will probably win out

SALT's emphasis is on "multimodal" applications: voice and XHTML working together. SALT does not build on VoiceXML. The Microsoft ideal, from what I can tell, is a fairly thick client --- a desktop PC or high-end handheld --- accessing a Web page, enabling users to interact by voice, as well as keyboard. But at Motorola, while we're very interested in the multimodal aspects of dialogs, we are hoping that some sort of VoiceXML/XHTML combination will take root. Whether SALT or VoiceXML will dominate here remains to be seen. W3C is starting a multimodal working group in February [2002]. IBM and Opera, the European browser company, have made a multimodal language proposal that uses VoiceXML and XHTML. At Motorola, we liked it so much that we co-sponsored the proposal. It could be an interesting standards tussle.

What's next for the VoiceXML spec?

While we're still polishing 2.0, the real changes are done. It will become a de jour standard, or at least a de facto one. We're going to start talking about 3.0 in the next working group.

Is it settled enough to provide a stable platform?

Definitely. It's reached a point where people trust it.