In-Depth

Multiple modes

One of the most exciting areas of innovation in speech tech today centers on the concept of ''multimodality.'' Multimodal applications essentially provide users with a choice of input sources, generally including voice, keypad, keyboard, mouse and stylus. Output takes the form of spoken prompts, audio and/or graphical displays.

Multimodality is a way of enhancing the usability of applications, said Sandeep Sibal, CTO at Kirusa, Berkeley Heights, N.J. Founded in 2001, Kirusa was established specifically to focus on developing multimodal solutions for mobile applications.

''Just like the graphical user interface revolutionized the way people used their PCs,'' Sibal said, ''I think multimodality brings in a fundamental shift in how we understand user interfaces and, for the first time, begins to combine two very dissimilar interfaces: the GUI and the voice user interface, or VUI.''

A VUI (pronounced ''vooey'') is the speech equivalent of a GUI, typically residing on a PDA or smart phone. It is more sophisticated than an interactive voice response (IVR) system, and offers a wider range of commands than simply ''yes'' or ''no.''

Combining these two interfaces is an extremely complex exercise, said Sibal, because they differ fundamentally. GUIs utilize two-dimensional space to express themselves, while VUIs express themselves over time.

The architecture of multimodality typically keeps the speech recognition and synthesis on the server side. The memory demands of speech are usually more than small client devices can handle.

There are two types of multimodality: sequential multimodality and simultaneous multimodality. In sequential multimodality, users can switch between interfaces. ''The notion here is that you are using only one interface at a given instant,'' Sibal said, ''but in a single session you might go back and forth.''

In simultaneous multimodality, both interfaces are active at the same time. Users can click to a map, say ''How do I get to here?'' and then tap the destination on the screen. ''It's very natural to do it this way,'' said Sibal. ''It's just not something that apps do today.

''With today's devices, many of which are not able to keep both the speech and GUI active simultaneously, you can start off with sequential and then move on to simultaneous as the devices that allow you to do that become available,'' he added.

Kirusa's flagship product, Kirusa Multimodal Platform (KMMP) currently supports sequential multimodality, but Sibal said upcoming versions will also support the simultaneous mode.

Although the company's early products were built on its own languages, which were based on VoiceXML, Kirusa was also an early supporter of SALT.

''I think the SALT forum's initiative helped us not just in terms of coming up with some kind of standard for representing multimodal applications, but also in terms of getting the awareness of multimodality raised in the community,'' Sibal said. ''Cool technology alone is not enough. You have to evangelize the stuff. When you have a bunch of industry heavyweights like Microsoft doing it for you... well, it's just what this industry needs.''

See the following related stories:
Giving applications a voice , by John K. Waters
Talking speech tech , by John K. Waters
Speech specs , by John K. Waters

About the Author

John K. Waters is a freelance writer based in Silicon Valley. He can be reached at [email protected].