In-Depth
Multiple modes
- By John K. Waters
- December 1, 2002
One of the most exciting areas of innovation in speech tech today centers on
the concept of ''multimodality.'' Multimodal applications essentially provide
users with a choice of input sources, generally including voice, keypad,
keyboard, mouse and stylus. Output takes the form of spoken prompts, audio
and/or graphical displays.
Multimodality is a way of enhancing the usability of applications, said
Sandeep Sibal, CTO at Kirusa, Berkeley Heights, N.J. Founded in 2001, Kirusa was
established specifically to focus on developing multimodal solutions for mobile
applications.
''Just like the graphical user interface revolutionized the way people used
their PCs,'' Sibal said, ''I think multimodality brings in a fundamental shift in
how we understand user interfaces and, for the first time, begins to combine two
very dissimilar interfaces: the GUI and the voice user interface, or VUI.''
A VUI (pronounced ''vooey'') is the speech equivalent of a GUI, typically
residing on a PDA or smart phone. It is more sophisticated than an interactive
voice response (IVR) system, and offers a wider range of commands than simply
''yes'' or ''no.''
Combining these two interfaces is an extremely complex exercise, said Sibal,
because they differ fundamentally. GUIs utilize two-dimensional space to express
themselves, while VUIs express themselves over time.
The architecture of multimodality typically keeps the speech recognition and
synthesis on the server side. The memory demands of speech are usually more than
small client devices can handle.
There are two types of multimodality: sequential multimodality and
simultaneous multimodality. In sequential multimodality, users can switch
between interfaces. ''The notion here is that you are using only one interface at
a given instant,'' Sibal said, ''but in a single session you might go back and
forth.''
In simultaneous multimodality, both interfaces are active at the same time.
Users can click to a map, say ''How do I get to here?'' and then tap the
destination on the screen. ''It's very natural to do it this way,'' said Sibal.
''It's just not something that apps do today.
''With today's devices, many of which are not able to keep both the speech and
GUI active simultaneously, you can start off with sequential and then move on to
simultaneous as the devices that allow you to do that become available,'' he
added.
Kirusa's flagship product, Kirusa Multimodal Platform (KMMP) currently
supports sequential multimodality, but Sibal said upcoming versions will also
support the simultaneous mode.
Although the company's early products were built on its own languages, which
were based on VoiceXML, Kirusa was also an early supporter of SALT.
''I think the SALT forum's initiative helped us not just in terms of coming up
with some kind of standard for representing multimodal applications, but also in
terms of getting the awareness of multimodality raised in the community,'' Sibal
said. ''Cool technology alone is not enough. You have to evangelize the stuff.
When you have a bunch of industry heavyweights like Microsoft doing it for
you... well, it's just what this industry needs.''
See the following related stories:
Giving
applications a voice , by John K.
Waters
Talking speech tech , by John K. Waters
Speech specs , by John K.
Waters
About the Author
John K. Waters is a freelance writer based in Silicon Valley. He can be reached
at [email protected].