Giving applications a voice
- By John K. Waters
When the talk turns to speech -- that is, the evolving array of technologies
for computer-voice interactions -- the consensus seems to be that ''we're not
there yet.'' There, say industry watchers, is the point at which computers can
handle actual conversations with human beings.
''People expect Hal from 2001: A Space
Odyssey,'' said Michael Gartenberg, research director at Jupiter Research,
a New York City-based research firm. ''They expect the computer on the starship
. They expect those kinds of
natural interactions. And when they don't get them -- when they're hit by the
limitations of the current state of the technology -- there's a sense of
But not being there does not mean there is nothing here. Although it has yet
to meet our science-fiction-inspired expectations, speech technology has
improved so rapidly over the past few years that a growing number of name-brand
vendors are planting flags in the market, and industry heavyweights like
Microsoft and Philips are planning to embed speech into almost everything.
Despite the economic downturn, barely a week goes by without the announcement of
a new product, specification or enterprise-changing implementation. From call
centers to handhelds to corporate portals, the technologies that allow computers
to respond to the human voice have come a long way. And according to Meta
Group's Earl Perkins, IT managers ignore it at their peril.
''Enterprise users who don't recognize the value of some of the more
mainstream capabilities [of speech technologies] that exist today are actually
jeopardizing their competitive advantage,'' said Perkins, senior program director
of global networking strategies at the Stamford, Conn.-based research firm.
''Just look at the airlines. Any airline that doesn't have speech recognition
today is at a pronounced competitive disadvantage to any airline that does. Flat
Perkins' airline example points to one of speech technology's greatest
success stories. The travel and tourism industry has been an early adopter of
speech technology, and successful implementations of available applications in
that sector abound.
''The travel industry had terrible self-service call completion rates,''
explained Bern Elliot, research director at Gartner Inc. ''The information is
often very complex to enter, and it was hard for callers to accomplish their
transaction with touch-tone responses. We're talking about things like city of
origin or destination. Just try to spell 'Philadelphia' on your touch-tone phone
or distinguish between 'Washington, D.C.' and 'Washington state.' Then you have
date and time of travel, class of travel, and a bunch of different times to
chose from. But it was all within a fairly limited range, so speech recognition
could really perform in this instance.''
In an effort to solve these problems, United Airlines launched a
speech-enabled, 24x7 flight information system in the fall of 1999. Running on a
platform from InterVoice-Brite Inc., and utilizing speech-recognition technology
from SpeechWorks International Inc., the system allowed callers to speak their
request into the phone to find flight arrivals, departures and gate information
for all United flights. According to United, in its first 30 months of
operation, the system handled more than 50 million calls.
According to Gartner's Elliot, implementations of similar speech-recognition
systems have resulted in significant improvements in call completion rates
throughout the travel and tourism industry. ''Everybody got it,'' he said. ''And
they have achieved their objectives in terms of ROI, customer satisfaction and
operational efficiency. In many cases, they even exceeded their goals.''
United's return on its investment in speech technology is impressive.
According to Steve Chambers, chief marketing officer at Boston-based SpeechWorks
International Inc., the system saved the company $24 million. ''I don't know many
other technologies that can claim those kinds of numbers,'' Chambers said. ''It's
true that we're 'not there yet' and this industry has a ways to go, but that's
not stopping us from delivering solid technology today.''
The speech or the speaker?
The system implemented by United
employed speech-recognition software, which should not be confused with
speaker-recognition technologies. (See Talking speech tech
recognition refers to applications and systems that 'understand' words spoken by
anyone -- in other words, the language itself. It takes the form of a range of
applications, from shrink-wrapped dictation programs that live on a desktop to
sophisticated business apps that allow customers to interact with a computer
over the telephone.
Speech recognition may be said to have two constituent qualities: ''accuracy''
and ''fluency.'' Accuracy refers to the system's ability to identify each word a
user speaks. Fluency refers to the depth of a system's vocabulary and its
There are three types of speech recognition applications: Command systems
recognize a few hundred words and eliminate the need for a mouse or keyboard.
Discrete voice-recognition systems are used for dictation, but users must pause
between words. Continuous voice-recognition systems are designed to understand
natural speech without pauses.
Speaker recognition (sometimes called voice authentication) refers to systems
with the ability to distinguish and confirm the identity of the individual
speaking to it. Speaker recognition can be further subdivided into speaker
identification, which determines which registered speaker provides a given
utterance from among a set of known speakers; and speaker verification, which
accepts or rejects the identity claim of a speaker.
One of the most common applications of speaker-recognition technology has
been to free call center agents from an annoying task: password reset.
''You cannot appreciate the volume of password resetting that is going on,''
said Gartner's Elliot. ''Everyone has forgotten a password at least once. The
more you have, the more you're likely to forget. And nowadays, people have lots
Automating similarly routine calls is also one of the key benefits of an
effective speech-recognition system. Giga Information Group analyst Elizabeth
Herrell calls that objective the current sweet spot for that technology.
''A lot of the speech-recognition applications have been deployed to offer
automated services, replace agents and eliminate some of the touch-tone commands
on telephones,'' she said. ''That's where there is high demand today. Most
companies want to implement more automated services. They're finding out that,
even with an interactive voice-response system, which can be a very
cost-effective tool, people are opting out and going back to the operator
because they want to speak to someone. They're finding that, by using speech
technology, they can get a much higher percentage of users to stay on the line
and complete their transactions without going directly to an agent.''
Interactive voice-response systems (IVRs) -- which speak to callers, who then
respond with the touch-tone keypad -- have been around a long time. According to
Jupiter Research's Gartenberg, the most most commonly deployed
speech-recognition technologies are overlaid on these systems.
''The goal is to create these types of voice-enabled applications so that
people can navigate that voice menu from hell without having to constantly punch
in numbers,'' he said. ''It's not really speech interaction, but rather enabling
speech into some of these applications to maximize customer satisfaction, which
can be very useful in the enterprise.''
Speech recognition is also finding a place in the unified messaging space, as
growing numbers of firms explore services that make faxes, e-mail and voice mail
accessible via phone-based voice commands. And an entire industry appears to be
growing around the loosely defined concept of a ''voice portal'' or ''vortal.''
''Essentially,'' said Meta Group's Perkins, ''it's the idea of offering a
variety of [Web-based] services -- financial reporting, traffic reports, sports
scores, entertainment -- on a speech-enabled platform accessible from a
''Voice portal'' is a buzzword in need of an adjective, said Gartner's Elliot.
Consumer voice portals are interfaces for consumer information -- newsletters,
sports and stocks -- typically offered by service providers. Enterprise voice
portals provide an integrated telephony interface to a wide range of enterprise
applications and information.
SpeechWorks' Chambers agrees that automating routine calls in these types of
systems is an important application of speech recognition. But he cautions that
it is not about replacing people, but freeing agents for more complex tasks.
''People should definitely be looking at speech to cut costs,'' Chambers said,
''but also to increase agent yields. When you deploy speech, your agents are
freed up for the cross sell and the upsell. They become much more effective.
Also, you're handling the routine calls that used to [annoy] the agents, so
those agents are less likely to turn over.''
Cutting agent turnover is no small thing. According to a Gartner survey, call
centers in the U.S. experienced an average 30% turnover rate in 2000 and
It is fair to say that currently available
speech technologies are already so widespread that we are beginning to take them
for granted. Dial toll-free directory assistance, and a TellMe application asks
you to ''Say the full name of the listing you want.'' Call your bank, and you can
''Press or say three'' to hear your current balance.
Yet for most IT organizations, speech is still a whole new ballgame. IT
managers thinking about implementing speech in their organizations have a lot to
consider. The solution itself consists of three basic elements: the speech
engine, which is the software that processes spoken input or produces speech
output; the platform, such as an IVR system or a voice portal; and the specific
application, which may be pre-built or custom-developed.
Gartner's Elliot advises IT managers to consider all three components in
parallel. ''Sometimes companies get blocked on which one is more important,' he
said, 'but all three areas have tools that need to work together.''
Many enterprises prefer to utilize internal resources to develop their own
speech apps, but speech app development is a highly specialized discipline. In
the words of Sunil Soares, director of product management at IBM's Pervasive
Computing Division, it is ''tricky stuff.''
''Believe it or not,'' Soares explained, ''there are many ways that you can say
even something as simple as 'yes' and 'no.' There's 'yes,' 'yeah,' 'okay,' 'all
right,' 'no,' 'nah' and even more. And it gets worse. I'm writing a stock quote
application, for example, and let's say I know that it's going to have multiple
stocks, maybe 15,000 that are regularly traded in the U.S. And let's say that
[some] of them will be IBM. The grammar for IBM could be 'IBM,' 'International
Business Machines' or 'Big Blue.' I need the ability to write that grammar. If
I'm adding Microsoft, 'Microsoft' doesn't exist in the dictionary, so the engine
has to be able to recognize a specific pronunciation.''
Gartner's Elliot strongly advises IT organizations with no experience in
speech to seek outside expertise, at least at the front end of a project, to
bring their development teams up to speed. ''We recommend against having your
touch-tone app developers start right in on speech recognition,'' he said. ''Bring
in some people with speech recognition experience and then, over time, you'll
grow into it.''
Fortunately -- or unfortunately -- there are lots of places to look for that
level of expertise. The speech-tech market is currently populated by a wide
range of large and small companies, from speech-engine providers to IVR system
vendors to text-to-speech application developers. Giga Information Group's
Herrell advises IT managers to consider the advantages of working with the
''I see such a mix of players in this market,'' she said, ''everything from big
telecoms down to three-man shops of kids just out of school who understand the
technology. But you'll want to work with the leading vendors. There are some
good new technologies out there, and some very promising ones, but the early
market leaders are the ones who have already gone through a lot of the rocky
The list of speech-tech vendors is a long one, and growing, but a few names
stand out in certain segments. Nuance, SpeechWorks, Phillips Speech Processing
and IBM top many lists of the leading speech-engine vendors. Intervoice, Nortel,
Avaya, Edify and Syntellect are among the leading IVR vendors. TellMe, BeVocal,
Voice Genie and Voxeo are among the best-known enterprise voice portal vendors.
This is by no means a definitive list, and the product offerings change as the
Bringing in outside expertise is a useful strategy, but
Meta Group's Perkins said most organizations getting into speech should not
actually pass off the job to outsiders before they have spent some time with
their own hands in the guts of the technology.
''It's very, very early in the game,'' he said, ''too early
to consider outsourcing. You don't want to hire someone to develop, deploy and run your speech
solutions for you until you understand the issues facing you in your enterprise
environment. You can't really outsource what you don't know the true cost of or the
true value of.''
Talking to customers
In a nutshell, the key challenge for
enterprise users implementing a speech solution is to cut costs without cutting
service levels. ''You want to take very good care of your customers, obviously,''
said Giga Information Group's Herrell. ''And the way you do that is by
eliminating the routine interactions that could be very easily put into a voice
application. But if you make it harder for them, you'll lose them.''
Jupiter Research's Gartenberg agrees: ''The consumer expectation here is
simple: It's got to be 100% accurate 100% of the time. Keep in mind, a 90%
accuracy rate still means that one out of every 10 times the system gets
something wrong. Customers get very frustrated very quickly with that [kind of]
A logical approach, then, is to try out an app internally before launching it
for public consumption. That is what United did. Before its flight information
system was exposed to a single customer, the company worked out the bugs in a
speech-enabled employee reservation system application provided by
''You want to make sure that there's a proof of concept in house to kick the
tires on the technology before rolling it out to customers,'' said IBM's Solares.
''I don't think that's specific to voice technologies. For anything, we recommend
you do a proof of concept before rolling it out to a broader audience. There are
a lot of challenges associated with this technology. And there's a fair amount
of testing you have to do.''
Meta Group's Perkins recommends that you ''test, test and test mercilessly''
before undertaking any public-facing deployments. ''One of the downsides of
deploying a major voice application in an environment with lots of customers is
that if you screw it up, you don't get a second chance,'' he said. ''It's not the
same thing as a Web site not being available for an hour or so. If you happen to
create 7,000 or 8,000 disgruntled customers who cannot interact properly with
the application you deploy, they won't do business with you again.''
Jupiter Research's Gartenberg advises using a variety of people of all ages
and genders, speaking as many different dialects as possible, as part of the
testing methodology. ''The whole goal here is to decrease the consumer's time on
hold, and to increase the customer's satisfaction,'' he said. ''You're trying to
help the users get to the information they want and to get to it quickly.
Anything that creates an inverse experience is going to cost you.''
SpeechWorks' Chambers agrees that organizations should do some tire-kicking
before deploying speech apps for their customers, but he is not convinced that
an internal pre-deployment deployment is necessary. He points out that, although
the industry itself may be characterized as young, current deployments have been
racking up some serious mileage.
''I can understand starting with an internal version, as United did, if the
economics are there,'' he said. ''But in many ways, I think we're a little past
that point. Three or four years ago, people still wanted a little more proof of
concept. Today, given our customer list and the millions of calls we take a day,
that kind of proof isn't called for. It may be that the best application for a
particular enterprise is an internal app. If it makes financial sense, that's
great. But it's certainly not the case that people are afraid to launch
publicly, and therefore want to start internally.''
''Application identification'' is the critical phase in the implementation of a
speech strategy, Chambers said. His company utilizes a process that pinpoints
what he calls the critical speech factors. That process begins with a vision
that is focused on customers.
''You should focus on them completely,'' Chambers said. ''Ask yourself why you
want a speech strategy. It's probably because you want to increase your
company's financial health and in a way that actually is more customer-friendly.
Touch-tone increased the financial health of many companies, but in a less
customer-friendly way. Speech can increase your corporate health and satisfy
callers in a very, very friendly way.''
IBM's Solares suggests approaching a speech
implementation holistically -- in other words, do not think of speech tech as a
silo. ''You want to think about your entire channel strategy,'' he said. ''How
are you going to offer access to this application using the Web, the voice
channel, the wireless channel. Then think about multimodal [see Multiple modes
]. Even if you decide not to
implement a given channel, you definitely want to understand how your
application is architected to support multiple channels and modalities.''
Killer tools and the killer app
According to market researchers
at Scottsdale, Ariz.-based In-Stat/MDR, the speech-recognition industry has
continued to grow, even through the economic ugliness of 2001. The enterprise
''continued to embrace speech recognition products ... as a means of lowering
costs, providing greater customer service, and launching new, innovative,
products,'' the firm wrote in a recent report. ''We are just on the cusp of a
speech-recognition revolution,'' senior analyst Brian Strachman stated in the
Worldwide spending on speech recognition will reach $41 billion by 2005,
according to market researchers at the Kelsey Group, and In-Stat/MDR analysts
expect the use of standards such as VoiceXML and SALT to ''open up vast new
markets and opportunities.''
Other growth factors include continuing increases in processor speeds, which
will allow for greater speech-recognition vocabulary size and accuracy; greater
emphasis on measuring ROI in a tight economy and voice recognition's ability to
demonstrate it; an increase in the number of wireless subscribers; and increased
regulation in the use of wireless handsets while operating a car.
The expanding population of mobile workers in this country and their use of
untethered computing devices is increasing the potential desirability of
non-keyboard input in a host of environments, from a realtor's car to a
lineman's telephone pole. And internationally, millions without access to PCs
carry cell phones and a hankering for information.
''You can't go anywhere on the globe without seeing a cell phone,'' said Giga
Information Group's Herrell. ''They're everywhere. The U.S. hardly touches the
number [of cell phone users] in Europe and Asia. But PC access over there is
much more limited than it is here. You have this huge population needing
information, and their access devices are cell phones with tiny keypads.''
Meta Group's Perkins agrees: ''The evolution of speech technology is going
hand in hand with the evolution of mobility technology. From a programming
perspective, you need to consider the fact that you have 1.5 billion clients.
You've got the Windows 2000 and XP clients, but guess what, you've also got
Nokia, Ericsson and Motorola clients. So you need to think of the apps that you
deploy, not merely in terms of clients who will be sitting in front of a PC, but
also the millions of clients with cell phones in their hands.''
Perhaps the most important driver in this market is the emergence of
developer tools and the resulting increase in production speed and lower costs.
Microsoft's .NET Speech SDK, in particular, is likely to push speech app
development. Coupled with Microsoft's Visual Studio .NET developer tools, the
SDK is designed to add voice to the list of methods for inputting data, which
includes the mouse, keyboard and stylus.
IBM, too, is bound to be influential here. The company pioneered
voice-recognition technology, and its ViaVoice dictation software is probably
its best-known application. But Big Blue is investing in speech-recognition
technology, which the company considers an essential component of its ''pervasive
computing'' strategy. IBM's WebSphere Studio now includes a plug-in called the
WebSphere Voice Toolkit, which allows users to write VoiceXML applications and
test and debug the grammars and pronunciation. IBM's Soares sees a big
opportunity in this market for developers, which his company wants to
''Now enterprise customers look at me in meetings and say, 'This is great, we
can now have our regular Web developers and our enterprise application
developers and voice developers all using the same environment: WebSphere
Studio,''' he said.
''I think the killer weakness in this market is the lack of good
applications,'' said Giga Information Group's Herrell. ''Everything is customized,
and everyone builds their own. I think there's room for developers to come in
and seize some of these applications. I would like to see a lot more developers
out there, and I don't think they're going to lack for customers.''
But even with great tools, Meta Group's Perkins said, the real challenge for
developers of speech technologies in the years ahead will be actually
understanding the human-computer interface. ''We're still in the first
generation,'' he said, ''and we haven't built that many apps that were developed
from the ground up that accept and recognize speech as the input. When
programmers truly understand how humans communicate by voice, and accurately
reflect that in their interactions with applications - when someone combines the
natural way that people communicate by voice with applications that could
actually be exploited using that capability, that's when you'll have your killer
See the following related stories:
speech tech, by John K.
Multiple modes, by John K. Waters
Speech specs, by John K.