In-Depth

Giving applications a voice

When the talk turns to speech -- that is, the evolving array of technologies for computer-voice interactions -- the consensus seems to be that ''we're not there yet.'' There, say industry watchers, is the point at which computers can handle actual conversations with human beings.

''People expect Hal from 2001: A Space Odyssey,'' said Michael Gartenberg, research director at Jupiter Research, a New York City-based research firm. ''They expect the computer on the starship Enterprise . They expect those kinds of natural interactions. And when they don't get them -- when they're hit by the limitations of the current state of the technology -- there's a sense of frustration.''

But not being there does not mean there is nothing here. Although it has yet to meet our science-fiction-inspired expectations, speech technology has improved so rapidly over the past few years that a growing number of name-brand vendors are planting flags in the market, and industry heavyweights like Microsoft and Philips are planning to embed speech into almost everything. Despite the economic downturn, barely a week goes by without the announcement of a new product, specification or enterprise-changing implementation. From call centers to handhelds to corporate portals, the technologies that allow computers to respond to the human voice have come a long way. And according to Meta Group's Earl Perkins, IT managers ignore it at their peril.

''Enterprise users who don't recognize the value of some of the more mainstream capabilities [of speech technologies] that exist today are actually jeopardizing their competitive advantage,'' said Perkins, senior program director of global networking strategies at the Stamford, Conn.-based research firm. ''Just look at the airlines. Any airline that doesn't have speech recognition today is at a pronounced competitive disadvantage to any airline that does. Flat out.''

Perkins' airline example points to one of speech technology's greatest success stories. The travel and tourism industry has been an early adopter of speech technology, and successful implementations of available applications in that sector abound.

''The travel industry had terrible self-service call completion rates,'' explained Bern Elliot, research director at Gartner Inc. ''The information is often very complex to enter, and it was hard for callers to accomplish their transaction with touch-tone responses. We're talking about things like city of origin or destination. Just try to spell 'Philadelphia' on your touch-tone phone or distinguish between 'Washington, D.C.' and 'Washington state.' Then you have date and time of travel, class of travel, and a bunch of different times to chose from. But it was all within a fairly limited range, so speech recognition could really perform in this instance.''

In an effort to solve these problems, United Airlines launched a speech-enabled, 24x7 flight information system in the fall of 1999. Running on a platform from InterVoice-Brite Inc., and utilizing speech-recognition technology from SpeechWorks International Inc., the system allowed callers to speak their request into the phone to find flight arrivals, departures and gate information for all United flights. According to United, in its first 30 months of operation, the system handled more than 50 million calls.

According to Gartner's Elliot, implementations of similar speech-recognition systems have resulted in significant improvements in call completion rates throughout the travel and tourism industry. ''Everybody got it,'' he said. ''And they have achieved their objectives in terms of ROI, customer satisfaction and operational efficiency. In many cases, they even exceeded their goals.''

United's return on its investment in speech technology is impressive. According to Steve Chambers, chief marketing officer at Boston-based SpeechWorks International Inc., the system saved the company $24 million. ''I don't know many other technologies that can claim those kinds of numbers,'' Chambers said. ''It's true that we're 'not there yet' and this industry has a ways to go, but that's not stopping us from delivering solid technology today.''

The speech or the speaker?
The system implemented by United employed speech-recognition software, which should not be confused with speaker-recognition technologies. (See Talking speech tech .) Speech recognition refers to applications and systems that 'understand' words spoken by anyone -- in other words, the language itself. It takes the form of a range of applications, from shrink-wrapped dictation programs that live on a desktop to sophisticated business apps that allow customers to interact with a computer over the telephone.

Speech recognition may be said to have two constituent qualities: ''accuracy'' and ''fluency.'' Accuracy refers to the system's ability to identify each word a user speaks. Fluency refers to the depth of a system's vocabulary and its grammar range.

There are three types of speech recognition applications: Command systems recognize a few hundred words and eliminate the need for a mouse or keyboard. Discrete voice-recognition systems are used for dictation, but users must pause between words. Continuous voice-recognition systems are designed to understand natural speech without pauses.

Speaker recognition (sometimes called voice authentication) refers to systems with the ability to distinguish and confirm the identity of the individual speaking to it. Speaker recognition can be further subdivided into speaker identification, which determines which registered speaker provides a given utterance from among a set of known speakers; and speaker verification, which accepts or rejects the identity claim of a speaker.

One of the most common applications of speaker-recognition technology has been to free call center agents from an annoying task: password reset.

''You cannot appreciate the volume of password resetting that is going on,'' said Gartner's Elliot. ''Everyone has forgotten a password at least once. The more you have, the more you're likely to forget. And nowadays, people have lots of them.''

Automating similarly routine calls is also one of the key benefits of an effective speech-recognition system. Giga Information Group analyst Elizabeth Herrell calls that objective the current sweet spot for that technology.

''A lot of the speech-recognition applications have been deployed to offer automated services, replace agents and eliminate some of the touch-tone commands on telephones,'' she said. ''That's where there is high demand today. Most companies want to implement more automated services. They're finding out that, even with an interactive voice-response system, which can be a very cost-effective tool, people are opting out and going back to the operator because they want to speak to someone. They're finding that, by using speech technology, they can get a much higher percentage of users to stay on the line and complete their transactions without going directly to an agent.''

Interactive voice-response systems (IVRs) -- which speak to callers, who then respond with the touch-tone keypad -- have been around a long time. According to Jupiter Research's Gartenberg, the most most commonly deployed speech-recognition technologies are overlaid on these systems.

''The goal is to create these types of voice-enabled applications so that people can navigate that voice menu from hell without having to constantly punch in numbers,'' he said. ''It's not really speech interaction, but rather enabling speech into some of these applications to maximize customer satisfaction, which can be very useful in the enterprise.''

Speech recognition is also finding a place in the unified messaging space, as growing numbers of firms explore services that make faxes, e-mail and voice mail accessible via phone-based voice commands. And an entire industry appears to be growing around the loosely defined concept of a ''voice portal'' or ''vortal.''

''Essentially,'' said Meta Group's Perkins, ''it's the idea of offering a variety of [Web-based] services -- financial reporting, traffic reports, sports scores, entertainment -- on a speech-enabled platform accessible from a telephone.''

''Voice portal'' is a buzzword in need of an adjective, said Gartner's Elliot. Consumer voice portals are interfaces for consumer information -- newsletters, sports and stocks -- typically offered by service providers. Enterprise voice portals provide an integrated telephony interface to a wide range of enterprise applications and information.

SpeechWorks' Chambers agrees that automating routine calls in these types of systems is an important application of speech recognition. But he cautions that it is not about replacing people, but freeing agents for more complex tasks.

''People should definitely be looking at speech to cut costs,'' Chambers said, ''but also to increase agent yields. When you deploy speech, your agents are freed up for the cross sell and the upsell. They become much more effective. Also, you're handling the routine calls that used to [annoy] the agents, so those agents are less likely to turn over.''

Cutting agent turnover is no small thing. According to a Gartner survey, call centers in the U.S. experienced an average 30% turnover rate in 2000 and 2001.

Implementing speech
It is fair to say that currently available speech technologies are already so widespread that we are beginning to take them for granted. Dial toll-free directory assistance, and a TellMe application asks you to ''Say the full name of the listing you want.'' Call your bank, and you can ''Press or say three'' to hear your current balance.

Yet for most IT organizations, speech is still a whole new ballgame. IT managers thinking about implementing speech in their organizations have a lot to consider. The solution itself consists of three basic elements: the speech engine, which is the software that processes spoken input or produces speech output; the platform, such as an IVR system or a voice portal; and the specific application, which may be pre-built or custom-developed.

Gartner's Elliot advises IT managers to consider all three components in parallel. ''Sometimes companies get blocked on which one is more important,' he said, 'but all three areas have tools that need to work together.''

Many enterprises prefer to utilize internal resources to develop their own speech apps, but speech app development is a highly specialized discipline. In the words of Sunil Soares, director of product management at IBM's Pervasive Computing Division, it is ''tricky stuff.''

''Believe it or not,'' Soares explained, ''there are many ways that you can say even something as simple as 'yes' and 'no.' There's 'yes,' 'yeah,' 'okay,' 'all right,' 'no,' 'nah' and even more. And it gets worse. I'm writing a stock quote application, for example, and let's say I know that it's going to have multiple stocks, maybe 15,000 that are regularly traded in the U.S. And let's say that [some] of them will be IBM. The grammar for IBM could be 'IBM,' 'International Business Machines' or 'Big Blue.' I need the ability to write that grammar. If I'm adding Microsoft, 'Microsoft' doesn't exist in the dictionary, so the engine has to be able to recognize a specific pronunciation.''

Gartner's Elliot strongly advises IT organizations with no experience in speech to seek outside expertise, at least at the front end of a project, to bring their development teams up to speed. ''We recommend against having your touch-tone app developers start right in on speech recognition,'' he said. ''Bring in some people with speech recognition experience and then, over time, you'll grow into it.''

Fortunately -- or unfortunately -- there are lots of places to look for that level of expertise. The speech-tech market is currently populated by a wide range of large and small companies, from speech-engine providers to IVR system vendors to text-to-speech application developers. Giga Information Group's Herrell advises IT managers to consider the advantages of working with the market leaders.

''I see such a mix of players in this market,'' she said, ''everything from big telecoms down to three-man shops of kids just out of school who understand the technology. But you'll want to work with the leading vendors. There are some good new technologies out there, and some very promising ones, but the early market leaders are the ones who have already gone through a lot of the rocky spots.''

The list of speech-tech vendors is a long one, and growing, but a few names stand out in certain segments. Nuance, SpeechWorks, Phillips Speech Processing and IBM top many lists of the leading speech-engine vendors. Intervoice, Nortel, Avaya, Edify and Syntellect are among the leading IVR vendors. TellMe, BeVocal, Voice Genie and Voxeo are among the best-known enterprise voice portal vendors. This is by no means a definitive list, and the product offerings change as the market evolves.

Bringing in outside expertise is a useful strategy, but Meta Group's Perkins said most organizations getting into speech should not actually pass off the job to outsiders before they have spent some time with their own hands in the guts of the technology.

''It's very, very early in the game,'' he said, ''too early to consider outsourcing. You don't want to hire someone to develop, deploy and run your speech solutions for you until you understand the issues facing you in your enterprise environment. You can't really outsource what you don't know the true cost of or the true value of.''

Talking to customers
In a nutshell, the key challenge for enterprise users implementing a speech solution is to cut costs without cutting service levels. ''You want to take very good care of your customers, obviously,'' said Giga Information Group's Herrell. ''And the way you do that is by eliminating the routine interactions that could be very easily put into a voice application. But if you make it harder for them, you'll lose them.''

Jupiter Research's Gartenberg agrees: ''The consumer expectation here is simple: It's got to be 100% accurate 100% of the time. Keep in mind, a 90% accuracy rate still means that one out of every 10 times the system gets something wrong. Customers get very frustrated very quickly with that [kind of] experience.''

A logical approach, then, is to try out an app internally before launching it for public consumption. That is what United did. Before its flight information system was exposed to a single customer, the company worked out the bugs in a speech-enabled employee reservation system application provided by SpeechWorks.

''You want to make sure that there's a proof of concept in house to kick the tires on the technology before rolling it out to customers,'' said IBM's Solares. ''I don't think that's specific to voice technologies. For anything, we recommend you do a proof of concept before rolling it out to a broader audience. There are a lot of challenges associated with this technology. And there's a fair amount of testing you have to do.''

Meta Group's Perkins recommends that you ''test, test and test mercilessly'' before undertaking any public-facing deployments. ''One of the downsides of deploying a major voice application in an environment with lots of customers is that if you screw it up, you don't get a second chance,'' he said. ''It's not the same thing as a Web site not being available for an hour or so. If you happen to create 7,000 or 8,000 disgruntled customers who cannot interact properly with the application you deploy, they won't do business with you again.''

Jupiter Research's Gartenberg advises using a variety of people of all ages and genders, speaking as many different dialects as possible, as part of the testing methodology. ''The whole goal here is to decrease the consumer's time on hold, and to increase the customer's satisfaction,'' he said. ''You're trying to help the users get to the information they want and to get to it quickly. Anything that creates an inverse experience is going to cost you.''

SpeechWorks' Chambers agrees that organizations should do some tire-kicking before deploying speech apps for their customers, but he is not convinced that an internal pre-deployment deployment is necessary. He points out that, although the industry itself may be characterized as young, current deployments have been racking up some serious mileage.

''I can understand starting with an internal version, as United did, if the economics are there,'' he said. ''But in many ways, I think we're a little past that point. Three or four years ago, people still wanted a little more proof of concept. Today, given our customer list and the millions of calls we take a day, that kind of proof isn't called for. It may be that the best application for a particular enterprise is an internal app. If it makes financial sense, that's great. But it's certainly not the case that people are afraid to launch publicly, and therefore want to start internally.''

''Application identification'' is the critical phase in the implementation of a speech strategy, Chambers said. His company utilizes a process that pinpoints what he calls the critical speech factors. That process begins with a vision that is focused on customers.

''You should focus on them completely,'' Chambers said. ''Ask yourself why you want a speech strategy. It's probably because you want to increase your company's financial health and in a way that actually is more customer-friendly. Touch-tone increased the financial health of many companies, but in a less customer-friendly way. Speech can increase your corporate health and satisfy callers in a very, very friendly way.''

IBM's Solares suggests approaching a speech implementation holistically -- in other words, do not think of speech tech as a silo. ''You want to think about your entire channel strategy,'' he said. ''How are you going to offer access to this application using the Web, the voice channel, the wireless channel. Then think about multimodal [see Multiple modes ]. Even if you decide not to implement a given channel, you definitely want to understand how your application is architected to support multiple channels and modalities.''

Killer tools and the killer app
According to market researchers at Scottsdale, Ariz.-based In-Stat/MDR, the speech-recognition industry has continued to grow, even through the economic ugliness of 2001. The enterprise ''continued to embrace speech recognition products ... as a means of lowering costs, providing greater customer service, and launching new, innovative, products,'' the firm wrote in a recent report. ''We are just on the cusp of a speech-recognition revolution,'' senior analyst Brian Strachman stated in the report.

Worldwide spending on speech recognition will reach $41 billion by 2005, according to market researchers at the Kelsey Group, and In-Stat/MDR analysts expect the use of standards such as VoiceXML and SALT to ''open up vast new markets and opportunities.''

Other growth factors include continuing increases in processor speeds, which will allow for greater speech-recognition vocabulary size and accuracy; greater emphasis on measuring ROI in a tight economy and voice recognition's ability to demonstrate it; an increase in the number of wireless subscribers; and increased regulation in the use of wireless handsets while operating a car.

The expanding population of mobile workers in this country and their use of untethered computing devices is increasing the potential desirability of non-keyboard input in a host of environments, from a realtor's car to a lineman's telephone pole. And internationally, millions without access to PCs carry cell phones and a hankering for information.

''You can't go anywhere on the globe without seeing a cell phone,'' said Giga Information Group's Herrell. ''They're everywhere. The U.S. hardly touches the number [of cell phone users] in Europe and Asia. But PC access over there is much more limited than it is here. You have this huge population needing information, and their access devices are cell phones with tiny keypads.''

Meta Group's Perkins agrees: ''The evolution of speech technology is going hand in hand with the evolution of mobility technology. From a programming perspective, you need to consider the fact that you have 1.5 billion clients. You've got the Windows 2000 and XP clients, but guess what, you've also got Nokia, Ericsson and Motorola clients. So you need to think of the apps that you deploy, not merely in terms of clients who will be sitting in front of a PC, but also the millions of clients with cell phones in their hands.''

Perhaps the most important driver in this market is the emergence of developer tools and the resulting increase in production speed and lower costs. Microsoft's .NET Speech SDK, in particular, is likely to push speech app development. Coupled with Microsoft's Visual Studio .NET developer tools, the SDK is designed to add voice to the list of methods for inputting data, which includes the mouse, keyboard and stylus.

IBM, too, is bound to be influential here. The company pioneered voice-recognition technology, and its ViaVoice dictation software is probably its best-known application. But Big Blue is investing in speech-recognition technology, which the company considers an essential component of its ''pervasive computing'' strategy. IBM's WebSphere Studio now includes a plug-in called the WebSphere Voice Toolkit, which allows users to write VoiceXML applications and test and debug the grammars and pronunciation. IBM's Soares sees a big opportunity in this market for developers, which his company wants to support.

''Now enterprise customers look at me in meetings and say, 'This is great, we can now have our regular Web developers and our enterprise application developers and voice developers all using the same environment: WebSphere Studio,''' he said.

''I think the killer weakness in this market is the lack of good applications,'' said Giga Information Group's Herrell. ''Everything is customized, and everyone builds their own. I think there's room for developers to come in and seize some of these applications. I would like to see a lot more developers out there, and I don't think they're going to lack for customers.''

But even with great tools, Meta Group's Perkins said, the real challenge for developers of speech technologies in the years ahead will be actually understanding the human-computer interface. ''We're still in the first generation,'' he said, ''and we haven't built that many apps that were developed from the ground up that accept and recognize speech as the input. When programmers truly understand how humans communicate by voice, and accurately reflect that in their interactions with applications - when someone combines the natural way that people communicate by voice with applications that could actually be exploited using that capability, that's when you'll have your killer app.''

See the following related stories:
Talking speech tech, by John K. Waters
Multiple modes, by John K. Waters
Speech specs, by John K. Waters