In-Depth
Talking speech tech
- By John K. Waters
- December 1, 2002
Specialized technology areas tend to have their own
jargon, and speech tech is quickly generating an alphabet soup of acronyms. Here
are some definitions of some of the key expressions of speech.
Automatic Speech Recognition (ASR) systems
-- utilize voice recognition to
replace keypad entry for telephone voice menus. These are the systems that tell
callers to speak the digits 0 through 9.
Computer Telephone Integration (CTI) -- combines data with voice systems for
enhanced telephone services.
Dual-Tone MultiFrequency (DTMF) -- the type of audio signals produced by a
touch-tone telephone.
Grammars -- in speech tech circles, ''grammars'' are the phrases a user might
say that a speech engine can recognize.
Interactive Voice Response (IVR) -- an automated telephone information system
to which callers respond by using the keypad or by speaking words. The system
communicates with callers using a combination of fixed voice menus and real-time
data from databases.
Prompts -- phrases that a voice system plays back to callers, indicating
which information the system needs next. For example: ''Please enter your credit
card number.''
Speech Application Language Tags (SALT)
-- extensions to HTML, XHTML and XML
for voice recognition and synthesized speech output. SALT is the newest
specification to emerge from the speech market. It is designed to support
''multimodality,'' including audio, video, text and graphics, depending on the
hardware.
Speaker recognition (sometimes called voice
authentication) -- refers to
systems with the ability to distinguish and confirm the identity of the
individual speaking to it. Speaker recognition can be further subdivided into
speaker identification, which determines which registered speaker provides a
given utterance from among a set of known speakers; and speaker verification,
which accepts or rejects the identity claim of a speaker.
Speech engine -- software that either processes speech input or produces
speech output.
Speech recognition -- refers to applications and systems that ''understand''
language, regardless of the speaker. It takes the form of a range of
applications, from shrink-wrapped dictation programs that live on a desktop to
sophisticated business apps that allow customers to interact with a computer
over the telephone.
Text-to-Speech (TTS) -- TTS systems convert text into synthesized speech
output. These systems were first designed to allow blind users to listen to
written material. Today, TTS is used extensively to convey financial data,
e-mail messages and other information via telephone.
Voice User Interface (VUI) -- the speech tech equivalent of a GUI, typically
residing on a PDA or smart phone. A VUI is more sophisticated than an IVR
system, and offers a wider range of commands than simply ''yes'' or ''no.''
Voice browser -- allows users to access the Web using speech synthesis,
pre-recorded audio and speech recognition.
Voice portal -- offers a variety of Web-based services on a speech-enabled
platform accessible from a telephone. A consumer voice portal is an interface
for consumer information, such as newsletters, sports and stocks, typically
offered by service providers. An enterprise voice portal provides an integrated
telephony interface to a wide range of enterprise applications and
information.
Voice XML (VXML) -- A markup language designed to create audio dialogs that
feature synthesized speech, digitized audio, recognition of spoken and DTMF key
input, recording of spoken input, telephony and mixed-initiative
conversations.
See the following related stories:
Giving
applications a voice , by John K.
Waters
Multiple modes , by John K. Waters
Speech specs , by John K.
Waters
About the Author
John K. Waters is a freelance writer based in Silicon Valley. He can be reached
at [email protected].