The Business of Speech Technologies

Wilpon, Jay; Gilbert, Mazin E.; Cohen, Jordan

doi:10.1007/978-3-540-49127-9_34

Jay Wilpon⁴,
Mazin E. Gilbert⁵ &
Jordan Cohen Ph.D⁶

Part of the book series: Springer Handbooks ((SHB))

7883 Accesses

Abstract

With the fast pace of developments of communications networks and devices, immediate and easy access to information and services is now the expected norm. Several critical technologies have entered the marketplace as key enablers to help make this a reality. In particular, speech technologies, such as speech recognition and natural language understanding, have changed the landscape of how services are provided by businesses to consumers forever. In 30 short years, speech has progressed from an idea in research laboratories across the world, to a multibillion-dollar industry of software, hardware, service hosting, and professional services. Speech is now almost ubiquitous in cell phones. Yet, the industry is still very much in its infancy with its focus being on simple low hanging fruit applications of the technologies where the current state of technology actually fits a specific market need, such as voice enabling of call center services or voice dialing over a cell phone.

With broadband access to networks (and therefore data), anywhere, anytime, and using any device, almost a reality, speech technologies will continue to be essential for unlocking the potential that such access provides. However, to unlock this potential, advances in basic speech technologies beyond the current state of the art are essential. In this chapter, we review the business of speech technologies and its development since the 1980s. How did it start? What were the key inventions that got us where we are, and the services innovations that supported the industry over the past few decades? What are the future trends on how speech technologies will be used? And what are the key technical challenges researchers must address and resolve for the industry to move forward to meet this vision of the future? This chapter is by no means meant to be exhaustive, but it gives the reader an understanding of speech technologies, the speech business, and areas where continued technical invention and innovation will be needed before the ubiquitous use of speech technologies can be seen in the marketplace.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 579.00; Price excludes VAT (USA)

Hardcover Book: USD 729.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

ARPA:: Advanced Research Projects Agency
ART:: advanced recognition technology
ASR:: automatic speech recognition
ATIS:: airline travel information system
BBN:: Bolt, Beranek and Newman
CE:: categorical estimation
CMU:: Carnegie Mellon University
DARPA:: Defense Advanced Research Projects Agency
DM:: dialog management
DP:: dynamic programming
DSP:: digital signal processing
DT:: discriminative training
DTW:: dynamic time warping
FFT:: fast Fourier transform
GMM:: Gaussian mixture model
HMIHY:: How May I Help You
HMM:: hidden Markov models
IP:: internet protocol
IVR:: interactive voice response
LDA:: linear discriminant analysis
LPC:: linear predictive coding
MCE:: minimum classification error
MLLR:: maximum-likelihood linear regression
MMI:: maximum mutual information
NLU:: natural language understanding
PDA:: pitch determination algorithms
SDC:: shifted delta cepstral
SLM:: statistical language model
SMS:: speaker model synthesis
SVM:: support vector machines
TI:: transinformation index
UE:: user experience
VRCP:: voice recognition call processing
VTLN:: vocal-tract-length normalization
VoIP:: voice over IP
XML:: extensible mark-up languages

References

J.R. Pierce: Whither speech recognition?, J. Acoust. Soc. Am. 46(4), 1029-1051 (1969)
Google Scholar
K.H. Davis, R. Biddulph, S. Balashek: Automatic recognition of spoken digits. In: Communication Theory, ed. by W. Jackson (Butterworths, London 1953)
Google Scholar
A. Lolje, M. Riley, D. Hindle, F. Pereira: The AT&T 60000 word speech-to-text system, Proc. Spoken Language Technology Workshop (Morgan Kaufmann, Austin 1995) pp. 162-165
Google Scholar
L. Rabiner, B.-H. Juang: Fundamentals of Speech Recognition (Prentice Hall, Englewood Cliffs 1993)
MATH Google Scholar
F.C. Pereira, M. Riley: Speech recognition by composition of weighted finite automata. In: Finite-State Devices for Natural Language Processing, ed. by E. Roche, Y. Schabes (MIT Press, Cambridge 1997)
Google Scholar
V. Goffin, C. Allauzen, E. Bocchieri, D. Hakkani-Tur, A. Ljolje, S. Parthasarathy, M. Rahim, G. Riccardi, M. Saraclar: The AT&T watson speech recognizer, Proc. IEEE ICASSP (2005)
Google Scholar
J. Huang, B. Kingsbury, L. Mangu, M. Padmanabhan, G. Saon, G. Zweig: Recent improvements in speech recognition performance on large conversational speech, Proc. ICSLP (2000)
Google Scholar
B. Atal: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, J. Acoust. Soc. Am. 55(6), 1304-1312 (1974)
Article Google Scholar
K. Vintsyuk: Speech discrimination by dynamic programming, Kibernetika 4, 81-88 (1968)
Article MathSciNet Google Scholar
F. Jelinek: Continuous speech recognition by statistical methods, Proc. IEEE 64(4), 532-556 (1976)
Article Google Scholar
L.R. Rabiner: A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77(2), 257-286 (1989)
Article Google Scholar
F. Jelinek: Statistical Methods for Speech Recognition (MIT Press, Cambridge 1997)
Google Scholar
L.R. Bahl, P.F. Brown, P.V. De Souza, R.L. Mercer: Maximum mutual information estimation of HMM parameters for speech recognition, Proc. IEEE ICASSP (1986)
Google Scholar
M.H. Cohen, J.P. Giangola, J. Balogh: Voice User Interface Design (Addison Wesley, Boston 2004)
Google Scholar
A. Smola, P. Bartlett, B. Scholkopf, D. Schuurmans: Advances in Large Margin Classifiers (MIT Press, Cambridge 2000)
MATH Google Scholar
R. Schapire, M. Rochery, M. Rahim, N. Gupta: Incorporating prior knowledge into boosting, Proc. Nineteenth Int. Conf. Machine Learning (2002)
Google Scholar
J. Baker: The Dragon system - an overview, IEEE Trans. ASSP 23(1), 24-29 (1975)
Article Google Scholar
A. Gorin, G. Riccardi, J. Wright: How May I Help You?, Speech Commun. 23, 113-127 (1997)
Article MATH Google Scholar
http://www.nexidia.com
http://www.verint.com
R. Natarajan, B. Prasad, B. Suhm, D. McCarthy: Speech enabled natural language call routing: BBN call director, Proc. Int. Conf. Spoken Language Process. (2002)
Google Scholar
L. Lee, R. Rose: A Frequency Warping Approach to Speaker Normalization, IEEE Trans. Speech Audio Process. 6, 49-60 (1998)
Article Google Scholar
D.A. Reynolds, R.C. Rose: Robust text-independent speaker identification using Gaussian mixture speaker models, IEEE Trans. Speech Audio Process. 3(1), 72-83 (1995)
Article Google Scholar
X.D. Huang, A. Acero, H.-W. Hon: Spoken Language Processing (Prentice Hall, Englewood Cliffs 2001)
Google Scholar
M. Rahim, B.-H. Juang: Signal bias removal by maximum likelihood estimation for robust speech recognition, IEEE Trans. Speech Audio Process. 4(1), 19-30 (1996)
Article Google Scholar
S. Bangalore, G. Riccardi: Stochastic finite-state models for spoken language machine translation, Mach. Transl. 17(3), 165-184 (2002)
Article Google Scholar
N. Gupta, G. Tur, D. Hakkani-Tür, S. Bangalore, G. Riccardi, M. Rahim: The AT&T spoken language understanding system, IEEE Trans. Audio Speech Lang. Process. 14(1), 213-222 (2006)
Article Google Scholar
G. Riccardi, D. Hakkani-Tür: Active and unsupervised learning for automatic speech recognition, Proc. 8th European Conf. Speech Commun. and Technol. (2003)
Google Scholar
S. McGlashan: Voice Extensible Markup Language (VoiceXML) Version 2.0 (2004) (http://www.w3.org/TR/2004/PR-voicexml20-20040203)
R. Nakatsu: Anser - An application of speech technology to the Japanese banking industry, Computer 23(8), 43-48 (1990)
Article Google Scholar
http://www.nuance.com
http://www.tellme.com
http://www.bevocal.com
http://www.telureka.com
http://www.convergys.com
http://www.west.com
J. Wilpon, L.R. Rabiner, C.H. Lee, E.R. Goldman: Automatic recognition of keywords in unconstrained speech using hidden Markov models, IEEE Trans. Acoust. Speech Signal Process. 38(11), 1870-1878 (1990)
Article Google Scholar
W.T. Hartwell, M.A. Johnson, J. Picone: Automatic speech recognition using echo cancellation, US Patent 4,914,692 (1990)
Google Scholar
V. Franco: Automation of operator services at AT&T, Proc. Voice (1993)
Google Scholar
S. Shanmugham, D. Burnett: Media Resource Control Protocol Version 2 (MRCPv2) (http://tools.ietf.org/wg/speechsc/draft-ietf-speechsc-mrcpv2/draft-ietf-speechsc-mrcpv2-09.txt)
http://www.w3.org/TR/xhtml+voice
L.R. Rabiner: Applications of voice processing to telecommunications, Proc. IEEE 82(2), 199-228 (1994)
Article Google Scholar
H. Sakoe, C. Chiba: Dynamic programming algorithm optimization for spoken word recognition, IEEE Trans. Acoust. Speech Signal Process. ASSP-26, 43-49 (1978)
Article MATH Google Scholar
J. Cooperstock: From the flashing 12:00 to a usable machine: Applying UbiComp to the VCR (http://acm.org/sigchi/chi97/proceedings/short-talk/jrc.htm)
A.H. Gray Jr., J.D. Markel: Distance measures for speech processing, IEEE Trans. ASSP 24(5), 380-391 (1976)
Article Google Scholar
M. Przybocki, A. Martin: NISTʼs Assessment of Text Independent Speaker Recognition Performance (2005)
Google Scholar
M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, P. Maloor: MATCH: An architecture for multimodal dialogue systems, Proc. 40th Annual Meeting of the Association for Computational Linguistics (2002)
Google Scholar
http://www.saltforum.org/saltforum/downloads/SALT1.0.pdf
T. Paek, E. Horvitz: Conversation as action under uncertainty, Proc. Conf. Uncertainty in Artificial Intelligence (UAI) (2000)
Google Scholar
J.D. Williams: Partially Observable Markov Decision processes for Spoken Dialog Management, Ph.D. Thesis (University of Cambridge, Cambridge 2006)
Google Scholar
I. Witten, E. Frank: Data Mining (Morgan Kaufmann, Austin 1999)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Voice and IP Services, Research AT& T Labs, 07932, Florham Park, NJ, USA
Jay Wilpon
AT& T Labs, Inc., Research, 180 Park Ave., 07932, Florham Park, NJ, USA
Mazin E. Gilbert
SRI International, 300 Ravenswood Drive, 94019, Menlo Park, CA, USA
Jordan Cohen Ph.D

Authors

Jay Wilpon
View author publications
You can also search for this author in PubMed Google Scholar
Mazin E. Gilbert
View author publications
You can also search for this author in PubMed Google Scholar
Jordan Cohen Ph.D
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jay Wilpon , Mazin E. Gilbert or Jordan Cohen Ph.D .

Editor information

Editors and Affiliations

INRS-EMT, University of Quebec, 800 de la Gauchetiere Ouest, H5A 1K6, Montreal, Quebec, Canada
Jacob Benesty Dr.
Avayalabs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA
M. Mohan Sondhi Ph.D.
Alcatel-Lucent, Bell Laboratories, 600 Mountain Avenue, 07974, Murray Hill, NJ, USA
Yiteng Arden Huang Dr.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wilpon, J., Gilbert, M.E., Cohen, J. (2008). The Business of Speech Technologies. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_34

Download citation

DOI: https://doi.org/10.1007/978-3-540-49127-9_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49125-5
Online ISBN: 978-3-540-49127-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics