Skip to main content

Modern Conversational Agents

  • Chapter
  • First Online:
Technologien für digitale Innovationen

Abstract

Conversational agents are computer programs engaging with human users in a conversation to assist, educate, or entertain. Being subject to substantial research interest ever since the advent of artificial intelligence in the 1950s and 60s, recent advances in cloud computing and the availability of smart devices with wireless high-speed Internet connection have led to steep progress in the engineering of conversational technology. “Modern” conversational agents understand spoken language, are able to answer complicated questions, or interact with humans in a dialog of hundreds of user turns. The reader of this article will learn about strengths of modern conversational agents driven by synergy among highperforming speech recognition, smart devices, high-speed Internet, cloud computing, standardization, and crowdsourcing. Together, we will see how the field is primarily driven by commercial stakeholders, and how open-source alternatives are expected to play a major role in the future of modern conversational agents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • itu 1993: Pulse Code Modulation (PCM) of Voice Frequencies. Technical Report ITU-T Recommendation G.711, ITU, Geneva, Switzerland.

    Google Scholar 

  • itu 2012: 7 kHz Audio-Coding within 64 kbit/s. Technical Report ITU-T Recommendation G.722, ITU, Geneva, Switzerland.

    Google Scholar 

  • Adda, G./Mariani, J./Besacier, L./Gelas, H. 2013: Economic and ethical background of crowdsourcing for speech. In Eskenazi, M./Levow, G./Meng, H./Parent, G./Suendermann, D., (eds): Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment. Wiley, Hoboken, USA.

    Google Scholar 

  • Bacchiani, M./Beaufays, F./Schalkwyk, J./Schuster, M./Strope, B. 2008: Deploying GOOG-411: Early Lessons in Data, Measurement, and Testing. In Proc. of the ICASSP, Las Vegas, USA.

    Google Scholar 

  • Black, A./Tokuda, K. 2005: Blizzard Challenge – 2005: Evaluating Corpus-Based Speech Synthesis on Common Datasets. In Proc. of the Interspeech, Lisbon, Portugal.

    Google Scholar 

  • Boyer, L./Danielsen, P./Ferrans, J./Karam, G./Ladd, D./Lucas, B./Rehor, K. 2004: VoiceXML 0.9. W3C Note–Initial Release. http://www.w3.org/TR/2000/ NOTEvoicexml- 20000505.

  • Boysen, E./Flathagen, J. 2011: Using SIP for Seamless Handover in Heterogeneous Networks. In Proc. of the ICUMT, Budapest, Hungary.

    Google Scholar 

  • Brants, T./Franz, A. 2006: Web 1T 5-Gram Corpus Version 1.1. Technical report, Google Research.

    Google Scholar 

  • Breazeal, C. 2005: Socially Intelligent Robots. Interactions, 12(2). (Bridle, 2004) Bridle,

    Google Scholar 

  • J. (2004). Towards Better Understanding of the Model Implied by the Use of Dynamic Features in HMMs. In Proc. of the ICSLP, Jeju Island, South Korea.

    Google Scholar 

  • Burnett, D./Shanmugham, S. 2012: Media Resource Control Protocol Version 2 (MRCPv2). http://tools.ietf.org/html/rfc6787.

  • Burnett, D./Shuang, Z./Baggia, P./Bagshaw, P./Bodell, M./Huang, D./Xiaoyan, L./McGlashan, S./Tao, J./Jun, Y./Fang, H./Kang, Y./Meng, H./Xia, W./Hairong, X./Wu, Z. 2010: Speech Synthesis Markup Language (SSML) Version 1.1. W3C

    Google Scholar 

  • Recommendation. http://www.w3.org/TR/2010/REC-speech-synthesis11–20100907.

  • Chai, J./Horvath, V./Nicolov, N./Stys, M./Kambhatla, N./Zadrozny, W./Melville, P. 2002: Natural Language Assistant–A Dialog System for Online Product Recommendation. AI Magazine, 23(2).

    Google Scholar 

  • Chen, S./Kingsbury, B./Mangu, L./Povey, D./Saon, G./Zweig, H. S. G. 2006: Advances in Speech Transcription at IBM under the DARPA EARS Program. IEEE Trans. on Audio, Speech and Language Processing, 14(5).

    Google Scholar 

  • Clarke, A. 1968: 2001: A Space Odyssey. New American Library, New York, USA.

    Google Scholar 

  • Davis, K./Biddulph, R./Balashek, S. 1952: Automatic Recognition of Spoken Digits. Journal of the Acoustical Society of America, 24(6).

    Google Scholar 

  • de Melo, G./Hose, K. 2013: Advances in Information Retrieval. Springer, New York, USA.

    Google Scholar 

  • ECMA 1999: Standard ECMA-262 ECMAScript Language Specification. http://www.ecma-international.org/publications/standards/Ecma-262.htm.

  • Eskenazi, M./Levow, G./Meng, H./Parent, G./Suendermann, D. 2013: Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment. Wiley, Hoboken, USA.

    Google Scholar 

  • Evanini, K./Suendermann, D./Pieraccini, R. 2007: Call Classification for Automated Troubleshooting on Large Corpora. In Proc. of the ASRU, Kyoto, Japan.

    Google Scholar 

  • Ferrucci, D./Brown, E./Chu-Carroll, J./Fan, J./Gondek, D./Kalyanpur, A./Lally, A./Murdock, W./Nyberg, E./Prager, J./Schlaefer, N./Welty, C. 2010: Building Watson: An Overview of the DeepQA Project. AI Magazine, 31(3).

    Google Scholar 

  • Fielding, R./Kaiser, G. 1997: The Apache HTTP Server Project. Internet Computing, 1(4).

    Google Scholar 

  • Fryer, L./Carpenter, R. 2006: Emerging Technologies–Bots as Language Learning Tools. Language Learning & Technology, 10(3).

    Google Scholar 

  • Gibbon, D./Moore, R./Winski, R. 1997: Handbook of Standards and Resources for Spoken Language Systems. Mouton de Gruyter, New York, USA.

    Google Scholar 

  • Glass, J./Hazen, T./Hetherington, I. 1999: Real-Time Telephone-Based Speech Recognition in the Jupiter Domain. In Proc. of the ICASSP, Phoenix, USA.

    Google Scholar 

  • Hakkani-Tür, D./Tur, G./Heck, L. 2012: Research Challenges and Opportunities in Mobile Applications. Signal Processing Magazine, 28(4).

    Google Scholar 

  • Hemphill, C./Godfrey, J./Doddington, G. 1990: The ATIS Spoken Language Systems Pilot Corpus. In Proc. of the Workshop on Speech and Natural Language, Hidden Valley, USA.

    Google Scholar 

  • Herzfeld, N. 2002: In Our Image: Artificial Intelligence and the Human Spirit. Fortress Press, Minneapolis, USA.

    Google Scholar 

  • Herzog, O./Siekmann, J./Rollinger, C. 1991: Text Understanding in LILOG: Integrating Computational Linguistics and Artificial Intelligence–Final Report on the LILOGProject. Springer, New York, USA.

    Google Scholar 

  • Hillebrand, F. 2002: GSM and UMTS: The Creation of Global Mobile Communications. Wiley, New York, USA.

    Google Scholar 

  • Hinton, G./Deng, L./Yu, D./Dahl, G./Mohamed, A./Jaitly, N./Senior, A./Vanhoucke, V./Nguyen, P./Sainath, T./Kingsbury, B. 2012: Deep Neural Networks for Acoustic Modeling in Speech Recognition. Signal Processing Magazine, 29(6).

    Google Scholar 

  • Holovaty, A./Kaplan-Moss, J. 2009: The Definitive Guide to Django: Web Development Done Right. Apress, New York, USA.

    Google Scholar 

  • Hunt, A. 2000: JSpeech Grammar Format. W3C Note. http://www.w3.org/TR/2000/NOTE-jsgf-20000605.

  • Hunt, A./McGlashan, S. 2004: Speech Recognition Grammar Specification Version 1.0.

    Google Scholar 

  • W3C Recommendation. http://www.w3.org/TR/2004/REC-speech-grammar-2004 0316.

  • Jelinek, F. 1997: Statistical Methods for Speech Recognition. MIT Press, Cambridge, USA.

    Google Scholar 

  • Johnston, A. 2004: SIP: Understanding the Session Initiation Protocol. Artech House, Norwood, USA.

    Google Scholar 

  • Keeling, K./McGoldrick, P./Beatty, S. 2007: Virtual Onscreen Assistants: A Viable Strategy to Support Online Customer Relationship Building? Advances in Consumer Research, 34.

    Google Scholar 

  • King, S./Karaiskos, V. 2010: The Blizzard Challenge 2010. In Blizzard Challenge Workshop, Kansai Science City, Japan.

    Google Scholar 

  • Kumar, A./Tewari, A./Horrigan, S./Kam, M./Metze, F./Canny, J. 2011: Rethinking Speech Recognition on Mobile Devices. In Proc. of the IUI, Palo Alto, USA.

    Google Scholar 

  • Lamere, P./Kwok, P./Gouvea, E./Raj, B., Singh/R., Walker, W./Warmuth, M./Wolf, P. 2003: The CMU SPHINX-4 Speech Recognition System. In Proc. of the ICASSP’03, Hong Kong, China.

    Google Scholar 

  • Larson, J. 2000: Introduction and Overview of W3C Speech Interface Framework. W3C Working Draft. http://www.w3.org/TR/voice-intro.

  • Lea, W. 1980: Trends in Speech Recognition. Prentice Hall, Englewood Cliffs, USA.

    Google Scholar 

  • Liu, Z./Bacchiani, M. 2011: TechWare: Mobile Media Search Resources. Signal Processing Magazine, 28(4).

    Google Scholar 

  • Maybury, M. 2004: New Directions in Question Answering. AAAI Press, Menlo Park, USA.

    Google Scholar 

  • Moreno, A./Lindberg, B./Draxler, C./Richard, G./Choukri, K./Euler, S./Allen, J. 2000: SPEECHDAT-CAR. A Large Speech Database for Automotive Environments. In Proc. of the LREC, Athens, Greece.

    Google Scholar 

  • Neustein, A. 2010: Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics. Springer, New York, USA.

    Google Scholar 

  • Oshry, M./Auburn, R./Baggia, P./Bodell, M./Burke, D./Burnett, D./Candell, E./Carter, J./McGlashan, S./Lee, A./Porter, B./Rehor, K. 2004: VoiceXML 2.1. W3C Recommendation. http://www.w3.org/ TR/2007/REC-voicexml21–20070619.

  • Pallett, D. 2003: A Look at NIST’s Benchmark ASR Tests: Past, Present, and Future. In Proc. of the ASRU, Virgin Islands, USA.

    Google Scholar 

  • Pieraccini, R. 2012: The Voice in the Machine: Building Computers that Understand Speech. MIT Press, Cambridge, USA.

    Google Scholar 

  • Price, P. 1990: Evaluation of Spoken Language Systems: The ATIS Domain. In Proc. of the Workshop on Speech and Natural Language, Hidden Valley, USA.

    Google Scholar 

  • Prylipko, D./Schnelle-Walka, D./Lord, S./Wendemuth, A. 2011: Zanzibar OpenIVR: An Open-Source Framework for Development of Spoken Dialog Systems. In Proc. of the TSD, Pilsen, Czech Republic.

    Google Scholar 

  • Rabiner, L. 1989: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. of the IEEE, 77(2).

    Google Scholar 

  • Radomski, S./Schnelle-Walka, D. 2012: VoiceXML for Pervasive Environments. International Journal of Mobile Human Computer Interaction, 4(2).

    Google Scholar 

  • Russell, S./Norvig, P. 2003: Artificial Intelligence–A Modern Approach. Prentice Hall, Upper Saddle River, USA.

    Google Scholar 

  • Schlangen, D./Skantze, G. 2009: A General, Abstract Model of Incremental Dialogue Processing. In Proc. of the EACL, Athens, Greece.

    Google Scholar 

  • Schnelle-Walka, D./Radomski, S./Mühlhäuser, M. 2013: JVoiceXML as a Modality Component in the W3C Multimodal Architecture. Journal on Multimodal User Interfaces.

    Google Scholar 

  • Seneff, S./Hurley, E./Lau, R./Pao, C./Schmid, P./Zue, V. 1998: Galaxy-II: A Reference Architecture for Conversational System Development. In Proc. of the ICSLP, Sydney, Australia.

    Google Scholar 

  • Simon, H. 1965: The Shape of Automation for Men and Management. Harper & Row, New York, USA.

    Google Scholar 

  • Suendermann, D. 2011: Advances in Commercial Deployment of Spoken Dialog Systems. Springer, New York, USA.

    Google Scholar 

  • Suendermann, D./Hunter, P./Pieraccini, R. 2008: Call Classification with Hundreds of Classes and Hundred Thousands of Training Utterances ... and No Target Domain Data. In Proc. of the PIT, Kloster Irsee, Germany.

    Google Scholar 

  • Suendermann, D./Liscombe, J./Dayanidhi, K./Pieraccini, R. 2009: A Handsome Set of Metrics to Measure Utterance Classification Performance in Spoken Dialog Systems. In Proc. of the SIGdial, London, UK.

    Google Scholar 

  • Suendermann, D./Liscombe, J./Pieraccini, R. 2010a: Contender. In Proc. of the SLT, Berkeley, USA.

    Google Scholar 

  • Suendermann, D./Liscombe, J./Pieraccini, R. 2010b: How to Drink from a Fire Hose: One Person Can Annoscribe 693 Thousand Utterances in One Month. In Proc. of the SIGdial, Tokyo, Japan.

    Google Scholar 

  • Suendermann, D./Liscombe, J./Pieraccini, R. 2010c: Minimally Invasive Surgery for Spoken Dialog Systems. In Proc. of the Interspeech, Makuhari, Japan.

    Google Scholar 

  • Suendermann, D./Liscombe, J./Pieraccini, R./Evanini, K. 2010d: ‘How am I doing?’ A new framework to effectively measure the performance of automated customer care contact centers. In Neustein, A. (ed.): Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics. Springer, New York, USA.

    Google Scholar 

  • Suendermann, D./Pieraccini, R. 2011: SLU in commercial and research spoken dialogue systems. In Tur, G./de Mori, R. (eds): Spoken Language Understanding. Wiley, New York, USA.

    Google Scholar 

  • Suendermann, D./Pieraccini, R. 2013: Crowdsourcing for industrial spoken dialog systems. In Eskenazi, M./Levow, G./Meng, H./Parent, G./Suendermann, D. (eds): Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment. Wiley, Hoboken, USA.

    Google Scholar 

  • Suendermann, D./Ney, H. 2003: synther – a New M-Gram POS Tagger. In Proc. of the NLPKE, Beijing, China.

    Google Scholar 

  • Suendermann, D./Strecha, G./Bonafonte, A./Höge, H./Ney, H. 2005: Evaluation of VTLN-Based Voice Conversion for Embedded Speech Synthesis. In Proc. of the Interspeech, Lisbon, Portugal.

    Google Scholar 

  • Tichelen, L./Burke, D. 2007: Semantic Interpretation for Speech Recognition (SISR) Version 1.0. W3C Recommendation. http://www.w3.org/TR/semantic-interpretation.

  • Tur, G./de Mori, R. 2011: Spoken Language Under- standing: Systems for Extracting Semantic Information from Speech. Wiley, Hoboken, USA.

    Google Scholar 

  • Turing, A. 1950: Computing Machinery and Intelligence. Mind, 59.

    Google Scholar 

  • Valin, J. 2006: Speex: A Free Codec for Free Speech. In Proc. of the Australian National Linux Conference, Dunedin, New Zealand.

    Google Scholar 

  • van Meggelen, J./Smith, J./Madsen, L. 2009: Asterisk: The Future of Telephony. O’Reilly, Sebastopol, USA.

    Google Scholar 

  • Wahlster, W. 2000: Verbmobil: Foundations of Speech-to-Speech Translation. Springer, New York, USA.

    Google Scholar 

  • Walker, M./Aberdeen, J./Sanders, G. 2003: 2001 Commu- nicator Evaluation. Linguistic Data Consortium, Philadelphia, USA.

    Google Scholar 

  • Walker, M./Rambow, O. 2002: Spoken Language Generation. Computer Speech and Language, 16(3).

    Google Scholar 

  • Walker, W./Lamere, P./Kwok, P. 2002: FreeTTS: A Performance Case Study. Technical report, Sun Microsystems, Santa Clara, USA.

    Google Scholar 

  • Wang, A. 2006:The Shazam Music Recognition Service. Communications of the ACM, 49(8).

    Google Scholar 

  • Weizenbaum, J. 1966: ELIZA–A Computer Program for the Study of Natural Language Communication between Man and Machine. Communications of the ACM, 9(1).

    Google Scholar 

  • Wiedenroth, H./Wollschläger, H. 2007: Karl Mays Werke: Historisch-Kritische Ausgabe. Karl-May-Verlag, Bamberg and Radebeul, Germany.

    Google Scholar 

  • Williams, J./Witt-Ehsani, S./Liska, A./Suendermann, D. 2011: Speech Recognition in a Multi-Modal Health Care Application: Two Sides of the Coin. In Proc. of the AVIxD/IxDA Workshop, New York, USA.

    Google Scholar 

  • Winarsky, N./Mark, B./Kressel, H. 2012: The Development of Siri and the SRI Venture Creation Process. Technical report, SRI International, Menlo Park, USA.

    Google Scholar 

  • Zechner, K./Higgins, D./Xi, X. 2007: Speechrater: A Construct-Driven Approach to Scoring Spontaneous Non-Native Speech. In Proc. of the SLaTE, Farmington, USA.

    Google Scholar 

  • Zyda, M./Thukral, D./Ferrans, J./Engelsma, J./Hans, M. 2008: Enabling a Voice Modality in Mobile Games through VoiceXML. In Proc. of the ACM SIGGRAPH symposium on Video games, Los Angeles, USA. Toad for Cloud Databases 2012. Online abrufbar unter: http://toadforcloud.com

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Suendermann-Oeft .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Fachmedien Wiesbaden

About this chapter

Cite this chapter

Suendermann-Oeft, D. (2014). Modern Conversational Agents. In: Jähnert, J., Förster, C. (eds) Technologien für digitale Innovationen. Springer VS, Wiesbaden. https://doi.org/10.1007/978-3-658-04745-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-658-04745-0_4

  • Published:

  • Publisher Name: Springer VS, Wiesbaden

  • Print ISBN: 978-3-658-04744-3

  • Online ISBN: 978-3-658-04745-0

  • eBook Packages: Humanities, Social Science (German Language)

Publish with us

Policies and ethics