Assembling the Jigsaw: How Multiple Open Standards Are Synergistically Combined in the HALEF Multimodal Dialog System

  • Vikram Ramanarayanan
  • David Suendermann-Oeft
  • Patrick Lange
  • Robert Mundkowsky
  • Alexei V. Ivanov
  • Zhou Yu
  • Yao Qian
  • Keelan Evanini


As dialog systems become increasingly multimodal and distributed in nature with advances in technology and computing power, they become that much more complicated to design and implement. However, open industry and W3C standards provide a silver lining here, allowing the distributed design of different components that are nonetheless compliant with each other. In this chapter we examine how an open-source, modular, multimodal dialog system—HALEF—can be seamlessly assembled, much like a jigsaw puzzle, by putting together multiple distributed components that are compliant with the W3C recommendations or other open industry standards. We highlight the specific standards that HALEF currently uses along with a perspective on other useful standards that could be included in the future. HALEF has an open codebase to encourage progressive community contribution and a common standard testbed for multimodal dialog system development and benchmarking.


Interactive Voice Response Voice Activity Detector Speech Recognizer Dialog System Speech Synthesizer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Baggia, P., Burnett, D., Marchand, R., & Matula, V. (2016, to appear). The role and importance of speech standards. In Multimodal interaction with W3C standards: Towards natural user interfaces to everything. Springer.Google Scholar
  2. 2.
    Baumann, T., Buß, O., & Schlangen, D. (2010). Inprotk in action: Open-source software for building German-speaking incremental spoken dialogue systems. Fachbereich Informatik: Hamburg.Google Scholar
  3. 3.
    Bohus, D., & Horvitz, E. (2010). Facilitating multiparty dialog with gaze, gesture, and speech. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction (ICMI-MLMI'10), November 8–12, 2010, Beijing, China (p. 5). ACM.Google Scholar
  4. 4.
    Bohus, D., Raux, A., Harris, T., Eskenazi, M., & Rudnicky, A.: Olympus: An open-source framework for conversational spoken language interface research. In Proceedings of the HLT-NAACL, Rochester (2007).Google Scholar
  5. 5.
    Damnati, G., Béchet, F., & De Mori, R. (2007). Experiments on the France telecom 3000 voice agency corpus: Academic research on an industrial spoken dialog system. In Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, NAACL-HLT, Rochester, NY, April 2007 (pp. 48–55). Association for Computational Linguistics.Google Scholar
  6. 6.
    DeMara, R. F., Gonzalez, A. J., Jones, S., Johnson, A., Hung, V., Leon-Barth, C., et al. (2008). Towards interactive training with an avatar-based human-computer interface. In The Interservice Industry Training, Simulation & Education Conference, ITSEC (December 2008). Citeseer.Google Scholar
  7. 7.
    Gorostiza, J. F., Barber, R., Khamis, A. M., Pacheco, M., Rivas, R., Corrales, A., et al. (2006). Multimodal human-robot interaction framework for a personal robot. In The 15th IEEE International Symposium on Robot and Human Interactive Communication, 2006. ROMAN 2006 (pp. 39–44). Hatfield, UK: IEEE.Google Scholar
  8. 8.
    Harel, D., & Politi, M. (1998). Modeling reactive systems with statecharts: The STATEMATE approach. New York: McGraw-Hill, Inc.Google Scholar
  9. 9.
    Hartholt, A., Traum, D., Marsella, S.C., Shapiro, A., Stratou, G., Leuski, A., et al. (2013). All together now. In Proceedings of the 13th International Conference on Intelligent Virtual Agents, IVA 2013, Edinburgh, UK, August 29–31, 2013 (pp. 368–381). Berlin/Heidelberg: Springer.Google Scholar
  10. 10.
    Hastie, H. W., Johnston, M., & Ehlen, P. (2002). Context-sensitive help for multimodal dialogue. In Proceedings of the 4th IEEE International Conference on Multimodal Interfaces (p. 93). Washington, DC, USA, IEEE Computer Society.Google Scholar
  11. 11.
    Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., et al. (2002). Match: An architecture for multimodal dialogue systems. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL), Philadelphia, July 2002 (pp. 376–383).Google Scholar
  12. 12.
    Jurčíček, F., Dušek, O., Plátek, O., & Žilka, L. (2014). Alex: A statistical dialogue systems framework. In Proceedings of the 17th International Conference on Text, Speech and Dialogue, TSD 2014, Brno, Czech Republic, September 8–12, 2014 (pp. 587–594). Switzerland: Springer.Google Scholar
  13. 13.
    Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., et al. (2003). The CMU SPHINX-4 speech recognition system. In Proceedings of the ICASSP’03, Hong Kong, China.Google Scholar
  14. 14.
    Lison, P. (2013). Structured probabilistic modelling for dialogue management. Ph.D. thesis, University of Oslo.Google Scholar
  15. 15.
    López-Cózar, R., Callejas, Z., Griol, D., & Quesada, J. F. (2015). Review of spoken dialogue systems. Loquens, 1(2), e012.CrossRefGoogle Scholar
  16. 16.
    Minessale, A., & Schreiber, D. (2012). FreeSWITCH Cookbook. Packt Publishing Ltd.Google Scholar
  17. 17.
    Neßelrath, R., & Alexandersson, J. (2009). A 3D gesture recognition system for multimodal dialog systems. In 6th IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems (pp. 46–51).Google Scholar
  18. 18.
    Pieraccini, R., & Huerta, J. (2005). Where do we go from here? Research and commercial spoken dialog systems. In 6th SIGdial Workshop on Discourse and Dialogue.Google Scholar
  19. 19.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The Kaldi speech recognition toolkit. In Proceedings of the ASRU, HI, USA.Google Scholar
  20. 20.
    Prylipko, D., Schnelle-Walka, D., Lord, S., & Wendemuth, A. (2011). Zanzibar openIVR: An open-source framework for development of spoken dialog systems. In Proceedings of the TSD, Pilsen, Czech Republic.Google Scholar
  21. 21.
    Ramanarayanan, V., Suendermann-Oeft, D., Ivanov, A., & Evanini, K. (2015). A distributed cloud-based dialog system for conversational application development. In 16th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL 2015), Prague, Czech Republic.Google Scholar
  22. 22.
    Schnelle-Walka, D., Radomski, S., & Mühlhäuser, M. (2013). JVoiceXML as a modality component in the W3C multimodal architecture. Journal on Multimodal User Interfaces 7(3), 183–194.Google Scholar
  23. 23.
    Schröder, M., & Trouvain, J. (2003). The German text-to-speech synthesis system MARY: A tool for research, development and teaching. International Journal of Speech Technology, 6 (4), 365–377.CrossRefGoogle Scholar
  24. 24.
    Suendermann-Oeft, D., Ramanarayanan, V., Teckenbrock, M., Neutatz, F., & Schmidt, D. (2015). HALEF: An open-source standard-compliant telephony-based modular spoken dialog system—A review and an outlook. In Proceedings of the IWSDS Workshop 2015, Busan, South Korea.Google Scholar
  25. 25.
    Swartout, W., Artstein, R., Forbell, E., Foutz, S., Lane, H.C., Lange, B., et al. (2013). Virtual humans for learning. AI Magazine, 34(4), 13–30.Google Scholar
  26. 26.
    Swartout, W., Traum, D., Artstein, R., Noren, D., Debevec, P., Bronnenkant, K., et al. (2010). Ada and grace: Toward realistic and engaging virtual museum guides. In Proceedings of the 10th International Conference on Intelligent Virtual Agents, IVA 2010, Philadelphia, PA, USA, September 20–22, 2010. Lecture Notes in Computer Science (pp. 286–300). Berlin/Heidelberg: Springer.Google Scholar
  27. 27.
    Taylor, P., Black, A., & Caley, R. (1998). The architecture of the festival speech synthesis system. In Proceedings of the ESCA Workshop on Speech Synthesis, Jenolan Caves.Google Scholar
  28. 28.
    van Meggelen, J., Smith, J., & Madsen, L. (2009). Asterisk: The future of telephony. Sebastopol: O’Reilly.Google Scholar
  29. 29.
    Yu, Z., Bohus, D., & Horvitz, E. (2015). Incremental coordination: Attention-centric speech production in a physically situated conversational agent. In 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (p. 402).Google Scholar
  30. 30.
    Yu, Z., Ramanarayanan, V., Mundkowsky, R., Lange, P., Ivanov, A., Black, A.W., et al. (2016). Multimodal HALEF: An open-source modular web-based multimodal dialog framework. In Proceedings of the IWSDS Workshop 2016, Saariselka, Finland.Google Scholar

Copyright information

© Springer International Publishing Switzerland 2017

Authors and Affiliations

  • Vikram Ramanarayanan
    • 1
  • David Suendermann-Oeft
    • 1
  • Patrick Lange
    • 1
  • Robert Mundkowsky
    • 2
  • Alexei V. Ivanov
    • 1
  • Zhou Yu
    • 3
  • Yao Qian
    • 1
  • Keelan Evanini
    • 2
  1. 1.Educational Testing Service (ETS) R&DSan FranciscoUSA
  2. 2.Educational Testing Service (ETS) R&DPrincetonUSA
  3. 3.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations