Advertisement

Multimodal HALEF: An Open-Source Modular Web-Based Multimodal Dialog Framework

  • Zhou Yu
  • Vikram Ramanarayanan
  • Robert Mundkowsky
  • Patrick Lange
  • Alexei Ivanov
  • Alan W. Black
  • David Suendermann-Oeft
Chapter
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 427)

Abstract

We present an open-source web-based multimodal dialog framework, “Multimodal HALEF”, that integrates video conferencing and telephony abilities into the existing HALEF cloud-based dialog framework via the FreeSWITCH video telephony server. Due to its distributed and cloud-based architecture, Multimodal HALEF allows researchers to collect video and speech data from participants interacting with the dialog system outside of traditional lab settings, therefore largely reducing cost and labor incurred during the traditional audio-visual data collection process. The framework is equipped with a set of tools including a web-based user survey template, a speech transcription, an annotation and rating portal, a web visual processing server that performs head tracking, and a database that logs full-call audio and video recordings as well as other call-specific information. We present observations from an initial data collection based on an job interview application. Finally we report on some future plans for development of the framework.

Keywords

Dialog systems Multimodal inputs 

References

  1. 1.
    Eskenazi, M., Black, A.W., Raux, A., Langner, B.: Let’s go lab: a platform for evaluation of spoken dialog systems with real world users. In: Proceedings of the Ninth Annual Conference of the International Speech Communication Association (2008)Google Scholar
  2. 2.
    Zue, V., Seneff, S., Glass, J.R., Polifroni, J., Pao, C., Hazen, T.J., Hetherington, L.: Jupiter: a telephone-based conversational interface for weather information. IEEE Trans. Speech Audio Process. 8(1), 85–96 (2000)Google Scholar
  3. 3.
    Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)Google Scholar
  4. 4.
    Bohus, D., Saw, C.W., Horvitz, E.: Directions robot: in-the-wild experiences and lessons learned. In: Proceedings of the International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pp. 637–644. International Foundation for Autonomous Agents and Multiagent Systems (2014)Google Scholar
  5. 5.
    McGraw, I., Lee, C.Y., Hetherington, I.L., Seneff, S., Glass, J.: Collecting voices from the cloud. In: LREC (2010)Google Scholar
  6. 6.
    He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.J.: Face recognition using laplacianfaces. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 328–340 (2005)CrossRefGoogle Scholar
  7. 7.
    Morency, L.P., Sidner, C., Lee, C., Darrell, T.: Contextual recognition of head gestures. In: Proceedings of the 7th international conference on Multimodal interfaces, pp. 18–24. ACM (2005)Google Scholar
  8. 8.
    Vinciarelli, A., Pantic, M., Bourlard, H.: Social signal processing: survey of an emerging domain. Image Vision Comput. J. 27(12), 1743–1759 (2009)CrossRefGoogle Scholar
  9. 9.
    Sciutti, A., Schillingmann, L., Palinko, O., Nagai, Y., Sandini, G.: A gaze-contingent dictating robot to study turn-taking. In: Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction Extended Abstracts, pp. 137–138. ACM (2015)Google Scholar
  10. 10.
    Yu, Z., Bohus, D., Horvitz, E.: Incremental coordination: attention-centric speech production in a physically situated conversational agent. In: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, p. 402 (2015)Google Scholar
  11. 11.
    Kousidis, S., Kennington, C., Baumann, T., Buschmeier, H., Kopp, S., Schlangen, D.: A multimodal in-car dialogue system that tracks the driver’s attention. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 26–33. ACM (2014)Google Scholar
  12. 12.
    Van Meggelen, J., Madsen, L., Smith, J.: Asterisk: The Future of Telephony. O’Reilly Media, Inc. (2007)Google Scholar
  13. 13.
    Schnelle-Walka, D., Radomski, S., Mühlhäuser, M.: Jvoicexml as a modality component in the w3c multimodal architecture. J. Multimodal User Interfaces 7(3), 183–194 (2013)CrossRefGoogle Scholar
  14. 14.
    Prylipko, D., Schnelle-Walka, D., Lord, S., Wendemuth, A.: Zanzibar openivr: an open-source framework for development of spoken dialog systems. In: Proceedings of the Text, Speech and Dialogue, pp. 372–379. Springer (2011)Google Scholar
  15. 15.
    Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., Warmuth, M., Wolf, P.: The CMU Sphinx-4 speech recognition system. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2003), Hong Kong, vol. 1, pp. 2–5. Citeseer (2003)Google Scholar
  16. 16.
    Taylor, P., Black, A.W., Caley, R.: The architecture of the Festival speech synthesis system In: the Third ESCA Workshop in Speech Synthesis, pp. 147–151 (1998)Google Scholar
  17. 17.
    Schröder, M., Trouvain, J.: The German text-to-speech synthesis system Mary: a tool for research, development and teaching. Int. J. Speech Technol. 6(4), 365–377 (2003)Google Scholar
  18. 18.
    Ramanarayanan, V., Suendermann-Oeft, D., Ivanov, A.V., Evanini, K.: A distributed cloud-based dialog system for conversational application development. In: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, p. 432 (2015)Google Scholar
  19. 19.
    Mehrez, T., Abdelkawy, A., Heikal, Y., Lange, P., Nabil, H., Suendermann-Oeft, D.: Who discovered the electron neutrino? A telephony-based distributed open-source standard-compliant spoken dialog system for question answering. In: Proceedings of the GSCL, Darmstadt, Germany (2013)Google Scholar
  20. 20.
    Pappu, A., Rudnicky, A.: Deploying speech interfaces to the masses. In: Proceedings of the Companion Publication of the 2013 International Conference on Intelligent User Interfaces Companion, pp. 41–42. ACM (2013)Google Scholar
  21. 21.
    Baltrusaitis, T., Robinson, P., Morency, L.P.: 3D constrained local model for rigid and non-rigid facial tracking. In: Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2610–2617. IEEE (2012)Google Scholar
  22. 22.
    Yu, Z., Gerritsen, D., Ogan, A., Black, A.W., Cassell, J.: Automatic prediction of friendship via multi-model dyadic features. In: Proceedings of SIGDIAL, pp. 51–60 (2013)Google Scholar

Copyright information

© Springer Science+Business Media Singapore 2017

Authors and Affiliations

  • Zhou Yu
    • 1
    • 2
    • 3
  • Vikram Ramanarayanan
    • 1
    • 2
  • Robert Mundkowsky
    • 1
    • 2
  • Patrick Lange
    • 1
    • 2
  • Alexei Ivanov
    • 1
    • 2
  • Alan W. Black
    • 3
  • David Suendermann-Oeft
    • 1
    • 2
  1. 1.Educational Testing Service (ETS) R&DSan FranciscoUSA
  2. 2.Educational Testing Service (ETS) R&DPrincetonUSA
  3. 3.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations