Audio-Visual Processing in Meetings: Seven Questions and Current AMI Answers

  • Marc Al-Hames
  • Thomas Hain
  • Jan Cernocky
  • Sascha Schreiber
  • Mannes Poel
  • Ronald Müller
  • Sebastien Marcel
  • David van Leeuwen
  • Jean-Marc Odobez
  • Sileye Ba
  • Herve Bourlard
  • Fabien Cardinaux
  • Daniel Gatica-Perez
  • Adam Janin
  • Petr Motlicek
  • Stephan Reiter
  • Steve Renals
  • Jeroen van Rest
  • Rutger Rienks
  • Gerhard Rigoll
  • Kevin Smith
  • Andrew Thean
  • Pavel Zemcik
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4299)

Abstract

The project Augmented Multi-party Interaction (AMI) is concerned with the development of meeting browsers and remote meeting assistants for instrumented meeting rooms – and the required component technologies R&D themes: group dynamics, audio, visual, and multimodal processing, content abstraction, and human-computer interaction. The audio-visual processing workpackage within AMI addresses the automatic recognition from audio, video, and combined audio-video streams, that have been recorded during meetings. In this article we describe the progress that has been made in the first two years of the project. We show how the large problem of audio-visual processing in meetings can be split into seven questions, like “Who is acting during the meeting?”. We then show which algorithms and methods have been developed and evaluated for the automatic answering of these questions.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ba, S.O., Odobez, J.M.: Evaluation of head pose tracking algorithm in indoor environments. In: Proceedings IEEE ICME (2005)Google Scholar
  2. 2.
    Ba, S.O., Odobez, J.M.: A rao-blackwellized mixed state particle filter for head pose tracking. In: Proceedings of the ACM-ICMI Workshop on MMMP (2005)Google Scholar
  3. 3.
    BANCA: Benchmark database, http://www.ee.surrey.ac.uk/banca
  4. 4.
    Burger, S., MacLaren, V., Yu, H.: The ISL meeting corpus: The impact of meeting type on speech style. In: Proceedings ICSLP (2002)Google Scholar
  5. 5.
    Cardinaux, F., Sanderson, C., Bengio, S.: Face verification using adapted generative models. In: Int. Conf. on Automatic Face and Gesture Recognition (2004)Google Scholar
  6. 6.
    Cardinaux, F., Sanderson, C., Marcel, S.: Comparison of MLP and GMM classifiers for face verification on XM2VTS. In: Proc. IEEE AVBPA (2003)Google Scholar
  7. 7.
    Carletta, J., et al.: The AMI meetings corpus. In: Proc. Symposium on Annotating and measuring Meeting Behavior (2005)Google Scholar
  8. 8.
    Fapso, M., Schwarz, P., Szoke, I., Smrz, P., Schwarz, M., Cernocky, J., Karafiat, M., Burget, L.: Search engine for information retrieval from speech records. In: Proceedings Computer Treatment of Slavic and East European Languages (2005)Google Scholar
  9. 9.
    Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: International Conference on Machine Learning (1996)Google Scholar
  10. 10.
    Hain, T., Burget, L., Dines, J., Garau, G., Karafiat, M., Lincoln, M., McCowan, I., Moore, D., Wan, V., Ordelman, R., Renals, S.: The 2005 AMI system for the transcription of speech in meetings. In: Proc. of the NIST RT 2005s workshop (2005)Google Scholar
  11. 11.
    Hain, T., Dines, J., Garau, G., Karafiat, M., Moore, D., Wan, V., Ordelman, R., Renals, S.: Transcription of conference room meetings: an investigation. In: Proceedings Interspeech (2005)Google Scholar
  12. 12.
    Heylen, D., Nijholt, A., Reidsma, D.: Determining what people feel and think when interacting with humans and machines: Notes on corpus collection and annotation. In: Kreiner, J., Putcha, C. (eds.) Proceedings 1st California Conference on Recent Advances in Engineering Mechanics (2006)Google Scholar
  13. 13.
    Hradis, M., Juranek, R.: Real-time tracking of participants in meeting video. In: Proceedings CESCG (2006)Google Scholar
  14. 14.
    Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., Wooters, C.: ICSI meeting corpus. In: Proceedings IEEE ICASSP (2003)Google Scholar
  15. 15.
    Messer, K., Kittler, J., Sadeghi, M., Hamouz, M., Kostyn, A., Marcel, S., Bengio, S., Cardinaux, F., Sanderson, C., Poh, N., Rodriguez, Y., Czyz, J., et al.: Face authentication test on the BANCA database. In: Proceedings ICPR (2004)Google Scholar
  16. 16.
    Motlicek, P., Burget, L., Cernocky, J.: Non-parametric speaker turn segmentation of meeting data. In: Proceedings Eurospeech (2005)Google Scholar
  17. 17.
    Pantic, M., Rothkrantz, L.J.M.: Automatic analysis of facial expressions: The state of the art. IEEE TPAMI 22(12), 1424–1445 (2000)Google Scholar
  18. 18.
    Poppe, R., Heylen, D., Nijholt, A., Poel, M.: Towards real-time body pose estimation for presenters in meeting environments. In: Proceedings WSCG (2005)Google Scholar
  19. 19.
    Potucek, I., Sumec, S., Spanel, M.: Participant activity detection by hands and face movement tracking in the meeting room. In: Proceedings CGI (2004)Google Scholar
  20. 20.
    Rienks, R., Poppe, R., Heylen, D.: Differences in head orientation for speakers and listeners: Experiments in a virtual environment. Int. Journ. HCS (to appear)Google Scholar
  21. 21.
    Schwarz, P., Matějka, P., Černocký, J.: Hierarchical structures of neural networks for phoneme recognition. In: IEEE ICASSP (accepted, 2006)Google Scholar
  22. 22.
    Smith, K., Ba, S., Odobez, J., Gatica-Perez, D.: Evaluating multi-object tracking. In: Workshop on Empirical Evaluation Methods in Computer Vision (2005)Google Scholar
  23. 23.
    Smith, K., Ba, S., Odobez, J.M., Gatica-Perez, D.: Multi-person wander-visual-focus-of-attention tracking. Technical Report RR-05-80, IDIAP (2005)Google Scholar
  24. 24.
    Smith, K., Schreiber, S., Beran, V., Potúcek, I., Gatica-Perez, D.: A comparitive study of head tracking methods. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  25. 25.
    Szöke, I., Schwarz, P., Matějka, P., Burget, L., Karafiát, M., Fapšo, M., Černocký, J.: Comparison of keyword spotting approaches for informal continuous speech. In: Proceedings Eurospeech (2005)Google Scholar
  26. 26.
  27. 27.
    NIST US: Spring 2004 (RT04S) and Spring 2005 (RT05S) Rich Transcription Meeting Recognition Evaluation Plan. Available at: http://www.nist.gov/
  28. 28.
    Viola, P., Jones, M.: Robust real-time object detection. International Journal of Computer Vision (2002)Google Scholar
  29. 29.
    Waibel, A., Steusloff, H., Stiefelhagen, R., CHIL Project Consortium: CHIL: Computers in the human interaction loop. In: Proceedings of the NIST ICASSP Meeting Recognition Workshop (2004)Google Scholar
  30. 30.
    Wellner, P., Flynn, M., Guillemot, M.: Browsing recorded meetings with Ferret. In: Bengio, S., Bourlard, H. (eds.) MLMI 2004. LNCS, vol. 3361, pp. 12–21. Springer, Heidelberg (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Marc Al-Hames
    • 1
  • Thomas Hain
    • 2
  • Jan Cernocky
    • 3
  • Sascha Schreiber
    • 1
  • Mannes Poel
    • 4
  • Ronald Müller
    • 1
  • Sebastien Marcel
    • 5
  • David van Leeuwen
    • 6
  • Jean-Marc Odobez
    • 5
  • Sileye Ba
    • 5
  • Herve Bourlard
    • 5
  • Fabien Cardinaux
    • 5
  • Daniel Gatica-Perez
    • 5
  • Adam Janin
    • 8
  • Petr Motlicek
    • 3
    • 5
  • Stephan Reiter
    • 1
  • Steve Renals
    • 7
  • Jeroen van Rest
    • 6
  • Rutger Rienks
    • 4
  • Gerhard Rigoll
    • 1
  • Kevin Smith
    • 5
  • Andrew Thean
    • 6
  • Pavel Zemcik
    • 3
  1. 1.Institute for Human-Machine-CommunicationTechnische Universität München 
  2. 2.Department of Computer ScienceUniversity of Sheffield 
  3. 3.Faculty of Information TechnologyBrno University of Technology 
  4. 4.Department of Computer ScienceUniversity of Twente 
  5. 5.IDIAP Research Institute and Ecole Polytechnique Federale de Lausanne (EPFL) 
  6. 6.Netherlands Organisation for Applied Scientific Research (TNO) 
  7. 7.Centre for Speech Technology ResearchUniversity of Edinburgh 
  8. 8.International Computer Science InstituteBerkeleyUSA

Personalised recommendations