Towards Computer Understanding of Human Interactions

  • Iain McCowan
  • Daniel Gatica-Perez
  • Samy Bengio
  • Darren Moore
  • Hervé Bourlard
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3361)

Abstract

People meet in order to interact – disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Waibel, A., Schultz, T., Bett, M., Malkin, R., Rogina, I., Stiefelhagen, R., Yang, J.: SMaRT:the Smart Meeting Room Task at ISL. In: Proc. IEEE ICASSP 2003 (2003)Google Scholar
  2. 2.
    Bobick, A., Intille, S., Davis, J., Baird, F., Pinhanez, C., Campbell, L., Ivanov, Y., Schutte, A., Wilson, A.: The KidsRoom: A Perceptually-Based Interactive and Immersive Story Environment. PRESENCE: Teleoperators and Virtual Environments 8 (1999)Google Scholar
  3. 3.
    Johnson, N., Galata, A., Hogg, D.: The acquisition and use of interaction behaviour models. In: Proc. IEEE Int. Conference on Computer Vision and Pattern Recognition (1998)Google Scholar
  4. 4.
    Jebara, T., Pentland, A.: Action reaction learning: Automatic visual analysis and synthesis of interactive behaviour. In: Proc. International Conference on Vision Systems (1999)Google Scholar
  5. 5.
    Oliver, N., Rosario, B., Pentland, A.: A bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000)Google Scholar
  6. 6.
    Hongeng, S., Nevatia, R.: Multi-agent event recognition. In: Proc. IEEE Int. Conference on Computer Vision, Vancouver (2001)Google Scholar
  7. 7.
    Carletta, J., Isard, A., Isard, S., Kowtko, J., Doherty-Sneddon, G., Anderson, A.: The coding of dialogue structure in a corpus. In: Andernach, J., van de Burgt, S., van der Hoeven, G. (eds.) Proceedings of the Twente Workshop on Language Technology: Corpus-based approaches to dialogue modelling, Universiteit Twente (1995)Google Scholar
  8. 8.
    Morgan, N., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Janin, A., Pfau, T., Shriberg, E., Stolcke, A.: The meeting project at ICSI. In: Proc. of the Human Language Technology Conference, San Diego, CA (2001)Google Scholar
  9. 9.
    Bales, R.F.: Interaction Process Analysis: A method for the study of small groups. Addison-Wesley, Reading (1951)Google Scholar
  10. 10.
    McGrath, J.E.: Groups: Interaction and Performance. Prentice-Hall, Englewood Cliffs (1984)Google Scholar
  11. 11.
    McGrath, J., Kravitz, D.: Group research. Annual Review of Psychology 33, 195–230 (1982)CrossRefGoogle Scholar
  12. 12.
    Padilha, E., Carletta, J.C.: A simulation of small group discussion. In: EDILOG (2002)Google Scholar
  13. 13.
    Parker, K.C.H.: Speaking turns in small group interaction: A context-sensitive event sequence model. Journal of Personality and Social Psychology 54, 965–971 (1988)CrossRefGoogle Scholar
  14. 14.
    Fay, N., Garrod, S., Carletta, J.: Group discussion as interactive dialogue or serial monologue: The influence of group size. Psychological Science 11, 487–492 (2000)CrossRefGoogle Scholar
  15. 15.
    Novick, D., Hansen, B., Ward, K.: Coordinating turn-taking with gaze. In: Proceedings of the 1996 International Conference on Spoken Language Processing, ICSLP 1996 (1996)Google Scholar
  16. 16.
    Krauss, R., Garlock, C., Bricker, P., McMahon, L.: The role of audible and visible back-channel responses in interpersonal communication. Journal of Personality and Social Psychology 35, 523–529 (1977)CrossRefGoogle Scholar
  17. 17.
    DePaulo, B., Rosenthal, R., Eisenstat, R., Rogers, P., Finkelstein, S.: Decoding discrepant nonverbal cues. Journal of Personality and Social Psychology 36, 313–323 (1978)CrossRefGoogle Scholar
  18. 18.
    Kubala, F.: Rough’n’ready: a meeting recorder and browser. ACM Computing Surveys 31 (1999)Google Scholar
  19. 19.
    Waibel, A., Bett, M., Metze, F., Ries, K., Schaaf, T., Schultz, T., Soltau, H., Yu, H., Zechner, K.: Advances in automatic meeting record creation and access. In: Proc. IEEE ICASSP, Salt Lake City, UT (2001)Google Scholar
  20. 20.
    Renals, S., Ellis, D.: Audio information access from meeting rooms. In: Proc. IEEE ICASSP 2003 (2003)Google Scholar
  21. 21.
    Cutler, R., Rui, Y., Gupta, A., Cadiz, J., Tashev, I., He, L., Colburn, A., Zhang, Z., Liu, Z., Silverberg, S.: Distributed meetings: A meeting capture and broadcasting system. In: Proc. ACM Multimedia Conference (2002)Google Scholar
  22. 22.
    Gatica-Perez, D., Lathoud, G., McCowan, I., Odobez, J.M.: A mixed-state i-particle filter for multi-camera speaker tracking. In: Proceedings of WOMTEC (2003)Google Scholar
  23. 23.
    Doucet, A., de Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001)MATHGoogle Scholar
  24. 24.
    Smith, K., Gatica-Perez, D.: Order matters: a distributed sampling method for multi-object tracking. In: IDIAP Research Report IDIAP-RR-04-25, Martigny (2004)Google Scholar
  25. 25.
    Ba, S., Odobez, J.M.: A probabilistic framework for joint head tracking and pose estimation. In: Proc. ICPR, Cambridge (2004)Google Scholar
  26. 26.
    Cutler, R.: The distributed meetings system. In: Proceedings of IEEE ICASSP 2003 (2003)Google Scholar
  27. 27.
    Stanford, V., Garofolo, J., Michel, M.: The nist smart space and meeting room projects: Signals, acquisition, annotation, and metrics. In: Proceedings of IEEE ICASSP 2003 (2003)Google Scholar
  28. 28.
    Silverman, H., Patterson, W., Flanagan, J., Rabinkin, D.: A digital processing system for source location and sound capture by large microphone arrays. In: Proceedings of ICASSP 1997 (1997)Google Scholar
  29. 29.
    Shriberg, E., Stolcke, A., Baron, D.: Observations on overlap: findings and implications for automatic processing of multi-party conversation. In: Proceedings of Eurospeech 2001, vol. 2, pp. 1359–1362 (2001)Google Scholar
  30. 30.
    Pfau, T., Ellis, D., Stolcke, A.: Multispeaker speech activity detection for the ICSI meeting recorder. In: Proceedings of ASRU 2001 (2001)Google Scholar
  31. 31.
    Kemp, T., Schmidt, M., Westphal, M., Waibel, A.: Strategies for automatic segmentation of audio data. In: Proceedings of ICASSP 2000 (2000)Google Scholar
  32. 32.
    Lathoud, G., McCowan, I.: Location based speaker segmentation. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (2003)Google Scholar
  33. 33.
    Lathoud, G., McCowan, I., Moore, D.: Segmenting multiple concurrent speakers using microphone arrays. In: Proceedings of Eurospeech 2003 (2003)Google Scholar
  34. 34.
    Lathoud, G., Odobez, J.M., McCowan, I.: Unsupervised location-based segmentation of multi-party speech. In: Proceedings of the 2004 ICASSP-NIST Meeting Recognition Workshop (2004)Google Scholar
  35. 35.
    Bitzer, J., Simmer, K.U.: Superdirective microphone arrays. In: Brandstein, M., Ward, D. (eds.) Microphone arrays, pp. 19–38. Springer, Heidelberg (2001)Google Scholar
  36. 36.
    McCowan, I., Bourlard, H.: Microphone array post-filter based on noise field coherence. IEEE Transactions on Speech and Audio Processing (2003) (to appear)Google Scholar
  37. 37.
    Moore, D., McCowan, I.: Microphone array speech recognition: Experiments on overlapping speech in meetings. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (2003)Google Scholar
  38. 38.
    Jain, A., Bolle, R., Pankanti, S.: Biometrics: Person Identification in Networked Society. Kluwer Academic Publishers, Dordrecht (1999)Google Scholar
  39. 39.
    Mariéthoz, J., Bengio, S.: A comparative study of adaptation methods for speaker verification. In: Proceedings of the International Conference on Spoken Language Processing, ICSLP (2002)Google Scholar
  40. 40.
    Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proc. IEEE Int. Conf. on Computer Vision (CVPR), Kawaii (2001)Google Scholar
  41. 41.
    Rowley, H., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Tran. Pattern Analysis and Machine Intelligence 20(1), 23–38 (1998)CrossRefGoogle Scholar
  42. 42.
    Sanderson, C., Paliwal, K.: Fast features for face authentication under illumination direction changes. Pattern Recognition Letters 24, 2409–2419 (2003)CrossRefGoogle Scholar
  43. 43.
    Cardinaux, F., Sanderson, C., Bengio, S.: Face verification using adapted generative models. In: Proc. Int. Conf. Automatic Face and Gesture Recognition (AFGR), Seoul, Korea (2004)Google Scholar
  44. 44.
    Bengio, S., Marcel, C., Marcel, S., Mariéthoz, J.: Confidence measures for multimodal identity verification. Information Fusion 3, 267–276 (2002)CrossRefGoogle Scholar
  45. 45.
    Bailly-Baillière, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Ruiz, B., Thiran, J.P.: The BANCA database and evaluation protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  46. 46.
    Messer, K., Kittler, J., Sadeghi, M., Hamouz, M., Kostyn, A., Marcel, S., Bengio, S., Cardinaux, F., Sanderson, C., Poh, N., Rodriguez, Y., Kryszczuk, K., Czyz, J., Vandendorpe, L., Ng, J., Cheung, H., Tang, B.: Face authentication competition on the BANCA database. In: International Conference on Biometric Authentication, ICBA (2004)Google Scholar
  47. 47.
    Rabiner, L.R., Juang, B.H.: Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs (1993)Google Scholar
  48. 48.
    Starner, T., Pentland, A.: Visual recognition of american sign language using HMMs. In: Proc. Int. Work. on Auto. Face and Gesture Recognition, Zurich (1995)Google Scholar
  49. 49.
    McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G.: Automatic analysis of multimodal group actions in meetings. Technical Report RR 03-27, IDIAP (2003)Google Scholar
  50. 50.
    Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia 2, 141–151 (2000)CrossRefGoogle Scholar
  51. 51.
    Bengio, S.: An asynchronous hidden markov model for audio-visual speech recognition. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, NIPS 15. MIT Press, Cambridge (2003)Google Scholar
  52. 52.
    Brand, M.: Coupled hidden markov models for modeling interacting processes. TR 405, MIT Media Lab Vision and Modeling (1996)Google Scholar
  53. 53.
    Zhang, D., Gatica-Perez, D., Bengio, S., McCowan, I., Lathoud, G.: Modeling individual and group actions in meetings: a two-layer hmm framework. In: Proc. IEEE CVPR Workshop on Event Mining, Washington, DC (2004)Google Scholar
  54. 54.
    De Gelder, B., Vroomen, J.: The perception of emotions by ear and by eye. Cognition and Emotion 14, 289–311 (2002)CrossRefGoogle Scholar
  55. 55.
    Basu, S., Choudhury, T., Clarkson, B., Pentland, A.: Learning human interactions with the influence model. Technical Report 539, MIT Media Laboratory (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Iain McCowan
    • 1
  • Daniel Gatica-Perez
    • 1
  • Samy Bengio
    • 1
  • Darren Moore
    • 1
  • Hervé Bourlard
    • 1
  1. 1.IDIAP Research InstituteMartignySwitzerland

Personalised recommendations