MMCorp 2008: Multimodal Corpora pp 93-108 | Cite as

Unsupervised Clustering in Multimodal Multiparty Meeting Analysis

  • Yosuke Matsusaka
  • Yasuhiro Katagiri
  • Masato Ishizaki
  • Mika Enomoto
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5509)

Abstract

Nonverbal signals such as gazes, head nods, facial expressions, and bodily gestures play significant roles in organizing human interactions. Their significance is even more emphasized in multiparty settings, since many interaction organization behaviors, for example, turn-taking and participation role assignment, are realized nonverbally. Several projects have been involved in collecting multimodal corpora [3,4] for multiparty dialogues, to develop techniques for meeting event recognitions from nonverbal as well as verbal signals (e.g., [11,2]).

The task of annotating nonverbal signals exchanged in conversational interactions poses both theoretical and practical challenges for the development of multimodal corpora. Many projects rely on both manual annotation and automatic signal processing in corpus building. Some projects apply different methods to different types of signals to facilitate the efficient construction of corpora through the division of labor [9]. Others treat manual annotations as ideal values in the process of validating their signal processing methods [7].

Keywords

Nonverbal Behavior Manual Annotation Unsupervised Cluster Head Direction Listener Response 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Asano, F., Ogata, J.: Detection and separation of speech events in meeting recordings. In: Proc. Interspeech, pp. 2586–2589 (2006)Google Scholar
  2. 2.
    Ba, S., Odobez, J.-M.: A study on visual focus of attention recognition from head pose in a meeting room. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 75–87. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  3. 3.
    Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., Wellner, P.: The ami meeting corpus: A pre-announcement. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 28–39. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Chen, L., Travis Rose, R., Qiao, Y., Kimbara, I., Parrill, F., Welji, H., Han, T.X., Tu, J., Huang, Z., Harper, M., Quek, F., Xiong, Y., McNeill, D., Tuttle, R., Huang, T.: Vace multimodal meeting corpus. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 40–51. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Kipp, M.: Gesture Generation by Imitation: From Human Behavior to Computer Character Animation. Dissertation.com, Boca Raton, FL (2004)Google Scholar
  6. 6.
    Lee, S., Hayes, M.H.: An application for interactive video abstraction. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 5, pp. 17–21 (2004)Google Scholar
  7. 7.
    Martin, J.-C., Caridakis, G., Devillers, L., Karpouzis, K., Abrilian, S.: Manual annotation and automatic image processing of multimodal emotional behaviours: Validating the annotation of TV interviews. In: Proc. LREC 2006, pp. 1127–1132 (2006)Google Scholar
  8. 8.
    Matsusaka, Y.: Recognition of 3 party conversation using prosody and gaze. In: Proc. Interspeech, pp. 1205–1208 (2005)Google Scholar
  9. 9.
    Pianesi, F., Zancanaro, M., Leonardi, C.: Multimodal annotated corpora of consensus decision making meetings. In: LREC 2006 Workshop on Multimodal Corpora, pp. 6–9 (2006)Google Scholar
  10. 10.
    Sas, C., O’ Hare, G., Reilly, R.: Virtual environment trajectory analysis: a basis for navigational assistance and scene adaptivity. Future Generation Computer Systems 21, 1157–1166 (2005)CrossRefGoogle Scholar
  11. 11.
    Stiefelhagen, R., Yang, J., Waibel, A.: Modeling focus of attention for meeting indexing based on multiple cues. IEEE Transactions on Neural Networks 13(4), 923–938 (2002)CrossRefGoogle Scholar
  12. 12.
    Turaga, P.K., Veeraraghavan, A., Chellappa, R.: From videos to verbs: Miningvideos for events using a cascade of dynamical systems. In: Proc. of IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (2007)Google Scholar
  13. 13.
    Viola, P., Jones, M.J.: Robust real-time face detection. International Journal of Computer Vision 57(2), 137–154 (2004)CrossRefGoogle Scholar
  14. 14.
    Wang, T., Shum, H., Xu, Y., Zheng, N.: Unsupervised analysis of human gestures. In: Shum, H.-Y., Liao, M., Chang, S.-F. (eds.) PCM 2001. LNCS, vol. 2195, pp. 174–181. Springer, Heidelberg (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Yosuke Matsusaka
    • 1
  • Yasuhiro Katagiri
    • 2
  • Masato Ishizaki
    • 3
  • Mika Enomoto
    • 4
  1. 1.National Institute of Advanced Industrial Science and TechnologyTsukubaJapan
  2. 2.Future University HakodateHakodateJapan
  3. 3.The University of TokyoTokyoJapan
  4. 4.Tokyo University of TechnologyTokyoJapan

Personalised recommendations