Smart Posterboard: Multi-modal Sensing and Analysis of Poster Conversations

  • Tatsuya KawaharaEmail author


Conversations in poster sessions in academic events, referred to as poster conversations, pose interesting and challenging topics on multi-modal multi-party interactions. This article gives an overview of our CREST project on the smart posterboard for multi-modal conversation analysis. The smart posterboard has multiple sensing devices to record poster conversations, so we can review who came to the poster and what kind of questions or comments he/she made. The conversation analysis combines speech and image processing such as face and eye-gaze tracking, speech enhancement and speaker diarization. It is shown that eye-gaze information is useful for predicting turn-taking and also improving speaker diarization. Moreover, high-level indexing of interest and comprehension level of the audience is explored based on the multi-modal behaviors during the conversation. This is realized by predicting the audience’s speech acts such as questions and reactive tokens.


Multi-modal Conversation analysis Speech processing Posterboard 



This work was conducted by the members of the CREST project including Hiromasa Yoshimoto, Tony Tung, Yukoh Wakabayashi, Kouhei Sumi, Zhi-Qiang Chang, Takuma Iwatate, Soichiro Hayashi, Koji Inoue, Katsuya Takanashi (Kyoto University) and Yuji Onuma, Shunsuke Nakai, Ryoichi Miyazaki, Hiroshi Saruwatari (Nara Institute of Science and Technology).


  1. 1.
    S. Renals, T. Hain, H. Bourlard, Recognition and understanding of meetings: The AMI and AMIDA projects. Proceedings of IEEE Workshop Automatic Speech Recognition & Understanding (2007)Google Scholar
  2. 2.
    K. Ohtsuka, Conversation scene analysis. Signal Process. Magaz. 28(4), 127–131 (2011)Google Scholar
  3. 3.
    T. Kawahara, Multi-modal sensing and analysis of poster conversations toward smart posterboard. In Proceedings of SIGdial Meeting Discourse and Dialogue, pp. 1–9 (keynote speech) (2012)Google Scholar
  4. 4.
    T. Kawahara, Smart posterboard: Multi-modal sensing and analysis of poster conversations. In Proceedings of APSIPA ASC, page (plenary overview talk) (2013)Google Scholar
  5. 5.
    T. Kawahara, H.Setoguchi, K. Takanashi, K.Ishizuka, S. Araki, Multi-modal recording, analysis and indexing of poster sessions. Proceedings of INTERSPEECH, pp. 1622–1625 (2008)Google Scholar
  6. 6.
    K. Maekawa, Corpus of spontaneous Japanese: its design and evaluation. Proceedings of ISCA and IEEE Workshop on Spontaneous Speech Processing and Recognition, pp. 7–12 (2003)Google Scholar
  7. 7.
    H. Yoshimoto, Y. Nakamura, Cubistic representation for real-time 3D shape and pose estimation of unknown rigid object. Proceedings ICCV, Workshop, pp. 522–529 (2013)Google Scholar
  8. 8.
    Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, K. Shikano, Blind spatial subtraction array for speech enhancement in noisy environment. IEEE Trans. Audio, Speech Language Process. 17(4), 650–664 (2009)Google Scholar
  9. 9.
    T. Ohsuga, M. Nishida, Y. Horiuchi, A. Ichikawa, Investigation of the relationship between turn-taking and prosodic features in spontaneous dialogue. Proceedings INTERSPEECH, pp. 33–36 (2005)Google Scholar
  10. 10.
    C.T. Ishi, H. Ishiguro, N. Hagita, Analysis of prosodic and linguistic cues of phrase finals for turn-taking and dialog acts. Proceedings of INTERSPEECH, pp. 2006–2009 (2006)Google Scholar
  11. 11.
    N.G. Ward, Y.A. Bayyari, A case study in the identification of prosodic cues to turn-taking: back-channeling in Arabic. Proceedings of INTERSPEECH, pp. 2018–2021 (2006)Google Scholar
  12. 12.
    B. Xiao, V. Rozgic, A. Katsamanis, B.R. Baucom, P.G. Georgiou, S. Narayanan, Acoustic and visual cues of turn-taking dynamics in dyadic interactions. Proceedings of INTERSPEECH, pp. 2441–2444 (2011)Google Scholar
  13. 13.
    R. Sato, R. Higashinaka, M. Tamoto, M. Nakano, K. Aikawa, Learning decision trees to determine turn-taking by spoken dialogue systems. Proceedings of ICSLP, pp. 861–864 (2002)Google Scholar
  14. 14.
    D. Schlangen, From reaction to prediction: experiments with computational models of turn-taking. Proceedings INTERSPEECH, pp. 2010–2013 (2006)Google Scholar
  15. 15.
    A. Raux, M. Eskenazi, A finite-state turn-taking model for spoken dialog systems. Proceedings of HLT/NAACL (2009)Google Scholar
  16. 16.
    N.G. Ward, O. Fuentes, A. Vega, Dialog prediction for a general model of turn-taking. Proceedings of INTERSPEECH, pp. 2662–2665 (2010)Google Scholar
  17. 17.
    S. Benus, Are we ’in sync’: turn-taking in collaborative dialogues. Proceedings of INTERSPEECH, pp. 2167–2170 (2009)Google Scholar
  18. 18.
    N. Campbell, S. Scherer, Comparing measures of synchrony and alignment in dialogue speech timing with respect to turn-taking activity. Proceedings of INTERSPEECH, pp. 2546–2549 (2010)Google Scholar
  19. 19.
    D. Bohus, E. Horvitz, Models for multiparty engagement in open-world dialog. Proceedings of SIGdial (2009)Google Scholar
  20. 20.
    S. Fujie, Y. Matsuyama, H. Taniyama, T. Kobayashi, Conversation robot participating in and activating a group communication. Proceedings of INTERSPEECH, pp. 264–267 (2009)Google Scholar
  21. 21.
    K. Laskowski, J. Edlund, M. Heldner, A single-port non-parametric model of turn-taking in multi-party conversation. Proceedings of ICASSP, pp. 5600–5603 (2011)Google Scholar
  22. 22.
    K. Jokinen, K. Harada, M. Nishida, S. Yamamoto, Turn-alignment using eye-gaze and speech in conversational interaction. Proceedings of InterSpeech, pp. 2018–2021 (2011)Google Scholar
  23. 23.
    A. Kendon, Some functions of gaze direction in social interaction. Acta Psychol. 26, 22–63 (1967)CrossRefGoogle Scholar
  24. 24.
    S.E. Tranter, D.A. Reynolds, An overview of automatic speaker diarization systems. IEEE Trans. ASLP 14(5), 1557–1565 (2006)Google Scholar
  25. 25.
    G. Friedland, A. Janin, D. Imseng, X. Anguera Miro, L. Gottlieb, M. Huijbregts, M.T. Knox, O. Vinyals, The ICSI RT-09 speaker diarization system. IEEE Trans. ASLP 20(2), 371–381 (2012)Google Scholar
  26. 26.
    R. Schmidt, Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 34(3), 276–280 (1986)CrossRefGoogle Scholar
  27. 27.
    K. Yamamoto, F. Asano, T. Yamada, N. Kitawaki, Detection of overlapping speech in meetings using support vector machines and support vector regression. IEICE Trans. E89-A(8), 2158–2165 (2006)Google Scholar
  28. 28.
    H. Misra, H. Bourlard, V. Tyagi, New entropy based combination rules in hmm/ann multi-stream asr. Proc. ICASSP 2, 741–744 (2003)Google Scholar
  29. 29.
    S. Araki, M. Fujimoto, K. Ishizuka, H. Sawada, S. Makino, A DOA based speaker diarization system for real meetings. Prooceedings of HSCMA, pp. 29–32 (2008)Google Scholar
  30. 30.
    Y. Wakabayashi, K. Inoue, H. Yoshimoto, T. Kawahara, Speaker diarization based on audio-visual integration for smart posterboard. Proceedings of APSIPA ASC (2014)Google Scholar
  31. 31.
    J.G. Fiscus, J. Ajot, M. Michel, J.S. Garofolo, The Rich Transcription 2006 Spring Meeting Recognition Evaluation (Springer, 2006)Google Scholar
  32. 32.
    H. Koiso, Y. Horiuchi, S. Tutiya, A. Ichikawa, Y. Den, An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs. Language & Speech 41(3–4), 295–321 (1998)Google Scholar
  33. 33.
    N. Ward, W. Tsukahara, Prosodic features which cue back-channel responses in English and Japanese. J. Pragmatics 32(8), 1177–1207 (2000)CrossRefGoogle Scholar
  34. 34.
    N. Kitaoka, M. Takeuchi, R. Nishimura, S. Nakagawa, Response timing detection using prosodic and linguistic information for human-friendly spoken dialog systems. J. Japn. Soc. Artific. Intell. 20(3), 220–228 (2005)Google Scholar
  35. 35.
    D. Ozkan, L.-P. Morency, Modeling wisdom of crowds using latent mixture of discriminative experts. Proceedings of ACL/HLT (2011)Google Scholar
  36. 36.
    L.S.Kennedy, D.P.W. Ellis, Laughter detection in meetings. NIST Meeting Recognition Workshop (2004)Google Scholar
  37. 37.
    K.P. Truong, D.A. van Leeuwen, Automatic detection of laughter. Proceedings InterSpeech, pp. 485–488 (2005)Google Scholar
  38. 38.
    K.Laskowski, Contrasting emotion-bearing laughter types in multiparticipant vocal activity detection for meetings. Proceedings of IEEE-ICASSP, pp. 4765–4768 (2009)Google Scholar
  39. 39.
    N. Ward, Pragmatic functions of prosodic features in non-lexical utterances. Speech Prosody, pp. 325–328 (2004)Google Scholar
  40. 40.
    F. Yang, G. Tur, E. Shriberg, Exploiting dialog act tagging and prosodic information for action item identification. Proceedings of IEEE-ICASSP, pp. 4941–4944 (2008)Google Scholar
  41. 41.
    A. Gravano, S. Benus, J. Hirschberg, S. Mitchell, I. Vovsha, Classification of discourse functions of affirmative words in spoken dialogue. Proceedings of InterSpeech, pp. 1613–1616 (2007)Google Scholar
  42. 42.
    K. Sumi, T. Kawahara, J. Ogata, M. Goto, Acoustic event detection for spotting hot spots in podcasts. Proceedings of INTERSPEECH, pp. 1143–1146 (2009)Google Scholar
  43. 43.
    M. Goto, K. Itou, S. Hayamizu, A real-time filled pause detection system for spontaneous speech recognition research. Proceedings of EuroSpeech, pp. 227–230 (1999)Google Scholar
  44. 44.
    T. Kawahara, Z.Q. Chang, K. Takanashi, Analysis on prosodic features of Japanese reactive tokens in poster conversations. Proceedings Int’l Conference Speech Prosody (2010)Google Scholar
  45. 45.
    S. Strombergsson, J. Edlund, D. House, Prosodic measurements and question types in the spontal corpus of Swedish dialogues. Proceedings of InterSpeech (2012)Google Scholar

Copyright information

© Springer Japan 2016

Authors and Affiliations

  1. 1.Kyoto UniversityKyotoJapan

Personalised recommendations