A Speaker Diarization System with Robust Speaker Localization and Voice Activity Detection

  • Yangyang HuangEmail author
  • Takuma Otsuka
  • Hiroshi G. Okuno
Part of the Studies in Computational Intelligence book series (SCI, volume 489)


In real-world auditory scene analysis of human-robot interactions, three types of information are essential and need to be extracted from the observation data – who speaks when and where. We present a speaker diarization system that is used to accomplish the resolution. Multiple signal classification (MUSIC) is a powerful method for voice activity detection (VAD) and direction of arrival (DOA) estimation. We propose our system and compare its performance in VAD and DOA with the method based on MUSIC algorithm.


Ground Truth Sound Source Audio Signal Blind Source Separation Free Speech 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kubota, Y., Yoshida, M., Komatani, K., Ogata, T., Okuno, H.G.: Design and implementation of 3d auditory scene visualizer towards auditory awareness with face tracking. In: Tenth IEEE International Symposium on Multimedia, pp. 468–476 (2008)Google Scholar
  2. 2.
    Nakadai, K., Takahashi, T., Okuno, H.G., Nakajima, H., Hasegawa, Y., Tsujino, H.: Design and implementation of robot audition system ’hark’ open source software for listening to three simultaneous speakers. Advanced Robotics 24(5-6), 739–761 (2010)CrossRefGoogle Scholar
  3. 3.
    Araki, S., Hori, T., Fujimoto, M., Watanabe, S., Yoshioka, T., Nakatani, T., Nakamura, A.: Online meeting recognizer with multichannel speaker diarization. In: ASILOMAR, pp. 1697–1701 (2010)Google Scholar
  4. 4.
    Tranter, S.E., Reynolds, D.A.: An overview of automatic speaker diarization systems. Proceedings of the IEEE Transactions on Audio, Speech, and Language Processing 14(5 ), 1557–1565 (2006)CrossRefGoogle Scholar
  5. 5.
    Nakamura, K., Nakadai, K., Asano, F., Ince, G.: Intelligent sound source localization and its application to multimodal human tracking. In: Proceedings of the IEEE/RSJ International Conference on IROS, pp. 143–148 (2011)Google Scholar
  6. 6.
    Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley Interscience (2001)Google Scholar
  7. 7.
    Ono, N.: Stable and fast update rules for independent vector analysis based on auxiliary function technique. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 189–192 (2011)Google Scholar
  8. 8.
    Schmidt, R.: Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation 34(3), 276–280 (1986)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  • Yangyang Huang
    • 1
    Email author
  • Takuma Otsuka
    • 1
  • Hiroshi G. Okuno
    • 1
  1. 1.Graduate School of InformaticsKyoto UniversityKyotoJapan

Personalised recommendations