Advertisement

Robot Audition: Missing Feature Theory Approach and Active Audition

  • Hiroshi G. Okuno
  • Kazuhiro Nakadai
  • Hyun-Don Kim
Part of the Springer Tracts in Advanced Robotics book series (STAR, volume 70)

Abstract

Robot capability of listening to several things at once by its own ears, that is,robot audition, is important in improving interaction and symbiosis between humans and robots. The critical issue in robot audition is real-time processing and robustness against noisy environments with high flexibility to support various kinds of robots and hardware configurations. This paper presents two important aspects of robot audition; Missing-Feature-Theory (MFT) approach and active audition. HARK open-source robot audition incorporates MFT approach to recognize speech signals that are localized and separated from a mixture of sound captured by 8- channel microphone array. HARK is ported to four robots, Honda ASIMO, SIG2, Robovie-R2 and HRP-2, with different microphone configurations and recognizes three simultaneous utterances with 1.9 sec latency. In binaural hearing, the most famous problem is a front-back confusion of sound sources. Active binaural robot audition implemented on SIG2 disambiguates the problem well by rotating its head with pitting. This active audition improves the localization for the periphery.

Keywords

Gaussian Mixture Model Sound Source Automatic Speech Recognition Acoustic Model Microphone Array 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aloimonos, Y., Weiss, I., Bandyopadhyay, A.: Active vision. Intern’l J. of Computer Vision 1(4), 333–356 (1999)CrossRefGoogle Scholar
  2. 2.
    Asano, F., Asoh, H., Matsui, T.: Sound source localization and signal separation for office robot “Jijo-2”. In: Proc. of IEEE Intern’l Conf. on Multisensor Fusion and Integration for Intelligent Systems, pp. 243–248 (1999)Google Scholar
  3. 3.
    Bahoura, M., Pelletier, C.: Respiratory Sound Classification using Cepstral Analysis and Gaussian Mixture Models. In: IEEE/EMBS Intern’l Conf., San Francisco, USA (2004)Google Scholar
  4. 4.
    Berglund, E.J.: Active Audition for Robots using Parameter-Less Self-Organising Maps. Ph. D Thesis, The University of Queensland, Australia (2005)Google Scholar
  5. 5.
    Barker, J., Cooke, M., Green, P.: Robust ASR Based on Clean Speech Models: An Evaluation of Missing Data Techniques for Connected Digit Recognition in Noise. In: 7th European Conference on Speech Communication Technology, pp. 213–216 (2001)Google Scholar
  6. 6.
    Blauert, J.: Spatial Hearing – The Psychophysics of Human Sound Localization. The MIT Press, Cambridge (1996) (revised edition)Google Scholar
  7. 7.
    Breazeal, C.: Emotive Qualities in Robot Speech. In: Proceeding of IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, Hawaii, pp. 1389–1394 (2001)Google Scholar
  8. 8.
    Bregman, A.S.: Auditory Scene Analysis. The MIT Press, Cambridge (1990)Google Scholar
  9. 9.
    Cooke, M., Green, P., Josifovski, L., Vizinho, A.: Robust Automatic Speech Recognition with Missing and Unreliable Acoustic Data. Speech Communication, vol. 34, pp. 267–285. Elsevier, Amsterdam (2001)Google Scholar
  10. 10.
    Ephraim, Y., Malah, D.: Speech enhancement using minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. on ASSP 32(6), 1109–1121 (1984)CrossRefGoogle Scholar
  11. 11.
    Hara, I., Asano, F., Kawai, Y., Kanehiro, F., Yamamoto, K.: Robust speech interface based on audio and video information fusion for humanoid HRP-2. In: Proceeding of IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, Sendai, Japan, pp. 2404–2410 (2004)Google Scholar
  12. 12.
    Kim, H.D., Komatani, K., Ogata, T., Okuno, H.G.: Real-Time Auditory and Visual Talker Tracking through integrating EM algorithm and Particle Filter. In: Okuno, H.G., Ali, M. (eds.) IEA/AIE 2007. LNCS (LNAI), vol. 4570, pp. 280–290. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  13. 13.
    Kim, H.D., Komatani, K., Ogata, T., Okuno, H.G.: Human Tracking System Integrating Sound and Face Localization using EM Algorithm in Real Environments. Advanced Robotics, 23(6):629–653 (2009) doi: 10.1163/156855309X431659Google Scholar
  14. 14.
    Kim, H.D., Komatani, K., Ogata, T., Okuno, H.G.: Binaural Active Audition for Humanoid Robots to Localize Speech over Entire Azimuth Range. Applied Bionic and Biomechanics (2009) (to appear)Google Scholar
  15. 15.
    Lu, L., Zhang, H.G., Jiang, H.: Content Analysis for Audio Classification and Segmentation. IEEE Trans. on Speech and Audio Processing 10(7), 504–516 (2002)CrossRefGoogle Scholar
  16. 16.
    Michaud, F., et al.: Robust Recognition of Simultaneous Speech by a Mobile Robot. IEEE Trans. on Robotics 23(4), 742–752 (2007)CrossRefGoogle Scholar
  17. 17.
    Moon, T.K.: The Expectation-Maximization algorithm. IEEE Signal Processing Magazine 13(6), 47–60 (1996)CrossRefGoogle Scholar
  18. 18.
    Nakadai, K., et al.: Active audition for humanoid. In: Proc. of 17th National Conference on Artificial Intelligence, pp. 832–839. AAAI, Menlo Park (2000)Google Scholar
  19. 19.
    Nakadai, K., Hidai, K., Okuno, H.G., Kitano, H.: Real-Time Speaker Localization and Speech Separation by Audio-Visual Integration. In: Proc. of IEEE-RAS Intern’l Conf. on Robotics and Automation, May 2002, pp. 1043–1049 (2002), doi:10.1109/ROBOT.2002.1013493Google Scholar
  20. 20.
    Nakadai, K., Matasuura, D., Okuno, H.G., Tsujino, H.: Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots. Speech Communication 44(4), 97–112 (2004), doi:10.1016/j.specom.2004.10.010CrossRefGoogle Scholar
  21. 21.
    Nakadai, K., Okuno, H.G.: An Open Source Software System for Robot Audition HARK and Its Evaluation. In: IEEE/RAS Intern’l Conf. on Humanoid Robots, pp. 561–566 (2008), doi:10.1109/ICHR.2008.4756031Google Scholar
  22. 22.
    Nakajima, H., Nakadai, K., Hasegawa, Y., Tsujino, H.: Adaptive Step-Size Parameter Control for Real-World Blind Source Separation. In: IEEE Intern’l Conf. on Acoustics, Speech and Signal Processing, pp. 149–152 (2008)Google Scholar
  23. 23.
    Nakajima, H., Nakadai, K., Hasegawa, Y., Tsujino, H.: High performance sound source separation adaptable to environmental changes for robot audition. In: IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, pp.2165–2171 (2008)Google Scholar
  24. 24.
    Nakajima, H., Nakadai, K., Hasegawa, Y., Tsujino, H.: Sound source separation of moving speakers for robot audition. In: IEEE Intern’l Conf. on Acoustics, Speech and Signal Processing, pp. 3685–3688 (2009)Google Scholar
  25. 25.
    Nishiura, T., Yamada, T., Nakamura, S., Shikano, K.: Localization of multiple sound sources based on a CSP analysis with a microphone array. In: Proceeding of IEEE Intern’l Conf. on Acoustics, Speech and Signal Processing, Istanbul, Turkey, pp. 1053–1056 (2000)Google Scholar
  26. 26.
    Nishimura, Y., Shinozaki, T., Iwano, K., Furui, S.: Noise-robust speech recognition using multi-band spectral features. In: 148th ASA Meeting, 1aSC7, ASA (2004)Google Scholar
  27. 27.
    Okuno, H.G., Nakadai, K., Hidai, K., Mizoguchi, H., Kitano, H.: Human-Robot Non-Verbal Interaction Empowered by Real-Time Auditory and Visual Multiple-Talker Tracking. Advanced Robotics 17(2), 115–130 (2003), VSP and RSJ, doi:10.1163/156855303321165088CrossRefGoogle Scholar
  28. 28.
    Parra, L.C., Alvino, C.V.: Geometric source separation: Mergin convolutive source separation with geometric beamforming. IEEE Trans. on SAP 10(6), 352–362 (2002)Google Scholar
  29. 29.
    Raj, H., Sterm, R.M.: Missing-feature approaches in speech recognition. IEEE Signal Processing Magazine 22(5),101–116 (2005)Google Scholar
  30. 30.
    Rosenthal, D., Okuno, H.G.: Computational Auditory Scene Analysis. Lawrence Erlbaum Associates, Mahwah, New Jersey (1998)Google Scholar
  31. 31.
    Schmidt, R.O.: Multiple Emitter Location and Signals Parameter Estimation. IEEE Transactions on Antennas and Propagation, AP-34, 276–280 (1986)Google Scholar
  32. 32.
    Shah, J.K., Iyer, A.N., Smolenski, B.Y., Yantormo, R.E.: Robust Voiced/Unvoiced classification using novel feature and Gaussian Mixture Model. In: Proc. of IEEE Intern’l Conf. on Acoustics, Speech, and Signal Processing, Montreal, Canada (2004)Google Scholar
  33. 33.
    Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J.: Localization of simultaneous moving sound sources for mobile robot using a frequency-domain steered beamformer approach. In: IEEE Intern’l Conf. on Robotics and Automation, pp. 1033–1038 (2004)Google Scholar
  34. 34.
    Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J.: Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter. In: IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, pp.2123–2128 (2004)Google Scholar
  35. 35.
    Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J., Nakadai, K., Okuno, H.G.: Robust Recognition of Simultaneous Speech by a Mobile Robot. IEEE Transactions on Robotics 23(4), 742–752 (2007), doi:10.1109/TRO.2007.900612CrossRefGoogle Scholar
  36. 36.
    Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J., Nakadai, K., Okuno, H.G.: Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robotics and Autonomous Systems J. 55(3), 216–228 (2007)CrossRefGoogle Scholar
  37. 37.
    Yamada, S., Lee, A., Saruwatari, H., Shikano, K.: Unsupervided speaker adaptation based on HMM sufficient statistics in various noisy environments. In: Proc. of Eurospeech 2003. ESCA, pp. 1493–1496 (2003)Google Scholar
  38. 38.
    Yamamoto, S., Valin, J.-M., Nakadai, K., Ogata, T., Okuno, H.G.: Enhanced Robot Speech Recognition Based on Microphone Array Source Separation and Missing Feature Theory. In: IEEE-RAS Intern’l Conf. on Robotics and Automation, pp. 1477–1482 (April 2005)Google Scholar
  39. 39.
    Yamamoto, S., et al.: Making A Robot Recognize Three Simultaneous Sentences in Real-Time. In: IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, pp. 4040–4045 (2005)Google Scholar
  40. 40.
    Yamamoto, S., et al.: Real-time robot audition system that recognizes simultaneous speech in the real world. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5333–5338 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Hiroshi G. Okuno
    • 1
  • Kazuhiro Nakadai
    • 2
    • 3
  • Hyun-Don Kim
    • 1
  1. 1.Graduate School of InformaticsKyoto UniversityKyotoJapan
  2. 2.Honda Research Institute Japan Co. Ltd.SaitamaJapan
  3. 3.Tokyo Institute of TechnologyJapan

Personalised recommendations