Skip to main content
Log in

Soft missing-feature mask generation for robot audition

  • Research Article
  • Published:
Paladyn

Abstract

This paper describes an improvement in automatic speech recognition (ASR) for robot audition by introducing Missing Feature Theory (MFT) based on soft missing feature masks (MFM) to realize natural human-robot interaction. In an everyday environment, a robot’s microphones capture various sounds besides the user’s utterances. Although sound-source separation is an effective way to enhance the user’s utterances, it inevitably produces errors due to reflection and reverberation. MFT is able to cope with these errors. First, MFMs are generated based on the reliability of time-frequency components. Then ASR weighs the time-frequency components according to the MFMs. We propose a new method to automatically generate soft MFMs, consisting of continuous values from 0 to 1 based on a sigmoid function. The proposed MFM generation was implemented for HRP-2 using HARK, our open-sourced robot audition software. Preliminary results show that the soft MFM outperformed a hard (binary) MFM in recognizing three simultaneous utterances. In a human-robot interaction task, the interval limitations between two adjacent loudspeakers were reduced from 60 degrees to 30 degrees by using soft MFMs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. J. Barker, M. Cooke, and P. Green. Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise. In Procedings of Eurospeech-2001, pages 213–216. ESCA, 2001.

  2. J. Barker, L. Josifovski, M. Cooke, and P. Green. Soft decisions in missing data techniques for robust automatic speech recognition. In Proc. of 6th International Conference on Spoken Language Processing (ICSLP-2000), volume I, pages 373–376, 2000.

    Google Scholar 

  3. S. F. Boll. A spectral subtraction algorithm for suppression of acoustic noise in speech. In Proceedings of 1979 International Conference on Acoustics, Speech, and Signal Processing (ICASSP-79), pages 200–203. IEEE, 1979.

  4. C. Breazeal. Emotive qualities in robot speech. In Proceedings of 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2001), pages 1389–1394, 2001.

  5. I. Cohen and B. Berdugo. Speech enhancement for nonstationary noise environments. Signal Processing, 81(2):2403–2418, 2001.

    Article  MATH  Google Scholar 

  6. M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust automatic speech recognition with missing and unreliable acoustic data. Speech Communication, 34(3):267–285, May 2000.

    Article  Google Scholar 

  7. C. Côté, D. Létourneau, F. Michaud, J. M. Valin, Y. Brosseau, C. Räievsky, M. Lemay, and V. Tran. Reusability tools for programming mobile robots. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 1820–1825. IEEE, 2004.

  8. J. de Veth, F. de Wet, B. Cranen, and L. Boves. Missing feature theory in asr: Make sure you miss the right type of features. In Proceedings of Workshop on Robust Methods for ASR in Adverse Conditions, Tampere, pages 231–234, 1999.

  9. A. Drygajlo and M. El-Maliki. Speaker verification in noisy environments with combined spectral subtraction and missing feature theory. In Proceedings of 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1998), pages 121–124, 1998.

  10. Y. Ephraim and D. Malah. Speech enhancement using minimum mean-square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-32(6):1109–1121, 1984.

    Article  Google Scholar 

  11. I. Hara, F. Asano, H. Asoh, J. Ogata, N. Ichimura, Y. Kawai, F. Kanehiro, H. Hirukawa, and K. Yamamoto. Robust speech interface based on audio and video information fusion for humanoid HRP-2. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 2404–2410. IEEE, 2004.

  12. H. Isao, A. Futoshi, K. Yoshihiro, K. Fumio, and Y. Kiyoshi. Robust speech interface based on audio and video information fusion for humanoid hrp-2. In Proceedings of 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 2404–2410, 2004.

  13. Multiband Julius. http://www.furui.cs.titech.ac.jp/mbandjulius/.

  14. H. D. Kim, K. Komatani, T. Ogata, and H. G. Okuno. Human tracking system integrating sound and face localization using em algorithm in real environments. Advanced Robotics, 23(6):629–653, 2007.

    Article  Google Scholar 

  15. R. P. Lippmann and B. A. Carlson. Robust speech recognition with time-varying filtering, interruptions, and noise. In Proceedings of 1997 ISCA 5th European Conference on Speech Communication and Technology (EuroSpeech 1997), pages 365–372, 1997.

  16. Y. Matsusaka, T. Tojo, S. Kuota, K. Furukawa, D. Tamiya, K. Hayata, Y. Nakano, and T. Kobayashi. Multi-person conversation via multi-modal interface — a robot who communicates withmulti-user. In Proceedings of 6th European Conference on Speech Communication Technology (Eurospeech 1999), pages 1723–1726, 1999.

  17. I. A. McCowan and H. Bourlard. Microphone array post-filter for diffuse noise field. In ICASSP-2002, volume 1, pages 905–908, 2002.

    Google Scholar 

  18. K. Nakadai, T. Lourens, H. G. Okuno, and H. Kitano. Active audition for humanoid. In Proc. of 17th National Conference on Artificial Intelligence (AAAI-2000), pages 832–839. AAAI, 2000.

  19. K. Nakadai, D. Matasuura, H. G. Okuno, and H. Tsujino. Improvement of recognition of simultaneous speech signals using av integration and scattering theory for humanoid robots. Speech Communication, 44(1–4):97–112, October 2004.

    Article  Google Scholar 

  20. K. Nakadai, D. Matsuura, H. G. Okuno, and H. Tsujino. Improvement of recognition of simultaneous speech signals using av integration and scattering theory for humanoid robots. Speech Communication, 44(1–4):97–112, 2004.

    Article  Google Scholar 

  21. K. Nakadai, H. G. Okuno, H. Nakajima, Y. Hasegawa, and H. Tsujino. An open source software system for robot audition hark and its evaluation. In Proceedings of 2008 IEEE/RAS International Conference on Humanoid Robots (HUMANOIDS 2008), pages 561–566, 2008.

  22. Y. Nishimura, T. Shinozaki, K. Iwano, and S. Furui. Noise-robust speech recognition using multi-band spectral features. In Proceedings of 148th Acoustical Society of America Meetings, number 1aSC7, 2004.

  23. M. T. Padilla, T. F. Quantieri, and D. A. Reynolds. Missing feature theory with soft spectral subtraction for speaker verification. In Proceedings of the 8th International Congress on Spoken Language Processing (InterSpeech 2006), pages 913–916, 2006.

  24. H.M. Park and R.M. Stern. Missing feature speech recognition using dereverberation and echo suppression in reerberation environments. In Proceedings of 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), volume IV, pages 381–384, 2007.

    Google Scholar 

  25. L. C. Parra and C. V. Alvino. Geometric source separation: Mergin convolutive source separation with geometric beamforming. IEEE Transactions on Speech and Audio Processing, 10(6):352–362, 2002.

    Article  Google Scholar 

  26. R. Plomp, L. C. W. Pols, and J. P. van de Geer. Dimensional analysis of vowel spectra. Acoustical Society of America, 41(3):707–712, 1967.

    Article  Google Scholar 

  27. B. Raj and R. M. Stern. Missing-feature approaches in speech recognition. Signal Processing Magazine, 22(5):101–116, 2005.

    Article  Google Scholar 

  28. P. Renevey and A. Drygajlo. Missing feature theory and probabilistic estimation of clean speech components for robust speech recognition. In Proceedings of European Conference on Speech Communication Technology (Eurospeech-1999), pages 2627–2630, 1999.

  29. M. L. Seltzer, B. Raj, and R. M. Stern. A bayesian classifier for spectrographicmask estimation formissing feature speech recognition. Speech Communication, 43:379–393, 2004.

    Article  Google Scholar 

  30. T. Takahashi, K. Nakadai, K. Komatani, T. Ogata, and H. G. Okuno. Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model. In Proceedings of 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2009), pages 2730–2735, 2009.

  31. K. Tatsuya and L. Akinobu. Free software toolkit for Japanese large vocabulary continuous speech recognition. In International Conference on Spoken Language Processing (ICSLP), volume 4, pages 476–479, 2000.

    Google Scholar 

  32. J. M. Valin, F. Michaud, and J. Rouat. Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robotics and Autonomous Systems Journal, 55(3):216–228, 2007.

    Article  Google Scholar 

  33. J. M. Valin, J. Rouat, and F. Michaud. Enhanced robot audition based on microphone array source separation with post-filter. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2133–2128, 2004.

  34. J. M. Valin, J. Rouat, and F. Michaud. Enhanced robot audition based on microphone array source separation with post-filter. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 2123–2128. IEEE, 2004.

  35. F. Wang, Y. Takeuchi, N. Ohnishi, and N. Sugie. Amobile robot with active localization and discrimination of a sound source. Journal of Robotic Society of Japan, 15(2):61–67, 1997.

    MATH  Google Scholar 

  36. S. Yamamoto, K. Nakadai, J. M. Valin, J. Rouat, F. Michaud,, K. Komatani, T. Ogata, and H. G. Okuno. Making a robot recognize three simultaneous sentences in real-time. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2005), pages 897–902. IEEE, 2005.

  37. S. Yamamoto, J. M. Valin, K. Nakadai, T. Ogata, and H. G. Okuno. Enhanced robot speech recognition based on microphone array source separation and missing feature theory. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2005), pages 1489–1494. IEEE, 2005.

  38. S. Yamamoto, K. Nakadai, M. Nakano, H. Tsujino, J. M. Valin, K. Komatani, T. Ogata, and H. G. Okuno. Design and implementation of a robot audition system for automatic speech recognition of simultaneous speech. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU-2007), pages 111–116. IEEE, 2007.

  39. S. Yamamoto, K. Nakadai, H. Tsujino, T. Yokoyama, and H. G. Okuno. Improvement of robot audition by interfacing sound source separation and automatic speech recognition with missing feature theory. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA 2004), pages 1517–1523. IEEE, 2004.

  40. S. Yamamoto, K. Nakadai, J.M. Valin, J. Rouat, F. Michaud, K. Komatani, T. Ogata, and H. G. Okuno. Genetic algorithm-based improvement of robot hearing capabilities inseparating and recognizing simultaneous speech signals. In Proceedings of 19th International Conference on Industrial, Engineering, and Other Applications of Applied Intelligent Systems (IEA/AIE’06), volume LNAI 4031, pages 207–217. Springer-Verlag, 2006.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Toru Takahashi.

About this article

Cite this article

Takahashi, T., Nakadai, K., Komatani, K. et al. Soft missing-feature mask generation for robot audition. Paladyn 1, 37–47 (2010). https://doi.org/10.2478/s13230-010-0005-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.2478/s13230-010-0005-1

Keywords

Navigation