Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement

  • Yasuaki Iwata
  • Tomohiro Nakatani
  • Takuya Yoshioka
  • Masakiyo Fujimoto
  • Hirofumi Saito


When speech signals are captured in real acoustical environments, the captured signals are distorted by certain types of interference, such as ambient noise, reverberation, and extraneous speakers’ utterances. There are two important approaches to speech enhancement that reduce such interference in the captured signals. One approach is based on the spatial features of the signals, such as direction of arrival and acoustic transfer functions, and enhances speech using multichannel audio signal processing. The other approach is based on speech spectral models that represent the probability density function of the speech spectra, and it enhances speech by distinguishing between speech and noise based on the spectral models. In this chapter, we propose a new approach that integrates the above two approaches. The proposed approach uses the spatial and spectral features of signals in a complementary manner to achieve reliable and accurate speech enhancement. The approach can be applied to various speech enhancement problems, including denoising, dereverberation, and blind source separation (BSS). In particular, in this chapter, we focus on applying the approach to BSS. We show experimentally that the proposed integration can improve the performance of BSS compared with a conventional approach.


Expectation Maximization Algorithm Blind Source Separation Speech Enhancement Clean Speech Conditional Probability Density Function 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    J. Benesty, S. Makino, J. Chen (eds.), Speech Enhancement (Signals and Communication Technology) (Springer, Berlin, 2005)Google Scholar
  2. 2.
    C.M. Biship, Pattern Recognition and Machine Learning (Information Science and Statistics) (Springer, New York, 2010)Google Scholar
  3. 3.
    A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B Methodol. 39, 1–38 (1977)MathSciNetMATHGoogle Scholar
  4. 4.
    N.Q.K. Duong, E. Vincent, R. Gribonval, Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)CrossRefGoogle Scholar
  5. 5.
    M. Fujimoto, T. Nakatani, Model-based noise suppression using unsupervised estimation of hidden Markov model for non-stationary noise, in Proceedings of INTERSPEECH 2013 (2013), pp. 2982–2986Google Scholar
  6. 6.
    S. Gannot, M. Moonen, Subspace methods for multimicrophone speech dereverberation. EURASIP J. Adv. Signal Process. 2003(11), 1074–1090 (2003)CrossRefMATHGoogle Scholar
  7. 7.
    J.F. Gemmeke, T. Virtanen, A. Hurmalainen, Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)CrossRefGoogle Scholar
  8. 8.
    S. Haykin, Adaptive Filter Theory, 5th edn. (Prentice Hall, Englewood Cliffs, 2013)Google Scholar
  9. 9.
    K. Iso, S. Araki, S. Makino, T. Nakatani, H. Sawada, T. Yamada, A. Nakamura, Blind source separation of mixed speech in a high reverberation environment, in Proceedings of 3rd Joint Workshop on Hands-free Speech Communication and Microphone Array (HSCMA-2011) (2011), pp. 36–39Google Scholar
  10. 10.
    N. Ito, S. Araki, T. Nakatani, Probabilistic integration of diffuse noise suppression and dereverberation, in Proceedings of IEEE ICASSP-2014 (2014), pp. 5204–5208Google Scholar
  11. 11.
    Y. Iwata, T. Nakatani, Introduction of speech log-spectral priors into dereverberation based on Itakura-Saito distance minimization, in Proceedings of IEEE ICASSP-2012 (2012), pp. 245–248Google Scholar
  12. 12.
    Y. Iwata, T. Nakatani, M. Fujimoto, T. Yoshioka, H. Saito, MAP spectral estimation of speech using log-spectral prior for noise reduction (in Japanese), in Proceedings of Autumn-2012 Meeting of the Acoustical Society of Japan (2012), pp. 795–798Google Scholar
  13. 13.
    Y. Izumi, N. Ono, S. Sagayama, Sparseness-based 2ch BSS using the EM algorithm in reverberant environment, in Proceedings of IEEE WASPAA-2007 (2007), pp. 147–150Google Scholar
  14. 14.
    P.C. Loizou, Speech Enhancement: Theory and Practice, 2nd edn. (CRC Press, Boca Raton, 2013)Google Scholar
  15. 15.
    P.J. Moreno, B. Raj, R.M. Stern, A vector taylor series approach for environment-independent speech recognition, in Proceedings of IEEE ICASSP-1996, vol. 2 (1996), pp. 733–736Google Scholar
  16. 16.
    T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B.H. Juang, Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio Speech Lang. Process. 18(7), 1717–1731 (2010)CrossRefGoogle Scholar
  17. 17.
    A. Ogawa, K. Kinoshita, T. Hori, T. Nakatani, A. Nakamura, Fast segment search for corpus-based speech enhancement based on speech recognition technology, in Proceedings of IEEE ICASSP-2014 (2014), pp. 1576–1580Google Scholar
  18. 18.
    D. Pearce, H.G. Hirsch, The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in Proceedings of INTERSPEECH-2000, vol. 2000 (2000), pp. 29–32Google Scholar
  19. 19.
    M. Rainer, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001)CrossRefGoogle Scholar
  20. 20.
    S.J. Rennie, J.R. Hershey, P.A. Olsen, Single-channel multitalker speech recognition. IEEE SP Mag. 27(6), 66–80 (2010)Google Scholar
  21. 21.
    H. Sawada, S. Araki, R. Mukai, S. Makino, Grouping separated frequency components by estimating propagation model parameters in frequency-domain blind source separation. IEEE Trans. Audio Speech Lang. Process. 15(5), 1592–1604 (2007)CrossRefGoogle Scholar
  22. 22.
    H. Sawada, S. Araki, S. Makino, Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. Audio Speech Lang. Process. 19(3), 516–527 (2011)CrossRefGoogle Scholar
  23. 23.
    M. Seltzer, D. Yu, Y. Wang, An investigation of deep neural networks for noise robust speech recognition, in Proceedings of IEEE ICASSP-2013 (2013), pp. 7398–7402Google Scholar
  24. 24.
    M. Souden, J. Chen, J. Benesty, S. Affes, An integrated solution for online multichannel noise tracking and reduction. IEEE Trans. Audio Speech Lang. Process. 19, 2159–2169 (2011)CrossRefGoogle Scholar
  25. 25.
    M. Togami, Y. Kawaguchi, R. Takeda, Y. Obuchi, N. Nukaga, Optimized speech dereverberation from probabilistic perspective for time varying acoustic transfer function. IEEE Trans. Audio Speech Lang. Process. 21(7), 1369–1380 (2013)CrossRefGoogle Scholar
  26. 26.
    O. Yilmaz, S. Rickard, Blind separation of speech mixture via time-frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004)MathSciNetCrossRefGoogle Scholar
  27. 27.
    T. Yoshioka, T. Nakatani, M. Miyoshi, H.G. Okuno, Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio Speech Lang. Process. 19(1), 69–84 (2011)CrossRefGoogle Scholar
  28. 28.
    E. Vincent, H. Sawada, P. Bofill, S. Makino, J. Rosca, First stereo audio source separation evaluation campaign: data, algorithms and results, in Proceedings of International Conference on Independent Component Analysis (ICA), pp. 552–559 (2007)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Yasuaki Iwata
    • 1
    • 2
  • Tomohiro Nakatani
    • 2
  • Takuya Yoshioka
    • 2
  • Masakiyo Fujimoto
    • 2
  • Hirofumi Saito
    • 1
  1. 1.Graduate School of Information ScienceNagoya UniversityNagoyaJapan
  2. 2.NTT Communication Science LaboratoriesNTT CorporationKyotoJapan

Personalised recommendations