Abstract
Recent automatic speech recognition results are quite good when the training data is matched to the test data, but much worse when they differ in some important regard, like the number and arrangement of microphones or the reverberation and noise conditions. Because these configurations are difficult to predict a priori and difficult to exhaustively train over, the use of unsupervised spatial-clustering methods is attractive. Such methods separate sources using differences in spatial characteristics, but do not need to fully model the spatial configuration of the acoustic scene. This chapter will discuss several approaches to unsupervised spatial clustering, with a focus on model-based expectation maximization source separation and localization (MESSL). It will discuss the basic two-microphone version of this model, which clusters spectrogram points based on the relative differences in phase and level between pairs of microphones, its generalization to more than two microphones, and its use to drive minimum variance distortionless response (MVDR) beamforming. These systems are evaluated for speech enhancement as well as automatic speech recognition, for which they are able to reduce word error rates by between 9.9 and 17.1% relative over a standard delay-and-sum beamformer in mismatched train–test conditions.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
For the purposes of the MESSL-MRF discussion, the indices k and k ′ are a shorthand for the T–F coordinates (t k , f k ) and \( (t_{k^{{\prime}}},f_{k^{{\prime}}}) \).
References
Aarabi, P.: Self-localizing dynamic microphone arrays. IEEE Trans. Syst. Man Cybern. C 32(4), 474–484 (2002)
Algazi, V.R., Duda, R.O., Thompson, D.M., Avendano, C.: The CIPIC HRTF database. In: Proceedings of WASPAA, pp. 99–102 (2001)
Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Language Process. 15(7), 2011–2022 (2007)
Araki, S., Sawada, H., Mukai, R., Makino, S.: Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal Process. 87, 1833–1847 (2007)
Bagchi, D., Mandel, M.I., Wang, Z., He, Y., Plummer, A., Fosler-Lussier, E.: Combining spectral feature mapping and multi-channel model-based source separation for noise-robust automatic speech recognition. In: Proceedings of ASRU (2015)
Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third “CHiME” speech separation and recognition challenge: dataset, task and baselines. In: Proceedings of ASRU (2015)
Besag, J.: On the statistical analysis of dirty pictures (with discussion). J. R. Stat. Soc. B 48(3), 259–302 (1986)
Capon, J.: High-resolution frequency-wavenumber spectrum analysis. Proc. IEEE 57(8), 1408–1418 (1969)
Carletta, J.: Unleashing the killer corpus: experiences in creating the multi-everything AMI meeting corpus. Lang. Resour. Eval. 41(2), 181–190 (2007)
Cermak, J., Araki, S., Sawada, H., Makino, S.: Blind speech separation by combining beamformers and a time frequency binary mask. In: Proceedings of IWAENC, Paris (2006)
Cermak, J., Araki, S., Sawada, H., Makino, S.: Blind source separation based on a beamformer array and time frequency binary masking. In: Proceedings of ICASSP, vol. 1, pp. 145–148. IEEE, New York (2007)
Chuang, K.S., Tzeng, H.L., Chen, S., Wu, J., Chen, T.J.: Fuzzy c-means clustering with spatial information for image segmentation. Comput. Med. Imaging Graph. 30(1), 9–15 (2006)
Cobos, M., Lopez, J.: Maximum a posteriori binary mask estimation for underdetermined source separation using smoothed posteriors. IEEE Trans. Audio Speech Language Process. 20(7), 2059–2064 (2012)
Deleforge, A., Forbes, F., Horaud, R.: Variational EM for binaural sound-source separation and localization. In: Proceedings of ICASSP, pp. 76–79 (2013)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 1–38 (1977)
Dmochowski, J., Benesty, J., Affes, S.: On spatial aliasing in microphone arrays. IEEE Trans. Signal Process. 57(4), 1383–1395 (2009)
Gales, M.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Audio Speech Language Process. 7(3), 272–281 (1999)
Gaubitch, N.D., Kleijn, W.B., Heusdens, R.: Auto-localization in ad-hoc microphone arrays. In: Proceedings of ICASSP, pp. 106–110. IEEE, New York (2013)
Grais, E., Erdogan, H.: Spectro-temporal post-smoothing in NMF based single-channel source separation. In: Proceedings of EUSIPCO, pp. 584–588 (2012)
Gu, D.B., Sun, J.: EM image segmentation algorithm based on an inhomogeneous hidden MRF model. IEEE Vis. Image Signal Process. 152(2), 184–190 (2004)
Haeb-Umbach, R., Ney, H.: Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: Proceedings of ICASSP, pp. 13–16 (1992)
Himawan, I., McCowan, I., Sridharan, S.: Clustered blind beamforming from ad-hoc microphone arrays. IEEE Trans. Audio Speech Language Process. 19(4), 661–676 (2011)
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Jiang, Y., Wang, D., Liu, R.: Binaural deep neural network classification for reverberant speech segregation. In: Proceedings of Interspeech, pp. 2400–2403 (2014)
Kim, M., Smaragdis, P.: Single channel source separation using smooth nonnegative matrix factorization with Markov random fields. In: Proceedings of MLSP, pp. 1–6 (2013)
Kim, M., Smaragdis, P., Ko, G.G., Rutenbar, R.A.: Stereophonic spectrogram segmentation using Markov random fields. In: Proceedings of MLSP, pp. 1–6 (2012)
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Sehr, A., Kellermann, W., Gannot, S., Maas, R., Haeb-Umbach, R., Leutnant, V., Raj, B.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: Proceedings of WASPAA, New Paltz, NY (2013)
Kühne, M., Togneri, R., Nordholm, S.: Smooth soft mel-spectrographic masks based on blind sparse source separation. In: Proceedings of Interspeech (2007)
Kühne, M., Togneri, R., Nordholm, S.: Adaptive beamforming and soft missing data decoding for robust speech recognition in reverberant environments. In: Proceedings of Interspeech, pp. 976–979 (2008)
Kühne, M., Togneri, R., Nordholm, S.: A novel fuzzy clustering algorithm using observation weighting and context information for reverberant blind speech separation. Signal Process. 90(2), 653–669 (2010)
Liang, S., Liu, W., Jiang, W.: Integrating binary mask estimation with MRF priors of cochleagram for speech separation. IEEE Signal Process. Lett. 19(10), 627–630 (2012)
Lippmann, R., Martin, E., Paul, D.: Multi-style training for robust isolated-word speech recognition. In: Proceedings of ICASSP, vol. 12, pp. 705–708 (1987)
Liu, Z., Zhang, Z., He, L.W., Chou, P.: Energy-based sound source localization and gain normalization for ad hoc microphone arrays. In: Proceedings of ICASSP, vol. 2, pp. 761–764. IEEE, New York (2007)
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Madhu, N., Breithaupt, C., Martin, R.: Temporal smoothing of spectral masks in the cepstral domain for speech separation. In: Proceedings of ICASSP, pp. 45–48 (2008)
Mandel, M.I., Roman, N.: Enforcing consistency in spectral masks using Markov random fields. In: Proceedings of EUSIPCO (2015)
Mandel, M.I., Weiss, R.J., Ellis, D.P.W.: Model-based expectation maximization source separation and localization. IEEE Trans. Audio Speech Language Process. 18(2), 382–394 (2010)
Middlebrooks, J.C., Green, D.M.: Sound localization by human listeners. Annu. Rev. Psychol. 42, 135–159 (1991)
Narayanan, A., Wang, D.: Investigation of speech separation as a front-end for noise robust speech recognition. IEEE Trans. Audio Speech Language Process. 22(4), 826–835 (2014)
O’Grady, P.D., Pearlmutter, B.A.: Soft-LOST: EM on a mixture of oriented lines. In: Independent Component Analysis and Blind Signal Separation, vol. 3195, 1270 pp. Springer, Berlin (2004)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco, CA (1988)
Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of ICASSP, pp. 4057–4060 (2008)
Rafaely, B., Weiss, B., Bachmat, E.: Spatial aliasing in spherical microphone arrays. IEEE Trans. Signal Process. 55(3), 1003–1010 (2007)
Renals, S., Hain, T., Bourlard, H.: Recognition and understanding of meetings: the AMI and AMIDA projects. In: Proceedings of ASRU, Kyoto (2007)
Roman, N., Wang, D.L., Brown, G.J.: Speech segregation based on sound localization. J. Acoust. Soc. Am. 114, 2236–2252 (2003)
Roweis, S.: Factorial models and refiltering for speech separation and denoising. In: Proceedings of Eurospeech, Geneva, pp. 1009–1012 (2003)
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Proceedings of Interspeech (2015)
Sawada, H., Mukai, R., Araki, S., Makino, S.: A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. Speech Audio Process. 12(5), 530–538 (2004)
Sawada, H., Araki, S., Makino, S.: Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. Audio Speech Language Process. 19(3), 516–527 (2011)
Souden, M., Benesty, J., Affes, S.: On optimal frequency-domain multichannel linear filtering for noise reduction. IEEE Trans. Audio Speech Language Process. 18(2), 260–276 (2010)
Swietojanski, P., Ghoshal, A., Renals, S.: Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In: Proceedings of ASRU (2013)
Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. IEEE Trans. Pattern Anal. Mach. Intell. 30(6), 1068–1080 (2008)
Togami, M., Sumiyoshi, T., Amano, A.: Stepwise phase difference restoration method for sound source localization using multiple microphone pairs. In: Proceedings of ICASSP (2007)
Traa, J., Kim, M., Smaragdis, P.: Phase and level difference fusion for robust multichannel source separation. In: Proceedings of ICASSP, pp. 6687–6690 (2014)
Veselý, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: Proceedings of Interspeech, pp. 2345–2349 (2013)
Vincent, E.: An experimental evaluation of Wiener filter smoothing techniques applied to under-determined audio source separation. In: International Conference on Latent Variable Analysis and Signal Separation, pp. 157–164. Springer, Berlin, Heidelberg (2010)
Vincent, E., Bertin, N., Gribonval, R., Bimbot, F.: From blind to guided audio source separation: how models and side information can improve the separation of sound. IEEE Signal Process. Mag. 31(3), 107–115 (2014)
Wang, D.: On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi, P. (ed.) Speech Separation by Humans and Machines, pp. 181–197. Springer US, Boston, MA (2005)
Weiss, R., Mandel, M.I., Ellis, D.W.P.: Combining localization cues and source model constraints for binaural source separation. Speech Commun. 53(5), 606–621 (2011)
Yedidia, J., Freeman, W., Weiss, Y.: Generalized belief propagation. In: Advances in Neural Information Processing Systems, pp. 689–695. MIT, Cambridge (2000)
Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time–frequency masking. IEEE Trans. Audio Speech Language Process. 52(7), 1830–1847 (2004)
Zhang, Y., Brady, M., Smith, S.: Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Trans. Med. Imaging 20(1), 45–57 (2001)
Acknowledgements
The work reported here was carried out during the 2015 Jelinek Memorial Summer Workshop on Speech and Language Technologies at the University of Washington, Seattle, and was supported by Johns Hopkins University via NSF Grant No. IIS 1005411, and gifts from Google, Microsoft Research, Amazon, Mitsubishi Electric, and MERL. It is also based upon work supported by the NSF under Grant No. IIS 1409431. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Mandel, M.I., Barker, J.P. (2017). Multichannel Spatial Clustering Using Model-Based Source Separation. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-64680-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)