Multichannel Spatial Clustering Using Model-Based Source Separation

Mandel, Michael I.; Barker, Jon P.

doi:10.1007/978-3-319-64680-0_3

Multichannel Spatial Clustering Using Model-Based Source Separation

Michael I. Mandel⁵ &
Jon P. Barker⁶

Chapter
First Online: 26 July 2017

2260 Accesses
1 Citations

Abstract

Recent automatic speech recognition results are quite good when the training data is matched to the test data, but much worse when they differ in some important regard, like the number and arrangement of microphones or the reverberation and noise conditions. Because these configurations are difficult to predict a priori and difficult to exhaustively train over, the use of unsupervised spatial-clustering methods is attractive. Such methods separate sources using differences in spatial characteristics, but do not need to fully model the spatial configuration of the acoustic scene. This chapter will discuss several approaches to unsupervised spatial clustering, with a focus on model-based expectation maximization source separation and localization (MESSL). It will discuss the basic two-microphone version of this model, which clusters spectrogram points based on the relative differences in phase and level between pairs of microphones, its generalization to more than two microphones, and its use to drive minimum variance distortionless response (MVDR) beamforming. These systems are evaluated for speech enhancement as well as automatic speech recognition, for which they are able to reduce word error rates by between 9.9 and 17.1% relative over a standard delay-and-sum beamformer in mismatched train–test conditions.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
For the purposes of the MESSL-MRF discussion, the indices k and k ^′ are a shorthand for the T–F coordinates (t _k, f _k) and \( (t_{k^{{\prime}}},f_{k^{{\prime}}}) \).

References

Aarabi, P.: Self-localizing dynamic microphone arrays. IEEE Trans. Syst. Man Cybern. C 32(4), 474–484 (2002)
Article Google Scholar
Algazi, V.R., Duda, R.O., Thompson, D.M., Avendano, C.: The CIPIC HRTF database. In: Proceedings of WASPAA, pp. 99–102 (2001)
Google Scholar
Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Language Process. 15(7), 2011–2022 (2007)
Article Google Scholar
Araki, S., Sawada, H., Mukai, R., Makino, S.: Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal Process. 87, 1833–1847 (2007)
Article MATH Google Scholar
Bagchi, D., Mandel, M.I., Wang, Z., He, Y., Plummer, A., Fosler-Lussier, E.: Combining spectral feature mapping and multi-channel model-based source separation for noise-robust automatic speech recognition. In: Proceedings of ASRU (2015)
Book Google Scholar
Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third “CHiME” speech separation and recognition challenge: dataset, task and baselines. In: Proceedings of ASRU (2015)
Google Scholar
Besag, J.: On the statistical analysis of dirty pictures (with discussion). J. R. Stat. Soc. B 48(3), 259–302 (1986)
MATH Google Scholar
Capon, J.: High-resolution frequency-wavenumber spectrum analysis. Proc. IEEE 57(8), 1408–1418 (1969)
Article Google Scholar
Carletta, J.: Unleashing the killer corpus: experiences in creating the multi-everything AMI meeting corpus. Lang. Resour. Eval. 41(2), 181–190 (2007)
Article Google Scholar
Cermak, J., Araki, S., Sawada, H., Makino, S.: Blind speech separation by combining beamformers and a time frequency binary mask. In: Proceedings of IWAENC, Paris (2006)
Google Scholar
Cermak, J., Araki, S., Sawada, H., Makino, S.: Blind source separation based on a beamformer array and time frequency binary masking. In: Proceedings of ICASSP, vol. 1, pp. 145–148. IEEE, New York (2007)
Google Scholar
Chuang, K.S., Tzeng, H.L., Chen, S., Wu, J., Chen, T.J.: Fuzzy c-means clustering with spatial information for image segmentation. Comput. Med. Imaging Graph. 30(1), 9–15 (2006)
Article Google Scholar
Cobos, M., Lopez, J.: Maximum a posteriori binary mask estimation for underdetermined source separation using smoothed posteriors. IEEE Trans. Audio Speech Language Process. 20(7), 2059–2064 (2012)
Article Google Scholar
Deleforge, A., Forbes, F., Horaud, R.: Variational EM for binaural sound-source separation and localization. In: Proceedings of ICASSP, pp. 76–79 (2013)
Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Dmochowski, J., Benesty, J., Affes, S.: On spatial aliasing in microphone arrays. IEEE Trans. Signal Process. 57(4), 1383–1395 (2009)
Article MathSciNet Google Scholar
Gales, M.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Audio Speech Language Process. 7(3), 272–281 (1999)
Article Google Scholar
Gaubitch, N.D., Kleijn, W.B., Heusdens, R.: Auto-localization in ad-hoc microphone arrays. In: Proceedings of ICASSP, pp. 106–110. IEEE, New York (2013)
Google Scholar
Grais, E., Erdogan, H.: Spectro-temporal post-smoothing in NMF based single-channel source separation. In: Proceedings of EUSIPCO, pp. 584–588 (2012)
Google Scholar
Gu, D.B., Sun, J.: EM image segmentation algorithm based on an inhomogeneous hidden MRF model. IEEE Vis. Image Signal Process. 152(2), 184–190 (2004)
Article Google Scholar
Haeb-Umbach, R., Ney, H.: Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: Proceedings of ICASSP, pp. 13–16 (1992)
Google Scholar
Himawan, I., McCowan, I., Sridharan, S.: Clustered blind beamforming from ad-hoc microphone arrays. IEEE Trans. Audio Speech Language Process. 19(4), 661–676 (2011)
Article Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Jiang, Y., Wang, D., Liu, R.: Binaural deep neural network classification for reverberant speech segregation. In: Proceedings of Interspeech, pp. 2400–2403 (2014)
Google Scholar
Kim, M., Smaragdis, P.: Single channel source separation using smooth nonnegative matrix factorization with Markov random fields. In: Proceedings of MLSP, pp. 1–6 (2013)
Google Scholar
Kim, M., Smaragdis, P., Ko, G.G., Rutenbar, R.A.: Stereophonic spectrogram segmentation using Markov random fields. In: Proceedings of MLSP, pp. 1–6 (2012)
Google Scholar
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Sehr, A., Kellermann, W., Gannot, S., Maas, R., Haeb-Umbach, R., Leutnant, V., Raj, B.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: Proceedings of WASPAA, New Paltz, NY (2013)
Google Scholar
Kühne, M., Togneri, R., Nordholm, S.: Smooth soft mel-spectrographic masks based on blind sparse source separation. In: Proceedings of Interspeech (2007)
Google Scholar
Kühne, M., Togneri, R., Nordholm, S.: Adaptive beamforming and soft missing data decoding for robust speech recognition in reverberant environments. In: Proceedings of Interspeech, pp. 976–979 (2008)
Google Scholar
Kühne, M., Togneri, R., Nordholm, S.: A novel fuzzy clustering algorithm using observation weighting and context information for reverberant blind speech separation. Signal Process. 90(2), 653–669 (2010)
Article MATH Google Scholar
Liang, S., Liu, W., Jiang, W.: Integrating binary mask estimation with MRF priors of cochleagram for speech separation. IEEE Signal Process. Lett. 19(10), 627–630 (2012)
Article Google Scholar
Lippmann, R., Martin, E., Paul, D.: Multi-style training for robust isolated-word speech recognition. In: Proceedings of ICASSP, vol. 12, pp. 705–708 (1987)
Google Scholar
Liu, Z., Zhang, Z., He, L.W., Chou, P.: Energy-based sound source localization and gain normalization for ad hoc microphone arrays. In: Proceedings of ICASSP, vol. 2, pp. 761–764. IEEE, New York (2007)
Google Scholar
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Article MathSciNet MATH Google Scholar
Madhu, N., Breithaupt, C., Martin, R.: Temporal smoothing of spectral masks in the cepstral domain for speech separation. In: Proceedings of ICASSP, pp. 45–48 (2008)
Google Scholar
Mandel, M.I., Roman, N.: Enforcing consistency in spectral masks using Markov random fields. In: Proceedings of EUSIPCO (2015)
Book Google Scholar
Mandel, M.I., Weiss, R.J., Ellis, D.P.W.: Model-based expectation maximization source separation and localization. IEEE Trans. Audio Speech Language Process. 18(2), 382–394 (2010)
Article Google Scholar
Middlebrooks, J.C., Green, D.M.: Sound localization by human listeners. Annu. Rev. Psychol. 42, 135–159 (1991)
Article Google Scholar
Narayanan, A., Wang, D.: Investigation of speech separation as a front-end for noise robust speech recognition. IEEE Trans. Audio Speech Language Process. 22(4), 826–835 (2014)
Article Google Scholar
O’Grady, P.D., Pearlmutter, B.A.: Soft-LOST: EM on a mixture of oriented lines. In: Independent Component Analysis and Blind Signal Separation, vol. 3195, 1270 pp. Springer, Berlin (2004)
Google Scholar
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco, CA (1988)
MATH Google Scholar
Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of ICASSP, pp. 4057–4060 (2008)
Google Scholar
Rafaely, B., Weiss, B., Bachmat, E.: Spatial aliasing in spherical microphone arrays. IEEE Trans. Signal Process. 55(3), 1003–1010 (2007)
Article MathSciNet Google Scholar
Renals, S., Hain, T., Bourlard, H.: Recognition and understanding of meetings: the AMI and AMIDA projects. In: Proceedings of ASRU, Kyoto (2007)
Google Scholar
Roman, N., Wang, D.L., Brown, G.J.: Speech segregation based on sound localization. J. Acoust. Soc. Am. 114, 2236–2252 (2003)
Article Google Scholar
Roweis, S.: Factorial models and refiltering for speech separation and denoising. In: Proceedings of Eurospeech, Geneva, pp. 1009–1012 (2003)
Google Scholar
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Proceedings of Interspeech (2015)
Google Scholar
Sawada, H., Mukai, R., Araki, S., Makino, S.: A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. Speech Audio Process. 12(5), 530–538 (2004)
Article Google Scholar
Sawada, H., Araki, S., Makino, S.: Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. Audio Speech Language Process. 19(3), 516–527 (2011)
Article Google Scholar
Souden, M., Benesty, J., Affes, S.: On optimal frequency-domain multichannel linear filtering for noise reduction. IEEE Trans. Audio Speech Language Process. 18(2), 260–276 (2010)
Article Google Scholar
Swietojanski, P., Ghoshal, A., Renals, S.: Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In: Proceedings of ASRU (2013)
Book Google Scholar
Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. IEEE Trans. Pattern Anal. Mach. Intell. 30(6), 1068–1080 (2008)
Article Google Scholar
Togami, M., Sumiyoshi, T., Amano, A.: Stepwise phase difference restoration method for sound source localization using multiple microphone pairs. In: Proceedings of ICASSP (2007)
Book Google Scholar
Traa, J., Kim, M., Smaragdis, P.: Phase and level difference fusion for robust multichannel source separation. In: Proceedings of ICASSP, pp. 6687–6690 (2014)
Google Scholar
Veselý, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: Proceedings of Interspeech, pp. 2345–2349 (2013)
Google Scholar
Vincent, E.: An experimental evaluation of Wiener filter smoothing techniques applied to under-determined audio source separation. In: International Conference on Latent Variable Analysis and Signal Separation, pp. 157–164. Springer, Berlin, Heidelberg (2010)
Google Scholar
Vincent, E., Bertin, N., Gribonval, R., Bimbot, F.: From blind to guided audio source separation: how models and side information can improve the separation of sound. IEEE Signal Process. Mag. 31(3), 107–115 (2014)
Article Google Scholar
Wang, D.: On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi, P. (ed.) Speech Separation by Humans and Machines, pp. 181–197. Springer US, Boston, MA (2005)
Chapter Google Scholar
Weiss, R., Mandel, M.I., Ellis, D.W.P.: Combining localization cues and source model constraints for binaural source separation. Speech Commun. 53(5), 606–621 (2011)
Article Google Scholar
Yedidia, J., Freeman, W., Weiss, Y.: Generalized belief propagation. In: Advances in Neural Information Processing Systems, pp. 689–695. MIT, Cambridge (2000)
Google Scholar
Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time–frequency masking. IEEE Trans. Audio Speech Language Process. 52(7), 1830–1847 (2004)
MathSciNet MATH Google Scholar
Zhang, Y., Brady, M., Smith, S.: Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Trans. Med. Imaging 20(1), 45–57 (2001)
Article Google Scholar

Download references

Acknowledgements

The work reported here was carried out during the 2015 Jelinek Memorial Summer Workshop on Speech and Language Technologies at the University of Washington, Seattle, and was supported by Johns Hopkins University via NSF Grant No. IIS 1005411, and gifts from Google, Microsoft Research, Amazon, Mitsubishi Electric, and MERL. It is also based upon work supported by the NSF under Grant No. IIS 1409431. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Brooklyn College (CUNY), 2900 Bedford Ave, Brooklyn, NY, 11210, USA
Michael I. Mandel
University of Sheffield, Regent Court, 211 Portobello, Sheffield, S1 4DP, UK
Jon P. Barker

Authors

Michael I. Mandel
View author publications
You can also search for this author in PubMed Google Scholar
Jon P. Barker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael I. Mandel .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mandel, M.I., Barker, J.P. (2017). Multichannel Spatial Clustering Using Model-Based Source Separation. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_3
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics