Past review, current progress, and challenges ahead on the cocktail party problem

  • Yan-min Qian
  • Chao Weng
  • Xuan-kai Chang
  • Shuai Wang
  • Dong Yu
Review

Abstract

The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the last two decades in attacking this problem. We focus our discussions on the speech separation problem given its central role in the cocktail party environment, and describe the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment. We argue effectively exploiting information in the microphone array, the acoustic training set, and the language itself using a more powerful model. Better optimization objective and techniques will be the approach to solving the cocktail party problem.

Keywords

Cocktail party problem Computational auditory scene analysis Non-negative matrix factorization Permutation invariant training Multi-talker speech processing 

CLC number

TP391.4 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abdel-Hamid O, Mohamed A, Jiang H, et al., 2014. Convolutional neural networks for speech recognition. Annual Conf of Int Speech Communication Association, p.1533–1545.Google Scholar
  2. Anguera X, Wooters C, Hernando J, 2007. Acoustic beamforming for speaker diarization of meetings. IEEE Trans Audio Speech Lang Process, 15(7):2011–2022. https://doi.org/10.1109/TASL.2007.902460Google Scholar
  3. Applebaum S, 1976. Adaptive arrays. IEEE Trans Antennas Propag, 24(9):585–598. https://doi.org/10.1109/TAP.1976.1141417Google Scholar
  4. Barker J, Ma N, Coy A, et al., 2010. Speech fragment decoding techniques for simultaneous speaker identification and speech recognition. Comput Speech Lang, 24(1):94–111. https://doi.org/10.1016/j.csl.2008.05.003Google Scholar
  5. Behnke S, 2003. Discovering hierarchical speech features using convolutional non-negative matrix factorization. Int Joint Conf on Neural Networks, p.2758–2763. https://doi.org/10.1109/IJCNN.2003.1224004Google Scholar
  6. Bello RWJ, 2010. Identifying repeated patterns in music using sparse convolutive non-negative matrix factorization. 11th Int Society for Music Information Retrieval Conf, p.123–128.Google Scholar
  7. Benesty J, Chen J, Huang Y, et al., 2007. On microphonearray beamforming from a MIMO acoustic signal processing perspective. IEEE Trans Audio Speech Lang Process, 15(3):1053–1065. https://doi.org/10.1109/TASL.2006.885251Google Scholar
  8. Benesty J, Chen J, Huang Y, 2008. Automatic Speech Recognition: a Deep Learning Approach. Springer Berlin Heidelberg, New York, USA.Google Scholar
  9. Bi M, Qian Y, Yu K, 2015. Very deep convolutional neural networks for LVCSR. 16th Annual Conf of Int Speech Communication Association, p.3259–3263.Google Scholar
  10. Bregman AS, 1990. Auditory scene analysis. In: Smelzer NJ, Bates PB (Eds.), International Encyclopedia of the Social and Behavioral Sciences. Elsevier, Amsterdam.Google Scholar
  11. Brown GJ, Cooke M, 1994. Computational auditory scene analysis. Comput Speech Lang, 8(4):297–336. https://doi.org/10.1006/csla.1994.1016Google Scholar
  12. Capon J, 1969. High resolution frequency-wavenumber spectrum analysis. Proc IEEE, 57:1408–1418. https://doi.org/10.1109/PROC.1969.7278Google Scholar
  13. Carter GC, Nuttall AH, Cable PG, 1973. The smoothed coherence transform. Proc IEEE, 61:1497–1498. https://doi.org/10.1109/PROC.1973.9300Google Scholar
  14. Chang X, Qian Y, Yu D, 2018. Adaptive permutation invariant training with auxiliary information for monaural multi-talker speech recognition. Int Conf on Acoustics, Speech, and Signal Processing, in press.Google Scholar
  15. Chen J, Benesty J, Huang Y, 2006. Time delay estimation in room acoustic environments: an overview. EURASIP J Adv Signal Process, 2006:026503. https://doi.org/10.1155/ASP/2006/26503MATHGoogle Scholar
  16. Chen N, Qian Y, Yu K, 2015. Multi-task learning for textdependent speaker verification. Annual Conf of Int Speech Communication Association, p.185–189.Google Scholar
  17. Chen Z, 2017. Single Channel Auditory Source Separation with Neural Network. PhD Thesis, Columbia University, New York, USA.Google Scholar
  18. Chen Z, Ellis DP, 2013. Speech enhancement by sparse, low-rank, and dictionary spectrogram decomposition. Workshop on Applications of Signal Processing to Audio and Acoustics, p.1–4. https://doi.org/10.1109/WASPAA.2013.6701883Google Scholar
  19. Chen Z, McFee B, Ellis DP, 2014. Speech enhancement by low-rank and convolutive dictionary spectrogram decomposition. Annual Conf of Int Speech Communication Association, p.2833–2837.Google Scholar
  20. Chen Z, Li J, Xiao X, et al., 2017a. Cracking the cocktail party problem by multi-beam deep attractor network. IEEE Workshop on Automatic Speech Recognition and Understanding, p.437–444.Google Scholar
  21. Chen Z, Luo Y, Mesgarani N, 2017b. Deep attractor network for single-microphone speaker separation. Int Conf on Acoustics, Speech, and Signal Processing, p.246–250. https://doi.org/10.1109/ICASSP.2017.7952155Google Scholar
  22. Chen Z, Droppo J, Li J, et al., 2017c. Progressive joint modeling in unsupervised single-channel overlapped speech recognition. http://arxiv.org/abs/1707.07048Google Scholar
  23. Cherry EC, 1953. Some experiments on the recognition of speech, with one and with two ears. J Acoust Soc Am, 25(5):975–979. https://doi.org/10.1121/1.1907229Google Scholar
  24. Cooke M, Hershey JR, Rennie SJ, 2010. Monaural speech separation and recognition challenge. Comput Speech Lang, 24(1):1–15. https://doi.org/10.1016/j.csl.2009.02.006Google Scholar
  25. Dehak N, Kenny PJ, Dehak R, et al., 2011. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process, 19(4):788–798. https://doi.org/10.1109/TASL.2010.2064307Google Scholar
  26. Doclo S, Moonen M, 2003. Design of far-field and near-field broadband beamformers using eigenfilters. IEEE Signal Process Lett, 83(12):2641–2673. https://doi.org/10.1016/j.sigpro.2003.07.005MATHGoogle Scholar
  27. Drude L, Haeb-Umbach R, 2017. Tight integration of spatial and spectral features for BSS with deep clustering embeddings. Annual Conf of Int Speech Communication Association, p.2650–2654.Google Scholar
  28. Du J, Tu Y, Xu Y, et al., 2014. Speech separation of a target speaker based on deep neural networks. Int Conf on Signal Processing, p.473–477. https://doi.org/10.1109/ICOSP.2014.7015050Google Scholar
  29. Ellis DPW, 1996. Prediction-Driven Computational Auditory Scene Analysis. PhD Thesis, Massachusetts Institute of Technology, Cambridge, USA.Google Scholar
  30. Ephraim Y, Malah D, 1985. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Audio Speech Lang Process, 33(2): 443–445. https://doi.org/10.1109/TASSP.1985.1164550Google Scholar
  31. Erdogan H, Hershey JR, Watanabe S, et al., 2015. Phasesensitive and recognition-boosted speech separation using deep recurrent neural networks. Int Conf on Acoustics Speech and Signal Processing, p.708–712. https://doi.org/10.1109/ICASSP.2015.7178061Google Scholar
  32. Erdogan H, Hershey J, Watanabe S, et al., 2016. Improved MVDR beamforming using single-channel mask prediction networks. Annual Conf of Int Speech Communication Association, p.1981–1985.Google Scholar
  33. Erdogan H, Hershey JR, Watanabe S, et al., 2017. Deep recurrent networks for separation and recognition of single-channel speech in nonstationary background audio. New Era for Robust Speech Recognition, p.165–186. https://doi.org/10.1007/978-3-319-64680-0_7Google Scholar
  34. Fischer S, Simmer KU, 1996. Beamforming microphone arrays for speech acquisition in noisy environments. Speech Commun, 20(3-4):215–227. https://doi.org/10.1016/S0167-6393(96)00054-4Google Scholar
  35. Frost OL, 1972. An algorithm for linearly constrained adaptive array processing. Proc IEEE, 60(8):926–935. https://doi.org/10.1109/PROC.1972.8817Google Scholar
  36. Gannot S, Burshtein D, Weinstein E, 2001. Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans Signal Process, 49(8): 1614–1626. https://doi.org/10.1109/78.934132Google Scholar
  37. Gannot S, Burshtein D, Weinstein E, 2004. Analysis of the power spectral deviation of the general transfer function GSC. IEEE Trans Signal Process, 52(4):1115–1120. https://doi.org/10.1109/TSP.2004.823487MathSciNetMATHGoogle Scholar
  38. Ghahramani Z, Jordan MI, 1996. Factorial hidden Markov models. NIPS, p.472–478.Google Scholar
  39. Hassab JC, Boucher RE, 1981. Performance of the generalized cross correlator in the presence of a strong spectral peak in the signal. IEEE Trans Audio Speech Lang Process, 29(3):549–555. https://doi.org/10.1109/TASSP.1981.1163613Google Scholar
  40. Hershey JR, Kristjansson T, Rennie S, et al., 2007. Single channel speech separation using factorial dynamics. NIPS, p.593–600.Google Scholar
  41. Hershey JR, Rennie SJ, Olsen PA, et al., 2010. Super-human multi-talker speech recognition: a graphical modeling approach. Comput Speech Lang, 24(1):45–66. https://doi.org/10.1016/j.csl.2008.11.001Google Scholar
  42. Hershey JR, Chen Z, Le Roux J, et al., 2016. Deep clustering: discriminative embeddings for segmentation and separation. Int Conf on Acoustics Speech and Signal Processing, p.31–35. https://doi.org/10.1109/ICASSP.2016.7471631Google Scholar
  43. Heymanna J, Drudea L, Haeb-Umbacha R, 2017. A generic neural acoustic beamforming architecture for robust multi-channel speech processing. Comput Speech Lang, 46(C):374–385. https://doi.org/10.1016/j.csl.2016.11.007Google Scholar
  44. Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 29(6):82–97. https://doi.org/10.1109/MSP.2012.2205597Google Scholar
  45. Hoyer PO, 2004. Non-negative matrix factorization with sparseness constraints. J Mach Learn Res, 5:1457–1469.MathSciNetMATHGoogle Scholar
  46. Hu G, Wang D, 2004. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans Neur Netw, 15(5):1135–1150. https://doi.org/10.1109/TNN.2004.832812Google Scholar
  47. Hu G, Wang D, 2008. Segregation of unvoiced speech from nonspeech interference. J Acoust Soc Am, 124(2): 1306–1319. https://doi.org/10.1121/1.2939132Google Scholar
  48. Hu G, Wang D, 2010. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process, 18(8):2067–2079. https://doi.org/10.1109/TASL.2010.2041110Google Scholar
  49. Hu K, Wang D, 2013. An unsupervised approach to cochannel speech separation. IEEE Trans Audio Speech Lang Process, 21(1):122–131. https://doi.org/10.1109/TASL.2012.2215591MathSciNetGoogle Scholar
  50. Hu Y, Loizou PC, 2007. Subjective comparison and evaluation of speech enhancement algorithms. Speech Commun, 49(7):588–601. https://doi.org/10.1016/j.specom.2006.12.006Google Scholar
  51. Hu Y, Loizou PC, 2008. Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process, 16(1):229–238. https://doi.org/10.1109/TASL.2007.911054Google Scholar
  52. Huang Z, Wang S, Qian Y, 2018. Joint i-vector with end-toend system for short duration text-independent speaker verification. Int Conf on Acoustics, Speech, and Signal Processing, in press.Google Scholar
  53. Hyvarinen A, Karhunen J, Oja E, 2001. Independent Component Analysis. John Wiley & Sons, Inc, New York, USA.Google Scholar
  54. Isik Y, Roux JL, Chen Z, et al., 2016. Single-channel multispeaker separation using deep clustering. Annual Conf of Int Speech Communication Association, p.545–549. https://doi.org/10.21437/Interspeech.2016-1176Google Scholar
  55. Kellermann W, 1997. Strategies for combining acoustic echo cancellation and adaptive beamforming microphone arrays. Int Conf on Acoustics Speech and Signal Processing, p.219–222. https://doi.org/10.1109/ICASSP.1997.599608Google Scholar
  56. Kim T, Attias HT, Lee SY, et al., 2006. Blind source separation exploiting higher-order frequency dependencies. IEEE Trans Audio Speech Lang Process, 15(4):70–79. https://doi.org/10.1109/TASL.2006.872618Google Scholar
  57. Kjems U, Boldt JB, Pedersen MS, et al., 2009. Role of mask pattern in intelligibility of ideal binary-masked noisy speech. J Acoust Soc Am, 126(3):1415–1426. https://doi.org/10.1121/1.3179673Google Scholar
  58. Knapp CK, Carter GC, 1976. The generalized correlation method for estimation of time delay. IEEE Trans Audio Speech Lang Process, 24(4):320–327. https://doi.org/10.1109/TASSP.1976.1162830Google Scholar
  59. Kolbæk M, Yu D, Tan ZH, et al., 2017a. Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training. IEEE Int Workshop on Machine Learning for Signal Processing. http://arxiv.org/abs/1708.09588Google Scholar
  60. Kolbæk M, Yu D, Tan ZH, et al., 2017b. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE Trans Audio Speech Lang Process, 25(10):1901–1913. https://doi.org/10.1109/TASLP.2017.2726762Google Scholar
  61. Kristjansson T, Hershey J, Olsen P, et al., 2006. Superhuman multi-talker speech recognition: the IBM 2006 speech separation challenge system. Int Conf on Spoken Language Processing, Paper 1775-Mon1WeS.7.Google Scholar
  62. Kuhl PK, 1991. Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Percept Psychol, 50(2):93–107. https://doi.org/10.3758/BF03212211Google Scholar
  63. Larcher A, Lee KA, Ma B, et al., 2014. Textdependent speaker verification: classifiers, databases and RSR2015. Speech Commun, 60:56–77. https://doi.org/10.1016/j.specom.2014.03.001Google Scholar
  64. Lee DD, Seung HS, 2001. Algorithms for non-negative matrix factorization. NIPS, p.556–562.Google Scholar
  65. Lee TW, 1998. Independent Component Analysis—Theory and Applications. Kluwer Academic Publishers, Boston, USA.MATHGoogle Scholar
  66. Lei Y, Scheffer N, Ferrer L, et al., 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. Int Conf on Acoustics Speech and Signal Processing, p.1695–1699. https://doi.org/10.1109/ICASSP.2014.6853887Google Scholar
  67. Li P, Guan Y, Wang S, et al., 2010. Monaural speech separation based on MAXVQ and CASA for robust speech recognition. Comput Speech Lang, 24(1):30–44. https://doi.org/10.1016/j.csl.2008.05.005Google Scholar
  68. Liu Y, Qian Y, Chen N, et al., 2015. Deep feature for text-dependent speaker verification. Speech Commun, 73:1–13. https://doi.org/10.1016/j.specom.2015.07.003Google Scholar
  69. Lovekin JM, Yantorno RE, Krishnamachari KR, et al., 2001. Developing usable speech criteria for speaker identification technology. Int Conf on Acoustics Speech and Signal Processing, p.421–424. https://doi.org/10.1109/ICASSP.2001.940857Google Scholar
  70. Mandel MI, Weiss RJ, Ellis DPW, 2010. Model-based expectation maximization source separation and localization. IEEE Trans Audio Speech Lang Process, 18(2):382–394. https://doi.org/10.1109/TASL.2009.2029711Google Scholar
  71. McDermott JH, 2009. The cocktail party problem. Curr Biol, 19(22):R1024–R1027.Google Scholar
  72. Mesgarani N, Chang EF, 2012. Selective cortical representation of attended speaker in multi-talker speech perception. Nature, 485(7397):233–236. https://doi.org/10.1038/nature11020Google Scholar
  73. Mowlaee P, Saeidi R, Tan ZH, et al., 2010. Joint singlechannel speech separation and speaker identification. Int Conf on Acoustics Speech and Signal Processing, p.4430–4433. https://doi.org/10.1109/ICASSP.2010.5495619Google Scholar
  74. Mowlaee P, Saeidi R, Christensen MG, et al., 2012. A joint approach for single-channel speaker identification and speech separation. IEEE Trans Audio Speech Lang Process, 20(9):2586–2601. https://doi.org/10.1109/TASL.2012.2208627Google Scholar
  75. Narayanan A, Wang D, 2013. Ideal ratio mask estimation using deep neural networks for robust speech recognition. Int Conf on Acoustics Speech and Signal Processing, p.7092–7096. https://doi.org/10.1109/ICASSP.2013.6639038Google Scholar
  76. Ono N, 2011. Stable and fast update rules for independent vector analysis based on auxiliary function technique. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. https://doi.org/10.1109/ASPAA.2011.6082320Google Scholar
  77. Peddinti V, Povey D, Khudanpur S, 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. Annual Conf of Int Speech Communication Association, p.3214–3218.Google Scholar
  78. Pedersen MS, Larsen J, Kjems U, et al., 2007. A Survey of Convolutive Blind Source Separation Methods. Springer Press, New York, USA.Google Scholar
  79. Qian YM, Bi M, Tan T, et al., 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE Trans Audio Speech Lang Process, 24(12):2263–2276. https://doi.org/10.1109/TASLP.2016.2602884Google Scholar
  80. Qian YM, Chang XK, Yu D, 2017. Single-channel multitalker speech recognition with permutation invariant training. http://arxiv.org/abs/1707.06527Google Scholar
  81. Qian YM, Tan T, Hu H, et al., 2018. Noise robust speech recognition on Aurora4 by humans and machines. Int Conf on Acoustics, Speech, and Signal Processing, in press.Google Scholar
  82. Raj B, Virtanen T, Chaudhuri S, et al., 2010. Non-negative matrix factorization based compensation of music for automatic speech recognition. Annual Conf of Int Speech Communication Association, p.717–720.Google Scholar
  83. Rennie SJ, Hershey JR, Olsen PA, 2010. Single-channel multitalker speech recognition. IEEE Signal Process Mag, 27(6):66–80. https://doi.org/10.1109/MSP.2010.938081Google Scholar
  84. Reynolds DA, Quatieri TF, Dunn RB, 2000. Speaker verification using adapted gaussian mixture models. Dig Signal Process, 10(1-3):19–41. https://doi.org/10.1006/dspr.1999.0361Google Scholar
  85. Rix AW, Beerends JG, Hollier MP, et al., 2001. Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs. Int Conf on Acoustics, Speech, and Signal Processing, p.749–752. https://doi.org/10.1109/ICASSP.2001.941023Google Scholar
  86. Roth P, 1971. Effective measurements using digital signal analysis. IEEE Spectr, 8(4):62–70.Google Scholar
  87. Sainath TN, Mohamed A, Kingsbury B, et al., 2013. Deep convolutional neural networks for LVCSR. Int Conf on Acoustics Speech and Signal Processing, p.8614–8618.Google Scholar
  88. Sainath TN, Vinyals O, Senior A, et al., 2015. Convolutional, long short-term memory, fully connected deep neural networks. Int Conf on Acoustics Speech and Signal Processing, p.4580–4584. https://doi.org/10.1109/ICASSP.2015.7178838Google Scholar
  89. Sawada H, Araki S, Makino S, 2007. A two-stage frequencydomain blind source separation method for underdetermined convolutive mixtures. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, p.139–142. https://doi.org/10.1109/ASPAA.2007.4393012Google Scholar
  90. Schmidt MN, Olsson RK, 2006. Single-channel speech separation using sparse non-negative matrix factorization. Annual Conf of Int Speech Communication Association, Paper 1652-Thu2FoP.10.Google Scholar
  91. Schuller B, Weninger F, Wöllmer M, et al., 2010. Nonnegative matrix factorization as noise-robust feature extractor for speech recognition. Int Conf on Acoustics Speech and Signal Processing, p.4562–4565. https://doi.org/10.1109/ICASSP.2010.5495567Google Scholar
  92. Sercu T, Puhrsch C, Kingsbury B, et al., 2016. Very deep multilingual convolutional neural networks for LVCSR. Int Conf on Acoustics Speech and Signal Processing, p.4955–4959. https://doi.org/10.1109/ICASSP.2016.7472620Google Scholar
  93. Shao Y, Wang D, 2003. Co-channel speaker identification using usable speech extraction based on multi-pitch tracking. Int Conf on Acoustics, Speech, and Signal Processing, p.205–208. https://doi.org/10.1109/ICASSP.2003.1202330Google Scholar
  94. Shao Y, Wang D, 2006. Model-based sequential organization in cochannel speech. IEEE Trans Audio Speech Lang Process, 14(1):289–298. https://doi.org/10.1109/TSA.2005.854106Google Scholar
  95. Souden M, Benesty J, Affes S, 2010. On optimal frequencydomain multichannel linear filtering for noise reduction. IEEE Trans Signal Process, 18(2):260–276. https://doi.org/10.1109/TASL.2009.2025790Google Scholar
  96. Souden M, Araki S, Kinoshita K, et al., 2013. A multichannel MMSE-based framework for speech source separation and noise reduction. IEEE Trans Signal Process, 21(9):1913–1928. https://doi.org/10.1109/TASL.2013.2263137Google Scholar
  97. Sydow C, 1994. Broadband beamforming for a microphone array. J Acoust Soc Am, 96(8):845–849. https://doi.org/10.1121/1.410323Google Scholar
  98. Taal CH, Hendriks RC, Heusdens R, et al., 2010. A shorttime objective intelligibility measure for time-frequency weighted noisy speech. Int Conf on Acoustics Speech and Signal Processing, p.4214–4217. https://doi.org/10.1109/ICASSP.2010.5495701Google Scholar
  99. Tan T, Qian Y, Yu D, 2018. Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition. Int Conf on Acoustics, Speech, and Signal Processing, in press.Google Scholar
  100. Tu Y, Du J, Xu Y, et al., 2014a. Deep neural network based speech separation for robust speech recognition. Int Conf on Signal Processing, p.532–536. https://doi.org/10.1109/ICOSP.2014.7015061Google Scholar
  101. Tu Y, Du J, Xu Y, et al., 2014b. Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers. Int Symp on Chinese Spoken Language Processing, p.250–254. https://doi.org/10.1109/ISCSLP.2014.6936615Google Scholar
  102. Variani E, Lei X, McDermott E, et al., 2014. Deep neural networks for small footprint text-dependent speaker verification. Int Conf on Acoustics Speech and Signal Processing, p.4052–4056. https://doi.org/10.1109/ICASSP.2014.6854363Google Scholar
  103. Vincent E, Gribonval R, Févotte C, 2006. Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process, 14(4):1462–1469. https://doi.org/10.1109/TSA.2005.858005Google Scholar
  104. Virtanen T, 2006. Speech recognition using factorial hidden Markov models for separation in the feature space. Annual Conf of Int Speech Communication Association, Paper 1850-Mon1WeS.5.Google Scholar
  105. Virtanen T, 2007. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process, 15(3):1066–1074. https://doi.org/10.1109/TASL.2006.885253Google Scholar
  106. Wang D, 2005. On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi P (Ed.), Speech Separation by Humans and Machines. Springer, Boston, USA, p.181–197. https://doi.org/10.1007/0-387-22794-6_12Google Scholar
  107. Wang D, Brown GJ, 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, New York, USA.Google Scholar
  108. Wang S, Qian Y, Yu K, 2018. Focal KL-divergence based dilated convolutional neural networks for co-channel speaker identification. Int Conf on Acoustics, Speech, and Signal Processing, in press.Google Scholar
  109. Wang Y, Narayanan A, Wang D, 2014. On training targets for supervised speech separation. IEEE Trans Audio Speech Lang Process, 22(12):1849–1858. https://doi.org/10.1109/TASLP.2014.2352935Google Scholar
  110. Weng C, Yu D, Seltzer ML, et al., 2015. Deep neural networks for single-channel multi-talker speech recognition. IEEE Trans Audio Speech Lang Process, 23(10):1670–1679. https://doi.org/10.1109/TASLP.2015.2444659Google Scholar
  111. Weninger F, Erdogan H, Watanabe S, et al., 2015. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. Int Conf on Latent Variable Analysis and Signal Separation, p.91–99. https://doi.org/10.1007/978-3-319-22482-4_11Google Scholar
  112. Xiao X, Zhao SK, Jones DL, et al., 2017. On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition. Int Conf on Acoustics Speech and Signal Processing, p.3246–3250. https://doi.org/10.1109/ICASSP.2017.7952756Google Scholar
  113. Xiong W, Droppo J, Huang X, et al., 2016. Achieving human parity in conversational speech recognition. http://arxiv.org/abs/1610.05256Google Scholar
  114. Xu Y, Du J, Dai LR, et al., 2014. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett, 21(1):65–68. https://doi.org/10.1109/LSP.2013.2291240Google Scholar
  115. Yilmaz O, Rickard S, 2004. Blind separation of speech mixtures via time-frequency masking. IEEE Trans Signal Process, 52(7):1830–1847. https://doi.org/10.1109/TSP.2004.828896MathSciNetMATHGoogle Scholar
  116. Yu D, Deng L, 2014. Automatic Speech Recognition: a Deep Learning Approach. Springer, New York, USA.MATHGoogle Scholar
  117. Yu D, Li, JY, 2017. Recent progresses in deep learning based acoustic models. IEEE/CAA J Automat Sin, 4(3):396–409. https://doi.org/10.1109/JAS.2017.7510508Google Scholar
  118. Yu D, Xiong W, Droppo J, et al., 2016. Deep convolutional neural networks with layer-wise context expansion and attention. Annual Conf of Int Speech Communication Association, p.17–21. https://doi.org/10.21437/Interspeech.2016-251Google Scholar
  119. Yu D, Chang X, Qian Y, 2017a. Recognizing multi-talker speech with permutation invariant training. Annual Conf of Int Speech Communication Association, p.2456–2460.Google Scholar
  120. Yu D, Kolbæk M, Tan ZH, et al., 2017b. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. Int Conf on Acoustics, Speech and Signal Processing, p.241–245. https://doi.org/10.1109/ICASSP.2017.7952154Google Scholar
  121. Yu F, Koltun V, 2015. Multi-scale context aggregation by dilated convolutions. http://arxiv.org/abs/1511.07122Google Scholar
  122. Zhang C, Koishida K, 2017. End-to-end text-independent speaker verification with triplet loss on short utterances. Annual Conf of Int Speech Communication Association, p.1487–1491. https://doi.org/10.21437/Interspeech.2017-1608Google Scholar
  123. Zhang L, Chen Z, Zheng M, et al., 2011. Robust nonnegative matrix factorization. Front. Electr. Electron. Eng. China, 6(2):192–200. https://doi.org/10.1007/s11460-011-0128-0Google Scholar
  124. Zhao X, Wang Y, Wang D, 2015a. Cochannel speaker identification in anechoic and reverberant conditions. IEEE Trans Audio Speech Lang Process, 23(11):1727–1736. https://doi.org/10.1109/TASLP.2015.2447284Google Scholar
  125. Zhao X, Wang Y, Wang D, 2015b. Deep neural networks for cochannel speaker identification. Int Conf on Acoustics, Speech and Signal Processing, p.4824–4828. https://doi.org/10.1109/ICASSP.2015.7178887Google Scholar
  126. Zhou Y, Qian Y, 2018. Robust mask estimation by integrating neural network-based and clustering-based approaches for adaptive acoustic beamforming. Int Conf on Acoustics, Speech, and Signal Processing, in press.Google Scholar
  127. Zmolikova K, Delcroix M, Kinoshita K, et al., 2017. Speakeraware neural network based beamformer for speaker extraction in speech mixtures. Annual Conf of Int Speech Communication Association, p.2655–2659. https://doi.org/10.21437/Interspeech.2017-667Google Scholar

Copyright information

© Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Tencent AI Lab, TencentBellevueUSA
  2. 2.Department of Computer Science and EngineeringShanghai Jiao Tong UniversityShanghaiChina

Personalised recommendations