Science China Information Sciences

, Volume 54, Issue 12, pp 2471–2480 | Cite as

Monaural voiced speech segregation based on elaborate harmonic grouping strategies

  • WenJu LiuEmail author
  • XueLiang Zhang
  • Wei Jiang
  • Peng Li
  • Bo Xu
Research Papers Special Focus


In this paper, an enhanced algorithm based on several elaborate harmonic grouping strategies for monaural voiced speech segregation is proposed. Main achievements of the proposed algorithm lie in three aspects. Firstly, the algorithm classifies the time-frequency (T-F) units into resolved and unresolved ones by carrier-to-envelope energy ratio, which leads to more accurate classification results than by cross-channel correlation. Secondly, resolved T-F units are grouped together according to minimum amplitude principle, which has been verified to exist in human perception, as well as the harmonic principle. Finally, “enhanced” envelope autocorrelation function is employed to detect amplitude modulation rates, which helps a lot in reducing half-frequency error in grouping of unresolved units. Systematic evaluation and comparison show that performance of separation is greatly improved by the proposed algorithm. Specifically, signal-to-noise ratio (SNR) is improved by 0.96 dB compared with that of previous method. Besides, our algorithm is also effective in improving the PESQ score and subjective perception score.


computational auditory scene analysis voiced speech separation harmonistic principle minimum amplitude principle elaborate harmonic grouping strategies 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

Supplementary material, approximately 2.75 MB.


  1. 1.
    Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoustics Speech Signal Process, 1979, 27: 113–120CrossRefGoogle Scholar
  2. 2.
    Paliwal K, Wojcicki K, Schwerin B. Single-channel speech enhancement using spectral subtraction in the short-time modulation domain. Speech Commun, 2010, 52: 450–475CrossRefGoogle Scholar
  3. 3.
    Benesty J, Makino S, Chen J. Speech Enhancement. New York: Springer, 2005Google Scholar
  4. 4.
    Asano F, Ikeda S, Ogawa M, et al. Combined approach of array processing and independent component analysis for blind separation of acoustic signals. IEEE Trans Speech Audio Process, 2003, 11: 204–215CrossRefGoogle Scholar
  5. 5.
    Koldovsky Z, Tichavsky P. Time-domain blind separation of audio sources based on a complete ICA decomposition of an observation space. IEEE Trans Audio Speech Lang Process, 2011, 19: 406–416CrossRefGoogle Scholar
  6. 6.
    Wang D L, Brown G J. Computational auditory scene analysis: principles, algorithms and applications. New Jersey: Wiley-IEEE Press, 2006Google Scholar
  7. 7.
    Bregman S. Auditory Scene Analysis. MA: MIT Press, 1990Google Scholar
  8. 8.
    Weintraub M. A theory and computational model of monaural auditory sound separation. Dissertation for Doctoral Degree. Palo Alto: Stanford University, 1985Google Scholar
  9. 9.
    Cooke M P. Modeling auditory processing and organization. Dissertation for Doctoral Degree. Sheffield: University of Sheffield, 1991Google Scholar
  10. 10.
    Hu G N, Wang D L. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans Neural Netw, 2004, 15: 1135–1150CrossRefGoogle Scholar
  11. 11.
    Li P, Guan Y, Wang S, et al. Monaural speech separation based on MAXVQ and CASA for robust speech recognition. Comput Speech Lang, 2010, 24: 30–44CrossRefGoogle Scholar
  12. 12.
    Carlyon R P, Shackleton T M. Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms? J Acoust Soc Am, 1994, 95: 3541–3554CrossRefGoogle Scholar
  13. 13.
    Klapuri A. Auditory-model based methods for multiple fundamental frequency estimation. In: Signal Processing Methods for Music Transcription. New York: Springer, 2006. 229–265CrossRefGoogle Scholar
  14. 14.
    de Boer E, de Jongh H R. On cochlear encoding: potentialities and limitations of the reverse-correlation techniques. J Acoust Soc Amer, 1978, 63: 115–135CrossRefGoogle Scholar
  15. 15.
    Kohlrausch A, Fassel R, Dau T. The influence of carrier level and frequency on modulation and beat-detection thresholds for sinusoidal carriers. J Acoust soc Am, 2000, 108: 723–734CrossRefGoogle Scholar
  16. 16.
    Tolonen T, Karjalainen M. A computationally efficient multipitch analysis model. IEEE Trans Speech Audio Process, 2000, 8: 708–716CrossRefGoogle Scholar
  17. 17.
    Hu G, Wang D L. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process, 2010, 18: 2067–2079CrossRefGoogle Scholar
  18. 18.
    Wang D L. On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi P, ed. Speech Separation by Humans and Machines. Boston: Kluwer, 2005. 181–197CrossRefGoogle Scholar

Copyright information

© Science China Press and Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • WenJu Liu
    • 1
    Email author
  • XueLiang Zhang
    • 1
  • Wei Jiang
    • 1
  • Peng Li
    • 2
  • Bo Xu
    • 2
  1. 1.National Laboratory of Pattern Recognition, Institute of AutomationChinese Academy of SciencesBeijingChina
  2. 2.Digital Media Content Technology Research Center, Institute of AutomationChinese Academy of SciencesBeijingChina

Personalised recommendations