Optimizing the Objective Measure of Speech Quality in Monaural Speech Separation

Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 43)


Monaural speech separation based on computational auditory scene analysis (CASA) is a challenging problem in the field of signal processing. The Ideal Binary Mask (IBM) proposed by DeLiang Wang and colleague is considered as the benchmark in CASA. However, it introduces objectionable distortions called musical noise and moreover, the perceived speech quality is very poor at low SNR conditions. The main reason for the degradation of speech quality is binary masking, in which some part of speech is discarded during synthesis. In order to address this musical noise problem in IBM and improve the speech quality, this work proposes a new soft mask as the goal of CASA. The performance of the proposed soft mask is evaluated using perceptual evaluation of speech quality (PESQ). The IEEE speech corpus and NOISEX92 noises are used to conduct the experiment. The experimental results indicate the superior performance of the proposed soft mask as compared to the traditional IBM in the context of monaural speech separation.


Monaural speech separation Computational auditory scene analysis Perceptual evaluation of speech quality Ideal binary mask Optimum soft mask 


  1. 1.
    Loizou, P.C.: Speech Enhancement: Theory and Practice, 2nd edn, CRC Press (2013)Google Scholar
  2. 2.
    Naik, G.R., Kumar, D.K.: An over view of independent component analysis and its applications. Informatica 35, 63–81 (2011)MATHGoogle Scholar
  3. 3.
    Grais, E., Erdogan, H.: Single channel speech music separation using nonnegative matrix factorization and spectral masks. In: The 17th International Conference on Digital Signal Processing, pp. 1–6. Island of Corfu, Greece (2011)Google Scholar
  4. 4.
    Jang, G.J., Lee, T.W.: A probabilistic approach to single channel source separation. In: Proceedings of Adv. Neural Inf. Process. System, pp. 1173–1180 (2003)Google Scholar
  5. 5.
    Bregman, A.S.: Auditory Scene Analysis. MIT Press, Cambridge (1990)Google Scholar
  6. 6.
    Christopher, H., Toby, S., Tim B.: On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis. In: Naik, G. R., Wang, W. (eds) Blind Source Separation Advances in Theory, Algorithms and Applications. Signals and Communication Technology, pp. 369–393. Springer-Verlag, Heidelberg (2014)Google Scholar
  7. 7.
    Radfar, M.H., Dansereau, R.M., Chan, W.Y.: Monaural speech separation based on gain adapted minimum mean square error estimation. J. Sign. Process Syst. 61, 21–37 (2010)CrossRefGoogle Scholar
  8. 8.
    Mowlaee, P., Saeidi, R., Martin, R.: Model-driven speech enhancement for multisource reverberant environment. In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) Latent Variable Analysis and Signal Separation. Lecture Notes in Computer Science, vol. 7191, pp. 454–461. Springer-Verlag, Heidelberg (2012)CrossRefGoogle Scholar
  9. 9.
    Wang, D.: On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi, P. (ed.) Speech Separation by Human and Machines, pp. 181–197. Kluwer Academic, Norwell (2005)CrossRefGoogle Scholar
  10. 10.
    Geravanchizadeh, M., Ahmadnia, R.: Monaural Speech Enhancement Based on Multi-threshold Masking. In: Naik, G. R., Wang, W. (eds) Blind Source Separation Advances in Theory, Algorithms and Applications. Signals and Communication Technology, pp. 369–393. Springer-Verlag, Heidelberg (2014)Google Scholar
  11. 11.
    Li, N., Loizou, P.C.: Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. J. Acoust. Soc. Am. 123(3), 1673–1682 (2008)CrossRefGoogle Scholar
  12. 12.
    Araki, S., Sawada, H., Mukai, R. Makino, S.: Blind sparse source separation with spatially smoothed time-frequency masking. In: International Workshop on Acoustic, Echo and Noise Control, Paris (2006)Google Scholar
  13. 13.
    Cao, S., Li, L., Wu, X.: Improvement of intelligibility of ideal binary-masked noisy speech by adding background noise. J. Acoust. Soc. Am. 129, 2227–2236 (2011)CrossRefGoogle Scholar
  14. 14.
    Patterson R.D., Nimmo-Smith, I., Holdsworth J.: Rice P : An Efficient Auditory Filter bank Based on the Gammatone Function. Report No. 2341, MRC Applied Psychology Unit, Cambridge (1985)Google Scholar
  15. 15.
    Rajavel, R., Sathidevi, P.S.: A new GA optimised reliability ratio based integration weight estimation scheme for decision fusion audio-visual speech recognition. Int. J. Sig. Imaging Syst. Eng. 4(2), 123–131 (2011)CrossRefGoogle Scholar
  16. 16.
    Rothauser, E.H., Chapman, W.D., Guttman, N., Hecker, M.H.L., Nordby, K.S., Silbiger, H.R., Urbanek, G.E., Weinstock, M.: Ieee recommended practice for speech quality measurements. IEEE Trans. Audio Electro Acoust. 17, 225–246 (1969)CrossRefGoogle Scholar
  17. 17.
  18. 18.
    ITU-T: Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs, Series P: Telephone Transmission Quality Recommendation P.862, ITU, 1.4. (2001)Google Scholar

Copyright information

© Springer India 2016

Authors and Affiliations

  • M. Dharmalingam
    • 1
  • M. C. John Wiselin
    • 2
  • R. Rajavel
    • 3
  1. 1.PRIST UniversityThanjavurIndia
  2. 2.Department of EEETravancore Engineering CollegeKollamIndia
  3. 3.Department of ECESSN College of EngineeringChennaiIndia

Personalised recommendations