A near-end listening enhancement system by RNN-based noise cancellation and speech modification

  • Gang Li
  • Ruimin Hu
  • Xiaochen Wang
  • Rui Zhang


When people listen to the phone in noisy environments, near-end listening enhancement (NELE) is a technology to enhance speech intelligibility against environmental noise. The complex environments in mobile communications have inspired many scholars to engage in NELE researches. Although they have proposed a lot of NELE systems, they only focus on the speech modification to enhance the intelligibility. Few scholars have attempted to further enhance the intelligibility by noise cancellation. Because traditional noise cancellation is based on adaptive filtering. If the adaptive filtering is used in the most common handset mode, the noise cancellation result will be poor because of inadequate feedback caused by the feedback microphone exposed to complex environments. With the booming of the deep neural network (DNN), DNN is able to predict noise signals for noise cancellation without the feedback microphone, especially for recurrent neural network (RNN). In this study, we propose a NELE System by RNN-based noise cancellation and speech modification (RNC-SM), which introduce a noise cancellation function after speech modification. Compared with existing NELE systems, RNC-SM system effectively improves the objective speech intelligibility index (SII) scores and the subjective listening quality.


NELE Speech intelligibility Noise cancellation Phase prediction RNN 



This work was supported by National Key R&D Program of China (No. 2017YFB1002803) and National Nature Science Foundation of China (No. 61801334, No. 61762005, No. U1736206).


  1. 1.
    Aicha AB (2017) Noise estimation for speech enhancement algorithms with post-smoothness processor incorporating global posterior SNR. Multimed Tools Appl 76(22):23661–23678CrossRefGoogle Scholar
  2. 2.
    ANSI (1997) American national standard methods for calculation of the speech intelligibility index. American National Standard Institute Inc s3:5–1997Google Scholar
  3. 3.
    Ballou G (2015) Handbook for sound engineers Focal PressGoogle Scholar
  4. 4.
    Chen Z, Luo Y, Mesgarani N (2017) Deep attractor network for single-microphone speaker separation. In: IEEE international conference on acoustics, speech and signal processing, pp 246–250Google Scholar
  5. 5.
    Cooke M, King S, Garnier M, Aubanel V (2014) The listening talker: a review of human and algorithmic context-induced modifications of speech. Comput Speech Lang 28(2, SI):543–571CrossRefGoogle Scholar
  6. 6.
    Deng L, Yu D (2014) Deep learning: methods and applications. Now Publishers, Inc, DelftzbMATHGoogle Scholar
  7. 7.
    ETSI (2014) TS 103 224 (V1.2.1): Speech and multimedia Transmission Quality (STQ); A sound field reproduction method for terminal testing including a background noise databas. Standard, ETSIGoogle Scholar
  8. 8.
    ETSI (2015) EG 202 396-1 (V1.6.1): Speech processing, transmission and quality aspects (STQ); Speech quality performance in the presence of background noise; Part 1: Background noise simulation technique and background noise databas. Standard, ETSIGoogle Scholar
  9. 9.
    George NV, Panda G (2013) Advances in active noise control: a survey, with emphasis on recent nonlinear techniques. Signal Process 93(2):363–377CrossRefGoogle Scholar
  10. 10.
    Han Y, Lee K (2016) Convolutional neural network with multiple-width frequency-delta data augmentation for acoustic scene classification. IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and EventsGoogle Scholar
  11. 11.
    ITU-T P (1996) 800: Methods for subjective determination of transmission quality. International Telecommunication Union, GenevaGoogle Scholar
  12. 12.
    Jokinen E, Remes U, Alku P (2016) The use of read versus conversational lombard speech in spectral tilt modeling for intelligibility enhancement in near-end noise conditions. In: Proceedings of the 17th annual conference of the international speech communication association, pp 2771–2775Google Scholar
  13. 13.
    Jokinen E, Remes U, Alku P (2017) Intelligibility enhancement of telephone speech using gaussian process regression for normal-to-lombard spectral tilt conversion. IEEE/ACM Trans Audio Speech Language Process 25(10):1985–1996CrossRefGoogle Scholar
  14. 14.
    Kakouros S, Rasanen O, Alku P (2017) Evaluation of spectral tilt measures for sentence prominence under different noise conditions. In: Proceedings of the annual conference of the international speech communication association, vol 2017, pp 3211–3215Google Scholar
  15. 15.
    Khademi S, Hendriks RC, Kleijn WB (2017) Intelligibility enhancement based on mutual information. IEEE/ACM Trans Audio Speech Language Process 25 (8):1694–1708CrossRefGoogle Scholar
  16. 16.
    Kleijn WB, Crespo JB, Hendriks RC, Petkov P, Sauert B, Vary P (2015) Optimizing speech intelligibility in a noisy environment: a unified view. IEEE Signal Process Mag 32(2):43–54CrossRefGoogle Scholar
  17. 17.
    Koutsogiannaki M, Francois H, Choo K, Oh E (2017) Real-time modulation enhancement of temporal envelopes for increasing speech intelligibility. In: Proceedings of the 18th annual conference of the international speech communication association, pp 1973–1977Google Scholar
  18. 18.
    Koutsogiannaki M, Stylianou Y (2014) Simple and artefact-free spectral modifications for enhancing the intelligibility of casual speech. In: IEEE international conference on acoustics, speech and signal processing, pp 4648–4652Google Scholar
  19. 19.
    Kuo SM, Morgan DR (1999) Active noise control: a tutorial review. Proc IEEE 87(6):943–973CrossRefGoogle Scholar
  20. 20.
    Niederjohn R, Grotelueschen J (1978) Speech intelligibility enhancement in a power generating noise environment. IEEE Trans Acoust Speech Signal Process 26(4):378–380CrossRefGoogle Scholar
  21. 21.
    Painter T, Spanias A (2000) Perceptual coding of digital audio. Proc IEEE 88(4):451–515CrossRefGoogle Scholar
  22. 22.
    Petkov PN, Kleijn WB (2015) Spectral dynamics recovery for enhanced speech intelligibility in noise. IEEE/ACM Trans Audio Speech Language Process 23(2):327–338CrossRefGoogle Scholar
  23. 23.
    Piczak KJ (2015) Environmental sound classification with convolutional neural networks. In: IEEE 25th international workshop on machine learning for signal processing, pp 1–6Google Scholar
  24. 24.
    Priyanka SS (2017) A review on adaptive beamforming techniques for speech enhancement. In: Innovations in power and advanced computing technologies, pp 1–6Google Scholar
  25. 25.
    Rao KR, Yip P (2014) Discrete cosine transform: algorithms, advantages, applications. Academic Press, CambridgezbMATHGoogle Scholar
  26. 26.
    Salamon J, Jacoby C, Bello JP (2014) A dataset and taxonomy for urban sound research. In: Proceedings of the 22nd ACM international conference on multimedia, pp 1041–1044Google Scholar
  27. 27.
    Sauert B, Vary P (2006) Near end listening enhancement: speech intelligibility improvement in noisy environments. In: IEEE international conference on acoustics speech and signal processing, vol 1, pp I–IGoogle Scholar
  28. 28.
    Spanias AS (1994) Speech coding: a tutorial review. Proc IEEE 82(10):1541–1582CrossRefGoogle Scholar
  29. 29.
    Taal CH, Hendriks RC, Heusdens R (2014) Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure. Comput Speech Lang 28(4):858–872CrossRefGoogle Scholar
  30. 30.
    Thomas IB, Niederjohn RJ (1968) Enhancement of speech intelligibility at high noise levels by filtering and clipping. J Audio Eng Soc 16(4):412–415Google Scholar
  31. 31.
    Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm 12(3):247–251CrossRefGoogle Scholar
  32. 32.
    Wang D, Zhang X (2015) THCHS-30: a free chinese speech corpus. Computer ScienceGoogle Scholar
  33. 33.
    West NE, O’Shea T (2017) Deep architectures for modulation recognition. In: IEEE international symposium on dynamic spectrum access networks, pp 1–6Google Scholar
  34. 34.
    Yan C, Xie H, Chen J, Zha Z, Hao X, Zhang Y, Dai Q (2018) A fast uyghur text detector for complex background images. IEEE Trans Multimedia 20 (12):3389–3398. ISSN=1520–9210CrossRefGoogle Scholar
  35. 35.
    Yan C, Xie H, Liu S, Yin J, Zhang Y, Dai Q (2018) Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans Intell Transp Syst 19(1):220–229CrossRefGoogle Scholar
  36. 36.
    Yan C, Xie H, Yang D, Yin J, Zhang Y, Dai Q (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans Intell Transp Syst 19(1):284–295CrossRefGoogle Scholar
  37. 37.
    Yan C, Zhang Y, Xu J, Dai F, Li L, Dai Q, Wu F (2014) A highly parallel framework for HEVC coding unit partitioning tree decision on many-core processors. IEEE Signal Process Lett 21(5):573–576CrossRefGoogle Scholar
  38. 38.
    Yan C, Zhang Y, Xu J, Dai F, Zhang J, Dai Q, Wu F (2014) Efficient parallel framework for HEVC motion estimation on many-core processors. IEEE Trans Circuits Syst Video Technol 24(12):2077–2089CrossRefGoogle Scholar
  39. 39.
    Yu D, Li J (2017) Recent progresses in deep learning based acoustic models. IEEE/CAA Journal of Automatica Sinica 4(3):396–409CrossRefGoogle Scholar
  40. 40.
    Zorilă TC, Kandia V, Stylianou Y (2012) Speech-in-noise intelligibility improvement based on spectral shaping and dynamic range compression. In: Proceedings of the 13th annual conference of the international speech communication association, pp 634–637Google Scholar
  41. 41.
    Zorilă TC, Stylianou Y, Flanagan S, Moore BC (2017) Evaluation of near-end speech enhancement under equal-loudness constraint for listeners with normal-hearing and mild-to-moderate hearing loss. J Acoust Soc Am 141(1):189–196CrossRefGoogle Scholar
  42. 42.
    Zorilă TC, Stylianou Y, Ishihara T, Akamine M (2016) Near and far field speech-in-noise intelligibility improvements based on a time-frequency energy reallocation approach. IEEE/ACM Trans Audio Speech, and Language Process 24(10):1808–1818CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Gang Li
    • 1
    • 2
    • 3
  • Ruimin Hu
    • 1
    • 2
    • 3
  • Xiaochen Wang
    • 1
  • Rui Zhang
    • 1
  1. 1.National Engineering Research Center for Multimedia Software, School of Computer ScienceWuhan UniversityWuhanChina
  2. 2.Hubei Key Laboratory of Multimedia and Network Communication EngineeringWuhan UniversityWuhanChina
  3. 3.Collaborative Innovation Center of Geospatial TechnologyWuhanChina

Personalised recommendations