Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement

Balasubramanian, S.; Rajavel, R.; Kar, Asutosh

doi:10.1007/s00034-023-02340-3

Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement

Published: 06 April 2023

Volume 42, pages 5313–5337, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

248 Accesses
1 Altmetric
Explore all metrics

Abstract

The estimation of the Ideal Binary Mask (IBM) based on speech cochleagram and visual cues were carried out in this paper to improve the speech intelligibility and quality using an Audio-Visual Convolutional Neural Network (AVCNN). Many speech enhancement techniques in the past depended heavily on audio attributes to reduce the noise present in the speech signal. Several studies have recently revealed that speech enhancement using visual data as an auxiliary input with audio data is more effective in reducing acoustic noise in speech signals. In the proposed work the multichannel CNN is used to extract the dynamics of both visual and audio signal features which were then integrated to estimate the threshold using the proposed algorithm to obtain the IBM for the enhancement of speech signal. The performance of the proposed model is evaluated primarily to measure the speech intelligibility in terms of STOI, ESTOI, and CSII additionally speech quality is also measured in terms PESQ, SSNR, CSIG, CBAK, and COVL. The evaluation results reveal that the proposed audio-visual mask estimation model outperforms the Audio-only, Visual-only, and existing audio-visual mask estimation models. The proposed AVCNN model, in turn, demonstrates its efficiency in merging the dynamics of audio information with visual speech information for speech enhancement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audio-visual speech recognition using deep learning

Article Open access 20 December 2014

Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks

An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

Data Availability

The present work utilizes two datasets. The first one is the Audio-Visual data from ‘GRID Corpus dataset’ developed by Barker and Cooke in 2007 and the second one is the ‘Real-world Noise dataset’ developed by Thiemann et al. in 2013. The two works are cited in the manuscript in [15] and [54] respectively.

References

A.H. Abdelaziz, Comparing fusion models for DNN-based audiovisual continuous speech recognition. IEEE/ACM Trans. Audio, Speech, Lang Process 26, 475–484 (2017)
Article Google Scholar
A. Adeel, M. Gogate, A. Hussain, Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments. Inf. Fusion 59, 163–170 (2020)
Article Google Scholar
A. Adeel, M. Gogate, A. Hussain, W.M. Whitmer, Lip-Reading driven deep learning approach for speech enhancement. IEEE Trans. Emerg. Top Comput. Intell. 5, 481–490 (2021). https://doi.org/10.1109/TETCI.2019.2917039
Article Google Scholar
I. Almajai, B. Milner, Visually derived wiener filters for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 19, 1642–1651 (2010)
Article Google Scholar
A. Arriandiaga, G. Morrone, L. Pasa, L. Badino, and C. Bartolozzi, Audio-visual target speaker enhancement on multi-talker environment using event-driven cameras. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, pp. 1–5 (2021)
B. Atal, M. Schroeder, Predictive coding of speech signals and subjective error criteria. IEEE Trans. Acoust. 27, 247–254 (1979)
Article Google Scholar
M. Berouti, R. Schwartz, and J. Makhoul, Enhancement of speech corrupted by acoustic noise. In: ICASSP’79. IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, pp. 208–211 (1979)
S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust 27, 113–120 (1979)
Article Google Scholar
E.J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis? J ACM 58, 1–37 (2011)
Article MathSciNet MATH Google Scholar
Chen, J., Benesty, J., Huang, Y. and Diethorn, E.J., Fundamentals of noise reduction. In: Springer Handbook of Speech Processing. Springer, (2008) pp 843–872
J. Chen, Y. Wang, D. Wang, A feature study for classification-based speech separation at low signal-to-Noise ratios. IEEE/ACM Trans. Audio, Speech, Lang Process 22, 1993–2002 (2014). https://doi.org/10.1109/TASLP.2014.2359159
Article Google Scholar
Z. Chen, S. Watanabe, H. Erdogan, and J.R. Hershey, Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association, (2015)
Y.-H. Chin, J.-C. Wang, C.-L. Huang, K.-Y. Wang, C.-H. Wu, Speaker identification using discriminative features and sparse representation. IEEE Trans. Inf. Forensics Secur. 12, 1979–1987 (2017)
Article Google Scholar
H.S. Choi, S. Park, J.H. Lee, H. Heo, D. Jeon, and K. Lee, Real-time denoising and dereverberation wtih tiny recurrent u-net. In: CASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5789–5793 (2021)
M. Cooke, J. Barker, S. Cunningham, X. Shao, An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120, 2421–2424 (2006)
Article Google Scholar
T. Darrell, J.W. Fisher Iii, and P. Viola, Audio-visual segmentation and “the cocktail party effect.” In: International Conference on Multimodal Interfaces. Springer, pp. 32–40 (2000)
J. Eggert, and E. Korner, Sparse coding and NMF. In: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541). IEEE, pp. 2529–2533 (2004)
A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K.W. Wilson, A. Hassidim, W.T. Freeman, M. Rubinstein, Looking to listen at the cocktail party. ACM Trans. Graph 37, 1–11 (2018)
Article Google Scholar
R. Frazier, S. Samsam, L. Braida, and A. Oppenheim, Enhancement of speech by adaptive filtering. In: ICASSP’76. IEEE International Conference on Acoustics, Speech, and Signal Processing. Citeseer, pp. 251–253 (1976)
A. Gabbay, A. Shamir, S. Peleg, Visual speech enhancement, in Interspeech 2018. ISCA, ISCA. pp 1170–1174 (2018)
L. Girin, J.-L. Schwartz, G. Feng, Audio-visual enhancement of speech in noise. J. Acoust. Soc. Am. 109, 3007–3020 (2001)
Article Google Scholar
M. Gogate, A. Adeel, R. Marxer, J. Barker, A. Hussain, DNN driven speaker independent audio-visual mask estimation for speech separation, in Proc Annu Conf Int Speech Commun Assoc INTERSPEECH 2018-Septe. pp 2723–2727 (2018). https://doi.org/10.21437/Interspeech.2018-2516
S. Graetzer, C. Hopkins, Intelligibility prediction for speech mixed with white Gaussian noise at low signal-to-noise ratios. J. Acoust. Soc. Am. 149, 1346–1362 (2021)
Article Google Scholar
R. Gu, S.-X. Zhang, Y. Xu, L. Chen, Y. Zou, D. Yu, Multi-modal multi-channel target speech separation. IEEE J. Sel. Top Signal Process 14, 530–541 (2020)
Article Google Scholar
Y. Guo, W. Yu, J. Zhou, ZTrans: A new transformer for speech enhancement. In: 2021 4th International Conference on Information Communication and Signal Processing (ICICSP). IEEE, pp. 178–182 (2021)
X. Hao, X. Su, R. Horaud, X. Li, Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 6633–6637 (2021)
E.W. Healy, M. Delfarah, J.L. Vasko, B.L. Carter, D. Wang, An algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker. J. Acoust. Soc. Am. 141, 4230–4239 (2017)
Article Google Scholar
J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, H.-M. Wang, Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerg. Top Comput. Intell. 2, 117–128 (2018)
Article Google Scholar
Y. Hu, P. C. Loizou, A subspace approach for enhancing speech corrupted by colored noise. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, pp. I–573 (2002)
Y. Hu, P. C. Loizou, Evaluation of objective measures for speech enhancement. In: Ninth international conference on spoken language processing, (2006)
P-S. Huang, S. D. Chen, P. Smaragdis, M. Hasegawa-Johnson,Singing-voice separation from monaural recordings using robust principal component analysis. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 57–60 (2012)
E. Ideli, B. Sharpe, I V. Bajić, R.G. Vaughan, Visually assisted time-domain speech enhancement. In: 2019 IEEE Global Conference on signal and information processing (GlobalSIP). IEEE, pp. 1–5 (2019)
J.M. Kates, K.H. Arehart, Coherence and the speech intelligibility index. J. Acoust. Soc. Am. 117, 2224–2237 (2005)
Article Google Scholar
K. Kinoshita, M. Delcroix, A. Ogawa, T.Nakatani, Text-informed speech enhancement with deep neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
U. Kjems, J. Jensen, Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement. In: 2012 Proceedings of the 20th European signal processing conference (EUSIPCO). IEEE, pp. 295–299 (2012)
M. Kolbæk, Z.-H. Tan, J. Jensen, Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Trans. Audio, Speech, Lang Process 25, 153–167 (2016)
Article Google Scholar
J. Li, D. Luo, Y. Liu, Y. Zhu, Z. Li, G. Cui, W. Tang, W. Chen, Densely Connected multi-stage model with channel wise subband feature for real-time speech enhancement. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 6638–6642 (2021)
J. Lin, A.J.D.L. Van Wijngaarden, K.C. Wang, M.C. Smith, Speech enhancement using multi-stage self-attentive temporal convolutional networks. IEEE/ACM Trans. Audio, Speech, Lang Process 29, 3440–3450 (2021). https://doi.org/10.1109/TASLP.2021.3125143
Article Google Scholar
T. Lotter, P. Vary, Speech enhancement by MAP spectral amplitude estimation using a super-Gaussian speech model. EURASIP J. Adv. Signal Process 2005, 1–17 (2005)
Article MATH Google Scholar
X. Lu, Y. Tsao, S. Matsuda, C. Hori, Speech enhancement based on deep denoising autoencoder. In: Interspeech, (2013) pp 436–440
J. Makhoul, Linear prediction: a tutorial review. Proc IEEE 63, 561–580 (1975)
Article Google Scholar
N. Neverova, C. Wolf, G. Taylor, F. Nebout, Moddrop: adaptive multi-modal gesture recognition. IEEE Trans, Pattern Anal, Mach, Intell, 38, 1692–1706 (2015)
Article Google Scholar
S.R. Quackenbush, T.P. Barnwell, M.A. Clements, Objective measures of speech quality (Prentice-Hall, 1988)
Google Scholar
L. Rabiner, B. Juang, An introduction to hidden Markov models. IEEE ASSP Mag 3, 4–16 (1986)
Article Google Scholar
R. Rajavel, P. S. Sathidevi, Static and dynamic features for improved HMM based visual speech recognition. In: Proceedings of the First International Conference on Intelligent Human Computer Interaction. Springer, pp. 184–194 (2009)
R. Rajavel, P.S. Sathidevi, Adaptive reliability measure and optimum integration weight for decision fusion audio-visual speech recognition. J. Signal Process Syst. 68, 83–93 (2012)
Article Google Scholar
A. Rezayee, S. Gazor, An adaptive KLT approach for speech enhancement. IEEE Trans. Speech Audio Process 9, 87–95 (2001)
Article Google Scholar
A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221). IEEE, pp. 749–752 (2001)
P. Scalart, Speech enhancement based on a priori signal to noise estimation. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. IEEE, pp. 629–632 (1996)
S. Shoba, R. Rajavel, A new Genetic Algorithm based fusion scheme in monaural CASA system to improve the performance of the speech. J. Ambient. Intell. Humaniz Comput. 11, 433–446 (2020)
Article Google Scholar
S. Suhadi, C. Last, T. Fingscheidt, A data-driven approach to a priori SNR estimation. IEEE Trans. Audio Speech Lang Process 19, 186–195 (2010)
Article Google Scholar
L. Sun, J. Du, L-R. Dai, C-H. Lee, Multiple-target deep learning for LSTM-RNN based speech enhancement. In: 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA). IEEE, (2017), pp 136–140
C.H, Taal, R.C, Hendriks, R. Heusdens, J. Jensen, A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 4214–4217 (2010)
J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. J. Acoust. Soc. Am. 133, 3591 (2013). https://doi.org/10.1121/1.4806631
Article Google Scholar
P. Viola, M.J. Jones, Robust real-time face detection. Int. J. Comput. Vis. 57, 137–154 (2004)
Article Google Scholar
D. Wang, G.J. Brown, Computational auditory scene analysis: Principles, algorithms, and applications. Wiley-IEEE press, (2006)
J.-C. Wang, Y.-S. Lee, C.-H. Lin, S.-F. Wang, C.-H. Shih, C.-H. Wu, Compressive sensing-based speech enhancement. IEEE/ACM Trans. Audio, Speech, Lang Process 24, 2122–2131 (2016)
Article Google Scholar
Weintraub, M., A theory and computational model of auditory monaural sound separation (1985)
Wu, J., Xu, Y., Zhang, S-X., Chen, L-W., Yu, M., Xie, L., Yu, D. Time domain audio visual speech separation. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, (2019), pp 667–673
B. Xia, C. Bao, Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification. Speech Commun. 60, 13–29 (2014)
Article Google Scholar
Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang Process 23, 7–19 (2014)
Article Google Scholar
C. Yu, K.-H. Hung, S.-S. Wang, Y. Tsao, J. Hung, Time-domain multi-modal bone/air conducted speech enhancement. IEEE Signal Process Lett. 27, 1035–1039 (2020)
Article Google Scholar

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

SSN College of Engineering, Anna University, Chennai, Tamilnadu, 603110, India
S. Balasubramanian & R. Rajavel
Dr B R Ambedkar National Institute of Technology Jalandhar, Jalandhar, Punjab, 44027, India
Asutosh Kar

Authors

S. Balasubramanian
View author publications
You can also search for this author in PubMed Google Scholar
R. Rajavel
View author publications
You can also search for this author in PubMed Google Scholar
Asutosh Kar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Balasubramanian.

Ethics declarations

Conflict of interest

The Authors declares that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Balasubramanian, S., Rajavel, R. & Kar, A. Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement. Circuits Syst Signal Process 42, 5313–5337 (2023). https://doi.org/10.1007/s00034-023-02340-3

Download citation

Received: 20 July 2022
Revised: 27 February 2023
Accepted: 28 February 2023
Published: 06 April 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s00034-023-02340-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement

Abstract

Access this article

Similar content being viewed by others

Audio-visual speech recognition using deep learning

Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks

An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement

Abstract

Access this article

Similar content being viewed by others

Audio-visual speech recognition using deep learning

Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks

An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation