Abstract
The estimation of the Ideal Binary Mask (IBM) based on speech cochleagram and visual cues were carried out in this paper to improve the speech intelligibility and quality using an Audio-Visual Convolutional Neural Network (AVCNN). Many speech enhancement techniques in the past depended heavily on audio attributes to reduce the noise present in the speech signal. Several studies have recently revealed that speech enhancement using visual data as an auxiliary input with audio data is more effective in reducing acoustic noise in speech signals. In the proposed work the multichannel CNN is used to extract the dynamics of both visual and audio signal features which were then integrated to estimate the threshold using the proposed algorithm to obtain the IBM for the enhancement of speech signal. The performance of the proposed model is evaluated primarily to measure the speech intelligibility in terms of STOI, ESTOI, and CSII additionally speech quality is also measured in terms PESQ, SSNR, CSIG, CBAK, and COVL. The evaluation results reveal that the proposed audio-visual mask estimation model outperforms the Audio-only, Visual-only, and existing audio-visual mask estimation models. The proposed AVCNN model, in turn, demonstrates its efficiency in merging the dynamics of audio information with visual speech information for speech enhancement.
Similar content being viewed by others
Data Availability
The present work utilizes two datasets. The first one is the Audio-Visual data from ‘GRID Corpus dataset’ developed by Barker and Cooke in 2007 and the second one is the ‘Real-world Noise dataset’ developed by Thiemann et al. in 2013. The two works are cited in the manuscript in [15] and [54] respectively.
References
A.H. Abdelaziz, Comparing fusion models for DNN-based audiovisual continuous speech recognition. IEEE/ACM Trans. Audio, Speech, Lang Process 26, 475–484 (2017)
A. Adeel, M. Gogate, A. Hussain, Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments. Inf. Fusion 59, 163–170 (2020)
A. Adeel, M. Gogate, A. Hussain, W.M. Whitmer, Lip-Reading driven deep learning approach for speech enhancement. IEEE Trans. Emerg. Top Comput. Intell. 5, 481–490 (2021). https://doi.org/10.1109/TETCI.2019.2917039
I. Almajai, B. Milner, Visually derived wiener filters for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 19, 1642–1651 (2010)
A. Arriandiaga, G. Morrone, L. Pasa, L. Badino, and C. Bartolozzi, Audio-visual target speaker enhancement on multi-talker environment using event-driven cameras. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, pp. 1–5 (2021)
B. Atal, M. Schroeder, Predictive coding of speech signals and subjective error criteria. IEEE Trans. Acoust. 27, 247–254 (1979)
M. Berouti, R. Schwartz, and J. Makhoul, Enhancement of speech corrupted by acoustic noise. In: ICASSP’79. IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, pp. 208–211 (1979)
S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust 27, 113–120 (1979)
E.J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis? J ACM 58, 1–37 (2011)
Chen, J., Benesty, J., Huang, Y. and Diethorn, E.J., Fundamentals of noise reduction. In: Springer Handbook of Speech Processing. Springer, (2008) pp 843–872
J. Chen, Y. Wang, D. Wang, A feature study for classification-based speech separation at low signal-to-Noise ratios. IEEE/ACM Trans. Audio, Speech, Lang Process 22, 1993–2002 (2014). https://doi.org/10.1109/TASLP.2014.2359159
Z. Chen, S. Watanabe, H. Erdogan, and J.R. Hershey, Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association, (2015)
Y.-H. Chin, J.-C. Wang, C.-L. Huang, K.-Y. Wang, C.-H. Wu, Speaker identification using discriminative features and sparse representation. IEEE Trans. Inf. Forensics Secur. 12, 1979–1987 (2017)
H.S. Choi, S. Park, J.H. Lee, H. Heo, D. Jeon, and K. Lee, Real-time denoising and dereverberation wtih tiny recurrent u-net. In: CASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5789–5793 (2021)
M. Cooke, J. Barker, S. Cunningham, X. Shao, An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120, 2421–2424 (2006)
T. Darrell, J.W. Fisher Iii, and P. Viola, Audio-visual segmentation and “the cocktail party effect.” In: International Conference on Multimodal Interfaces. Springer, pp. 32–40 (2000)
J. Eggert, and E. Korner, Sparse coding and NMF. In: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541). IEEE, pp. 2529–2533 (2004)
A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K.W. Wilson, A. Hassidim, W.T. Freeman, M. Rubinstein, Looking to listen at the cocktail party. ACM Trans. Graph 37, 1–11 (2018)
R. Frazier, S. Samsam, L. Braida, and A. Oppenheim, Enhancement of speech by adaptive filtering. In: ICASSP’76. IEEE International Conference on Acoustics, Speech, and Signal Processing. Citeseer, pp. 251–253 (1976)
A. Gabbay, A. Shamir, S. Peleg, Visual speech enhancement, in Interspeech 2018. ISCA, ISCA. pp 1170–1174 (2018)
L. Girin, J.-L. Schwartz, G. Feng, Audio-visual enhancement of speech in noise. J. Acoust. Soc. Am. 109, 3007–3020 (2001)
M. Gogate, A. Adeel, R. Marxer, J. Barker, A. Hussain, DNN driven speaker independent audio-visual mask estimation for speech separation, in Proc Annu Conf Int Speech Commun Assoc INTERSPEECH 2018-Septe. pp 2723–2727 (2018). https://doi.org/10.21437/Interspeech.2018-2516
S. Graetzer, C. Hopkins, Intelligibility prediction for speech mixed with white Gaussian noise at low signal-to-noise ratios. J. Acoust. Soc. Am. 149, 1346–1362 (2021)
R. Gu, S.-X. Zhang, Y. Xu, L. Chen, Y. Zou, D. Yu, Multi-modal multi-channel target speech separation. IEEE J. Sel. Top Signal Process 14, 530–541 (2020)
Y. Guo, W. Yu, J. Zhou, ZTrans: A new transformer for speech enhancement. In: 2021 4th International Conference on Information Communication and Signal Processing (ICICSP). IEEE, pp. 178–182 (2021)
X. Hao, X. Su, R. Horaud, X. Li, Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 6633–6637 (2021)
E.W. Healy, M. Delfarah, J.L. Vasko, B.L. Carter, D. Wang, An algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker. J. Acoust. Soc. Am. 141, 4230–4239 (2017)
J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, H.-M. Wang, Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerg. Top Comput. Intell. 2, 117–128 (2018)
Y. Hu, P. C. Loizou, A subspace approach for enhancing speech corrupted by colored noise. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, pp. I–573 (2002)
Y. Hu, P. C. Loizou, Evaluation of objective measures for speech enhancement. In: Ninth international conference on spoken language processing, (2006)
P-S. Huang, S. D. Chen, P. Smaragdis, M. Hasegawa-Johnson,Singing-voice separation from monaural recordings using robust principal component analysis. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 57–60 (2012)
E. Ideli, B. Sharpe, I V. Bajić, R.G. Vaughan, Visually assisted time-domain speech enhancement. In: 2019 IEEE Global Conference on signal and information processing (GlobalSIP). IEEE, pp. 1–5 (2019)
J.M. Kates, K.H. Arehart, Coherence and the speech intelligibility index. J. Acoust. Soc. Am. 117, 2224–2237 (2005)
K. Kinoshita, M. Delcroix, A. Ogawa, T.Nakatani, Text-informed speech enhancement with deep neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
U. Kjems, J. Jensen, Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement. In: 2012 Proceedings of the 20th European signal processing conference (EUSIPCO). IEEE, pp. 295–299 (2012)
M. Kolbæk, Z.-H. Tan, J. Jensen, Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Trans. Audio, Speech, Lang Process 25, 153–167 (2016)
J. Li, D. Luo, Y. Liu, Y. Zhu, Z. Li, G. Cui, W. Tang, W. Chen, Densely Connected multi-stage model with channel wise subband feature for real-time speech enhancement. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 6638–6642 (2021)
J. Lin, A.J.D.L. Van Wijngaarden, K.C. Wang, M.C. Smith, Speech enhancement using multi-stage self-attentive temporal convolutional networks. IEEE/ACM Trans. Audio, Speech, Lang Process 29, 3440–3450 (2021). https://doi.org/10.1109/TASLP.2021.3125143
T. Lotter, P. Vary, Speech enhancement by MAP spectral amplitude estimation using a super-Gaussian speech model. EURASIP J. Adv. Signal Process 2005, 1–17 (2005)
X. Lu, Y. Tsao, S. Matsuda, C. Hori, Speech enhancement based on deep denoising autoencoder. In: Interspeech, (2013) pp 436–440
J. Makhoul, Linear prediction: a tutorial review. Proc IEEE 63, 561–580 (1975)
N. Neverova, C. Wolf, G. Taylor, F. Nebout, Moddrop: adaptive multi-modal gesture recognition. IEEE Trans, Pattern Anal, Mach, Intell, 38, 1692–1706 (2015)
S.R. Quackenbush, T.P. Barnwell, M.A. Clements, Objective measures of speech quality (Prentice-Hall, 1988)
L. Rabiner, B. Juang, An introduction to hidden Markov models. IEEE ASSP Mag 3, 4–16 (1986)
R. Rajavel, P. S. Sathidevi, Static and dynamic features for improved HMM based visual speech recognition. In: Proceedings of the First International Conference on Intelligent Human Computer Interaction. Springer, pp. 184–194 (2009)
R. Rajavel, P.S. Sathidevi, Adaptive reliability measure and optimum integration weight for decision fusion audio-visual speech recognition. J. Signal Process Syst. 68, 83–93 (2012)
A. Rezayee, S. Gazor, An adaptive KLT approach for speech enhancement. IEEE Trans. Speech Audio Process 9, 87–95 (2001)
A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221). IEEE, pp. 749–752 (2001)
P. Scalart, Speech enhancement based on a priori signal to noise estimation. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. IEEE, pp. 629–632 (1996)
S. Shoba, R. Rajavel, A new Genetic Algorithm based fusion scheme in monaural CASA system to improve the performance of the speech. J. Ambient. Intell. Humaniz Comput. 11, 433–446 (2020)
S. Suhadi, C. Last, T. Fingscheidt, A data-driven approach to a priori SNR estimation. IEEE Trans. Audio Speech Lang Process 19, 186–195 (2010)
L. Sun, J. Du, L-R. Dai, C-H. Lee, Multiple-target deep learning for LSTM-RNN based speech enhancement. In: 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA). IEEE, (2017), pp 136–140
C.H, Taal, R.C, Hendriks, R. Heusdens, J. Jensen, A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 4214–4217 (2010)
J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. J. Acoust. Soc. Am. 133, 3591 (2013). https://doi.org/10.1121/1.4806631
P. Viola, M.J. Jones, Robust real-time face detection. Int. J. Comput. Vis. 57, 137–154 (2004)
D. Wang, G.J. Brown, Computational auditory scene analysis: Principles, algorithms, and applications. Wiley-IEEE press, (2006)
J.-C. Wang, Y.-S. Lee, C.-H. Lin, S.-F. Wang, C.-H. Shih, C.-H. Wu, Compressive sensing-based speech enhancement. IEEE/ACM Trans. Audio, Speech, Lang Process 24, 2122–2131 (2016)
Weintraub, M., A theory and computational model of auditory monaural sound separation (1985)
Wu, J., Xu, Y., Zhang, S-X., Chen, L-W., Yu, M., Xie, L., Yu, D. Time domain audio visual speech separation. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, (2019), pp 667–673
B. Xia, C. Bao, Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification. Speech Commun. 60, 13–29 (2014)
Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang Process 23, 7–19 (2014)
C. Yu, K.-H. Hung, S.-S. Wang, Y. Tsao, J. Hung, Time-domain multi-modal bone/air conducted speech enhancement. IEEE Signal Process Lett. 27, 1035–1039 (2020)
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The Authors declares that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Balasubramanian, S., Rajavel, R. & Kar, A. Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement. Circuits Syst Signal Process 42, 5313–5337 (2023). https://doi.org/10.1007/s00034-023-02340-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-023-02340-3