Face mask effects on speaker verification performance in the presence of noise

Al-Karawi, Khamis A.

doi:10.1007/s11042-023-15824-w

Face mask effects on speaker verification performance in the presence of noise

Published: 29 May 2023

Volume 83, pages 4811–4824, (2024)
Cite this article

Download PDF

Multimedia Tools and Applications Aims and scope Submit manuscript

Face mask effects on speaker verification performance in the presence of noise

Download PDF

Khamis A. Al-Karawi ORCID: orcid.org/0000-0001-9275-6902¹

826 Accesses
3 Citations
Explore all metrics

Abstract

Due to its spread via physical contact and the regulations on wearing face masks, COVID-19 has resulted in tough challenges for speaker recognition. Masks may aid in preventing COVID-19 transmission, although the implications of the mask on system performance in a clean environment and with varying levels of background noise are unclear. The face mask has an impact on speech output. The task of comprehending speech while wearing a face mask is made more difficult by the mask's frequency response and radiation qualities, which is vary depending on the material and design of the mask. In this study, we recorded speech while wearing a face mask to see how different masks affected a state-of-the-art text-independent speaker verification system using an i-vector speaker identification system. This research investigates the influence of facial coverings on speaker verification. To address this, we investigated the effect of fabric masks on speaker identification in a cafeteria setting. These results present preliminary speaker recognition rates as well as mask verification trials. The result shows that masks had little to no effect in low background noise, with an EER of 2.4–2.5% in 20 dB SNR for both masks compared to no mask at the same level. In noisy conditions, accuracy was 12.7–13.0% lowers than without a mask with a 5 dB SNR, indicating that while different masks perform similarly in low background noise levels, they become more noticeable in high noise levels.

Speaker Identification

A review on face recognition systems: recent approaches and challenges

Article 30 July 2020

The struggle for recognition in the age of facial recognition technology

Article Open access 08 March 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Automatic speaker recognition systems can achieve high accuracy in ideal conditions, i.e., with speech samples free from noise, reverberation, or channel variations. However, real applications can hardly meet such environments [2, 6].

Speech is the most natural form of human communication. Emotional and acoustic contexts are also conveyed through voice signals and the speaker's identity and accents. In forensic speaker verification, audio recordings collected at the scene of a crime are used to identify the persons involved. In this sense, identification systems have trouble evaluating material that has been manipulated, forged, or organically altered [3]. The COVID-19 pandemic poses a substantial challenge to biometrics in two respects. First, the virus remains on surfaces for long, preventing users from physically engaging with shared biometric devices. With COVID-19, contact-free biometrics has become critical. Second, it is airborne [39]. Consequently, health officials worldwide recommend frequently using face masks to prevent viral transmission [11]. This daily usage disables current face recognition systems, whether personal (on computers or phones) or public (at hospitals or airports). Speaker recognition systems have been extensively studied [7, 8] with several benchmark datasets [3]. Deep Learning (DL) speech biometrics has also been suggested [9]. Speaker identification systems' performance degrades when the circumstances of training and evaluation are misaligned [35]. In terms of masks, this implies identifying mask-wearing speakers with audio from mask-wearing speakers and non-mask-wearing speakers with audio from non-mask-wearing speakers. Thus, using speech features to detect if a speaker is wearing a mask automatically may enhance voice biometrics systems. The effect of wearing a mask has been widely examined in various situations; it has been proven to alter the human-to-human perception of speech, while the findings are mixed for non-hearing impaired people [22, 40]. It seems that the impact of mask-wearing stems from the loss of the visual information that the brain uses to compensate for the degradation in auditory information rather than a direct effect of the facial coverings on the acoustics itself. Some research [27] looks at the acoustic consequences of wearing a mask. The impacted frequencies are 1–8 kHz, with the most significant influence being 2–4 kHz. These ranges are essential for voice biometrics, specifically 1–3 kHz [29]. By evaluating the audio cues used by the algorithms to predict whether a speaker is wearing a mask, we may learn more about the impacts of mask-wearing. In this study, we provide text-independent speaker identification using fabric and surgical face masks. We give a new speech dataset compiled to aid the research in this study. We train speaker-specific models using audio captured under various face masks using a state-of-the-art i-vector based recognition system. Normal speech, referred to as "no mask," serves as a natural choice for training utterances. During the test phase, the recognition rates for the use of face cover in both matched and mismatched circumstances were supplied. We also look at a mask classification scenario in which a closed set of face masks is used to figure out what kind of mask was used in a short speech.

2 SPEECH under face cover

It was discovered during the COVID-19 epidemic that there had been a minimal investigation into how wearing face masks affected the acoustic qualities of speech. Due to using a face mask, the voice production unit adapts to the changed surroundings [21]. When wearing a face mask, the degree to which speech production is altered is determined by the kind of mask, the amount of contact the mask has with speech production organs, and the extent to which the mask modifies the perception of one's voice [17, 34]. Wearing a face mask affects the audio recording of your speech in both an active and passive manner. The wearing of a face mask has been demonstrated to influence specific speech articulation processes and sound absorption properties. Depending on the kind of mask used, the amount of time it is in contact with the face, and the pressure applied to the face, lip, and jaw movements are severely restricted. The influence of wearing a surgical mask for automatic speaker recognition in a person's voice is being investigated in addition to the impacts of other forms of voice disguise. The researchers looked at the identification scores for each member of a group of target speakers. They observed that wearing a surgical mask considerably affected the recognition system's performance [23, 43]. The intelligibility of voice is created by three distinct facial coverings: the niqab (a fabric mask used by Muslim women), the balaclava (a ski mask that exposes just a piece of the face), and the surgical mask, which are all explored. The authors observed that listeners could correctly recognize the target phrases regardless of the kind of mask utilised. You can find out what frequency ranges are ideal for your microphone and the masks in general by playing a loudspeaker's speech and re-recording it using a microphone isolated from the loudspeaker by a face mask [12, 19]. The researchers selected very modest differences in signal attenuation over several mask materials when they completed this series of studies. As reported earlier, the acoustic transmission loss of 44 different woven fabrics was previously studied in various audible frequency ranges [28, 37]. By the data, the transmission loss of fabric is strongly dependent on its weight, thickness, density/porosity, and other properties [28]. According to research, the absorption of sound energy by different fabrics results in higher energy loss in high-frequency ranges than in low-frequency ranges. In a recent study, researchers found no difference in the transmission qualities of different shirt types or between shirts and the bare-microphone condition when they looked at wearing microphones (Fig. 1) [38].

3 Speaker recognition system

Microsoft Research (MSR) has created an identity toolset for testing speech recognition [1, 33]. The created toolkit uses machine recognition based on the Gaussian Mixture Model (GMM) and Universal Background Model (UBM) and offers frameworks for i-vector analysis. The front and back ends are the two primary phases through which speaker recognition systems operate. The first stage's capability is feature extraction from each speaker's enrolled speech signals and transformation to acoustic features. Given that the Mel-scale in MFCC is a scale that represents the foundation of converting the frequency and the perceived pitch to the feature coefficients equivalent human auditory system, which is not a linear system, the Cepstral features, such as the MFCCs, are most frequently used with speaker recognition systems [20, 26]. In contrast, the reference models for the enrolled speakers are generated in the second phase (back-end) using the front-end phases retrieved features. It should be noted that both the Gaussian Mixture Model (GMM) and the Gaussian Mixture Model-Universal Background Model (GMM-UBM) are recognized as the basis for ASR systems. The expectation–maximization (EM) algorithm is used in the (GMM-UBM) framework to obtain GMM parameters. During enrolment, speaker modules are obtained using the adaptation Maximum a posteriori (MAP) [32]. Log-likelihood thresholding is used to estimate scoring and make decisions. The universal background model (UBM) is trained as a speaker-independent module by gathering speech samples from a sizable set of speakers. The block diagram of the GMM-UBM framework is shown in Fig. 2.

4 Background

4.1 Baseline ASV: i-vector

Several speech classification tasks, including speaker, language, dialect, speaker diarization, speech recognition, and clustering, heavily rely on the i-vector technique. The i-vector concisely represents what transpires during a specific speech recording. Additionally, the classical i-vector approach is based on the Gaussian Mixture Model (GMM), which entails using subspace modeling to represent the activation of neurons. The i-Vector, which has a total change subspace, represents the superior vector of the GMM. The fact that the JFA channel area contains information that can tell different speakers apart [14] provided the impetus for the one-part space method. The i-vector speaker that is compatible with GMM can be represented by (1)

$$\mu = m + {\rm T}w$$

(1)

where m denotes the mean supervisor taken from the Universal Background Model UBM, T refers to a rectangular low-rank matrix, and w represents the random vector with normal distribution, i.e., the supposedly i-vectors [15]. The outcome technique in the i-vector frame consists of figuring out how well the target and test speaker's factors line up. Several post-processing steps have been advised to make this point, such as Gaussian probabilistic linear discriminant analysis (PLDA) and linear discriminant analysis (LDA) to maximize between-class variation and decrease within-class variance. Then, results are prepared to rely on PLDA hyper parameter log-likelihood thresholding. The total variability factors, w, are independent random variables with a normal distribution. I-vectors (w) are assumed to be scattered with parameters N. (0,1) typically. It is fundamentally a maximal posterior adaptation (MAP) of w in the subspace cleared by T to remove the i-vector from the total-variability subspace. The essential suitable subspace from which i-vectors are extracted is the total-variability subspace. I-vector techniques need additional intersession compensation approaches before scoring to mitigate the effects of channel variability because the overall variability space denoted by T encompasses both speaker and channel variability. The block diagram of the i-vector framework is shown in Fig. 3.

4.2 Noise problems in speech signals

Environmental sounds are a rich source of acoustic data in ambient noises and soundscape surveillance. Still, they also include a lot of background noise, making extracting useful information difficult. Ambient noises are typically recognized as quasi-data sets that are underused. For several individuals, it is a cause of worry [24]. There is difficulty in describing such noises by utilizing standard audio characteristics and creating relevant ambient noise characteristics in autonomous acoustical classification systems [5, 8]. A range of ambient sound sources generates various types of noise. These environmental sources produce harmful noises, referred to as "noise sources." Acoustic sensors are used outdoors in the context of environmental and soundscape monitoring. In uncontrolled situations, they are subject to changing and severe weather conditions, and microphone wind noise becomes a big worry. Environmental and system-introduced noise, such as channel interference, often contaminates signals of interest [4, 8].

4.3 Features extraction

The MFCC (Mel Frequency Cepstrum Coefficients) is the most extensively utilised of the MFCCs. Computed using log energies dispersed on a Mel scale. The MFCCs are widely used because of their simplicity and strong performance in automated speaker identification tasks under clean-matched settings [6, 13, 25]. Human perception is insensitive to common-frequency localised random disturbances that impact MFCCs [8, 36]. Noise and reverberation reduce MFCC performance quickly. Because MFCCs are not robust under noisy, reverberant, or mismatched circumstances, numerous researchers are investigating robust variations. The filter bandwidths are proportional to the auditory ERB function defined in [2, 18]. A GF is generated from a bank of Gammatone filters, which were suggested to replicate human cochlear filtering. The GFCC is generated from GF. Because it can simulate the human auditory system, GFCC has demonstrated good resistance to acoustic changes [1, 31]. The research shows that the GFCC’s Gammatone filterbank was more noise-resistant than MFCC's Mel-filterbank. Therefore, the GFCC feature was used in this experiment. The HTK toolkit is often employed to derive GFCC [42]. The main steps of the extraction of the GFCC feature are shown in the following (Fig. 4):

5 Experimental setup

In the past two decades, there has been a significant increase in interest in text-independent speaker verification [4]. This strategy, known as the i-vector approach, is considered the most up-to-date method of text-independent speaker detection. As seen in Fig. 2, the architecture of our i-vector-based identification system is displayed. As is usual in real-world applications, we employ a commercially available, state-of-the-art recognition system in which the recognition system settings cannot be tailored to the test environment due to a shortage of data for the i-vector recognition system. The training phase involves around 20 s of active speech on average, but the test phase includes just non-overlapping chunks of 2.5 s of active speech on average. This experiment takes test segments from a speech delivered under various masks.

5.1 Dataset

Regarding data set building, training, and testing speech material, several datasets are found related to speaker verification, such as those obtained from the Face Mask dataset (FMD) [41]. Furthermore, short speech samples were recorded by 30 participants, 15 male and 15 female, using a Zoom H6 recorder. Recordings were made without a face covering, with a fabric cover, and with a surgical mask. The recordings were made in the School of Science, Engineering, and Environment at Salford University-United Kingdom. A specific noise-reducing wall element has been installed in the studio, which helps to decrease background noise. As a result, ensure high audio recording quality so there is far less reverberation. The room is around five by five square meters, with two windows and one door. Because of the speaker identification tests in this study, the data was initially captured in 44.1 kHz wave format; however, the sampling frequency was decreased to 16 kHz for the examinations. The data was captured using a Zoom H6 recorder, with the microphone set close to the speaker's lips to ensure a clear sound. We apply simple energy-based voice activity detection to remove the large chunks of silence in the excerpt. The speech is sampled at 16 kHz. Each participant was told to provide two speech samples with varying lengths of delay between them (approximately 30 s for the first sample and around 20 s for the second sample). Unless otherwise stated, the speakers are speaking in English. The speakers depend on a random selection of broad material from easily legible sources. (Newspapers, books, pamphlets, articles, and similar materials may be used to produce text/language-independent samples.)

5.2 Noisy mixing

For this investigation, the recorded samples described in the preceding part were mixed with various noise levels to create noise utterances with SNR values. The noisy speech samples were gated by mixing the cafeteria noise with the clean speech to produce a varied range of signal-to-noise ratios (SNR ranging from 20 dB to -5 dBs MATLAB code was used to create an audio mixer that mixes speech samples based on signal strength. The noise files were truncated to the same length as the target speech utterance in the mixing procedure. This step aimed to ensure noise mixing occurred over the entire speech signal. The following describes the mixing process: the normalisation problem is handled so that appropriate amounts of speech and noise samples may be added to prevent misinterpretation. The default technique is normalising mixed or comparing signals to the same observed level. The combined samples are handled at this stage to make them seem identical (RMS). In the fifth step, the utterance samples from the previously mentioned overall samples obtained from 30 speakers (15 male and 15 female) are mixed with cafeteria noise at 5-difference SNR ratios ranging from -5 dB to 20 dB and the noisy utterance is used to validate the suggested speaker recognition technique. The cafeteria noise was employed in this study and captured at a sampling rate of 44.1 kHz.

5.3 Evaluation methods

The validation ratings for trials for system error categories are calculated as the log-likelihood ratio between the speaker models and the test observations. To grade ASR, the National Institute of Standards and Technology (NIST) developed a set of standard performance measures [16]. False positives and false negatives, often known as false alarms, are the most common mistakes in hypothesis tests. When a system wrongly certifies an imposter as the target during the verification of impostor trials, it is called a false positive. During the verification target trials, the system deems the target an imposter, resulting in a false negative. A linear plot of bit error rates on a normal scale, known as the Detection Error Trade-off (DET) curves by the NIST, is a beneficial manner of showing the system's performance [10]. The outcome is a straight line when the underlying error rates are normal. The Equal Error Rate (EER) is the curve's crucial working region where the error rates (false rejection rate (FRR) and false acceptance rate (FAR) are equal. The given system's actual choices are defined as a decision cost function (DCF).

6 Experiments results

The findings of the open-set speaker verification are reported in Table 1 of this document. When matching conditions are used, the speaker verification rate for the "no cover" situation is somewhat lower than for other conditions. The experiment extracts 64-dimensional GFCCs from the speech signal and stores them on the disc in HTK format. The spoken signal is divided into 30 ms frames with a 15 ms overlap. Then 19 GFCCs are isolated and added by frame energies. This yields 60-dimensional feature vectors. After deleting unvoiced speech, we perform feature warping [30]. The template i-vectors were created for speech under various face masks and no face masks to examine the behaviour of the verification system while comparing i-vectors coming from speech under various face masks. The experiment was implemented by utilising clean utterances with a 5-s length in the training stage versus speech signals uttered under two different face masks and corrupted with various signal-to-noise ratios of cafeteria noise in the testing phase. Generally, a high-performance authentication rate is observed when the enrolment and the testing utterances are from the same group. Degradation in recognition performance occurs when the test segment uses speech under face masks.

Table 1 Shows the baseline performance using clean speech and various SNRs in testing

Full size table

An EER of 2.0% was observed for all the comparisons, indicating perfect discrimination of speakers with 20% SNR. Therefore, speaker recognition was significantly more accurate in high SNR ratios (the accuracy is 2.0%) than in low SNR ratios (17.2%, as shown in Table 1). Same-speaker scores were lower under the fabric covering condition but not to a sufficient extent to introduce errors. In contrast, the same-speaker scores with a surgical mask and no face covering were similar. However, the accuracy of speaker verification was degraded when using the masks. As shown in Tables 2 and 3, masks had little to no effect at low SNR levels, with no more than a 2.4% EER loss in accuracy when using speech under a fabric mask compared to a no-mask condition. The surgical mask condition had an EER of 2.50%, as did the hoodie/scarf combination condition, the closest analogue to the fabric coverings, though again, the surgical mask same-speaker scores were the highest. With high SNR levels, accuracy was degraded from 17.2% using no mask to 19.7% with a fabric mask, the EER being higher than in the no-mask condition. But there was no statistically significant difference between the surgical and fabric masks. Masks, for example, had little to no effect at high levels of SNR, with no more than a 2.6% EER loss in accuracy when using speech under a surgical mask compared to a no-mask condition, as shown in Table 3. Low SNR levels degraded accuracy from 17.2% using no mask to 20.0%, the EER being higher than in the no-mask condition. The results show that two types of masks provide equivalent accuracy at low background noise levels. Depending on the SNR, masks had varied effects on speaker recognition performance (Figs. 5, 6, 7 and 8).

Table 2 System performance with different SNRs using a fabric face mask

Full size table

Table 3 System performance with different SNRs using a surgical mask

Full size table

7 Conclusion remarks

The findings of a preliminary study investigating the effect of using masks on the effectiveness of a cutting-edge text-independent speaker authentication scheme were reported. These studies aim to determine if a face mask has a passive or active effect on recorded speech. This research relies on a small speech corpus of 30 volunteer speakers and two kinds of face masks (farcical and surgical) built specifically for the study. This paper presents preliminary results from trials comparing spontaneous speech under face cover to conventional speech in the context of speaker verification and mask categorization tasks. The performance of the i-vector-based speaker verification system and the GFCC function diminishes dramatically when used with mismatch masks. However, the minor relative deterioration implies that state-of-the-art verification algorithms can at least partly correct the mismatch between the face and the mask. The result shows that masks had little to no effect in low background noise, with an EER of 2.4–2.5 percent in 20 dB SNR for both masks compared to no mask at the same level. In noisy conditions, accuracy was 12.7–13.0% higher than without a mask with a 5 dB SNR, indicating that both masks perform similarly in low and high levels of background noise, and the degradation in the system accuracy becomes more noticeable in high levels of noise compared to no mask at the same level. To better understand the influence of wearing a face cover on the speech signal, future studies should investigate the acoustical alterations related to various sections of speech on an individual basis. An in-depth investigation is being conducted to classify face masks efficiently.

References

Alenizi AS, Al-karawi KA (2022) Cloud Computing Adoption-Based Digital Open Government Services: Challenges and Barriers. In: Proceedings of Sixth International Congress on Information and Communication Technology, pp. 149–160
Al-Karawi K (2018) Robust speaker recognition in reverberant condition-toward greater biometric security, University of Salford
Al-Karawi KA (2020) Mitigate the reverberation effect on the speaker verification performance using different methods. Int J Speech Technol 1–11
Al-Karawi KA, Ahmed ST (2021) Model selection toward robustness speaker verification in reverberant conditions. Multimed Tools Appl 1–18
Al-Karawi KA, Al-Noori AH, Li FF, Ritchings T (2015) Automatic Speaker Recognition System in Adverse Conditions-Implication of Noise and Reverberation on System Performance. Int J Inf Electron Eng 5:423
Google Scholar
Al-Karawi KA, Li F (2017) Robust speaker verification in reverberant conditions using estimated acoustic parameters—A maximum likelihood estimation and training on the fly approach. In: 2017 Seventh International Conference on Innovative Computing Technology (INTECH). IEEE, Luton, pp. 52–57. https://doi.org/10.1109/INTECH.2017.8102427
Al-Karawi KA, Mohammed DY (2019) Early reflection detection using autocorrelation to improve robustness of speaker verification in reverberant conditions. Int J Speech Technol 1–8
Al-Karawi KA, Mohammed DY (2021) Improving short utterance speaker verification by combining MFCC and Entrocy in Noisy conditions. Multimed Tools Appl 80:22231–22249
Article Google Scholar
Boles A, Rad P (2017) Voice biometrics: Deep learning-based voiceprint authentication system, in 2017 12th System of Systems Engineering Conference (SoSE), pp. 1–6
Chen Y-W, Lin C-J (2006) Combining SVMs with various feature selection strategies, in Feature extraction, (edn). Springer, pp. 315–324
Chu DK, Akl EA, Duda S, Solo K, Yaacoub S, Schünemann HJ et al (2020) Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: a systematic review and meta-analysis. Lancet 395:1973–1987
Article Google Scholar
Das RK, Li H (2020) Classification of speech with and without face mask using acoustic features, in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 747-752
Davis SB, Mermelstein P (1990) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: Readings in speech recognition, pp. 65–74
Dehak N, Dehak R, Kenny P, Brümmer N, Ouellet P, Dumouchel P (2009) Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification, in Tenth Annual conference of the international speech communication association
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19:788–798
Article Google Scholar
Doddington GR, Przybocki MA, Martin AF, Reynolds DA (2000) The NIST speaker recognition evaluation–Overview, methodology, systems, results, perspective. Speech Commun 31:225–254
Article Google Scholar
Fecher N (2014) Effects of forensically-relevant facial concealment on acoustic and perceptual properties of consonants. University of York
Ghitza O (1994) Auditory models and human performance in tasks related to speech coding and speech recognition. Speech and Audio Processing, vol 2. IEEE Trans Speech Audio Process, pp 115–132
Llamas C, Harrison P, Donnelly D, Watt D (2009) Effects of different types of face coverings on speech acoustics and intelligibility
Logan B (2000) Mel frequency cepstral coefficients for music modeling, in Ismir, pp. 1–11
Malone V. Speaker Recognition For Speech Under Face Cover
Mendel LL, Gardino JA, Atcherson SR (2008) Speech understanding using surgical masks: a problem in health care? J Am Acad Audiol 19:686–695
Article Google Scholar
Mohamed MM, Nessiem MA, Batliner A, Bergler C, Hantke S, Schmitt M et al (2022) Face mask recognition from audio: The MASC database and an overview on the mask challenge. Pattern Recogn 122:108361
Article Google Scholar
Mohammed DY, Al-Karawi K, Aljuboori A (2021) Robust speaker verification by combining MFCC and entrocy in noisy conditions. Bull Electr Eng Inform 10:2310–2319
Article Google Scholar
Mohammed DY, Al-Karawi KA, Husien IM, Ghulam MA (2020) Mitigate the Reverberant Effects on Speaker Recognition via Multi-training. Cham, pp. 95–109
Mohammed DY, Al-Karawi KA, Husien IM, Ghulam MA (2020) Mitigate the reverberant effects on speaker recognition via multi-training, in Applied Computing to Support Industry: Innovation and Technology: First International Conference, ACRIT 2019, Ramadi, Iraq, September 15–16, 2019, Revised Selected Papers 1, pp. 95-109
Nguyen DD, McCabe P, Thomas D, Purcell A, Doble M, Novakovic D et al (2021) Acoustic voice characteristics with and without wearing a facemask. Sci Rep 11:1–11
Google Scholar
Nute ME, Slater K (1973) 63—the effect of fabric parameters on sound-transmission loss. J Text Inst 64:652–658
Article Google Scholar
Orman ÖD, Arslan LM (2001) Frequency analysis of speaker identification, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop
Pelecanos J, Sridharan S (2001) Feature warping for robust speaker verification, in Proceedings of 2001 A Speaker Odyssey: The Speaker Recognition Workshop, pp. 213–218
Qi J, Wang D, Xu J, Tejedor J (2013) Bottleneck Features based on Gammatone Frequency Cepstral Coefficients, in Interspeech'13
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digital Signal Process 10:19–41
Article Google Scholar
Sadjadi SO, Slaney M, Heck L (2013) MSR Identity Toolbox v1. 0: A MATLAB toolbox for speaker-recognition research. Speech and Language Processing Technical Committee Newsletter
Saeidi R, Huhtakallio I, Alku P (2016) Analysis of Face Mask Effect on Speaker Recognition. in Interspeech, pp. 1800–1804
Saeidi R, Niemi T, Karppelin H, Pohjalainen J, Kinnunen T, Alku P (2015) Speaker recognition for speech under face cover
Shao Y, Wang D (2006) Robust speaker recognition using binary time-frequency masks, in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, pp. I-I
Toscano JC, Toscano CM (2021) Effects of face masks on speech recognition in multi-talker babble noise. PLoS ONE 16:e0246842
Article Google Scholar
VanDam M (2014) Acoustic characteristics of the clothes used for a wearable recording device. J Acoust Soc Am 136:263–267
Article Google Scholar
Van Doremalen N, Bushmaker T, Morris DH, Holbrook MG, Gamble A, Williamson BN et al (2020) Aerosol and surface stability of SARS-CoV-2 as compared with SARS-CoV-1. N Engl J Med 382:1564–1567
Article Google Scholar
Wittum KJ, Feth L, Hoglund E (2013) The effects of surgical masks on speech perception in noise. In: Proceedings of Meetings on Acoustics ICA2013, p. 060125
Wu P, Li H, Zeng N, Li F (2022) FMD-Yolo: An efficient face mask detection method for COVID-19 prevention and control in public. Image Vis Comput 117:104341
Article Google Scholar
Young S, Kershaw D, Odell I, Ollason D, Valtchev V, Woodland P (2000) The HTK Book (for HTK Version 3.0). Microsoft Corporation, ed
Zhang C, Tan T (2008) Voice disguise and automatic speaker recognition. Forensic Sci Int 175:118–122
Article Google Scholar

Download references

Author information

Authors and Affiliations

Diyala University, Baqubah, Diyala, Iraq
Khamis A. Al-Karawi

Authors

Khamis A. Al-Karawi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khamis A. Al-Karawi.

Ethics declarations

Conflicts of interests/Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

The authors declare the following financial interests/personal relationships, which may be considered potential competing interests.

Data sharing does not apply to this article as no datasets were generated or analyzed during the current study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Al-Karawi, K.A. Face mask effects on speaker verification performance in the presence of noise. Multimed Tools Appl 83, 4811–4824 (2024). https://doi.org/10.1007/s11042-023-15824-w

Download citation

Received: 29 March 2022
Revised: 10 April 2023
Accepted: 10 May 2023
Published: 29 May 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15824-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Face mask effects on speaker verification performance in the presence of noise

Abstract