1 Introduction

Automatic speaker recognition systems can achieve high accuracy in ideal conditions, i.e., with speech samples free from noise, reverberation, or channel variations. However, real applications can hardly meet such environments [2, 6].

Speech is the most natural form of human communication. Emotional and acoustic contexts are also conveyed through voice signals and the speaker's identity and accents. In forensic speaker verification, audio recordings collected at the scene of a crime are used to identify the persons involved. In this sense, identification systems have trouble evaluating material that has been manipulated, forged, or organically altered [3]. The COVID-19 pandemic poses a substantial challenge to biometrics in two respects. First, the virus remains on surfaces for long, preventing users from physically engaging with shared biometric devices. With COVID-19, contact-free biometrics has become critical. Second, it is airborne [39]. Consequently, health officials worldwide recommend frequently using face masks to prevent viral transmission [11]. This daily usage disables current face recognition systems, whether personal (on computers or phones) or public (at hospitals or airports). Speaker recognition systems have been extensively studied [7, 8] with several benchmark datasets [3]. Deep Learning (DL) speech biometrics has also been suggested [9]. Speaker identification systems' performance degrades when the circumstances of training and evaluation are misaligned [35]. In terms of masks, this implies identifying mask-wearing speakers with audio from mask-wearing speakers and non-mask-wearing speakers with audio from non-mask-wearing speakers. Thus, using speech features to detect if a speaker is wearing a mask automatically may enhance voice biometrics systems. The effect of wearing a mask has been widely examined in various situations; it has been proven to alter the human-to-human perception of speech, while the findings are mixed for non-hearing impaired people [22, 40]. It seems that the impact of mask-wearing stems from the loss of the visual information that the brain uses to compensate for the degradation in auditory information rather than a direct effect of the facial coverings on the acoustics itself. Some research [27] looks at the acoustic consequences of wearing a mask. The impacted frequencies are 1–8 kHz, with the most significant influence being 2–4 kHz. These ranges are essential for voice biometrics, specifically 1–3 kHz [29]. By evaluating the audio cues used by the algorithms to predict whether a speaker is wearing a mask, we may learn more about the impacts of mask-wearing. In this study, we provide text-independent speaker identification using fabric and surgical face masks. We give a new speech dataset compiled to aid the research in this study. We train speaker-specific models using audio captured under various face masks using a state-of-the-art i-vector based recognition system. Normal speech, referred to as "no mask," serves as a natural choice for training utterances. During the test phase, the recognition rates for the use of face cover in both matched and mismatched circumstances were supplied. We also look at a mask classification scenario in which a closed set of face masks is used to figure out what kind of mask was used in a short speech.

2 SPEECH under face cover

It was discovered during the COVID-19 epidemic that there had been a minimal investigation into how wearing face masks affected the acoustic qualities of speech. Due to using a face mask, the voice production unit adapts to the changed surroundings [21]. When wearing a face mask, the degree to which speech production is altered is determined by the kind of mask, the amount of contact the mask has with speech production organs, and the extent to which the mask modifies the perception of one's voice [17, 34]. Wearing a face mask affects the audio recording of your speech in both an active and passive manner. The wearing of a face mask has been demonstrated to influence specific speech articulation processes and sound absorption properties. Depending on the kind of mask used, the amount of time it is in contact with the face, and the pressure applied to the face, lip, and jaw movements are severely restricted. The influence of wearing a surgical mask for automatic speaker recognition in a person's voice is being investigated in addition to the impacts of other forms of voice disguise. The researchers looked at the identification scores for each member of a group of target speakers. They observed that wearing a surgical mask considerably affected the recognition system's performance [23, 43]. The intelligibility of voice is created by three distinct facial coverings: the niqab (a fabric mask used by Muslim women), the balaclava (a ski mask that exposes just a piece of the face), and the surgical mask, which are all explored. The authors observed that listeners could correctly recognize the target phrases regardless of the kind of mask utilised. You can find out what frequency ranges are ideal for your microphone and the masks in general by playing a loudspeaker's speech and re-recording it using a microphone isolated from the loudspeaker by a face mask [12, 19]. The researchers selected very modest differences in signal attenuation over several mask materials when they completed this series of studies. As reported earlier, the acoustic transmission loss of 44 different woven fabrics was previously studied in various audible frequency ranges [28, 37]. By the data, the transmission loss of fabric is strongly dependent on its weight, thickness, density/porosity, and other properties [28]. According to research, the absorption of sound energy by different fabrics results in higher energy loss in high-frequency ranges than in low-frequency ranges. In a recent study, researchers found no difference in the transmission qualities of different shirt types or between shirts and the bare-microphone condition when they looked at wearing microphones (Fig. 1) [38].

Fig. 1
figure 1

Masks used in the experiment: a surgical mask, b fabric mask [38]

3 Speaker recognition system

Microsoft Research (MSR) has created an identity toolset for testing speech recognition [1, 33]. The created toolkit uses machine recognition based on the Gaussian Mixture Model (GMM) and Universal Background Model (UBM) and offers frameworks for i-vector analysis. The front and back ends are the two primary phases through which speaker recognition systems operate. The first stage's capability is feature extraction from each speaker's enrolled speech signals and transformation to acoustic features. Given that the Mel-scale in MFCC is a scale that represents the foundation of converting the frequency and the perceived pitch to the feature coefficients equivalent human auditory system, which is not a linear system, the Cepstral features, such as the MFCCs, are most frequently used with speaker recognition systems [20, 26]. In contrast, the reference models for the enrolled speakers are generated in the second phase (back-end) using the front-end phases retrieved features. It should be noted that both the Gaussian Mixture Model (GMM) and the Gaussian Mixture Model-Universal Background Model (GMM-UBM) are recognized as the basis for ASR systems. The expectation–maximization (EM) algorithm is used in the (GMM-UBM) framework to obtain GMM parameters. During enrolment, speaker modules are obtained using the adaptation Maximum a posteriori (MAP) [32]. Log-likelihood thresholding is used to estimate scoring and make decisions. The universal background model (UBM) is trained as a speaker-independent module by gathering speech samples from a sizable set of speakers. The block diagram of the GMM-UBM framework is shown in Fig. 2.

Fig. 2
figure 2

Block diagram of the GMM-UBM framework

4 Background

4.1 Baseline ASV: i-vector

Several speech classification tasks, including speaker, language, dialect, speaker diarization, speech recognition, and clustering, heavily rely on the i-vector technique. The i-vector concisely represents what transpires during a specific speech recording. Additionally, the classical i-vector approach is based on the Gaussian Mixture Model (GMM), which entails using subspace modeling to represent the activation of neurons. The i-Vector, which has a total change subspace, represents the superior vector of the GMM. The fact that the JFA channel area contains information that can tell different speakers apart [14] provided the impetus for the one-part space method. The i-vector speaker that is compatible with GMM can be represented by (1)

$$\mu = m + {\rm T}w$$
(1)

where m denotes the mean supervisor taken from the Universal Background Model UBM, T refers to a rectangular low-rank matrix, and w represents the random vector with normal distribution, i.e., the supposedly i-vectors [15]. The outcome technique in the i-vector frame consists of figuring out how well the target and test speaker's factors line up. Several post-processing steps have been advised to make this point, such as Gaussian probabilistic linear discriminant analysis (PLDA) and linear discriminant analysis (LDA) to maximize between-class variation and decrease within-class variance. Then, results are prepared to rely on PLDA hyper parameter log-likelihood thresholding. The total variability factors, w, are independent random variables with a normal distribution. I-vectors (w) are assumed to be scattered with parameters N. (0,1) typically. It is fundamentally a maximal posterior adaptation (MAP) of w in the subspace cleared by T to remove the i-vector from the total-variability subspace. The essential suitable subspace from which i-vectors are extracted is the total-variability subspace. I-vector techniques need additional intersession compensation approaches before scoring to mitigate the effects of channel variability because the overall variability space denoted by T encompasses both speaker and channel variability. The block diagram of the i-vector framework is shown in Fig. 3.

Fig. 3
figure 3

Block diagram of the i-vector framework

4.2 Noise problems in speech signals

Environmental sounds are a rich source of acoustic data in ambient noises and soundscape surveillance. Still, they also include a lot of background noise, making extracting useful information difficult. Ambient noises are typically recognized as quasi-data sets that are underused. For several individuals, it is a cause of worry [24]. There is difficulty in describing such noises by utilizing standard audio characteristics and creating relevant ambient noise characteristics in autonomous acoustical classification systems [5, 8]. A range of ambient sound sources generates various types of noise. These environmental sources produce harmful noises, referred to as "noise sources." Acoustic sensors are used outdoors in the context of environmental and soundscape monitoring. In uncontrolled situations, they are subject to changing and severe weather conditions, and microphone wind noise becomes a big worry. Environmental and system-introduced noise, such as channel interference, often contaminates signals of interest [4, 8].

4.3 Features extraction

The MFCC (Mel Frequency Cepstrum Coefficients) is the most extensively utilised of the MFCCs. Computed using log energies dispersed on a Mel scale. The MFCCs are widely used because of their simplicity and strong performance in automated speaker identification tasks under clean-matched settings [6, 13, 25]. Human perception is insensitive to common-frequency localised random disturbances that impact MFCCs [8, 36]. Noise and reverberation reduce MFCC performance quickly. Because MFCCs are not robust under noisy, reverberant, or mismatched circumstances, numerous researchers are investigating robust variations. The filter bandwidths are proportional to the auditory ERB function defined in [2, 18]. A GF is generated from a bank of Gammatone filters, which were suggested to replicate human cochlear filtering. The GFCC is generated from GF. Because it can simulate the human auditory system, GFCC has demonstrated good resistance to acoustic changes [1, 31]. The research shows that the GFCC’s Gammatone filterbank was more noise-resistant than MFCC's Mel-filterbank. Therefore, the GFCC feature was used in this experiment. The HTK toolkit is often employed to derive GFCC [42]. The main steps of the extraction of the GFCC feature are shown in the following (Fig. 4):

Fig. 4
figure 4

Block Diagram of GFCC extraction [3]

5 Experimental setup

In the past two decades, there has been a significant increase in interest in text-independent speaker verification [4]. This strategy, known as the i-vector approach, is considered the most up-to-date method of text-independent speaker detection. As seen in Fig. 2, the architecture of our i-vector-based identification system is displayed. As is usual in real-world applications, we employ a commercially available, state-of-the-art recognition system in which the recognition system settings cannot be tailored to the test environment due to a shortage of data for the i-vector recognition system. The training phase involves around 20 s of active speech on average, but the test phase includes just non-overlapping chunks of 2.5 s of active speech on average. This experiment takes test segments from a speech delivered under various masks.

5.1 Dataset

Regarding data set building, training, and testing speech material, several datasets are found related to speaker verification, such as those obtained from the Face Mask dataset (FMD) [41]. Furthermore, short speech samples were recorded by 30 participants, 15 male and 15 female, using a Zoom H6 recorder. Recordings were made without a face covering, with a fabric cover, and with a surgical mask. The recordings were made in the School of Science, Engineering, and Environment at Salford University-United Kingdom. A specific noise-reducing wall element has been installed in the studio, which helps to decrease background noise. As a result, ensure high audio recording quality so there is far less reverberation. The room is around five by five square meters, with two windows and one door. Because of the speaker identification tests in this study, the data was initially captured in 44.1 kHz wave format; however, the sampling frequency was decreased to 16 kHz for the examinations. The data was captured using a Zoom H6 recorder, with the microphone set close to the speaker's lips to ensure a clear sound. We apply simple energy-based voice activity detection to remove the large chunks of silence in the excerpt. The speech is sampled at 16 kHz. Each participant was told to provide two speech samples with varying lengths of delay between them (approximately 30 s for the first sample and around 20 s for the second sample). Unless otherwise stated, the speakers are speaking in English. The speakers depend on a random selection of broad material from easily legible sources. (Newspapers, books, pamphlets, articles, and similar materials may be used to produce text/language-independent samples.)

5.2 Noisy mixing

For this investigation, the recorded samples described in the preceding part were mixed with various noise levels to create noise utterances with SNR values. The noisy speech samples were gated by mixing the cafeteria noise with the clean speech to produce a varied range of signal-to-noise ratios (SNR ranging from 20 dB to -5 dBs MATLAB code was used to create an audio mixer that mixes speech samples based on signal strength. The noise files were truncated to the same length as the target speech utterance in the mixing procedure. This step aimed to ensure noise mixing occurred over the entire speech signal. The following describes the mixing process: the normalisation problem is handled so that appropriate amounts of speech and noise samples may be added to prevent misinterpretation. The default technique is normalising mixed or comparing signals to the same observed level. The combined samples are handled at this stage to make them seem identical (RMS). In the fifth step, the utterance samples from the previously mentioned overall samples obtained from 30 speakers (15 male and 15 female) are mixed with cafeteria noise at 5-difference SNR ratios ranging from -5 dB to 20 dB and the noisy utterance is used to validate the suggested speaker recognition technique. The cafeteria noise was employed in this study and captured at a sampling rate of 44.1 kHz.

5.3 Evaluation methods

The validation ratings for trials for system error categories are calculated as the log-likelihood ratio between the speaker models and the test observations. To grade ASR, the National Institute of Standards and Technology (NIST) developed a set of standard performance measures [16]. False positives and false negatives, often known as false alarms, are the most common mistakes in hypothesis tests. When a system wrongly certifies an imposter as the target during the verification of impostor trials, it is called a false positive. During the verification target trials, the system deems the target an imposter, resulting in a false negative. A linear plot of bit error rates on a normal scale, known as the Detection Error Trade-off (DET) curves by the NIST, is a beneficial manner of showing the system's performance [10]. The outcome is a straight line when the underlying error rates are normal. The Equal Error Rate (EER) is the curve's crucial working region where the error rates (false rejection rate (FRR) and false acceptance rate (FAR) are equal. The given system's actual choices are defined as a decision cost function (DCF).

6 Experiments results

The findings of the open-set speaker verification are reported in Table 1 of this document. When matching conditions are used, the speaker verification rate for the "no cover" situation is somewhat lower than for other conditions. The experiment extracts 64-dimensional GFCCs from the speech signal and stores them on the disc in HTK format. The spoken signal is divided into 30 ms frames with a 15 ms overlap. Then 19 GFCCs are isolated and added by frame energies. This yields 60-dimensional feature vectors. After deleting unvoiced speech, we perform feature warping [30]. The template i-vectors were created for speech under various face masks and no face masks to examine the behaviour of the verification system while comparing i-vectors coming from speech under various face masks. The experiment was implemented by utilising clean utterances with a 5-s length in the training stage versus speech signals uttered under two different face masks and corrupted with various signal-to-noise ratios of cafeteria noise in the testing phase. Generally, a high-performance authentication rate is observed when the enrolment and the testing utterances are from the same group. Degradation in recognition performance occurs when the test segment uses speech under face masks.

Table 1 Shows the baseline performance using clean speech and various SNRs in testing

An EER of 2.0% was observed for all the comparisons, indicating perfect discrimination of speakers with 20% SNR. Therefore, speaker recognition was significantly more accurate in high SNR ratios (the accuracy is 2.0%) than in low SNR ratios (17.2%, as shown in Table 1). Same-speaker scores were lower under the fabric covering condition but not to a sufficient extent to introduce errors. In contrast, the same-speaker scores with a surgical mask and no face covering were similar. However, the accuracy of speaker verification was degraded when using the masks. As shown in Tables 2 and 3, masks had little to no effect at low SNR levels, with no more than a 2.4% EER loss in accuracy when using speech under a fabric mask compared to a no-mask condition. The surgical mask condition had an EER of 2.50%, as did the hoodie/scarf combination condition, the closest analogue to the fabric coverings, though again, the surgical mask same-speaker scores were the highest. With high SNR levels, accuracy was degraded from 17.2% using no mask to 19.7% with a fabric mask, the EER being higher than in the no-mask condition. But there was no statistically significant difference between the surgical and fabric masks. Masks, for example, had little to no effect at high levels of SNR, with no more than a 2.6% EER loss in accuracy when using speech under a surgical mask compared to a no-mask condition, as shown in Table 3. Low SNR levels degraded accuracy from 17.2% using no mask to 20.0%, the EER being higher than in the no-mask condition. The results show that two types of masks provide equivalent accuracy at low background noise levels. Depending on the SNR, masks had varied effects on speaker recognition performance (Figs. 5, 6, 7 and 8).

Table 2 System performance with different SNRs using a fabric face mask
Table 3 System performance with different SNRs using a surgical mask
Fig. 5
figure 5

DET plot with SNR = 20%

Fig. 6
figure 6

DET plot with SNR = 10%

Fig. 7
figure 7

DET plot with SNR = 0%

Fig. 8
figure 8

DET plot with SNR = -5%

7 Conclusion remarks

The findings of a preliminary study investigating the effect of using masks on the effectiveness of a cutting-edge text-independent speaker authentication scheme were reported. These studies aim to determine if a face mask has a passive or active effect on recorded speech. This research relies on a small speech corpus of 30 volunteer speakers and two kinds of face masks (farcical and surgical) built specifically for the study. This paper presents preliminary results from trials comparing spontaneous speech under face cover to conventional speech in the context of speaker verification and mask categorization tasks. The performance of the i-vector-based speaker verification system and the GFCC function diminishes dramatically when used with mismatch masks. However, the minor relative deterioration implies that state-of-the-art verification algorithms can at least partly correct the mismatch between the face and the mask. The result shows that masks had little to no effect in low background noise, with an EER of 2.4–2.5 percent in 20 dB SNR for both masks compared to no mask at the same level. In noisy conditions, accuracy was 12.7–13.0% higher than without a mask with a 5 dB SNR, indicating that both masks perform similarly in low and high levels of background noise, and the degradation in the system accuracy becomes more noticeable in high levels of noise compared to no mask at the same level. To better understand the influence of wearing a face cover on the speech signal, future studies should investigate the acoustical alterations related to various sections of speech on an individual basis. An in-depth investigation is being conducted to classify face masks efficiently.