Effect of identical twins on deep speaker embeddings based forensic voice comparison

Abed, Mohammed Hamzah; Sztahó, Dávid

doi:10.1007/s10772-024-10108-6

Effect of identical twins on deep speaker embeddings based forensic voice comparison

Open access
Published: 02 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Speech Technology Aims and scope Submit manuscript

Effect of identical twins on deep speaker embeddings based forensic voice comparison

Download PDF

229 Accesses
Explore all metrics

Abstract

Deep learning has gained widespread adoption in forensic voice comparison in recent years. It is mainly used to learn speaker representations, known as embedding features or vectors. In this work, the effect of identical twins on two state-of-the-art deep speaker embedding methods was investigated with special focus on metrics of forensic voice comparison. The speaker verification performance has been assessed using the likelihood-ratio framework by likelihood ratio cost and equal error rate. The AVTD twin speech dataset was applied. The results show a significant reduction in speaker verification performance when twin samples are present. Neither the adaptation of LR score calculation to twin samples, nor fine-tuning the pre-trained speaker embedding models seemed to be able to leverage this limitation. It was found that the recognition of same or different speakers was possible even in the case of identical twins but the performance dropped greatly. The lowest EER of the best performing model was 3.4% in the case of non-twin; at the same time, EER was 25.3% when twins were present. This doesn’t mean that the presented methods are useless in case of identical twins, but it must be taken into consideration that in case of a higher likelihood-ratio score (which indicates same speakers on the tested samples), the possibility of twins must also be considered in a real casework.

Silent no more: a comprehensive review of artificial intelligence, deep learning, and machine learning in facilitating deaf and mute communication

Article Open access 26 June 2024

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Biometrics recognition using deep learning: a survey

Article 13 January 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

This study investigates the impact of identical twins on forensic voice comparison by examining the performance of deep speaker embedding methods in differentiating between the voices of twins. Specifically, we aim to answer the question: How does the presence of identical twins affect the performance of speaker verification systems in forensic applications?

Forensic voice comparison (FVC) is the scientific methodology of investigating recorded voices to decide if they belong to the same person or different individuals (Ferragne et al., 2024; Sigona & Grimaldi, 2023; Zheng et al., 2020). It is usually used in a legal proceeding or criminal investigation where the origin of voice samples is examined (Frost et al., 2015; Geoffrey et al., 2020). Typically, FVC is implemented by comparing voice recordings of known and unknown origins: suspect and other voice recordings of disputed origin (typically referred to as offender) (Abed & Sztahó, 2023; Wang et al., 2022). In the machine learning methodology, a typical way to conclude pair-wise comparisons is to label the trials (sample pairs to compare) as ‘same’ or ‘different’; this enables the training of a model accordingly (Sztahó & Fejes, 2023; Sztahó et al., 2021). Based on the definition of FVC, it is technically classified as a speaker verification schema, despite the absence of explicit knowledge of the “known” speaker (Al-Ali et al., 2021; Ishihara, 2018). The relevant expert can generally be delimited to individuals of either male or female gender who communicate in a specific language with a distinct accent (Morrison & Enzinger, 2016; Stewart & Enzinger, 2019). Through hearing perception, non-professionals, including members of the judiciary and jurors, can generally discern the gender of the individual speaking in a questioned audio recording, determine the language being spoken, and broadly identify the accent of that language (Morrison et al., 2022). However, when determining the specific person, the process will be more complex and challenging to those with adequate expertise. In cases involving identical twins, verifying the speaker’s identity becomes even more intricate for specialists. One of the most critical issues researchers and experts face in forensic techniques is the problem of identical twins of the same gender because they share very similar physical characteristics (Akin et al., 2018). Naturally, this also implies the characteristics of the voice, making the output of perception-based and automatic verification methods unreliable. Deep speaker embeddings have become state-of-the-art in speaker recognition and verification (Desplanques et al., 2020; Nagrani et al., 2019). Although several researchers have investigated the ways of using traditional and deep learning models for feature extraction and speaker verification, it is not yet clear how identical twins may affect (presumably deteriorate) the performance of such systems. Ariyaeeinia and colleagues (2008) explore the efficacy of speaker verification technology in distinguishing between individuals who are genetically identical twins. The research employs a customized clean-speech database that includes recordings from 49 pairs of identical twins. The primary focus is to address the challenges associated with speaker verification when dealing with identical twins. Although the authors demonstrate a method that enhances speaker verification in the case of identical twins, the given method is based on GMM models and is outdated. Stallone et al. conducted a study (Sabatier et al., 2019) where they gathered speech samples from 167 pairs of identical twins. The experiment’s findings revealed that the verification performance, as the Equal Error Rate (EER) indicated, was notably higher when working with twins than with non-twin samples. Their approach is based on i-vectors. In the works of Akin and colleagues (Akin et al., 2018; Cihan et al., 2019), a multi-biometric approach is presented for identifying twins by leveraging speech samples and images of ears. However, we will concentrate solely on the speech component for this literature review. The authors employed the AVTD dataset, the same as the one used in this study, to assess their methodology. These works do not detail how twins deteriorate performance but investigate if using the information of ear images jointly can increase performance. The voice comparison methods presented in these works are not state-of-the-art. Other works delve into a more traditional acoustic analysis of FVC. These are not closely related to the topic presented here, but like in the work of San Segundo and Yang (2019), they all found that the performance of such systems is greatly affected by the phenomena of identical twins. Hence, there is a need for such an examination considering novel deep learning approaches. Furthermore, deep learning has gained widespread usage in speaker verification techniques as a feature engineering method in recent years. One of the prominent architectures in this field is the x-vectors (Snyder et al., 2017) architecture. The x-vectors represent a speech segment as a fixed-dimensional vector (typically 512), obtained by passing the speech signal through a time-delay deep neural network. This network comprises multiple layers, including frame-level and segment-level layers, enabling the extraction of speaker-specific features. A model trained on labelled data in a speaker identification manner allows the network to learn and extract discriminative features specific to each speaker. Another noteworthy deep learning architecture for feature engineering in speaker verification is the ECAPA-TDNN model (Desplanques et al., 2020), which is an extension of the x-vectors. This model leverages convolutional layers and SE-Res2blocks to capture high-level features from speech signals and multi-layer feature aggregation and summation. Thus enhancing the discriminative power of the embedding vectors. Deep learning techniques, including the x-vectors architecture and ECAPA-TDNN, have significantly contributed to the speaker verification field. By effectively extracting speaker-specific features, these architectures have enhanced the accuracy and reliability of speaker verification systems and become state-of-the-art techniques. Ongoing research in deep learning holds the potential for further advancements in speaker verification technology, fostering its broader adoption across various real-world applications. The models are evaluated in the likelihood-ratio framework used in forensic science and practice (Geoffrey, 2011). It allows for automatic and semi-automatic evaluation of evidence using different methods and measurement types (e.g. DNA, fingerprinting). It supports a processing pipeline that can be easily computed for multiple types of evidence. FVC, where speaker recognition techniques are adapted to the framework’s requirements, is an area where this new paradigm can be applied. For a piece of evidence, it produces a ratio of the likelihood or probability density (at a given point) of the same and different speakers (Morrison, 2011). The current study investigates the effect of identical twins on FVC, focusing on understanding the impact of twin samples on the performance of speaker verification systems based on deep speaker embeddings (namely x-vectors and ECAPA-TDNN). Additionally, it is examined if fitting the LR scoring model and fine-tuning the deep speaker embedding models could help decrease the negative effect of the twins’ voices. The results can help FVC experts and the judiciary determine how reliable such automatic methods are in court proceedings. The rest of this paper is organized as follows: Sect. 2 illustrates the materials and methods used in this work, including the dataset. Section 3 includes the result and analysis, followed by a discussion and the work’s conclusion.

2 Materials and methods

The workflow of forensic speaker verification, which relies on speaker embedding vectors (specifically x-vectors and ECAPA), has been assessed using samples from the Audio-Visual Twins Database (AVTD) (Li et al., 2015) dataset. The diagram in Fig. 1 depicts the structure of the applied procedure. The SpeechBrain toolkit was utilized to fine-tune pre-trained models. The deep learning models were employed to extract embedding vectors (features) from the speech samples. This is followed by cosine similarity calculation between pairs of embedding vectors. This was fed into a logistic regression model that calculated the probability of the exact speaker origin of a trial (test sample pair), enabling the calculation of the likelihood-ratio (LR) score. Two parts in this workflow can be influenced by and be adapted to twin voices from a machine learning point of view: speaker embedding models and LR score calculation models. Both are subject to examination in this study: fine-tuning speaker embedding models and fitting LR score calculation (logistic regression) models on twins’ voice samples.

2.1 Dataset description

The AVTD (Li et al., 2015) was used to evaluate the questions and test cases (detailed in Sect. 2.6). The audio recordings within the dataset were captured during the data collection phase of the International Twins Festival held in China. A high-quality Audio Technica Condenser Microphone was employed across three separate recording sessions. The dataset comprises 39 pairs of twin volunteers. Most participants were Chinese, two were American, and three were Russian. The twin volunteers varied in age, ranging from 7 to 54 years. Most of them were between 7 and 22 years old. Figure 2 illustrates the age distribution of twins. Also, Fig. 3 shows an example speech sample for a twin pair reading the same sentence. The dataset contains three sessions in three different modalities. In this work, we will focus on the audio recording session, where the participants were asked to read three short texts in either Chinese or English. The audio part of the dataset contains recordings of three different texts for each subject. Each subject read three texts: numbers from 1 to 10 and two poems. Each task was read and recorded three times. The samples were converted from a stereo into a mono format and re-sampled from 44 kHz to 16 kHz to be suitable for the given deep learning pre-trained models. Figure 4 shows the number of samples used per duration.

The dataset was divided into two parts: training and testing. The training part comprised of 20 pairs of twins, while the testing part had 19 pairs. We made sure to maintain this partitioning throughout the workflow. We carefully considered the diversity of the samples when creating training and testing subsets. The age, language, and gender distributions were evaluated during the dataset-splitting process. Logistic regression models were fitted, fine-tuning was done on the training part, and the evaluation was carried out on the testing part. Due to the few samples, the training set was also used as a validation set during fine-tuning the models. Thus, final trials (test sample pairs) were constructed based on the testing set.

2.2 Deep learning models

Two pre-trained models were used for embedding vector extraction: the x-vector (Snyder et al., 2017), and ECAPA-TDNN (Desplanques et al., 2020). The models are fed by speech features calculated from speech samples, and through various layers, it outputs a speaker identifier. However, features can be extracted from deeper layers. These vectors contain speaker characteristics and can be used for verification methods. This work utilised the SpeechBrain (Ravanelli et al., 2021) toolkit to run deep learning-based models for extracting features. The pre-trained models were downloaded from Huggingface,^{Footnote 1}.^{Footnote 2}

2.2.1 The x-vectors

The x-vectors is a deep learning model primarily designed for speaker verification (Snyder et al., 2017). This approach utilizes a multi-layered architecture consisting of fully connecting (FC) layers with distinct temporal contexts at each architecture layer, commonly referred to as time-delay Neural Network (TDNN) due to its extended temporal context. Table 1 illustrates the architecture of the x-vectors and layer context. The first part of the model, which consists of five layers, operates on different frames of speech, each with a slightly expanded temporal context centred around the current frame t. Because of the broader temporal context, the architecture is called time-delay Neural Network (TDNN). The dimension of the embedding vector (b), as described in Table 1 [the output of segment 6 (“x-vectors”)], is configured as 512, while the input consists of 24 mel-frequency bands. The model used in this study was pre-trained on the VoxCeleb dataset and was acquired from the Huggingface model repository. The fine-tuning performed here adhered to the original model’s input format, sampling rate (16 kHz) and network structure. The fine-tuning process spanned 35 epochs and used early stopping based on the minimum loss observed on the training set. In addition, The Adam optimizer was employed, initializing with a learning rate of 0.001.

Table 1 X-vectors DNN architecture Snyder et al. (2017)

Full size table

2.2.2 ECAPA-TDNN

The ECAPA-TDNN (Desplanques et al., 2020) is an extension of the x-vectors model architecture by providing three critical enhancements. These enhancements consist of techniques such as channel and context-dependent statistics pooling 1-Dimensional Squeeze-Excitation Res2Blocks (1D SE-Res2Block) and multi-layer feature aggregation and summation. Channel- and context-dependent statistics pooling allows the network to emphasize speaker characteristics that are not simultaneously activated at similar instances. This includes distinguishing between speaker-specific properties of vowels and speaker-specific properties of consonants. The SE-Res2Block technique, commonly used in computer vision, combines these characteristics effectively. This model has extended the number of frames in the x-vectors to be more suitable for the voice samples’ global properties. Moreover, the multi-layer features aggregation of the ECAPA-TDNN model ensures that not only the activation of the selected deep layer is used as a feature map (as in the case of the x-vectors) but that shallower layers (in this case: SE-Res2Blocks) are also concatenated to feed additional information about the speaker’s identity. A schematic representation of this architecture is illustrated in Fig. 5. The ECAPA-TDNN model downloaded from Huggingface was pre-trained on the VoxCeleb dataset for speaker verification. The model was trained with a custom configuration where the dimension of the extracted embedding vector was set to 192. The input features consisted of 80 mel-frequency band energies. As in the case of x-vectors, the fine-tuning process involved training the model for 35 epochs while utilizing early stopping based on the training set loss. The Adam optimizer was employed with a starting learning rate 0.001 to optimize the model’s parameters during training.

2.2.3 Fine-tuning

During fine-tuning, the segmented twin’s dataset was subject to data augmentation techniques following the exact procedure used during the original training recipe: the samples were augmented with every combination of time-distorted (duration was scaled with factors 0.95 and 1.05) and noise distorted (with 15 dB white noise) variants. This augmentation approach was consistently applied to the entire dataset used in the study, allowing for a fair comparison between fine-tuned and pre-trained models. It should be noted that we don’t have control over pre-trained models in this regard. However, we followed the same augmentation method described by those models’ creators for fine-tuning the models. The published results on the pre-trained models indicate that this augmentation technique enhances the models’ robustness.

2.3 Cosine distance

The cosine distance was employed to evaluate the degree of likeness between the extracted embedding vectors of twins’ voice sample pairs. The cosine distance formula calculates the two vector’s normalized dot product, yielding a similarity score. Equation 1 depicts the procedure used to compute the cosine distance, commonly used in speaker verification.

$$\begin{aligned} CD(a,b)=\frac{(a.b)}{(||a|| \times ||b||)} \end{aligned}$$

(1)

where a and b are the two vectors, ||a|| and ||b|| represent the magnitude or length of each vector, which is calculated as the square root of the sum of the squares of its components. The . represents the dot product of the two vectors, which is the sum of the products of their corresponding components. In this study, the calculated cosine distance similarity scores were fed into a logistic regression model, that was used to perform speaker verification between pairs of samples by estimating the probabilities of the target belonging to each class.

2.4 LR score calculation

To calculate LR scores, logistic regression was employed (with the Python sklearn package). The logistic regression model was fit with the cosine distance scores by labelling each trial as the same or different speaker. The probability of the same speaker’s decision was calculated based on the logistic regression model using Eq. 2, where E is the evidence, $H_{so}$ is the hypothesis of the same origin speakers, and $H_{do}$ is the hypothesis of different origin speakers. The probability of different speaker origins can be calculated using Eq. 3. This can be done since $H_{so}$ and $H_{do}$ in Eq. 2 are mutually exclusive and exhaustive events and input classes were weighted so that $P(H_{so})=P(H_{do})$. Figure 6 shows an example of a trained logistic regression model, where the distributions of cosine distances of sample pairs of the same and different speakers were blue and yellow lines, respectively Sztahó and Fejes (2023).

$$\begin{aligned} LR &= \frac{P(E|H_{so})}{P(E|H_{do})} \end{aligned}$$

(2)

$$\begin{aligned} P(E|H_{do}) & = 1-P(E|H_{so}) \end{aligned}$$

(3)

Two logistic regression models were considered in this work: (1) a pre-fitted model on a larger dataset and (2) a model fitted on the training set of the AVTD dataset (as mentioned above). Using a pre-fitted model is valid as such models are found to be language-independent Sztahó and Fejes (2023). The logistic regression model was trained on the ForVoice 120+ dataset using 2–10 s long speech samples acquired from 40 native Hungarian speakers.

In logistic regression, the aim is to find the optimal set of parameters that maximize the likelihood of observing the given data. The logistic regression model applies the sigmoid function to the linear combination of the input features and their corresponding weights. In addition, it employs an iterative optimization process to estimate the weights that minimize the difference between the predicted probabilities and the actual class labels in the training data.

2.5 Evaluation metrics

The outputs of the models in different test cases were evaluated using Equal Error Rate (EER) and log-likelihood ratio cost. The EER is the point at which False Acceptance Rate (FAR) and the False Rejection Rate (FRR) are equal 4. It is commonly used in biometric security systems. The False Acceptance Rate is the percentage of impostor trials incorrectly accepted as genuine. At the same time, the False Rejection Rate is the percentage of genuine trials that are incorrectly rejected. The EER serves as a threshold-independent measure to evaluate the performance of a verification system, indicating how well it can distinguish between genuine and impostor trials. Lower EER values indicate better performance, indicating a smaller gap between the FAR and FRR. The EER is a concise summary of the discrimination capability of the detector as such, it is a very powerful indicator of the discrimination ability of the detector across a wide range of applications. However, it does not measure calibration’s ability to set good decision thresholds van Leeuwen and Niko (2007).

The log-likelihood ratio cost is defined by equation 5. It is based on the LR scores between pairs of same and different speakers. In addition, Cllr measures the balance of LR scores of same and different origin comparisons, as proposed by Brümmer and Du Preez (2006). Ideally, same-origin and different-origin comparisons should have log(LR) values greater than 0 and less than 0, respectively. Furthermore, in addition to Cllr, the minimum Cllr value is also reported, which generalizes the original cost function and produces application-independent Cllr values.

Tippett plots also illustrate results, a commonly used visualization tool in speaker verification. It represents the proportion of correctly identified same and different speaker origin pairs.

$$\begin{aligned} EER & = (FAR+FRR)/2 \end{aligned}$$

(4)

$$\begin{aligned} Cllr &= \frac{1}{2} \left( \frac{1}{N_{so}}\sum _{i=1}^{N_{so}}(1+\frac{1}{LR_{so_i}})+\frac{1}{N_{do}}\sum _{j=1}^{N_{do}}(1+LR_{do_j}) \right) \end{aligned}$$

(5)

2.6 Test cases

In this study, multiple cases were investigated to show the effect of twins in forensic speaker verification:

1.
Using pre-trained TDNNs without adapting LR score calculation (using an already fitted LR score calculation model Sztahó and Fejes (2023)).
2.
Using pre-trained TDNNs with fitting the LR score calculation on twins’ voice samples.
3.
Fine-tuning the pre-trained TDNNs with fitting the LR score calculation on twins’ voice samples.

The purpose of these experiments was to examine how the voice samples of twins impact forensic speaker verification and which part of the method workflow should be adapted to increase performance. The pre-fitted LR score calculation model is fitted on a Hungarian voice dataset created for forensic science purposes Sztahó and Fejes (2023). It has already been found that this model can work independently on language. In the same study, the pre-trained TDNN models (available on Huggingface) have also been found to work on languages different from the ones they are trained on. Due to the limited samples, the new LR model fitting and the TDNN fine-tuning were done on the training set of the used twin’s dataset.

3 Results

The efficacy of a forensic voice comparison and speaker verification system when dealing with twins’ identities and its impact on system performance was evaluated by metrics detailed in Sect. 2.5. Three different trial groups were considered for each test scenario: (1) all samples, (2) only twin samples and (3) only non-twin samples. In the (2) case, the twin trials and the same speaker trials were included. In the (3) case, only the non-twin different speaker trials and the same speaker trials were included.

3.1 Pre-trained models without fine-tuning

Table 2 presents the performance outcomes of pre-trained speaker embedding models: the x-vectors and ECAPA-TDNN. A logistic regression model pre-fitted on the ForVoice 120+ dataset was employed. The ECAPA-TDNN models consistently demonstrated superior performance over the x-vector models across all examined scenarios. Evident and substantial reductions in performance metrics were observed across the board. Particularly noteworthy results were attained when employing non-twin test samples, yielding compelling outcomes: an EER of 3.4%, a $Cllr_{min}$ of 0.122, and a Cllr of 0.288, all indicative of the effectiveness of the ECAPA-TDNN model. Furthermore, utilising test samples involving twins yielded markedly higher metric values for both models. Specifically, the EER values were recorded at 25.3% and 31.2 % for ECAPA-TDNN and x-vectors, respectively. The Tippett plots of all and twin test trials (for ECAPA-TDNN) are shown in Fig. 7.

Table 2 Results obtained with pre-trained models and pre-fitted LR models

Full size table

3.2 Pre-trained speaker embedding models and LR score calibration

Next, we investigated if the LR score calculation model adapted to samples of twins’ trials would increase the performance. This adaptation means that the logistic regression model used to calculate the LR scores was fitted with the training part of the AVTD dataset (Sect. 2.1). To evaluate the effect of this adapted model, two different scenarios were used: fitting encompassed (1) all training samples and (2) only non-twin samples. The results are presented in Table 3. For the AVTD dataset across all three scenarios—all samples, twins, and non-twin samples—it was observed that the EER and $Cllr_{min}$ remained unaffected by the presence of twins for both LR models. The final $Cllr_{min}$ and EER values are the same as in Table 2. However, it did impact the value of the base Cllr value. Interestingly, the most favourable outcomes were obtained when the logistic regression model was trained using non-twin samples and non-twin testing in the case of ECAPA-TDNN models. The EER was 3.4% in non-twin, and it was worse in the case of twin present, 25.3%.

Table 3 Results obtained with pre-trained embedding models and LR models fitted on the AVTD dataset

Full size table

3.3 Fine-tuning the embedding models

Besides adapting the LR models to twins’ voices, the speaker embedding model can also be adapted. Table 4 shows the metrics when the embedding models were also fine-tuned on the training part of the AVTD dataset. Two LR model setups were considered: (1) fitted on the AVTD training set and (2) pre-trained on the ForVoice120+ dataset. The performance metrics show that the results did not improve; the values are similar to those in Tables 2 and 3. In the case of twin test trials, the performance drops significantly. It is also evident that ECAPA-TDNN still consistently exhibits superior performance compared to x-vectors in all scenarios. Figure 8 shows the Tippett plots for all test trials and twin test trials in the case of the fine-tuned ECAPA-TDNN embedding model.

Table 4 Results obtained with fine-tuned embedding models

Full size table

4 Discussion

In the current study, the effects of twins were investigated in terms of the forensic voice comparison perspective using deep speaker embeddings. The main goal was to assess whether twin samples affect voice comparison and to investigate whether a pre-trained model can be helpful for forensic voice comparison. Furthermore, it was intended to determine whether LR score adaptation affects the performance results. The experiments illustrate that the twin voice comparison question is worth checking and studying due to the results achieved. It is found that identical twins strongly affect and decrease the performance of a forensic voice verification system. The lowest EER of ECAPA-TDNN was 3.4% in the case of non-twin; at the same time, EER was 25.3% when twins were present. The same scenario occurred with x-vectors, where the value of EER was 31.2% in the case of twins and was better with non-twin trials at 8.5%. The results of both pre-trained models are poorer in twins, suggesting the great effect of twins on verification performance. Although the results are poorer in twin trials, the EER is still around 25% so this doesn’t mean that the presented methods are useless in the case of identical twins. Still, it must be considered that in case of a higher LR score (which indicates the same speakers on the tested samples), the possibility of twins must also be considered in actual casework. Including twins’ samples (here, AVTD twin dataset) into the dataset to fit a logistic regression model to calculate LR scores did not affect the final performance. The LR scores are better calibrated using the AVTD than using a different dataset (Table 4); hence, they are slightly shifted in the non-twin case. Nevertheless, the final $Cllr_{min}$ and EER values are the same, meaning that the distinctive power of the two LR models is the same. However, including twins in the dataset used for training seems beneficial to LR scores. The results indicated that solely adapting LR did not improve the performance of the forensic voice comparison. Regarding the fine-tuning of the embedding models (x-vectors and ECAPA-TDNN), the results show that fine-tuning also did not help to increase performance. If anything, it instead worsened the metrics, but not in a significant way. This suggests that the twin voices are similar to human perception and the embedding speaker representations used in the study. The same tendencies were found in the case of the LR models when breaking down the results into twin and non-twin cases. Deep learning embedding models used in the study can handle identical twins to some extent, but challenges remain due to their high physical similarity. Advanced network architectures and diverse training twins data can improve the model’s ability to distinguish between identical twins because one of the main limitations in this field is the limited number of twin samples. Such a dataset is hard to obtain. Also, novel deep learning-based embedding models that can learn subtle differences in voices may be developed, albeit the root of the problem is the high level of similarity in the voice production organs; therefore, such advancement is not likely. Forensic practitioners condense the outcomes of their speaker verification measurements within an interpretive framework. This framework’s configuration and attributes significantly influence the expert report’s conclusive determination of speaker identity probability. By leveraging the findings of this investigation involving identical twins, the utilization of such a framework holds the potential to enhance the objectivity of expert analyses and facilitate a comprehensive evaluation, particularly given the noteworthy observation that the EER for twins reached 25.3% in the most favourable scenario. Consequently, the framework warrants consideration for practical application, with due consideration to the potential presence of twin individuals. In practical cases, when the LR score suggests that the two speakers in question are the same, other examinations should rule out the possibility of identical twins. These can be manual inspections on multimodal data, other documents, and reports in an investigation case. Also, there are studies dealing with identifying identical twins automatically using machine learning. These can be inserted into the practical casework to avoid making a mistake. Some potential limitations originate from the available dataset used in the study. These limitations include a small sample size, particularly for identical twins, and an age distribution skewed towards younger participants. Recommendations for future studies include the creation or use of larger twin-focused datasets, ensuring a more representative age distribution.

5 Conclusion

The present study focuses on the effects of identical twins on deep speaker embeddings with a special focus on metrics of forensic voice comparison. The research employs two distinct deep-learning models, x-vectors and ECAPA-TDNN, to extract embedding features from audio samples of a dataset containing identical twins. The likelihood framework uses logistic regression as the LR score calculation method. The metric results show a significant reduction in speaker verification performance when twin samples are present. Neither adapting LR score calculation to twin samples nor fine-tuning the speaker embedding models could leverage this limitation. It was found that recognising the same or different speakers was possible even in the case of identical twins, but the performance dropped significantly. Based on the results, the plan could include investigating newly developed speaker embedding techniques. And evaluate their performance by comparing it with current results. Such efforts may advance our understanding of the impact of twins on verification performance and contribute to improving forensic speaker verification methodologies.

Notes

References

Abed, M. H., & Sztahó, D. (2023). Effects of emotional speech on forensic voice comparison using deep speaker embeddings. In 19th Hungarian computational linguistics conference (pp. 159–170). http://acta.bibl.u-szeged.hu/78411
Akin, C., Kacar, U., & Kirci, M. (2018). A multi-biometrics for twins identification based speech and ear. arXiv preprint. arXiv:1801.09056 https://doi.org/10.48550/arXiv.1801.09056
Al-Ali, A. K. H., Chandran, V., & Naik, G. R. (2021). Enhanced forensic speaker verification performance using the ICA-EBM algorithm under noisy and reverberant environments. Evolutionary Intelligence, 14, 1475–1494. https://doi.org/10.1007/s12065-020-00406-8
Article Google Scholar
Ariyaeeinia, A., Morrison, C., Malegaonkar, A., & Black, S. (2008). A test of the effectiveness of speaker verification for differentiating between identical twins. Science & Justice, 48, 182–186. https://doi.org/10.1016/j.scijus.2008.02.002
Article Google Scholar
Brümmer, N., & Du Preez, J. (2006). Application-independent evaluation of speaker detection. Computer Speech & Language, 20, 230–275. https://doi.org/10.1016/j.csl.2005.08.001
Article Google Scholar
Cihan, A., Umit, K., & Murvet, K. (2019). Twins recognition using hierarchical score level fusion. arXiv preprint. arXiv:1911.05625 https://doi.org/10.48550/arXiv.1911.05625
Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Proceedings of Interspeech 2020 (pp. 3830–3834).https://doi.org/10.21437/Interspeech.2020-2650
Ferragne, E., Guyot Talbot, A., Cecchini, M., Beugnet, M., Delanoë-Brun, E., Georgeton, L., Stécoli, S., Bonastre, J.-F., & Fredouille, C. (2024). Forensic audio and voice analysis: TV series reinforce false popular beliefs. Languages, 9(2), 55.
Frost, D., & Ishihara, S. (2015). Likelihood ratio-based forensic voice comparison on L2 speakers: A case of Hong Kong native male production of English vowels. In Proceedings of Australasian language technology association workshop (pp. 39–47). Retrieved from http://hdl.handle.net/1885/104003
Geoffrey, S. M. (2011). Measuring the validity and reliability of forensic likelihood-ratio systems. Science Justice, 51, 91–98. https://doi.org/10.1016/j.scijus.2011.03.002
Article Google Scholar
Geoffrey, S. M., Ewald, E., Ramos, D., González-Rodríguez, J., & Lozano-Díez, A. (2020). Statistical models in forensic voice comparison. In Handbook of forensic statistics (p. 47). CRC Press. https://doi.org/10.1201/9780367527709
Ishihara, S. (2018). Sensitivity of likelihood-ratio based forensic voice comparison under mismatched conditions of within-speaker sample sizes across databases. Australian Journal of Forensic Sciences, 50, 307–322. https://doi.org/10.1080/00450618.2016.1259351
Article Google Scholar
van Leeuwen, D. A., & Niko, B. (2007). An introduction to application-independent evaluation of speaker recognition systems. In Speaker classification I: Fundamentals, features, and methods (pp. 330–353). Springer. https://doi.org/10.1007/978-3-540-74200-519
Li, J., Zhang, L., Guo, D., Zhuo, S., & Sim, T. (2015). Audio-visual twins database. In 2015 International conference on biometrics (ICB) (pp. 493–500). https://doi.org/10.1109/ICB.2015.7139115
Morrison, G. S. (2011). A comparison of procedures for the calculation of forensic likelihood ratios from acoustic–phonetic data: Multivariate kernel density (MVKD) versus gaussian mixture model–universal background model (GMM–UBM). Speech Communication, 53, 242–256. https://doi.org/10.1016/j.specom.2010.09.005
Article Google Scholar
Morrison, G. S., & Enzinger, E. (2016). Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01)-introduction. Speech Communication, 85, 119–126. https://doi.org/10.1016/j.specom.2016.07.006
Article Google Scholar
Morrison, G. S., Weber, P., Enzinger, E., Labrador, B., Lozano-Díez, A., Ramos, D., & González-Rodríguez, J. (2022). Forensic voice comparison—human-supervised-automatic approach. In Encyclopedia of forensic sciences (3rd ed., Vol. 2, pp. 720–736). Elsevier. https://doi.org/10.1016/B978-0-12-823677-2.00182-3
Nagrani, A., Chung, J. S., Xie, W., & Zisserman, A. (2019). Voxceleb: Large-scale speaker verification in the wild. Computer Science and Language. https://doi.org/10.1016/j.csl.2019.101027
Article Google Scholar
Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J.-C., Yeh, S.-L., Fu, S.-W., Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y., … Bengio, Y. (2021). Speechbrain: A general-purpose speech toolkit. arXiv preprint. arXiv:2106.04624 https://doi.org/10.48550/arXiv.2106.04624
Sabatier, S. B., Trester, M. R., & Dawson, J. M. (2019). Measurement of the impact of identical twin voices on automatic speaker recognition. Measurement, 134, 385–389. https://doi.org/10.1016/j.measurement.2018.10.057
Article Google Scholar
San Segundo, E., & Yang, J. (2019). Formant dynamics of spanish vocalic sequences in related speakers: A forensic-voice-comparison investigation. Journal of Phonetics, 75, 1–26. https://doi.org/10.1016/j.wocn.2019.04.001
Article Google Scholar
Sigona, F., & Grimaldi, M. (2023). Validation of an ECAPA-TDNN system for forensic automatic speaker recognition under case work conditions. arXiv preprint. http://arxiv.org/abs/2305.10805
Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. In Interspeech (pp. 999–1003). https://doi.org/10.21437/Interspeech.2017-620
Stewart, M. G., & Enzinger, E. (2019). Introduction to forensic voice comparison. In The Routledge handbook of phonetics (pp. 599–634). Routledge. https://doi.org/10.4324/9780429056253-22
Sztahó, D., & Fejes, A. (2023). Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings. Journal of Forensic Sciences, 68, 871–883. https://doi.org/10.1111/1556-4029.15250
Article Google Scholar
Sztahó, D., Szaszák, G., & Beke, A. (2021). Deep learning methods in speaker recognition: A review. Periodica Polytechnica Electrical Engineering and Computer Science, 65, 310–328. https://doi.org/10.3311/PPee.17024
Article Google Scholar
Wang, B. X., Hughes, V., & Foulkes, P. (2022). The effect of sampling variability on systems and individual speakers in likelihood ratio-based forensic voice comparison. Speech Communication, 138, 38–49. https://doi.org/10.1016/j.specom.2022.01.009
Article Google Scholar
Zheng, L., Li, J., Sun, M., Zhang, X., & Zheng, T. F. (2020). When automatic voice disguise meets automatic speaker verification. IEEE Transactions on Information Forensics and Security, 16, 824–837. https://doi.org/10.1109/TIFS.2020.3023818
Article Google Scholar

Download references

Acknowledgements

The work was funded by Project No. FK128615, which has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the FK_18 funding scheme.

Funding

Open access funding provided by Budapest University of Technology and Economics.

Author information

Authors and Affiliations

Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Magyar tudósok körútja, 2, Budapest, 1117, Hungary
Mohammed Hamzah Abed & Dávid Sztahó

Authors

Mohammed Hamzah Abed
View author publications
You can also search for this author in PubMed Google Scholar
Dávid Sztahó
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammed Hamzah Abed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Abed, M.H., Sztahó, D. Effect of identical twins on deep speaker embeddings based forensic voice comparison. Int J Speech Technol (2024). https://doi.org/10.1007/s10772-024-10108-6

Download citation

Received: 06 February 2024
Accepted: 09 May 2024
Published: 02 June 2024
DOI: https://doi.org/10.1007/s10772-024-10108-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Effect of identical twins on deep speaker embeddings based forensic voice comparison

Abstract

Similar content being viewed by others

Silent no more: a comprehensive review of artificial intelligence, deep learning, and machine learning in facilitating deaf and mute communication

A comprehensive survey on automatic speech recognition using neural networks

Biometrics recognition using deep learning: a survey

1 Introduction