1 Introduction

Spoken language diarization (LD) is a task to automatically segment and label the monolingual segments in a given multilingual speech signal. The existing works towards LD are very few [30]. The majority of them use phonotactic (i.e. the distribution of sound units) based approaches [5, 17, 32]. The development of LD using a phonotactic-based approach requires transcribed speech utterances. The same is difficult to obtain as most of the languages present in the code-switched multilingual utterances are resource-scarce in nature [30, 32]. Even though there exist some transfer learning approaches that adapt the phonotactic models of the high resource language to obtain the models for the low resource language, they may end up with performance degradation if both the languages are not from the same language group [30]. Further, LD is effortless for humans, especially for known languages, and challenging for machines. Hence there is a need for exploring alternative approaches for LD.

Speaker diarization (SD) is a task to automatically segment and label the mono-speaker segments for a given multispeaker utterance, which is well explored in the literature. Though there exist differences in the information that needs to be captured to perform LD and SD tasks, there exist many similarities like the features approximating the vocal tract resonances that have been successfully used for the modeling of both speaker and language-specific phonemes [4, 14, 15]. Furthermore, most of the approaches used for spoken language identification (LID) are inspired by the approaches used for the speaker identification/verification (SID/SV) task [25, 31]. In addition to that most of the successful LID systems that are borrowed from SID/SV literature do not require transcribed speech data [14, 31]. Alternatively, LID systems developed using the phonotactic approach require transcribed speech data. This motivates a close association study between the LD and SD tasks and may be exploited to come up with approaches for LD.

The SD field has evolved mainly in two ways: (1) change point detection followed by clustering and boundary refinement, and (2) fixed duration segmentation followed by i-vector/ embedding vector extraction, clustering, and boundary refinement [21, 23, 33]. [3, 7, 13, 23] reported that initial change point detection improved overall SD performance. Thus this study focuses on the development of spoken language change detection (LCD) through a comparative analysis between LCD and speaker change detection (SCD). The available SCD approaches can be broadly classified into two groups: (1) distance-based unsupervised approach and (2) model-based supervised approach [21, 23]. The distance-based approach applies hypothesis testing (either coming from a unique speaker or not) for predicting the speaker change to the speaker’s specific features extracted from the speech signal with sliding consecutive windows [21, 23]. Following this approach, many feature extraction techniques like excitation source [9, 27], fundamental frequency contour [13], etc., and distance metrics like Kullback-Leibler (KL) divergence [29], Bayesian information criteria (BIC) [6], KL2 [29], generalized likelihood ratio (GLR) [11] and information bottleneck (IB) [7] are proposed in the literature. Generally, the performance of the distance-based unsupervised approach degrades with variation in environment and background noise (it may predict false changes), hence to resolve the issue supervised model-based approaches are proposed in the literature [21, 23]. In the early days, the proposed approaches model individual speakers using the Gaussian mixture model and universal background model (GMM-UBM) [2, 21], hidden Markov model (HMM) [19], etc, but nowadays, using the deep learning framework the approach predicts the speaker change by discriminating between the speaker change segments (neighborhood of the speaker change point) with no change segments [21, 23]. However, the model-based approach smooths the output evidence and may lead to miss detection of the change points [21]. In addition to that training of the supervised model requires labeled speech data from a similar environment/recording condition, speaking style, language, etc., making the system development complicated. Therefore the distance-based unsupervised approaches are more popular and widely used for SCD tasks [7, 21, 23].

Fig. 1
figure 1

a and c Two speaker time domain speech signal and its spectrogram, respectively. b and d Two languages (Bilingual) time domain speech signal and its spectrogram, respectively

Even though the available SCD frameworks look simple to adopt, there are challenges in doing so. Figure 1a and b, show the time domain speech signals corresponding to the utterance having a speaker change and a language change, respectively. By listening and observing the time domain representation of both utterances, the identified speaker/language change points are manually marked. From the time domain signal, it is very difficult to locate both the speaker and language change points. Figure 1c and d show the spectrogram of both the utterances. The spectrogram shows that the high energy spectral structures around the speaker change show significant variation, whereas the structure is intact around language change. When the speaker changes, the speech production system changes, hence might be a variation in the spectral structures. However, the structure of the high energy frequencies remains intact during language change, as the single speaker is speaking both languages.

It is interesting to note that humans discriminate between spoken languages without knowing the detailed lexical rules and phonemic distribution of the respective languages. Of course, humans need to have prior exposure to the languages [14]. Humans may exploit the long-term phoneme dynamics to discriminate between languages. Therefore, the language change may be detected by capturing the long-term language-specific spectral-temporal dynamics. This may represent valid phoneme sequences and their combinations to form syllables and subwords of a language. Based on the need to exploit the long-term spectro-temporal evidence, it can be hypothesized that the LCD by human/machine may require more neighborhood duration around the change point than the SCD. In addition, the LCD may also benefit from the prior exposure of respective languages. A human subjective study that focuses on language/speaker change detection is set up for validating the same. Further, for the automatic detection of language change, initial studies are performed by considering the unsupervised distance-based SCD framework. After that, based on the experimental results, the framework will be appropriately tuned up to improve the performance of the LCD task.

The main contributions of this work are summarized as follows:

  1. 1.

    Analysis of spectro-temporal representation around speaker and language changes indicates that, unlike speaker change detection, identifying language changes requires a larger duration around the change point and prior knowledge of the involved languages.

  2. 2.

    The same hypothesis is investigated by the human subjective study.

  3. 3.

    We generate a synthetic dataset to closely examine the relationship between Language Change Detection (LCD) and Speaker Change Detection (SCD) using the gold standard Indian Institute of Technology Madras text-to-speech (IITM-TTS) dataset.

  4. 4.

    Initial baselines for LCD are established using SCD frameworks, and their performances are evaluated.

  5. 5.

    Finally, these frameworks are further refined to improve the performance of LCD.

2 Database Setup

This section provides a brief description of the database used in this study. For performing the LCD/SCD task among humans, we have selected 32 and 15 utterances for the language and speaker change study, respectively. All the utterances have only one language/speaker change point and have approximately \(6-8\) syllables on either side of the change point. For the language change study, we have selected 32 utterances from the publicly available sources (mostly from Youtube), whereas we have chosen 15 utterances from the IITG-MV phase 3 and DIHARD datasets for the speaker change study [12, 26]. The 32 utterances used for the language change study are from the 10 language pairs and have 4, 4, 4, 4, 4, 2, 2, 4, 2, 2 utterances respectively from, (1) Hindi-English (HIE), (2) Bengali-English (BEE), (3) Telugu-English (TEE), (4) Tamil-English (TAE), (5) Bengali-Assamese (BEA), (6) Bengali-Bengali (BEB), (7) Assamese-Assamese (ASA), (8) Tamil-Malayalam (TAM), (9) Tamil-Tamil (TAT) and, (10) Malayalam-Malayalam (MAM) respectively. It is difficult to get the utterances having language pairs, Bengali-Assamese, and Tamil-Malayalam spoken by a single speaker. Hence, these language pairs with and without having a language change are considered along with a speaker change. The selected utterances for both LCD and SCD tasks along with their change point annotations are available at.Footnote 1

Initially, the studies have been performed with synthetically generated code-switch and multi-speaker utterances. For generating the utterances, we have used the IITM-TTS corpus [1]. The IITM-TTS corpus consists of speech data recordings from native speakers of 13 Indian languages. For each native language, two speakers (a male and a female) recorded their utterances in their native language and English. In this study for synthesizing the code-switch utterances, a female speaker speaking her native language Hindi, and her second language English is considered. For each language, the first 5 hours of data are used for training purposes. The rest of the monolingual utterances are stitched randomly for generating code-switched utterances. Altogether, 4000 utterances are generated having one to five language change points. The average monolingual segment duration of the generated code-switch utterances for Hindi and English languages are approximately 6.5 and 5.2 secs, respectively. The generated dataset is termed as TTS female language change (TTSF-LC) corpus. Similarly, for generating speaker change utterances by keeping the language identical, we have used English speech utterances from native Hindi and Assamese female speakers. The average mono-speaker segment duration of the generated utterances are 5.19 and 4.86 secs respectively. The generated dataset is termed as TTS female speaker change corpus (TTSF-SC). The detailed algorithm of the generated data is given in Appendix 2.

Finally, for generalizing the obtained observations, the experiments are performed on the standard LCD corpus. Microsoft code-switched challenge task-B (MSCSTB) dataset is used. The dataset has development and training partitions that consist of code-switched utterances and language tags (each 200 msec) from three language pairs: Gujarati-English (GUE), Tamil-English (TAE), and Telugu-English (TEE). The approximate duration of each language in the training and development set is 16 and 2 hours respectively. The details about the database can be found at [10].

3 Human Subjective Study for Language and Speaker Change Detection

An experimental procedure has been set up, where each human subject is exposed to a pool of utterances that may or may not have a language/speaker change. The human subjects are asked to mark, if there exists a language/speaker change or not. The utterances are classified into five groups. Each group is represented with approximate duration considered in terms of the number of energy frames (NEF) taken around the true/false change point. The true change point refers to the actual change points of the selected utterances. The selected utterances are split around the change point to generate the mono-language/speaker utterance. The false change point represents the centered energy frame’s start location of the given mono-language/speaker utterance. The energy frame is decided by taking \(6\%\) of the average short time frame energy (computed with a frame size of 20 msec and a frameshift of 10 msec) of a given utterance as a threshold [24]. The 30 mono-speaker utterances are generated by splitting the selected 15 utterances around the true change point. Out of 30, with respect to duration, the largest 15 has been chosen for this study. The same procedure has been followed to generate the mono-lingual utterances using the selected code-switched utterances belonging to the HIE, BEE, TAE, and TEE language pairs. However, there is an exception for the utterances belonging to BEA and TAM, as the utterances have a speaker change along with the language change. Hence for a fair comparison, the mono-lingual utterances for these cases are synthesized, such that they also have a speaker change, i.e. BEB, ASA, MAM, and TAT respectively. After that, each utterance S(n) is masked by considering x number of energy frames (NEF-x) from the left and right of the true/false change point. According to the value of x, the masked utterances are grouped into five different groups, termed NEF-10, NEF-20, NEF-30, NEF-50, and NEF-75. To avoid abrupt masking, a Gaussian mask G(n) with appropriate parameters is multiplied with the utterances to obtain the masked utterance \(S_{m}(n)=S(n) \times G(n)\). The masked signal is passed through an energy-based endpoint detection algorithm to obtain the final masked utterance [24]. The detailed procedure of the masked utterance generation is attached in Appendix 1, and the generated utterances are available at.Footnote 2

The listening experiment is conducted with 18 subjects. Out of them, 13 number of the subjects are male and 5 are female. The selected subjects are from the \(20-30\) years age group. The subjects have no prior exposure to the speech samples of the speakers used in this study. However, the subjects are comfortable with English, and for other languages, the comfortability varies. To know the language comfortability, each of the subjects is asked to provide a language comfortability score (LCS) from zero to three for each pair of languages.

The listening study is conducted with 390 utterances (i.e 240 for LCD and 150 for SCD). The LCD task is separate from SCD, hence conducted in two different sessions, and also the subjects are well rested so that they don’t have listener fatigue. A graphical user interface (GUI) has been designed to perform the listening study. For a specific LCD/SCD study, all the masked utterances are presented to the listener in a random order, irrespective of their segment duration. If a listener is unable to provide the response for one-time playing, s/he is allowed to play the utterance multiple times. Our objective here is to observe, how correctly humans recognize the speaker and language change by listening to the utterances coming from the five different groups. Hence, the responses recorded in [28] for analyzing the talker change detection ability of humans are used here. Three kinds of responses have been recorded, these are (1) language/speaker change detected or not (2) the number of times replayed (NR), and (3) response time (RT). RT is the time duration taken by a subject to provide his/her response, after listening to the full utterance. The RT is computed by subtracting the respective utterance duration (UD) from the total duration (TD) (i.e. \(RT=TD-UD\)). The TD is the duration taken by a subject (i.e. from pressing the play button to pressing the yes/no button) to provide his/her response.

$$\begin{aligned} DER=\frac{(FA+FR)}{N} \end{aligned}$$
(1)

For a given subject, there are three kinds of performance measures computed in this study: (1) average detection error rate (DER) (2) average number of times replayed (NR), and (3) average response time (RT). The DER is defined in Eq. 1, where N is the total number of trials, FA is the number of false language/speaker change utterances, marked as true by the subject and FR is the number of true language/speaker change utterances, marked as false by the subject, respectively. The DER measure defines the inability of the subject to detect language/speaker change. The NR provides an estimation of the average number of replays required for the subject to mark their response comfortably. Similarly, the RT provides an estimation of the average duration required for the subject to perceive the language/speaker change, after listening to the respective utterances. A higher value of the performance measures indicates the inability of the human subject to perceive the language/speaker change and vice versa.

Fig. 2
figure 2

a DER distributions of the subjects, b shows the distributional difference between the obtained DER of LCD (L) and SCD (S) in terms of F-Statistics (F Stat) values of the ANOVA test, respectively

Fig. 3
figure 3

Median values of the a NR and b RT distributions for the LCD and SCD

After performing both the LCD and SCD experiments, the subject-specific, DER, NR, and RT are computed with respect to NEF. The distributions of the obtained DER with respect to the NEF are depicted in Fig. 2a. It can be seen that the DER values are smaller for the SCD than for the LCD, regardless of the NEF. This suggests that human subjects are more comfortable with detecting the switching of speakers than language. Furthermore, as the NEF increases from 10 to 75, the DER decreases for both SCD and LCD. The differences between the DER distribution of the LCD and SCD decrease with an increase in the NEF. This suggests that human subjects’ comfortability in detecting language change increases and becomes at par with speaker change, with the increase in the NEF. To further validate fact, a statistical test called an analysis of variance (ANOVA) has been performed between the DER distribution (after removing the outliers) of LCD and SCD. The obtained F-statistics values are depicted in Fig. 2b. The higher F-statistics value suggests having better discrimination between the two distributions and vice-versa. From the figure, it can be observed the F-statistics values reduced with an increase in NEF. This justifies the claim that humans’ language discrimination ability improves and goes closure to the speaker discrimination ability with an increase in NEF. The median values of the recorded NR and RT values are depicted in Fig. 3. It can be observed from the figure that, like DER, the median value of NR, and RT reduces with an increase in NEF. The median values of NR and RT are also smaller for SCD than LCD. This concludes that human subjects require a larger duration around the change point to detect language than the speaker change comfortably.

For observing the effect of language comfortability on detecting language change, the responses of the human subjects are considered for the group NEF-50 and NEF-75 that have the median of DER lesser than 0.25 (assuming sufficient duration from either side). With respect to the LCS, the responses are segregated into four groups. The group segregation with respect to language comfortability is done as 0: very low, 1: lower medium, 2: medium, and 3: excellent, respectively. The obtained DER distribution with respect to LCS is depicted in Fig. 4. From the figure, it can be observed that the DER values decrease with an increase in LCS. This concludes that a priori knowledge of languages helps people to better discriminate between languages.

Fig. 4
figure 4

DER versus language comfortability score (LCS) for LCD with NEF-50 and NEF-75

4 LCD and SCD Using Unsupervised Distance-Based Approach

The objective of this section is to perform LCD tasks inspired by the existing unsupervised distance-based SCD framework. In general, the SCD task is performed by computing and thresholding the distance contour obtained between the features of the sliding analysis window with a fixed length N. Initially, this study uses the Mel-frequency cepstral coefficients (MFCC), linear prediction cepstral coefficients (LPCC), perceptual linear prediction (PLP), and shifted delta coefficients (SDC) [14] as feature representations to compare their language discrimination ability. Then taking evidence from the discrimination study, LCD is performed. The basic block diagram of the change detection approach is depicted in Fig. 5.

First feature vectors are extracted from the speech signal and then energy-based speech activity detection (SAD) is performed to obtain the energy frame indices. The energy frame indices are stored for future reference and the feature vectors corresponding to the energy frames are used for further processing. The two different Gaussian distributions (\(g_{a}\) and \(g_{b}\)) are estimated using the feature vectors from the two consecutive windows having a fixed length of N frames. The divergence distance contour is obtained through the entire scan of the given test utterance by sliding the analysis window with a frame, as mentioned in Eq. 2.

$$\begin{aligned} D(i)=KL(g_{A}|g_{B})+KL(g_{B}|g_{A}), \end{aligned}$$
(2)

The divergence distance contour is then smoothed with the hamming window with length (\(h_{l}\)). The smoothed contour is then used for peak detection, with a peak-picking algorithm having a minimum peak distance parameter called \(\gamma \). The higher value of \(\gamma \) reduces the number of detected peaks and vice-versa. For reducing the number of false change points, an approach of deriving a threshold counter proposed in [16] is used here. The computation procedure of threshold contour is mentioned in Eq. 3.

$$\begin{aligned} Th(i) =\alpha .\frac{1}{M}\sum _{n=0}^{M}D(i-n-1,i-n) \end{aligned}$$
(3)

where \(\alpha \) is the threshold amplification factor and M is the number of samples in the divergence distance contour. Finally, the change frame is obtained by comparing the strength of the detected peaks with the threshold contour. The change point’s actual frame index and sample location are obtained by using the stored energy frame locations.

Fig. 5
figure 5

Basic block diagram of the change detection framework for unsupervised distance-based approach

Table 1 Perforance of LCD and SCD with the unsupervised distance-based approach

We then used the TTSF-SC dataset for designing and tuning the hyperparameters of the SCD system. Out of 4000 test utterances, the first 100 utterances are used to tune the hyperparameters. It has been observed that the performance is optimal by considering \(\alpha =1\), \(\gamma \) equal to 0.9 times the analysis window length, and 150 frames as the analysis window length. Keeping the methodology and hyperparameters identical, the TTSF-LC and MSCSTB dataset is used to perform the LCD task. For evaluating the performance, the commonly used performance measures for event detection tasks, i.e. identification rate (IDR), false acceptance rate (FAR), miss rate (MR), and mean deviation (\(D_{m}\)) are used here [20, 22]. The measures are independent of the tolerance window, and calculated by observing the event activity in each region of interest (ROI) segment. The ROI segments are the duration between the mid-location of the consecutive ground-truth change points. The obtained performances of both tasks are tabulated in Table 1.

From the results, it can be observed that the performance of the SCD in terms of IDR is \(84.1\%\), whereas the performance of the LCD in terms of IDR is \(51.2\%\). The reduction in performance may be due to two reasons, (1) the used MFCC features may fail to capture language-specific discriminative evidence, and (2) the hyperparameters, mostly the analysis window length, are tuned for SCD and may not be appropriate for LCD. Hence to understand the issue a study is carried out by varying the features and analysis window length around the change point. The most used features in literature for language identification (LID) tasks, i.e. MFCC, LPCC, SDC, and PLP are considered here. The objective here is to observe the language discriminative ability of the features by considering a fixed number of energy frames (NEF), x around the change point, and compare it with the speaker discrimination ability of the MFCC feature. This study will help us to reason out the performance degradation of LCD as compared to SCD. Further, the observation will also help us to optimally decide the feature and analysis window length for performing LCD.

Fig. 6
figure 6

Distance computation around the true and false change point of an utterance, a true change point and b, c false change points, fl_d, and tr_d are false and true distances, respectively

For performing the study, the TTSF-SC and TTSF-LC dataset is considered. Out of 4000 test utterances, the utterances having only one change point are selected. The number of utterances selected for speaker change and language change is 799 and 836, respectively. For observing the discrimination ability, the idea here is to observe the distributional difference between the true and false distances. The true distances are the KL divergence distance between the x number feature vectors from either side of the ground truth change point. Similarly, the false distance is computed by placing the change point randomly anywhere in the mono-language/ speaker segments. The procedure of computing the true and false distances is also depicted in Fig. 6. For observing the duration effect on the discrimination, the value x is considered as 10, 20, 30, 50, 75, 100, 150, 200, 250, and 300 respectively. For a given x value, the ANOVA test is conducted between the obtained true and false distances. The obtained F-statistics values of the ANOVA test are depicted in Fig. 7.

Fig. 7
figure 7

ANOVA test F-statistics (F Stat) values obtained between the true and false KL divergence distances for speaker/language change study with varying the number of energy frames (NEF)

From the figure, it can be observed that the F-statistics values increase with an increase in NEF, saturate after a certain number of energy frames, and start decreasing after that. A similar observation has also been observed in the case of the LCD and SCD study by humans. However, in case humans’ performance doesn’t degrade with an increase in NEF. This may be due to the inability of the Gaussian (assumption of statistical independence) to model the speaker and language spectral dynamics and leading to the increase of the class-specific variance in the distance distribution. Using the MFCC feature, the F-statistic values of the SCD are higher than the LCD irrespective of the NEF. Further, it can also be observed that the discrimination ability (in terms of F-statistics) of the LCD follows the SCD with an increase in the NEF. Furthermore, it has also been observed that the highest F-statistics values obtained for speaker and language change study are at 150 and 200, respectively.

In addition to this, for language change study, the MFCC features provide better F-statistics value, followed by PLP, LPCC, and SDC. For clear observation, the distance distribution of the MFCC feature to perform SCD and the MFCC and PLP features to perform LCD with NEF of 50, 150, 200, and 250 is depicted through box plots in Fig. 8. From the box plots, it can also be noticed that the speaker and language discrimination saturates at NEF 150 and 200, respectively. Though the boxplots look to have better discrimination, the increase in inter-class variance leads to a decrease of the F-statistics values. Furthermore, the discrimination ability of the MFCC is better compared to PLP, as the separation between the true and false distance distribution of the MFCC feature is higher than the PLP feature for LCD at NEF equal to 200. This motivates us to consider the MFCC feature with the analysis window length of 200 for performing LCD for the TTSF-LC dataset. The performance of the LCD task with modified analysis window length is tabulated in Table 1.

Fig. 8
figure 8

True (T) and false (F) distance distribution (D-dist) a with MFCC feature for speaker change, b with MFCC feature for language change, c with PLP feature for language change, and d shows the corresponding F-statistics (F Stat) values

The table shows that the performance in terms of IDR, FAR, and MDR follows the observations noticed with respect to the F-statistics. The performance obtained for the TTSF-LC dataset with MFCC feature (considering analysis window length 200) is \(66.1\%\) in terms of IDR, providing a relative improvement of \(29.1\%\) and followed by the IDR of \(64.06\%\) using PLP feature. Similar observations also have been reported using the MSCSTB dataset, where the performance in terms of IDR improved relatively with \(2.72\%\), \(2.85\%\), and \(1.63\%\) by considering the analysis window length of 160, 180, and 170 for GUE, TAE, and TEE language pairs, respectively. The analysis window length 160, 180, and 170 are decided greedily by evaluating the performance by considering the analysis window length from 100 to 250 with a shift of 10 on the first 100 test trails. Hence, this justifies the hypothesis that the requirement of relatively higher duration information to perform LCD than SCD.

5 Language Change Detection by Model-Based Approach

The SCD and LCD by human suggest that prior exposure to the language make human more efficient in detecting language change. This motivates extracting the statistical/embedding vectors from the trained machine learning (ML)/ Deep learning (DL) framework and using them to perform change detection tasks. The detailed procedure is explained in the following subsections.

5.1 Model-Based Change Detection Framework

The block diagram of the model-based change detection framework is depicted in Fig. 9.

Fig. 9
figure 9

Block diagram for the model-based change detection study

5.1.1 Training of Statistical/Embedding Vector Extractor and PLDA Classifier

From the training data, initially, MFCC\(+\Delta +\Delta \Delta \) are computed, and then SAD is used to select the feature vectors. The feature vectors are then used to train the statistical models like the universal background model (UBM), adaptation model, Total variability matrix (T matrix), and DL model like TDNN-based x-vector models. The statistical vectors like u/a/i-vectors are extracted using trained UBM/adapt model/ T-matrix, respectively. The u-vector and a-vectors are computed by computing the zeroth order statistics from the UBM and adapt model, respectively. The zeroth order statistics are computed using Eq. 4, where i ranges from \(1 \le i \le M\), M is the number of mixture components, \(x_{j}\) are the MFCC features and T is the number of frames. The u-vectors are the M dimensional vectors extracted using the UBM model, whereas the a-vectors are the concatenation of the M dimensional vectors, extracted from the class-specific adapt models. The i-vectors are extracted as mentioned in [8]. Similarly, the x-vectors are extracted from the trained TDNN-based x-vector model. Both the statistical/ embedding vectors are computed by considering N number feature vectors as analysis window length. The extracted vectors are then used to train the linear discriminate analysis (LDA), within class covariance normalization (WCCN) matrix, and the probabilistic LDA (PLDA) model.

$$\begin{aligned} N(i)=\frac{1}{T}\sum _{j=1}^{T}P(i|x_{j}) \end{aligned}$$
(4)

5.1.2 Change Detection During Testing

During testing, the feature vectors are extracted from the code-switched utterances. After that using the VAD labels, with a fixed number of frames the statistical/embedding (S/E) vectors are extracted using the trained models. The S/E vector extraction and the distance contour for each test utterance are computed using Eq. 5. Where \(x_{i}\)s’ are the feature vectors, \(\psi (.)\) is the distance computation function and \({\mathbb {F}}(.)\) is the mapping function from the feature space to S/E vector space.

$$\begin{aligned} D(i)=\psi ({\mathbb {F}}(x_{i-N},\ldots ,x_{i}),{\mathbb {F}}(x_{i+1},\ldots ,x_{i+N})) \end{aligned}$$
(5)

The distance contour is then smoothed using a hamming window with length (\(h_{l}\)). The \(h_{l}\) is considered as \(1/\delta \) times N. The peaks of the smoothed contour are computed and the magnitude of peaks greater than the threshold contour are considered as the change points.

5.2 Experimental Setup

The TTSF-SC dataset is used for SCD, whereas TTSF-LC and MSCSTB are used for performing LCD tasks. The training partition of the TTSF-SC, TTSF-LC, and MSCSTB is used to train the statistical/embedding vector extractor, and the testing partition is used to evaluate the performance of the LCD and SCD tasks.

The 39 dimensional MFCC\(+\Delta +\Delta \Delta \) feature vectors are computed from the speech signal with 20 msec and 10 msec as window and hop duration, respectively. The energy frames are decided by considering the frame energy greater than the \(6\%\) of the utterance’s average frame energy. The UBM and adapt models are trained with a cluster size of 32. The dimensions of the u/a/i-vectors are 32, 64 and 50, respectively. The x-vector framework has 8 layers, out of 5 are working on frame level and the rest 3 are working on segment level (i.e. N frames). The number of neurons in each layer is 512, 512, 512, 512, 1500, 512, 512 and 2, respectively. For the speaker-specific study, the x-vectors are trained without dropout and L2 normalization, whereas for the language-specific study, dropouts of 0.2 in the second, third, fourth, and sixth layers are used along with L2 normalization.

During training, the speaker/language-specific feature vectors are used to extract the S/E vectors dis-jointly with a fixed N, whereas during testing the S/E vectors are extracted with a sample frameshift. All the models have been trained for 20 epochs. For TTSF-LC the optimal N is decided experimentally as 200 and for TTSF-SC N is considered as 50. After training, by observing the validation loss and accuracy the model corresponding to the \(15^{th}\) and \(11^{th}\) epoch is chosen for the language and speaker-specific study, respectively. Similarly, for MSCSTB, x-vector models for each language pair are trained. After training for 100 epochs, by observing the validation loss and accuracy the model belonging to the (\(54^{th}\), \(29^{th}\), and \(26^{th}\)) epochs for \(N=200\) are chosen for GUE, TAE, and TEE language pairs, respectively.

For TTSF-LC and TTSF-SC, the extracted embedding vectors are normalized without having LDA and WCCN. The normalized vectors are used for modeling the PLDA and computing the distance contour for LCD and SCD tasks. Using the MSCSTB dataset, it is observed that performing LDA, and WCCN along with using cosine kernel distance instead of PLDA distance contour improves the change detection performance. This may be due to the nature of the datasets. The TTSF-LC and TTSF-SC are the studio recording of read speech, whereas the MSCSTB is the conversation recording in the office environment.

For the SCD task, after extracting the i/x vectors, the change points are detected for each test utterance using the hyperparameters \(\alpha \), \(\delta \), and \(\gamma \) as 2.6, 1.3 and 0.9, respectively. The hyper-parameters are decided greedily by observing the change detection performance on the first 100 test trails. For the LCD task (using TTSF-LC), the hyper-parameters are decided as 3.2, 1.3, and 0.9, respectively. Similarly, for MSCSTB (\(N=200\)), the optimal hyperparameters for GUE, TAE, and TEE are (0.3, 4.5, and 1.1), (0.3, 4.5, and 1.1) and (0.3, 3.9, and 1.1), respectively.

5.3 Language Discrimination by Statistical/Embedding Vectors

The aim here is to observe the discrimination ability of the extracted S/E vectors for language discrimination, by synthetically emulating the CS scenario. The TTSF-LC, where the same speaker is speaking two languages is considered for this study. The training partition is used to train the UBM, adapt, T-matrix, and TDNN-based x-vector model. From the test partitions, two utterances are selected, one from each language, spoken by a speaker. Using the selected utterances the MFCC\(+\Delta +\Delta \Delta \) features and the S/E vectors are extracted and projected in two dimensions using t-SNE [18].

Fig. 10
figure 10

t-SNE feature distribution between the Hindi (H) and English (E) a MFCC features, b u-vector, c a-vector, d i-vector, and e x-vector. Within and Between language PLDA score distribution, with EER of (f) 28.5, g 17.35, h 12.55, and i 3.6 for u, a, i, and x-vector, respectively

The two-dimensional representations are depicted in Fig. 10a–e. From the figure, it can be observed that the overlapping between the languages reduces by moving from the feature space to the S/E vector space. This shows, like human subjects, prior exposure to the languages through ML/DL models helps in better discrimination. Furthermore, among the S/E vectors, the overlap between the languages is least in the x-vector space, followed by the i-vector, adapt, and UBM posterior space. This is due to the ability of the modeling techniques to capture the language-specific feature dynamics.

For strengthening the observation, the features are extracted from the test utterances and pooled together with respect to a given language. The pooled feature vectors are randomly segmented with a context of 200 and used to extract the S/E vectors. The extracted S/E vectors are paired to form 2000 within a language (WL) and 2000 between language (BL) trails. The WL and BL vector pairs are compared using the PLDA scores. Figure 10f–i shows boxplots of the PLDA score distribution of the WL and BL pairs. From the box plot distribution, it can be observed that, between the WL and BL, the overlap of PLDA scores distribution reduces with improvement in the modeling techniques from UBM to x-vector.

In the change point detection task, the aim is to get a sudden change in the distance contour, when there exists a change in language. That can be achieved if the contour (negative of PLDA score) variation is less in WL and provide a sudden change in the contour for BL pairs. Hence for ensuring this, the PLDA score distribution between the WL and BL should be maximized. Keeping this into account, the equal error rate (EER) has been used as an objective measure, where the WL and BL trials are termed false scores and true scores, respectively. The obtained EER for UBM/adapt/i-vector and x-vector are 28.5, 17.35, 12.55, and 3.6, respectively. Hence as per the discrimination ability, the change point detection study has been carried out using i/x-vectors as the representations of the speaker and language.

5.4 Experimental Results

Initially, the change detection study is conducted with TTSF-SC and TTSF-LC using i/x-vectors as the speaker/language representation. The discrimination ability and the LCD/SCD study suggest that the x-vector is a better representation of the speaker/language than the i-vector. Therefore, the LCD task on the MSCSTB dataset is conducted by considering x-vectors as language representations.

The experimental results are tabulated in Table 2. The performance obtained in terms of IDR on SCD task using i-vector and x-vector is \(87.75\%\) and \(92.27\%\), respectively. Similarly, for LCD tasks the performances on TTSF-LC are \(80.58\%\) and \(87.01\%\), respectively. As evidenced by the language discrimination study, the performance of LCD provides a relative improvement of \(21.9\%\) and \(31.63\%\) using i-vectors and x-vectors, over the best performance achieved on the unsupervised distance-based approach, respectively. This justifies the claim that, like humans, the performance of the LCD can be improved by incorporating language-specific prior information through computational models.

Table 2 Performance of LCD and SCD by model-based approaches,
Fig. 11
figure 11

Distribution of the mono-lingual segment duration of a TTSF-LC’s HE, and MSCSTB’s b GUE, c TAE, and d TEE language pairs, respectively

The performance of the LCD task on the MSCSTB dataset using x-vectors as language representation with considering N as 200 (same as TTSF-LC) is \(46.56\%\), \(49.91\%\) and \(47.13\%\) in terms of IDR for the GUE, TAE, and TEE partitions, respectively. The performance provides a relative improvement of \(5.6\%\), \(2.3\%\), and \(4.2\%\) over the best performance achieved on the unsupervised distance-based approach, respectively. However, the improvement is small as compared to the improvement achieved using TTSF-LC data. This may be due to the distributional difference in the monolingual segment duration in the TTSF-LC and MSCSTB datasets. A boxplot showing the distribution of the monolingual segments of TTSF-LC’s and MSCSTB’s test set is depicted in Fig.  11. From the figure, it can be observed that the median of the monolingual segment duration in the case of TTSF-LC for primary and secondary language are (5.54 and 4.9) seconds, and for MSCSTB is (1.46 and 0.51), (1.54 and 0.41), (1.61 and 0.41) seconds for GUE, TAE, and TEE partition, respectively. Further, it has been observed that language discrimination is better by considering N equal to 200 (i.e. approx. 2 seconds). Hence, due to the monolingual segment duration of the MSCSTB dataset being smaller than the considered analysis window duration resulting in smoothing on the resultant distance contour, and leads to an increase in the MDR. Therefore, the alternative is to reduce the analysis window length, but that may affect the language discrimination ability of the x-vectors.

Fig. 12
figure 12

Within and Between language cosine score distribution, with EER of a 7.1, b 9.8, c 12.8, d 19.8 and e 29.2 for the analysis window length of 200, 150, 100, 75 and 50, respectively

The language discrimination test is performed using the GUE partition of the MSCSTB dataset by reducing the analysis window length from 200 to 50. The cosine score distribution of the x-vectors’ WL and BL pairs after the LDA and WCCN projection with varying the analysis window length are depicted in Fig. 12. The test shows that the language discrimination ability reduces with a decrease in analysis window duration. From the figure, it can be observed that the overlap between the WL and BL score distribution increases with a decrease in the value of N. As an objective measure, the computed EER for N equals to 200, 150, 100, 75,and 50 are 7.1, 9.8, 12.8, 19.8 and 29.2, respectively. Hence resolving the issue may require coming up with a better embedding extractor strategy, that can provide better language discrimination in a small analysis window duration or can come up with a framework that is independent of the variation in analysis window duration.

6 Discussion

The human-based LCD and SCD study suggests that the language requires more neighborhood information as compared to the speaker for comfortable discrimination. Further, prior exposure to the languages helps humans to better discriminate between the languages. Motivated by this, it is hypothesized that the performance of LCD by machine can be improved with the (a) incorporation larger duration analysis window (N) and (b) language-specific exposure through computational models.

In the unsupervised distance-based approach, it has been observed that the performance of the LCD improves by increasing the value of N. The optimal N value for the SCD study is 150. Considering the same value of N, the LCD task is carried out for both TTSF-LC and MSCSTB datasets, and performances are tabulated in Table 3. In the case of the MSCSTB dataset, the average IDR values with respect to all three language pairs are tabulated. Motivating by the LCD/SCD study by humans, the N value is increased and the obtained optimal N value for the LCD with TTSF-LC is 200. Similarly, the optimum N value for MSCSTB is 160, 180, and 170 for the GUE, TAE, and TEE, respectively. The performance with the optimal N value for TTSF-LC and MSCSTB is \(66.1\%\) and \(46.02\%\), which provides a relative improvement of \(29.1\%\), and \(2.4\%\), respectively. These observations justify the claim that the performance of the LCD by machines can be improved by increasing the analysis window duration.

Table 3 Performance comparison

Furthermore, as hypothesized from the subjective study, the incorporation of language-specific exposure through computational models improves LCD performance. The i/x-vector models have been trained, which essentially capture the language-specific cepstral dynamics. It has been observed that with the x-vector approach, the obtained performance is \(87.01\%\) for TTSF-LC and \(52.59\%\) in terms of IDR, which provides a relative improvement of \(31.63\%\) and \(4.01\%\) over the performance of the unsupervised distance-based approach. Similarly, for the SCD task using the TTSF-SC dataset, the performance provides a relative improvement of \(9.71\%\). Comparing the performance of LCD and SCD on synthetic data, it can be observed that the improvement is more significant on LCD than the SCD. This concludes, like human subjective study, in an ideal condition (only speaker/language variation and keeping other variations limited), the requirement model-based approach is more significant on LCD than the SCD.

It is also observed that in the LCD task, the performance improvement on MSCSTB data is limited as compared to the improvement achieved on the synthetic TTSF-LC dataset. This is due to the difference in the mono-lingual segment duration. The trade-off between the analysis window duration and the language discrimination ability shows that the discrimination ability improves with an increase in analysis window duration. At the same time during change detection, as the mono-lingual segment duration can possibly be less than 500 msec (approx. 50 energy frames), considering a larger analysis window leads to degrading in performance by smoothening the evidence contour (leads to an increase in MDR). Hence to overcome this issue, (1) need to achieve significant language discrimination with the N value as small as possible, and (2) need to develop a framework whose performance will be least affected/independent with the variations of the analysis window duration.

7 Conclusion

In this work, we performed LCD using the available frameworks for SCD. From the subjective study, it is observed that humans require comparatively larger neighborhood information around the change point as compared to the speaker. It is also observed that prior language-specific exposure improves the performance of the LCD task. In the unsupervised distance-based approach, the incorporation of larger neighborhood information improves the LCD performance by relatively \(29.1\%\) and \(2.4\%\) on the synthetic TTSF-LC and the practical MSCSTB dataset, respectively. Similarly, incorporating language-specific prior information through the computational models provides a relative improvement of \(31.63\%\) and \(4.01\%\) over the unsupervised distance-based approach. This shows that it is indeed possible to perform LCD by appropriately tuning the SCD frameworks.

It has also been observed that the practical data set does not perform as expected like synthetic data. This is due to the distributional difference in the monolingual segment duration on both datasets. The MSCSTB dataset consists of the monolingual segments having a duration lesser than 0.5 secs, and for better language discrimination the required duration is about 2 secs (about 200 energy frames). Hence it is challenging to decide on the analysis window duration. The larger duration smooths the evidence contour and increases the MDR, whereas a smaller duration of 0.5 secs is not able to provide appropriate language discrimination.

Therefore, our future attempts will try to develop a better framework, that can provide better language discrimination on a small duration, and also plan to come up with a change detection framework, whose performance should be independent/less affected by the variations of the analysis window duration.