Spoken language change detection inspired by speaker change detection

Spoken language change detection (LCD) refers to identifying the language transitions in a code-switched utterance. Similarly, identifying the speaker transitions in a multispeaker utterance is known as speaker change detection (SCD). Since tasks-wise both are similar, the architecture/framework developed for the SCD task may be suitable for the LCD task. Hence, the aim of the present work is to develop LCD systems inspired by SCD. Initially, both LCD and SCD are performed by humans. The study suggests humans require (a) a larger duration around the change point and (b) language-specific prior exposure, for performing LCD as compared to SCD. The larger duration requirement is incorporated by increasing the analysis window length of the unsupervised distance-based approach. This leads to a relative performance improvement of 29.1% and 2.4%, and a priori language knowledge provides a relative improvement of 31.63% and 14.27% on the synthetic and practical codeswitched datasets, respectively. The performance difference between the practical and synthetic datasets is mostly due to differences in the distribution of the monolingual segment duration.


I. INTRODUCTION
Spoken language diarization (LD) is a task to automatically segment and label the monolingual segments in a given multilingual speech signal.The existing works towards LD are very few (Sitaram et al., 2019).The majority of them use phonotactic (i.e. the distribution of sound units) based approaches (Chan et al., 2004;Lyu et al., 2013;Spoorthy et al., 2018).The development of LD using a phonotactic-based approach requires transcribed speech utterances.The same is difficult to obtain as most of the languages present in the code-switched multilingual utterances are resource-scare in nature (Sitaram et al., 2019;Spoorthy et al., 2018).Even though, there exist some transfer learning approaches that adapt the phonotactic models of the high resource language to obtain the models for the low resource language, may end up with performance degradation if both the languages are not from the same language group (Sitaram et al., 2019).Further, LD is effortless for humans, especially for known languages, and challenging for machines.Hence there is a need for exploring alternative approaches for LD.
Speaker diarization (SD) is a task to automatically segment and label the mono-speaker segments for a given multispeaker utterance, which is well explored in the literature.Though there exist differences in the information that needs to be captured to perform LD and SD tasks, there exist many similarities like the features apa jagabandhu.mishra.18@iitdh.ac.in b prasanna@iitdh.ac.in proximating the vocal tract resonances that have been successfully used for the modeling of both speaker and language-specific phonemes (Carrasquillo et al., 2002;Li et al., 2013;Liu et al., 2021).Furthermore, most of the approaches used for spoken language identification (LID) are inspired by the approaches used for the speaker identification/verification (SID/SV) task (Richardson et al., 2015;Snyder et al., 2018).In addition to that most of the successful LID systems that are borrowed from SID/SV literature do not require transcribed speech data (Li et al., 2013;Snyder et al., 2018).Alternatively LID systems developed using the phonotactic approach require transcribed speech data.This motivates a close association study between the LD and SD tasks and may be exploited to come up with approaches for LD.
The SD field has evolved mainly in two ways: (1) change point detection followed by clustering and boundary refinement, and (2) fixed duration segmentation followed by i-vector/ embedding vector extraction, clustering, and boundary refinement (Moattar and Homayounpour, 2012;Park et al., 2022;Tranter and Reynolds, 2006).(Bredin et al., 2017;Dawalatabad et al., 2020;Hogg et al., 2019;Park et al., 2022) reported that initial change point detection improved overall SD performance.Thus this study focuses on the development of spoken language change detection (LCD) through a comparative analysis between LCD and speaker change detection (SCD).The available SCD approaches can be broadly classified into two groups: (1) distance-based unsupervised approach and (2) modelbased supervised approach (Moattar and Homayounpour, 2012;Park et al., 2022).The distance-based approach applies hypothesis testing (either coming from a unique speaker or not) for predicting the speaker change to the speaker's specific features extracted from the speech signal with sliding consecutive windows (Moattar and Homayounpour, 2012;Park et al., 2022).Following this approach, many feature extraction techniques like excitation source (Dhananjaya and Yegnanarayana, 2008;Sarma et al., 2015), fundamental frequency contour (Hogg et al., 2019), etc., and distance metrics like Kullback-Leibler (KL) divergence (Siegler et al., 1997), Bayesian information criteria (BIC) (Chen et al., 1998), KL2 (Siegler et al., 1997), generalized likelihood ratio (GLR) (Gish et al., 1991) and information bottleneck (IB) (Dawalatabad et al., 2020) are proposed in the literature.Generally, the performance of the distance-based unsupervised approach degrades with variation in environment and background noise (it may predict false changes), hence to resolve the issue supervised modelbased approaches are proposed in the literature (Moattar and Homayounpour, 2012;Park et al., 2022).In the early days, the proposed approaches model individual speakers using the Gaussian mixture model and universal background model (GMM-UBM) (Barras et al., 2006;Moattar and Homayounpour, 2012), hidden Markov model (HMM) (Meignier et al., 2006), etc, but nowadays, using the deep learning framework the approach predicts the speaker change by discriminating between the speaker change segments (neighborhood of the speaker change point) with no change segments (Moattar and Homayounpour, 2012;Park et al., 2022).However, the modelbased approach smooths the output evidence and may lead to miss detection of the change points (Moattar and Homayounpour, 2012).In addition to that training of the supervised model requires labeled speech data from a similar environment/recording condition, speaking style, language, etc., making the system development complicated.Therefore the distance-based unsupervised approaches are more popular and widely used for SCD tasks (Dawalatabad et al., 2020;Moattar and Homayounpour, 2012;Park et al., 2022).
Even though the available SCD frameworks look simple to adopt, there are challenges in doing so.Fig. 1 (a) and (b), show the time domain speech signals corresponding to the utterance having a speaker change and a language change, respectively.By listening and observing the time domain representation of both utterances, the identified speaker/language change points are manually marked.From the time domain signal, it is very difficult to locate both the speaker and language change points.Fig. 1 (c) and (d) show the spectrogram of both utterances.From the spectrogram, it can be observed that around the speaker change the formant structure shows significant variation, whereas around language change the structure is intact.When the speaker changes, the vocal tract system information changes and hence the variation in the formant structure.However, the structure of the formant frequencies remains intact during language change as the single speaker is speaking both languages.It is interesting to note that humans discriminate between spoken languages without knowing the detailed lexical rules and phonemic distribution of the respective languages.Of course, humans need to have prior exposure to the languages (Li et al., 2013).Humans may exploit the long-term phoneme dynamics to discriminate between languages.Therefore, the language change may be detected by capturing the long-term language-specific spectral-temporal dynamics.This may represent valid phoneme sequences and their combinations to form syllables and subwords of a language.
Based on the need to exploit the long-term spectrotemporal evidence, it can be hypothesized that the LCD by human/machine may require more neighborhood duration around the change point than the SCD.In addition, LCD may also benefit from prior exposure to respective languages.A human subjective study that focuses on language/speaker change detection is set up for validating the same.
For automatic detection of language change, the initial studies are performed using the available unsupervised distance and the supervised model-based SCD approaches.The model-based approaches include GMM-UBM, i-vector, and x-vector.Based on the experimental results for LCD and SCD, appropriate modifications will be done to each framework for improving the performance of the LCD task.
The main contribution of this work are summarized as follows: (a) by observing the spectro-temporal representation around the speaker and language change, it is hypothesized that detecting language change, requires a larger duration around the change point and a priori knowledge of the language as compared to detecting a speaker change.The same hypothesis is confirmed by the human subjective study, (b) the SCD frameworks are used as initial baselines to perform LCD and their performances are analyzed, and (c) these frameworks are further refined to improve the performance of LCD.

II. DATABASE SETUP
This section provides a brief description of the database used in this study.
Initially, the studies have been performed with synthetically generated code-switch and multi-speaker utterances.For generating the utterances, we have used the Indian institute of technology Madras text-to-speech (IITM-TTS) corpus (Baby et al., 2016).The IITM-TTS corpus consists of speech data recordings from native speakers of 13 Indian languages.For each native language, two speakers (a male and a female) recorded their utterances in their native language and English.In this study for synthesizing the code-switch utterances, a female speaker speaking her native language Hindi, and her second language English is considered.For each language, the first 5 hours of data are used for training purposes.The rest of the monolingual utterances are stitched randomly for generating code-switched utterances.Altogether, 4000 utterances are generated having one to five language change points.The average monolingual segment duration of the generated code-switch utterances for Hindi and English languages are approximately 6.5 and 5.2 secs, respectively.The generated dataset is termed as TTS female language change (TTSF-LC) corpus.Similarly, for generating speaker change utterances by keeping the language identical, we have used English speech utterances from native Hindi and Assamese female speakers.The average mono-speaker segment duration of the generated utterances are 5.19 and 4.86 secs, respectively.The generated dataset is termed as TTS female speaker change corpus (TTSF-SC).
Finally, for generalizing the obtained observations, the experiments are performed on the standard LCD corpus.Microsoft code-switched challenge task-B (MSC-STB) dataset is used.The dataset has development and training partitions that consist of code-switched utterances and language tags (each 200 msec) from three language pairs: Gujarati-English (GUE), Tamil-English (TAE), and Telugu-English (TEE).The approximate duration of each language in the training and development set is 16 and 2 hours, respectively.The detail about the database can be found at (Diwan et al., 2021).

III. HUMAN SUBJECTIVE STUDY FOR LANGUAGE AND SPEAKER CHANGE DETECTION
An experimental procedure has been set up, where each human subject is exposed to a pool of utterances that may or may not have a language/speaker change.The human subjects are asked to mark, if there exists a language/speaker change or not.The utterances are classified into five groups.Each group is represented with approximate duration considered in terms of the number of voiced frames (NVF) taken around the true/false change point.The true change point refers to the actual change points of the selected utterances.The selected utterances are split around the change point to generate the mono-language/speaker utterance.The false change point represents the centered voiced frame's start location of the given mono-language/speaker utterance.The voiced frame is decided by taking 6% of the average short time frame energy (computed with a frame size of 20 msec and a frameshift of 10 msec) of a given utterance as a threshold (Rabiner, 1978).The 30 mono-speaker utterances are generated by splitting the selected 15 utterances around the true change point.Out of 30, with respect to duration, the largest 15 has been chosen for this study.The same procedure has been followed to generate the mono-lingual utterances using the selected codeswitched utterances belonging to the HIE, BEE, TAE, and TEE language pairs.However, there is an exception for the utterances belonging to BEA and TAM, as the utterances have a speaker change along with the language change.Hence for a fair comparison, the monolingual utterances for these cases are synthesized, such that they also have a speaker change, i.e.BEB, ASA, MAM, and TAT, respectively.After that, each utterance S(n) is masked by considering x number of voiced frames (NVF-x) from the left and right of the true/false change point.According to the value of x, the masked utterances are grouped into five different groups, termed NVF-10, NVF-20, NVF-30, NVF-50, and NVF-75.To avoid abrupt masking, a Gaussian mask G(n) with appropriate parameters is multiplied with the utterances to obtain the masked utterance The masked signal is passed through an energy-based endpoint detection algorithm to obtain the final masked utterance (Rabiner, 1978).The detailed procedure of the masked utterance generation is attached in the supplementary 1 , and also the generated utterances are available at https://github.com/jagabandhumishra/HUMAN-SUBJECTIVE-STUDY-FOR-LCD-and-SCD.
The listening experiment is conducted with 18 subjects.Out of them, 13 number of the subjects are male and 5 are female.The selected subjects are from the 20 − 30 years age group.The subjects have no prior exposure to the voice samples of the speakers used in this study.However, the subjects are comfortable with English, and for other languages, the comfortability varies.To know the language comfortability, each of the subjects is asked to provide a language comfortability score (LCS) from zero to three for each pair of languages.
The listening study is conducted with 390 utterances (i.e 240 for LCD and 150 for SCD).The LCD task is separate from SCD, hence conducted in two different sessions, and also the subjects are well rested so that they don't have listener fatigue.A graphical user interface (GUI) has been designed to perform the listening study.For a specific LCD/SCD study, all the masked utterances are presented to the listener in a random order, irrespective of their segment duration.If a listener is unable to provide the response for one-time playing, s/he is allowed to play the utterance multiple times.Our objective here is to observe, how correctly humans recognize the speaker and language change by listening to the utterances coming from the five different groups.Hence, the responses recorded in (Sharma et al., 2019) for analyzing the talker change detection ability of humans are used here.Three kinds of responses have been recorded, these are (1) language/speaker change detected or not (2) the number of times replayed (NR), and (3) response time (RT).RT is the time duration taken by a subject to provide his/her response, after listening to the full utterance.The RT is computed by subtracting the respective utterance duration (UD) from the total duration (TD) (i.e.RT = T D − U D).The TD is the duration taken by a subject (i.e. from pressing the play button to pressing the yes/no button) to provide his/her response.
For a given subject, there are three kinds of performance measures computed in this study: (1) average detection error rate (DER) (2) average number of times replayed (N R), and (3) average response time (RT ).The DER is defined in Eq. 1, where N is the total number of trials, F A is the number of false language/speaker change utterances, marked as true by the subject and F R is the number of true language/speaker change utterances, marked as false by the subject, respectively.The DER measure defines the inability of the subject to detect language/speaker change.The N R provides an estimation of the average number of replays required for the subject to mark their response comfortably.Similarly, the RT provides an estimation of the average duration required for the subject to perceive the language/speaker change, after listening to the respective utterances.A higher value of the performance measures indicates the inability of the human subject to perceive the language/speaker change and vice versa.
After performing both the LCD and SCD experiments, the subject-specific, DER, N R, and RT are computed with respect to NVF.The distributions of the obtained DER with respect to the NVF are depicted in Fig. 2(a).It can be seen that the DER values are smaller for the SCD than for the LCD, regardless of the NVF.This suggests that human subjects are more comfortable with detecting the switching of speakers than language.Furthermore, as the NVF increases from 10 to 75, the DER decreases for both SCD and LCD.The For observing the effect of language comfortability on detecting language change, the responses of the human subjects are considered for the group NVF-50 and NVF-75 that have the median of DER lesser than 0.25 (assuming sufficient duration from either side).With respect to the LCS, the responses are segregated into four groups.The group segregation with respect to language comfortability is done as 0: very low, 1: lower medium, 2: medium, and 3: excellent, respectively.The obtained DER distribution with respect to LCS is depicted in Fig. 4. From the figure, it can be observed that the DER values are decreases with an increase in LCS.This concludes that a priori knowledge of languages helps people to better discriminate between languages.The objective of this section is to perform LCD tasks inspired by the existing unsupervised distancebased SCD framework.In general, the SCD task is performed by computing and threshold the distance contour obtained between the features of the sliding analysis window with a fixed length N .The basic block diagram of the approach is depicted in Fig. 5. First feature vectors are extracted from the speech signal and then energybased voice activity detection (VAD) is performed to obtain the voiced frame indices.The voiced frame indices are stored for future reference and the feature vectors corresponding to the voiced frames are used for further processing.The voiced feature vectors are used with two consecutive windows having a fixed length to model two different Gaussian distributions (g a and g b ).The divergence distance contour is obtained through the entire scan of the given test utterance by sliding the analysis window with a frame, as mentioned in Eq. 2. The evidence contour is then smoothed with the hamming window with length (h l ).The smoothed contour is then used for peak detection, with a peak-picking algorithm having a minimum peak distance parameter called γ.The higher value of γ reduces the number of detected peaks and vice-versa.For reducing the number of false change points, an approach of deriving a threshold counter proposed in (Lu and Zhang, 2002) and mentioned in Eq 3 is used here.Finally, the change frame is obtained by comparing the strength of the detected peaks with the threshold contour.The change point's actual frame index and sample location are obtained by using the stored voiced frame locations. (2) Initially, we used the TTSF-SC dataset for designing and tuning the hyperparameters of the SCD system.Out of 4000 test utterances, the first 100 utterances are used to tune the hyperparameters.It has been observed that the performance is optimal by considering α = 1, γ equal to 0.9 times the analysis window length, and 150 as the analysis window length.Keeping the methodology and hyperparameters identical, the TTSF-LC and MSCSTB dataset is used to perform the LCD task.For evaluating the performance, the commonly used performance measures for event detection tasks, i.e. identification rate (IDR), false acceptance rate (FAR), miss rate (MR), and mean deviation (D m ) are used here (Mishra et al., 2021;Murty and Yegnanarayana, 2008).The performances of both tasks are tabulated in Table I.
From the results, it can be observed that the performance of the SCD in terms of IDR is 84.1%, whereas the performance of the LCD in terms of IDR is 51.2%.The reduction in performance may be due to two reasons, (1) the used MFCC features may fail to capture languagespecific discriminative evidence, and (2) the hyperparameters, mostly the analysis window length, are tuned for SCD and may not be appropriate for LCD.Hence to understand the issue a study is carried out by varying the features and analysis window length around the change point.The most used features in literature for language identification (LID) tasks, i.e.MFCC, LPCC, SDC, and PLP are considered here.The objective here is to observe the language discriminative ability of the features by considering a fixed number of voiced frames (NVF), x around the change point and compare it with the speaker discrimination ability of the MFCC feature.This study will help us to reason out the performance degradation of LCD as compared to SCD.Further, the observation will also help us to optimally decide the feature and analysis window length for performing LCD.
For performing the study, the TTSF-SC and TTSF-LC dataset is considered.Out of 4000 test utterances, the utterances having only one change point are selected.The number of utterances selected for speaker change and language change is 799 and 836, respectively.For observing the discrimination ability, the idea here is to observe the distributional difference between the true and false distances.The true distances are the KL divergence distance between the x number feature vectors from either side of the ground truth change point.Similarly, the false distance is computed by placing the change point randomly anywhere in the mono-language/ speaker segments.The procedure of computing the true and false distances is also depicted in Fig. 6.For observing the duration effect on the discrimination, the value x is considered as 10, 20, 30, 50, 75, 100, 150, 200, 250, and 300, respectively.For a given x value, the ANOVA test is conducted between the obtained true and false distances.
The obtained F-statistics values of the ANOVA test are depicted in Fig. 7.
From the figure, it can be observed that the Fstatistics values increase with an increase in NVF and saturate after a certain number of voiced frames, and started decreasing after that.A similar observation has also been observed in the case of the LCD and SCD study by humans.However, in case humans' performance doesn't degrade with an increase in NVF.This may be due to the inability of the Gaussian (assumption of statistical independence) to model the speaker and language spectral dynamics and leading to the increase of the class-specific variance in the distance distribution.
Using the MFCC feature, the F-statistic values of the SCD are higher than the LCD irrespective of the NVF.Further, it can also be observed that the discrimination ability (in terms of F-statistics) of the LCD follows the SCD with an increase in the NVF.Furthermore, it has also been observed that the highest F-statistics values obtained for speaker and language change study are at 150 and 200, respectively.In addition to this, for language change study, the MFCC features provide better F-statistics value, followed by PLP, LPCC, and SDC.For clear observation, the distance distribution of the MFCC feature to perform SCD and the MFCC and PLP features to perform LCD with NVF of 50, 150, 200, and 250 is depicted through box plots in Fig. 8. From the box plots, it can also be noticed that the speaker and language discrimination saturates at NVF 150 and 200, respectively.Though the boxplots look to have better discrimination, the increase in inter-class variance leads to a decrease of the F-statistics values.Furthermore, the discrimination ability of the MFCC is better compared to PLP, as the separation between the true and false distance distribution of the MFCC feature is higher than the PLP feature for LCD at NVF equal to 200.This motivates us to consider the MFCC feature with the analysis window length of 200 for performing LCD for the TTSF-LC dataset.The performance of the LCD task with modified analysis window length is tabulated in Table I.
The table shows that the performance in terms of IDR, FAR, and MDR follows the observations noticed with respect to the F-statistics.The performance obtained for the TTSF-LC dataset with MFCC feature (considering analysis window length 200) is 66.1% in terms of IDR, providing a relative improvement of 29.1% and followed by the IDR of 64.06% using PLP feature.Similar observations also have been reported using the MSCSTB dataset, where the performance in terms of IDR improved relatively with 2.72%, 2.85%, and 1.63% by considering the analysis window length of 160, 180, and 170 for GUE, TAE, and TEE language pairs, respectively.The analysis window length 160, 180, and 170 are decided greedily by evaluating the performance by considering the analysis window length from 100 to 250 with a shift of 10 on the first 100 test trails.Hence, this justifies the hypothesis that the requirement of relatively higher duration information to perform LCD than SCD.

V. LANGUAGE CHANGE DETECTION BY MODEL-BASED APPROACH
The SCD and LCD by human suggest that prior exposure to the language make human more efficient in detecting language change.This motivates extracting the statistical/embedding vectors from the trained machine learning (ML)/ Deep learning (DL) framework and using them to perform change detection tasks.The detailed procedure is explained in the following subsections.

A. Model-based change detection framework
The block diagram of the model-based change detection framework is depicted in Fig. 9. From the training data, initially, MFCC+∆ + ∆∆ are computed, and voiced feature vectors are selected for further processing by using VAD.The voiced feature vectors are used to train the statistical models like the universal background model (UBM), adaptation model, Total variability matrix (T matrix), and DL model like TDNNbased x-vector models.The statistical vectors like u/a/ivectors are extracted using trained UBM/adapt model/ T-matrix, respectively.The u-vector and a-vectors are computed by computing the zeroth order statistics from the UBM and adapt model, respectively.The zeroth order statistics are computed using Equation 4, where i ranges from 1 ≤ i ≤ M , M is the number of mixture components, x j are the MFCC features and T is the number of voiced frames.The u-vectors are the M dimensional vectors extracted using the UBM model, whereas the avectors are the concatenation of the M dimensional vec- tors, extracted from the class-specific adapt models.The i-vectors are extracted as mentioned in (Dehak et al., 2010).Similarly, the x-vectors are extracted from the trained TDNN-based x-vector model.Both the statistical/ embedding vectors are computed by considering N number voiced feature vectors as analysis window length.The extracted vectors are then used to train the linear discriminate analysis (LDA), within class covariance normalization (WCCN) matrix, and the probabilistic LDA (PLDA) model.
During testing, the feature vectors are extracted from the code-switched utterances.After that using the VAD labels, with a fixed number of voiced frames the statistical/embedding (S/E) vectors are extracted using the trained models.The S/E vector extraction and the distance contour for each test utterance are computed using Eq. 5.Where x i s' are the voiced feature vectors, ψ(.) is the distance computation function and F(.) is the mapping function from the feature space to S/E vector space.
The distance contour is then smoothed using a hamming window with length (h l ).The h l is considered as 1/δ times N .The peaks of the smoothed contour are computed and the magnitude of peaks greater than the threshold contour is considered as the change points.

B. Experimental Setup
The TTSF-SC dataset is used for SCD, whereas TTSF-LC and MSCSTB are used for performing LCD tasks.The 39 dimensional MFCC+∆ + ∆∆ feature vectors are computed from the speech signal with 20 msec and 10 msec as window and hop duration, respectively.The voiced frames are decided by considering the frame energy that is greater than the 6% of the utterance's average frame energy.The UBM and adapt models are trained with a cluster size of 32.The dimensions of the u/a/i-vectors are 32, 64 and 50, respectively.The recipe from the speech brain is used to train and extract the 512 dimension x-vectors (Ravanelli et al., 2021).For the speaker-specific study, the x-vectors are trained without dropout and L2 normalization, whereas for the languagespecific study, dropouts of 0.2 in the second, third, fourth, and sixth layers are used along with L2 normalization.
During training, the speaker/language-specific voiced feature vectors are used to extract the S/E vectors dis-jointly with a fixed N , whereas during testing the S/E vectors are extracted with a sample frameshift.All the models have been trained for 20 epochs.For TTSF-LC the optimal N is decided experimentally as 200 and for TTSF-SC N is considered as 50.After training, by observing the validation loss and accuracy the model corresponding to the 15 th and 11 th epoch is chosen for the language and speaker-specific study, respectively.Similarly, for MSCSTB, x-vector models for each language pair are trained.After training for 100 epochs, by observing the validation loss and accuracy the model belonging to the (54 th , 29 th , and 26 th ) epochs for N = 200 and (25 th , 80 th , and 18 th ) epochs for N = 50 are chosen for GUE, TAE, and TEE language pairs, respectively.
For TTSF-LC and TTSF-SC, the extracted embedding vectors are normalized without having LDA and WCCN.The normalized vectors are used for modeling the PLDA and computing the distance contour for LCD and SCD tasks.Using the MSCSTB dataset, it is observed that performing LDA, and WCCN along with using cosine kernel distance instead of PLDA distance contour improves the change detection performance.This may be due to the nature of the datasets.The TTSF-LC and TTSF-SC are the studio recording of read speech, whereas the MSCSTB is the conversation recording in the office environment.

C. Language discrimination by statistical/embedding vectors
The aim here is to observe the discrimination ability of the extracted S/E vectors for language discrimination, by synthetically emulating the CS scenario.The TTSF-LC, where the same speaker is speaking two languages is considered for this study.The training partition is used to train the UBM, adapt, T-matrix, and TDNN- FIG. 10. t-SNE feature distribution between the Hindi (H) and English (E) (a) MFCC features, (b) u-vector, (c) a-vector, (d) i-vector, and (e) x-vector.Within and Between language PLDA score distribution, with EER of (f) 28.5, (g) 17.35,(h) 12.55, and (i) 3.6 for u, a, i, and x-vector, respectively.
based x-vector model.From the test partitions, two utterances are selected, one from each language, spoken by a speaker.Using the selected utterances the MFCC+∆+ ∆∆ features and the S/E vectors are extracted and projected in two dimensions using t-SNE (Maaten and Hinton, 2008).The two-dimensional representations are depicted in Fig. 10(a-e).From the figure, it can be observed that the overlapping between the languages reduces by moving from the feature space to the S/E vector space.This shows, like human subjects, prior exposure to the languages through ML/DL models helps in better discrimination.Furthermore, among the S/E vectors, the overlap between the languages is least in the x-vector space, followed by the i-vector, adapt, and UBM posterior space.This is due to the ability of the modeling techniques to capture the language-specific feature dynamics.
For strengthening the observation, the features are extracted from the test utterances and pooled together with respect to a given language.The pooled feature vectors are randomly segmented with a context of 200 and used to extract the S/E vectors.The extracted S/E vectors are paired to form 2000 within a language (WL) and 2000 between language (BL) trails.The WL and BL vector pairs are compared using the PLDA scores.Fig. 10 (f-i) shows boxplots of the PLDA score distribution of the WL and BL pairs.From the box plot distribution, it can be observed that, between the WL and BL, the overlap of PLDA scores distribution reduces with improvement in the modeling techniques from UBM to x-vector.
In the change point detection task, the aim is to get a sudden change in the distance contour, when there exists a change in language.That can be achieved if the contour (negative of PLDA score) variation is less in WL and provide a sudden change in the contour for BL pairs.Hence for ensuring this, the PLDA score distribution between the WL and BL should be maximized.Keeping this into account, the equal error rate (EER) has been used as an objective measure, where the WL and BL trials are termed false scores and true scores, respectively.The obtained EER for UBM/adapt/i-vector and x-vector are 28.5, 17.35, 12.55, and 3.6, respectively.Hence as per the discrimination ability, the change point detection study has been carried out using i/x-vectors as the representations of the speaker and language.

D. Experimental Results
Initially, the change detection study is conducted with TTSF-SC and TTSF-LC using i/x-vectors as the speaker/language representation.The discrimination ability and the LCD/SCD study suggest that the x-vector is a better representation of the speaker/language than the i-vector.Therefore, the LCD task on the MSCSTB dataset is conducted by considering x-vectors as language representations.
The experimental results are tabulated in Table II.The performance obtained in terms of IDR on SCD task using i-vector and x-vector is 87.75% and 92.27%, respectively.Similarly, for LCD tasks the performances on TTSF-LC are 80.58% and 87.01%, respectively.As evidenced by the language discrimination study, the performance of LCD provides a relative improvement of 21.9% and 31.63%using i-vectors and x-vectors, over the best performance achieved on the unsupervised distancebased approach, respectively.This justifies the claim that, like humans, the performance of the LCD can be improved by incorporating language-specific prior information through computational models.
The performance of the LCD task on the MSCSTB dataset using x-vectors as language representation with considering N as 200 (same as TTSF-LC) is 46.56%, 49.91% and 47.13% in terms of IDR for the GUE, TAE, and TEE partitions, respectively.The performance provides a relative improvement of 5.6%, 2.3%, and 4.2%.However, the improvement is small as compared to the improvement achieved using TTSF-LC data.This may  11.From the figure, it can be observed that the median of the monolingual segment duration in the case of TTSF-LC for primary and secondary language are (5.54 and 4.9) seconds, and for MSCSTB is (1.46 and 0.51), (1.54 and 0.41), (1.61 and 0.41) seconds for GUE, TAE, and TEE partition, respectively.Further, it has been observed that language discrimination is better by considering N equal to 200 (i.e.approx. 2 seconds).Hence, due to the monolingual segment duration of the MSCSTB dataset being smaller than the considered analysis window duration resulting in smoothing on the resultant distance contour, and leads to an increase in the MDR.Therefore, the alternative is to reduce the analysis window length, but that may affect the language discrimination ability of the x-vectors.
A study is performed for observing the trade-off between the analysis window length and language discrimination ability.The language discrimination test and the LCD task are performed using the GUE partition of the MSCSTB dataset by reducing the analysis window length from 200 to 50.The results of the LCD task are tabulated in Table III.The cosine score distribution of the x-vectors' WL and BL pairs after the LDA and WCCN projection with varying the analysis window length are depicted in Fig. 12. From the  be observed that with decreasing in N , the performance of the LCD task improves, and achieved the best performance of 54.74% at N equals to 50.Hence the change detection performance is computed with N equal to 50 for GUE, TAE, and TEE language pairs and tabulated in Table II.However, the relative performance improvement by incorporating language-specific prior exposure through the x-vector model is not as expected as in the TTSF-LC dataset.This is due to the language discrimination ability of the x-vectors reducing with the decrease in N .From Fig. 12, it can be observed that the overlap between the WL and BL score distribution increases with a decrease in the value of N .As an objective measure, the computed EER for N equals to 200, 150, 100, 75 ,and 50 are 7. 1, 9.8, 12.8, 19.8 and 29.2, respectively.

VI. DISCUSSION
The human-based LCD and SCD study suggests that the language requires more neighborhood information as compared to the speaker for comfortable discrimination.Further, prior exposure to the languages helps humans to better discriminate between the languages.Motivated by this, it is hypothesized that the performance of LCD by machine can be improved with the (a) incorporation larger duration analysis window (N ) and (b) languagespecific exposure through computational models.
In the unsupervised distance-based approach, it has been observed that the performance of the LCD improves by increasing the value of N .The optimal N value for the SCD study is 150.Considering the same value of N , the LCD task is carried out for both TTSF-LC and MSCSTB datasets, and performances are tabulated in Table IV.In the case of the MSCSTB dataset, the average IDR values with respect to all three language pairs are tabulated.Motivating by the LCD/SCD study by humans, the N value is increased and the obtained optimal N value for the LCD with TTSF-LC is 200.Similarly, the optimum N value for MSCSTB is 160, 180, and 170 for the GUE, TAE, and TEE , respectively.The performance with the optimal N value for TTSF-LC and MSCSTB is 66.1% and 46.02%, which provides a relative improvement of 29.1%, and 2.4%, respectively.These observations justify the claim that the performance of the LCD by machines can be improved by increasing the analysis window duration.
Furthermore, as hypothesized from the subjective study, the incorporation of language-specific exposure through computational models improves LCD performance.The i/x-vector models have been trained, which essentially capture the language-specific cepstral dynamics.It has been observed that with the x-vector approach, the obtained performance is 87.01%for TTSF-LC and 52.59% in terms of IDR, which provides a relative improvement of 31.63% and 14.27% over the performance of the unsupervised distance-based approach.Similarly, for the SCD task using the TTSF-SC dataset, the performance provides a relative improvement of 9.71%.Comparing the performance of LCD and SCD on synthetic data, it can be observed that the improvement is more significant on LCD than the SCD.This concludes, like human subjective study, in an ideal condition (only speaker/language variation and keeping other variations limited), the requirement model-based approach is more significant on LCD than the SCD.
It is also observed that in the LCD task, the performance improvement on MSCSTB data is limited as compared to the improvement achieved on the synthetic TTSF-LC dataset.This is due to the difference in the mono-lingual segment duration.The trade-off between the analysis window duration and the language discrimination ability shows that the discrimination ability improves with an increase in analysis window duration.At the same time during change detection, as the monolingual segment duration can possibly be lesser than 500 msec (approx.50 voiced frames), considering a larger analysis window leads to degrading in performance by smoothening the evidence contour (leads to an increase in MDR).Hence to overcome this issue, (1) need to achieve significant language discrimination with the N value as small as possible, and (2) need to develop a framework whose performance will be least affected/independent with the variations of the analysis window duration.

VII. CONCLUSION
In this work, we performed LCD using the available frameworks for SCD.From the subjective study, it is observed that humans require comparatively larger neighborhood information around the change point as compared to the speaker.It is also observed that prior language-specific exposure improves the performance of the LCD task.In the unsupervised distance-based approach, the incorporation of larger neighborhood information improves the LCD performance by relatively 29.1% and 2.4% on the synthetic TTSF-LC and the practical MSCSTB dataset, respectively.Similarly, incorporating language-specific prior information through the computational models provides a relative improvement of 31.63% and 14.27% over the unsupervised distance-based approach.
It has also been observed that the practical data set does not perform as expected like synthetic data.This is due to the distributional difference in the monolingual segment duration on both datasets.The MSCSTB dataset consists of the monolingual segments having a duration lesser than 0.5 secs, and for better language discrimination the required duration is about 2 secs (about 200 voiced frames).Hence it is challenging to decide on the analysis window duration.The larger duration smooths the evidence contour and increases the MDR, whereas a smaller duration of 0.5 secs is not able to provide appropriate language discrimination.Therefore, our future attempts will try to develop a better framework, which can provide better language discrimination on a small duration, and also plan to come up with a change detection framework, whose performance should be independent/less affected by the variations of the analysis window duration.itY), Govt. of India, for supporting us through different projects.

FIG. 1 .
FIG. 1.(a) and (c) Two speaker time domain speech signal and its spectrogram, respectively.(b) and (d) Two languages (Bilingual) time domain speech signal and its spectrogram, respectively.
FIG. 2. (a) DER distributions of the subjects, (b) F-Statistics (F Stat) values of the ANOVA test between the DER distributions of LCD (L) and SCD (S) study, respectively .

FIG. 5 .
FIG. 5. Basic block diagram of the change detection framework for unsupervised distance-based approach

FIG. 6 .
FIG. 6. Distance computation around the true and false change point of an utterance, (a) true change point and (b), (c) false change points, fl d, and tr d are false and true distances, respectively.
FIG. 7. ANOVA test F-statistics (F Stat) values obtained between the true and false KL divergence distances for speaker/language change study with varying the number of voiced frames (NVF).

FIG. 9 .
FIG. 9. Block diagram for the model-based change detection study.

TABLE I .
Perforance of LCD and SCD with the unsupervised distance-based approach.A: with N = 150 (tuned for SCD) and B: with the optimal N value (tuned for LCD).

TABLE II .
Performance of LCD and SCD by model-based approaches, S: statistical i-vector, E: embedding based x-vector, N: analysis window length.

TABLE IV .
Performance comparison, RI: relative improvement, A: with N = 150 (tuned for SCD), B: with the optimal N value (tuned for LCD), and C: x-vector based approach.