1 Introduction

Spoken language recognition (also called Language Identification, LID) is to identify the language spoken in an utterance [1], which can be used as a pre-processing step in many multilingual applications, such as speech translation [2] and multilingual speech recognition [3].

Classical end-to-end LID methods are composed of a feature extractor that maps variable-length speech segments to fixed-length embeddings [4], and an identification module which makes the decision. Different levels of features are beneficial for LID [5, 6], including acoustic features [7, 8], phonetic features [9], syntactic features [10], and prosodies [11]. Among them, acoustic features are the most commonly used features in classical end-to-end LID systems [4, 12, 13].

Although classical end-to-end LID methods directly utilize acoustic features as input, phonetic features can better reveal the basic difference between languages, that is, the distribution or frequency of phones [5]. Since phonetic features represent information at a higher level than acoustic features, they are more robust for noise and channels [9]. Therefore, recent end-to-end methods mainly focus on phonetic features [14] (also called bottleneck features [15]), which are extracted by an automatic speech recognition (ASR) task.

While the classical end-to-end LID methods using acoustic features can implicitly learn phonetic knowledge as well, phonetic features are more easily discovered by an ASR task with frame-level phonemic labels, considering utterance-level linguistic labels might be too coarse [16]. ASR tasks can efficiently extract phonetic features, and noise such as speaker information can be filtered. Phonetic features extracted from one of the hidden layers of a pre-trained ASR model can be interpreted as compression of phonetic information [17,18,19], and they are much richer, that is, they are at frame-level and involve compacted information of all phones [9].

Lately, end-to-end ASR models have achieved outstanding performances and largely simplified multilingual models by learning shared representations directly from data [20, 21]. These studies have greatly simplified the process of extracting phonetic features for LID tasks, making it one of the hot-spots in recent years. Watanabe et al. [22] present a single multilingual model with a unique vocabulary that can recognize speeches of 10 languages. Multi-task learning that jointly learns linguistic and phonetic information is studied in [23, 24]. Ren et al. [25] build a two-stage language identification system that outperforms the baseline multi-task system. Wang et al. [26] analyze different conformer-based architectures and demonstrate the great improvement of a two-stage system, named transfer learning in their experiments. Duroselle et al. [16, 27] study different modeling and training strategies and show that bottleneck features can be greatly improved by using language identification loss during the training of the feature extractor. Alum et al. [28] incorporate a large pre-trained multilingual XLSR-53 wav2vec2.0 model and reveal its excellent modeling abilities, that is, fine-tuning the model with just one utterance per target language already outperforms the baseline model that does not use pre-training but is trained with around 10,000 utterances per language.

Although ASR has been applied as an auxiliary task to LID, how to better optimize these two independent LID and ASR neural networks has not yet been well studied. Most of the existing methods use a two-stage training strategy, which is not optimal and faces a trade-off between the recognition accuracy and the generalization ability.

Therefore, in this paper, we introduce a three-stage training strategy with an orthogonality regularization method to overcome the problem of the trade-off between the recognition accuracy and the generalization ability, which we analyze is caused by a trade-off between hybrid-level knowledge, that is, various mixed-level features including phonetic information and prosodies, and noise embedded in a shared encoder. Existing methods can learn only single-level phonetic features or only noisy hybrid-level features, but the methods proposed in this paper can learn hybrid-level and noiseless features, so they overcome the trade-off mentioned above and outperform the existing methods.

The main contributions of this paper are summarized as follows.

  • This paper systematically analyzes the phonetic features and hybrid-level features extracted by the ASR task and discusses the trade-off between the recognition accuracy and the generalization ability for optimizing the LID model with an auxiliary ASR task.

  • A novel three-stage training strategy is proposed to learn hybrid-level knowledge by adding a multi-task learning stage to the traditional two-stage strategy. The supervision of the ASR task makes an encoder encode hybrid-level features and exclude most of noise. Meanwhile, a frozen encoder in the final stage ensures that there is almost no new noise being learned. Such hybrid-level and noiseless features can guarantee both the recognition accuracy and the generalization ability.

  • The orthogonality regularization method is introduced to improve the performances of both the two-stage and three-stage training strategies. By adding prior knowledge to model a better embedding space, the language classifier can achieve a higher accuracy.

The rest of the paper is organized as follows. An overview of joint speech and language recognition architectures is described in Section 2. Then, our proposed methods are discussed in Section 3. Section 4 details our whole multi-task architecture, and the experiments are illustrated in Section 5. Finally, in Section 6, we conclude this paper.

Fig. 1
figure 1

a The end-to-end LID architecture. It directly gets utterance-level LID decisions from frame-level acoustic features. b The ASR-LID parallel branches architecture. The shared encoder trained with ASR-loss can produce frame-level phonetic information, which helps to improve the LID performance dramatically

2 Related works

2.1 The ASR-LID parallel branches architecture

Although the end-to-end LID architecture (as shown in Fig. 1a) has achieved great success in the past few years, it still has the problem of domain mismatch [29, 30]. Domain information about speakers, genders, channels, and other kinds of noise may have different distributions between the training set and the real environment. An ideal LID system should ignore such domain-related noise, which may cause a poor generalization ability. As phonetic features are independent of noise, they can be used to boost the LID performance [9].

Directly using LID to extract frame-level phonetic information from raw acoustic features is quite inefficient, for the utterance-level language labels are too coarse to provide sufficient supervision [9]. Considering the ASR task runs at the frame level, the output of hidden layers from a well-trained ASR model can incorporate phonetic information and be robust to noise. Such an architecture can provide a dramatic performance improvement over end-to-end LID systems [9, 23, 26, 31,32,33].

The architecture of phonetic LID is generally composed of two branches: the one for the LID task and the other for the ASR task, as shown in Fig. 1b. Phonetic features and other high-level knowledge are extracted by the shared encoder and then fed into the two branches. The ASR branch tends to model phonetic features to the encoder, but the LID branch tends to model hybrid-level features with domain noise. How to optimize the two individual tasks is still worth studying.

2.2 Training strategies

Traditional training methods utilize a two-stage training strategy [15], which updates parts of the neural network using ASR-loss and LID-loss one after another. In the first stage, the shared encoder is trained with an ASR decoder using a hybrid CTC/attention loss. Acoustic features are compassed into phonetic features, and other information is treated as noise and filtered out. In the second stage, the LID model, with or without the shared encoder, is trained using a softmax cross-entropy loss.

Both frozen and unfrozen encoders are selected in recent years, as shown in Table 1 (“-F” means the frozen encoder is used, otherwise the unfrozen encoder). If the shared encoder is frozen, it will not be trained in the second stage, and features extracted by the encoder are still single-level phonetic features. Otherwise, if the shared encoder is unfrozen and is trained together with the following LID model, the model can extract other higher-level features, including phonetic features, and all the extracted features are called hybrid-level features. Predictably, noise is naturally included in such end-to-end learning.

Table 1 Previous proposed methods
Fig. 2
figure 2

The visualization results of language space trained with a frozen encoder strategy (a) and an unfrozen encoder (b) strategy, respectively

3 Proposed methods

3.1 A trade-off between the recognition accuracy and the generalization ability

Table 1 lists the details of the recent proposed methods. It can be observed that these methods differ in whether the ASR encoder is frozen or not during the second stage of training. It is shown in [16] that the unfrozen encoder is superior in the recognition accuracy. In our preliminary experiments, we tried these two training strategies mentioned above and extracted fixed-length embeddings from some cross channel test data. Visualizations of these embeddings are shown in Fig. 2, with each color representing one language. While the ideal language identification model should ignore channel differences, embeddings in the same color should be gathered into one cluster, even if they are from different channels. From Fig. 2a, it can be observed that the model trained using a frozen encoder is relatively compact within each category, and the classification boundaries are relatively smooth. The unfrozen encoder strategy, as shown in Fig. 2b, exhibits multiple sub-centers within each category, and the distance between cross-channel data is far, indicating a poor generalization performance. However, the method with the poor generalization performance achieved a higher accuracy, which reflects the trade-off between the accuracy and robustness.

Moreover, the methods listed in Table 1 utilize the traditional phonetic features [33]. There is a long-standing hypothesis that phonetic features may be sufficient for LID tasks [9], so that the ASR branch is designed to extract phonetic features and the LID branch is designed to decode these phonetic features, whereas other kinds of features are generally ignored. However, it cannot explain the trade-off mentioned above and why a higher ASR accuracy does not yield a better LID performance [34, 35]. Therefore, in this section, we analyze the reason why the two-stage training strategy can hardly improve both the recognition accuracy and the generalization ability.

Fig. 3
figure 3

Three types of features fed into LID. a End-to-end LID tends to learn hybrid-level knowledge, which includes noise. b A pretrained frozen ASR encoder can only provide phonetic features for LID. c Ideally, a pretrained unfrozen ASR encoder should provide noiseless hybrid-level features

Figure 3 lists three types of features fed into LID under different training strategies. When we directly use acoustic features to train an end-to-end LID model, the extracted information can be separated into three parts, as shown in Fig. 3a. Firstly, according to whether the information can be extracted by ASR, two parts separated by a vertical line are phonetic information and other high-level features. Then, these high-level features can further be separated into language discriminative features and noise, by a horizontal line. Specifically, language discriminative features include prosodies and some special phone sequences [5], which can hardly be extracted by an end-to-end ASR task. Different kinds of noise include not only background noise but also the bias of the dataset, such as an unbalanced speaker distribution. While such kind of noise may be useful in a specific dataset, the generalization ability of the LID model may be affected, too.

If the encoder is frozen in the second stage of the LID training, the features fed into LID are still phonetic features, just the same as those after the first stage of the ASR training, as shown in Fig. 3b. Because the encoder is frozen, the LID model can only make decisions based on the extracted phonetic features rather than noise, thus ensuring a good generalization ability.

If we train both the encoder and LID in the second stage, as shown in Fig. 3c, the encoder is ideally allowed to extract hybrid-level information suitable for LID but not discriminative for ASR, which explains the superior of the unfrozen encoder. Nonetheless, considering that stage is actually an end-to-end LID training, the encoder will forget the phonetic information and noise will still be learned by the encoder after many steps of training. In this case, overfitting may be caused by the noise, which is discovered by our visualization analysis shown in Fig. 6, that is, although the unfrozen encoder may have better results, the embeddings extracted from the frozen encoder models have smoother classification boundaries.

Therefore, there is a trade-off between the recognition accuracy and the generalization ability for the traditional two-stage training strategy. The features extracted by the shared encoder are either only phonetic information or hybrid-level information with noise, so that they are either too sparse to get a good recognition accuracy or too flexible to get a good generalization ability. Hence, our goal is to improve the recognition accuracy of LID and keep the generalization ability at the same time.

3.2 The proposed three-stage training strategy

In this paper, we propose a three-stage training strategy by adding a multi-task learning stage between the first ASR stage and the final LID stage, as shown in Fig. 4, so that the high-level features suitable for LID can be embedded into the encoder before freezing. Meanwhile, the phonetic features will not be forgotten and noise cannot be learned in the final LID stage after freezing. With the proposed three-stage strategy using the frozen encoder, the final LID model can have both the good performance of the unfrozen encoder and the generalization ability of the frozen encoder.

Firstly, the conformer-based ASR model is trained using a hybrid CTC/attention loss function. Then, the whole end-to-end model is trained in a multi-task manner. The final loss is composed of the LID loss and the ASR loss with an empirical control factor \(\alpha\):

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{asr} + \alpha \mathcal {L}_{lid}, \end{aligned}$$
(1)

where \(\mathcal {L}_{asr}\) is defined in Eq. 4 in Section 4 and \(\mathcal {L}_{lid}\) is the classical softmax cross-entropy loss. Finally, the third stage fine-tunes the model with \(\mathcal {L}_{lid}\) only. The shared parameters of two adjacent stages are initialized by the values of the parameters of the previous stage using transfer learning.

Fig. 4
figure 4

The proposed three-stage training strategy. a The first end-to-end ASR training stage. b The second multi-task training stage. c The final end-to-end LID training stage

3.3 Orthogonality regularization

Orthogonality regularization is used in our work to improve the LID performance. Specifically, it constrains the basis vectors to help model the language space and can be used in both the traditional two-stage strategy and our proposed three-stage strategy.

Empirically, compared with the unfrozen encoder, the frozen encoder has a better generalization ability but is poor at the classification ability. We design an embedding space with a set of orthogonal basis vectors to facilitate the frozen encoder to get a better classification ability without reducing its generalization ability.

The supervised training of the end-to-end LID method is a progress of finding the most language discriminative embedding space, from a geometric point of view. The centers of languages form a set of basis vectors for the language space. After the embeddings are extracted by the LID model from input utterances, the projection of one embedding onto each basis vector corresponds to the similarity between the utterance and the center of the corresponding language.

The classification loss function used in this paper, that is, softmax cross-entropy loss, is presented as follows:

$$\begin{aligned} \mathcal {L}_{CE} = -\frac{1}{N}\sum \limits _{i=1}^{N} {log\frac{{e}^{W_{y_i}^T x_i + b_{y_i}}}{\sum _{j=1}^{n} {{e}^{W_j^T x_i + b_j}}}}, \end{aligned}$$
(2)

where \(x_i \in \mathbb {R}^d\) denotes the embedding feature of the i-th sample, belonging to the \(y_i\)-th language, and n means the number of languages. N represents the batch-size. Formula \({e}^{W_j^T x_i + b_j}\) is the projection mentioned above, and \(W \in \mathbb {R}^{n \times d}\) is n language centers or n basis vectors. The further the distance between two basis vectors is, the larger the between-class variance of their corresponding languages is, and the more separable these two languages are. Therefore, we speculate the basis vectors are orthogonal to each other, which means the value of \({W^T}W - I\) should be as less as possible. However, in our preliminary experiments, there is not always a reduction of that term, especially when the shared encoder is frozen, so we add orthogonality regularization into the LID loss:

$$\begin{aligned} \mathcal {L}_{lid} = \mathcal {L}_{CE} + \lambda \sigma _{max}(WW^T - I), \end{aligned}$$
(3)

where \(\lambda\) is a hyper-parameter, and \(\sigma _{max}(\cdot )\) denotes the spectral norm of a matrix.

4 The model architecture

This section presents the details of the phonetic LID model we design, including a shared encoder and two decoders for LID and ASR respectively. The shared encoder can be implemented by various structures, and in this paper, we use the conformer [36], which integrates components from CNNs and transformers, and can efficiently capture both local and global correlations. With the hybrid CTC/attention ASR decoder [37, 38], the shared encoder has strong capabilities to discriminate languages and therefore can improve the performance [26].

For the LID decoder, we do not choose complex structures such as TDNN [4] or LSTM [39]. We implement it in a simple but acceptable manner to present the relationship between LID and ASR more clearly. Frame-level phonetic features extracted by the shared encoder are directly aggregated by a statistic pooling layer without any kind of transform. The aggregated embedding vector has a fixed length and contains linguistic information extracted from phonetic features. Then, two fully connected layers are used to make the final decision. The first fully connected layer maps the fix-length embeddings to a space with the best language discrimination, while the second fully connected layer projects the transformed embeddings to the centers of 10 languages, respectively.

Fig. 5
figure 5

The architecture of the end-to-end multi-task model

As shown in Fig. 5, the architecture for the phonetic feature extraction used in this paper is a hybrid CTC/attention one [38], which consists of three components: a shared encoder, an attention-based decoder, and a CTC decoder. This model can solve the problem caused by the too flexible alignment property of the attention-based method with CTC through a regularization during training and a score correction during decoding [40]. The encoder receives acoustic features of T frames, that is, \(X = \lbrace x_t \in \mathbb {R}^{D_1} \vert t=1, \ldots , T\rbrace\), and extracts the phonetic features within them, that is, \(H = \lbrace h_t \in \mathbb {R}^{D_2} \vert t=1, \ldots , T\rbrace\). The following two decoders are used to deal with the T-length phonetic features to an N-length word sequence, that is, \(W = \lbrace w_n \in \mathcal {V} \vert n=1, \ldots , N\rbrace\). In X, H, and W, \(x_t\) is a \(D_1\)-dimensional feature vector (e.g., Mel filterbanks) at frame t, and \(h_t\) is a \(D_2\)-dimensional feature vector containing phonetic information, and \(w_n\) is a word or sub-word at position n in Vocabulary \(\mathcal {V}\).

The loss functions are expressed as:

$$\begin{aligned} \mathcal {L}_{asr} = (\alpha )\mathcal {L}_{ctc} + (1-\alpha )\mathcal {L}_{att}, \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {L}_{ctc} = -log p _{ctc}(W\vert X), \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {L}_{att} = -log p _{att}(W\vert X), \end{aligned}$$
(6)

where hyper-parameter \(\alpha\) represents the weight of the CTC loss, and \(p _{ctc}\) can be computed using dynamic programming [41], and \(p _{att}\) is the output of the attention-based decoder. It is shown that the introduction of the CTC module helps to ensure appropriate alignments and fast converge [42].

The building of Vocabulary \(\mathcal {V}\) is a tricky problem, especially for multilingual ASR tasks. Considering the out-of-vocabulary problem, the vocabulary should be large enough and its units are typically at character level [22, 38], while the sub-word level units like byte-pair encoding used in [28] can also achieve an acceptable performance. There are 10 languages in our experiments, and we use a byte-pair encoding vocabulary shared over all languages, which is composed of sub-vocabularies, with each corresponding to a language. There are 1200 to 3500 units in each sub-vocabulary.

Table 2 Details of the OLR dataset

5 Experiments

5.1 The OLR dataset

We train and evaluate the model on the OLR dataset [43,44,45,46]. Table 2 details the contents of the dataset. ASR-train refers to the subset used for training the ASR system, composed of AP16-OL7 and AP17-OL3. Other non-transcript subsets in the OLR dataset, including AP17-OLR-test, AP18-OLR-test, and AP19-OLR-test, are added to ASR-train and used for LID training, and all of them are collectively named LID-train in our experiments. In ASR-train, there are 10 languages, that is, Cantonese, Mandarin, Indonesian, Japanese, Russian, Korean, Vietnamese, Kazakh, Tibetan, and Uyghur, consisting of about 71.42 h of speech signals recorded by mobile phones. channel-test and noisy-test are two standard test sets for OLR20 challenge [43], used for the validation of different recording equipments and environments, and for the validation of noisy environments (low SNR), respectively.

Table 3 Parameters of the conformer ASR model

5.2 Training setup

We conduct all experiments on the WeNet platform [47]. Eighty-dimensional FBank is extracted from every speech frame with a 25-ms window and a stride of 10 ms. Data augmentation methods, including speed and volume perturbation and specAugment [48], are used before training. For speed perturbation, we apply speed factors of 0.9 and 1.1 respectively to slow down or speed up the original recording. For volume perturbation, the random volume factor is applied. For specAument, a time wrap with the max width of 80, 2 time masks with the max width of 50, and 2 frequency masks with the max width of 10 are applied. No voice activity detection (VAD) is used. The cepstral mean and variance normalization [49] is applied to the input features.

We use the conformer as the ASR encoder and the transformer as the ASR decoder. The parameters are listed in Table 3.

For the LID model, we use a statistic pooling followed by two fully connected layers with ReLu activation and BatchNorm per layer. The dimensions of the input and output of the first fully connected layer are both 256, corresponding to the output dimension of the conformer. For the second fully connected layer, its input dimension is 256, and its output dimension is 10. The dropout rate is set to 0.5 before the cross-entropy loss.

Both ASR and LID models are trained with the ADAM optimizer. ASR models are trained with the warm-up schedule [50] of 25,000 warm-up steps, and the peak learning rate is 1e−2. For the LID model, a relatively low learning rate of 1e−5 is used. We train the models for 240, 80, and 40 epochs for the first, second, and third stages, respectively. 240 is the default number of epochs for training an ASR model in the wenet framework [47], and we also use this number because the model converges well after 240 epochs in our experiments. Eighty and 40 are our empirical values set according to the convergence rate of the training set after the preliminary experiments, which ensures that all systems in this paper can converge under these parameters. We use a dynamic batch size to suit our NVIDIA 1080Ti with 11GB memory. Model average is computed by averaging the 10 best models according to the loss on the dev set, and based on it, the methods in our experiments are fairly compared.

Table 4 Experimental results under different training strategies

5.3 Evaluation metrics

As in the OLR20 challenge [51], we use two metrics to evaluate the language recognition systems: Cavg as the principal evaluation metric, and the Equal Error Rate (EER) as the secondary evaluation metric. Considering that language recognition tasks need to keep both the false alarm rate and the false rejection rate as low as possible, Cavg and EER comprehensively measure these two rates. Among them, EER does not consider the performance of individual languages, and the scores of all samples are mixed for calculation. However, since the optimal threshold for each language may not be the same for the best metric, Cavg calculates the error rate separately for each language and then averages them, resulting in a more precise evaluation.

5.3.1 Cavg

In most language recognition challenges, such as LRE [52] and OLR [43], Cavg is chosen as the principle evaluation metric. The pair-wise loss of a specific target/non-target language pair is defined as:

$$\begin{aligned} C\left( L_{t}, L_{n}\right)= & {} P_{\text{ Target } } P_{\text{ Miss } }\left( L_{t}\right) + \nonumber \\{} & {} \left( 1-P_{\text{ Target } }\right) P_{F A}\left( L_{t}, L_{n}\right) , \end{aligned}$$
(7)

where \(L_t\) and \(L_n\) are the target and non-target languages, respectively; \(P_{Miss}\) and \(P_{FA}\) are the missing and false alarm probabilities, respectively. \(P_{target}\) is the prior probability for the target language, which is set to 0.5 in the evaluation.

Cavg is defined as follows:

$$\begin{aligned} C_{avg}= & {} \frac{1}{N}\left\{ P_{\text{ Target } } \cdot \sum \limits _{L_{t}} P_{\text{ Miss } }\left( L_{t}\right) +\right. \nonumber \\{} & {} \left. \sum \limits _{L_{t}} \sum \limits _{L_{n}}P_{\text{ Nontarget } } P_{F A}\left( L_{t}, L_{n}\right) \right\} , \end{aligned}$$
(8)

where N is the number of languages and \(P_{Nontarget} = (1-P_{Target}) / (N-1)\).

5.3.2 Equal error rate

EER is widely used in many recognition tasks [24]. In order to define EER, false rejected ratio (FRR) and false accepted ratio (FAR) need to be considered first, written as:

$$\begin{aligned} P_{FRR} = \frac{\text {Number of target trials rejected}}{\text {Number of Total target trials}}, \end{aligned}$$
(9)
$$\begin{aligned} P_{FAR} = \frac{\text {Number of nontarget trials accepted}}{\text {Number of Total nontarget trials}}, \end{aligned}$$
(10)

If we use \(P_{FRR}\) as the vertical axis and \(P_{FAR}\) as the horizontal axis, a continuous curve of FAR corresponding to FRR is obtained, named Detect Error Trade-off (DET).

EER is defined as the value in the DET curve, where \(P_{FRR} = P_{FAR}\).

5.4 Experimental results under different training strategies

The experimental results under different training strategies and our proposed orthogonality regularization are listed in Table 4.

We first implement the traditional two-stage strategy with two different encoders: the frozen encoder and the unfrozen encoder, as reported in No. 0 and No. 1 in Table 4. We get the same conclusion as [16], that is, fine-tuning the LID task with the unfrozen encoder outperforms that with the frozen encoder. Note that although No. 1 has better results on Cavg, we believe that it may come from overfitting to some languages, and overall, EER is increased.

Then, the orthogonality regularization with \(\lambda =0.1\) is added to the above two experiments, reported as No. 2 and No. 3. From Table 4, we see that our orthogonality regularization brings improvements to the frozen encoder method, while the results of the unfrozen method decrease slightly. The reason may be that the frozen encoder cannot get enough information and our regularization can be deemed as extra prior knowledge, while the unfrozen method already has the flexibility to explore any kinds of knowledge they need, so it cannot benefit from the regularization.

In our proposed three-stage training strategy, a multi-task learning stage is added between the traditional two stages. It can also be trained with the frozen encoder or the unfrozen encoder, as reported in No. 4 and No. 5 respectively in Table 4. Compared with their corresponding two-stage experiments, that is, No. 0 and No. 1, the proposed three-stage strategy obtains obvious better performances, which implies the additional knowledge is beneficial. However, there is one exception. By comparing No. 0 and No. 4 on the noisy-test set, we see that the two-stage strategy “asr-lid-F” performs better than the three-stage-stategy “asr-mt-lid-F”. The reason may be that noise is not completely filtered by the ASR task, so they can be involved in the end-to-end LID training during the multi-task learning stage and then are fixed by the frozen encoder and finally affect the performance.

The experimental results of the three-stage training strategy with orthogonality regularization are reported in No. 6 and No. 7. It can be observed that the trends are similar to those in the above two-stage training strategies, that is, orthogonality regularization dramatically improves the performance of the frozen method, but it seems useless in the unfrozen method. Moreover, compared with the two-stage training methods, that is, No. 2 and No. 3, the proposed three-stage training methods achieve better performances on both channel-test and noisy-test.

The experimental results of the two-stage strategy involving multitask learning following LID fine-tuning, that is, mt-lid, are reported in No. 8 and No. 9. Since its multitask learning stage tends to learn hybrid-level and noiseless features, it should theoretically achieve results similar to the three-stage strategy. However, its results are worse, which verify the effectiveness of the three-stage strategy.

Table 5 Experimental results with different values of the hyperparameter \(\lambda\)

5.5 Hyperparameter analysis

To find the optimal strategy of the proposed methods, the hyperparameter \(\lambda\) in Section 3.3 is experimentally and theoretically analyzed.

As described in Section 3.3, \(\lambda\) is the weight of orthogonality regularization in \(\mathcal {L}_{lid}\). We hope that the embedding space is a space formed by a set of orthogonal basis vectors, so that the correlation between any two language centers is as small as possible. However, considering the natural similarities between languages, completely orthogonal basis vectors may not directly lead to better classification results. Therefore, we use \(\lambda\) to control the strength of the constraint.

To explore the optimal value of \(\lambda\), we set it to be 0, 0.01, 0.1, and 1.0, respectively. As shown in Table 5, No. 0 to No. 3 denote the three-stage strategy with the frozen encoder in the final stage. The results are as expected, that is, setting \(\lambda\) to be 0 or 1 cannot bring the best results, and setting \(\lambda\) to be 0.1 is optimal. Also, using orthogonality regularization (\(\lambda \ne 0\)) is better than without it (\(\lambda = 0\)), showing the advantages of the introduction of orthogonality regularization.

No. 4 to No. 7 denote the three-stage strategy with the unfrozen encoder in the final stage. The best results on the two test sets appear in No.7 and No.5, respectively. Still, consistent with the results in Table 4, for the unfrozen encoder, the strategies with orthogonality regularization (No. 5 to No. 7) do not show better performances than the strategy without orthogonality regularization (No. 4), as the unfrozen encoder already gives models enough flexibility to learn the optimal embedding space, so orthogonality regularization cannot bring better results. However, it is not worse either. For example, when \(\lambda = 1\) (No. 7), the best results on channel-test are achieved. Observing that the results of No. 7 on the noisy-test set decrease, we suspect that the prior knowledge introduced by the orthogonal constraint still has some impact on modeling the language space, which is worth further analysis in future work.

Fig. 6
figure 6

Language embeddings of different training strategies, plotted by t-SNE on channel-test. Each color represents a different language. Models are named by their strategies plus the value of \(\lambda\) and correspond to those in Table 4

5.6 Visualizing analysis

As mentioned in Section 3.1, when the pre-trained encoder is unfrozen in the second stage, it may achieve a higher performance but a poor generalization ability. We compare the models’ generalization abilities through visualizing analysis of the distributions of language embeddings. The models yielding smoother classification boundaries are considered to have better generalization abilities. Language embeddings are extracted from the penultimate layer of the corresponding LID model and plotted after dimension reduction. T-Stochastic Neighbor Embedding (t-SNE) is used as the non-linear dimension reduction algorithm.

We take the channel-test test set as an example and extract language embeddings from the strategies listed in Table 4, as shown in Fig. 6. Each color represents a different language. To make it clear, we slightly adjust the order of the strategies and put those using the frozen encoder together. The two-stage strategies and the three-stage strategies are divided into two lines.

Generally speaking, the classification boundaries of the proposed three-stage strategies (e–h) in Fig. 6 are smoother and clearer than those of the traditional two-stage strategies (a–d). The strategies trained with the frozen encoder (c, d, g, and f) show better generalization abilities, and the strategies trained with orthogonality regularization (b, d, f, and h) have higher between-class variances and lower within-class variances.

Specifically, a and c in Fig. 6 compare the difference in the generalization ability with freezing the encoder or not in the traditional two-stage strategy. Although the former is more accurate, its uneven classification boundary brings about a poor generalization ability. The strategies trained with orthogonality regularization reduce the within-class variance, which is quite obvious for Vietnamese (vi-vn) in e and f. Whether or not orthogonality regularization is used, the three-stage strategies show better generalization abilities than the two-stage strategies, especially when using the unfrozen encoder.

Table 6 The comparison of our best model with the best two-stage model and the top three models on the OLR20 challenge

5.7 Comparison with state-of-the-arts

Usually, back-end processing can improve the model’s performance and hence is widely used in previous OLR challenges [51]. However, back-end processing is computationally complex because it needs to extract the embeddings of the entire training set. Therefore, in our experiments, we only apply back-end processing to our best two-stage and three-stage models to compare with other state-of-the-art models which also do back-end processing.

In our back-end processing, after training, the output of the penultimate layer of the LID model is extracted as language embeddings. Linear discriminant analysis (LDA) is applied using embeddings of the training set and the embeddings are projected to 100 dimensions. After LDA projection and centering, Logistic Regression (LR) is used to generate the scores.

Table 6 gives the comparison of our best two-stage and three-stage models with the top three models reported on the OLR20 Challenge [51]. From the comparison, we see that both the best two-stage and three-stage models are comparable with the top three models on noisy-test. It is worth mentioning that our best three-stage model with orthogonality regularization shows obvious advantages on both channel-test and noisy-test, showing the effectiveness of our proposed three-stage training and orthogonality regularization methods.

6 Conclusions

In this paper, we extended the input features of language recognition from phonetic features to hybrid-level features, and the traditional two-stage training strategy was analyzed. Preliminary experiments showed that the traditional two-stage strategy was not enough to ensure both the recognition accuracy and the generalization ability. Therefore, we proposed two methods to solve this problem: the three-stage training strategy and the orthogonality regularization. A multi-task learning stage was added between the traditional two stages, which can extract hybrid-level but noiseless features for the following LID task. Then, we combined orthogonality regularization with the three-stage training strategy to get our end-to-end multi-task architecture. The experimental results demonstrated that our proposed model achieved significant performance improvements, compared with the baseline two-stage model and the top three models on the OLR20 Challenge.