1 Introduction

Automatic speech recognition (ASR) [1,2,3] refers to the use of machines to convert human speech into corresponding text. After more than half a century of rapid development, ASR technology has been widely applied in various fields of human social production and life, including economy, military, and culture [4,5,6,7]. Currently, ASR systems developed for international languages such as English have achieved human-level recognition capabilities, with robust, fast, and accurate performance. However, among the more than 7000 languages globally, the majority do not have sufficient training resources like the major languages such as English. According to statistics, about 40% of languages are facing extinction, with a user base of fewer than 1000 users [8,9,10]. It is the scarcity of the transcribed data for these low-resource language communities that prevents large neural networks from being substantially trained, leading to poor performance and a lack of real-world applications. Therefore, ASR for low-resource languages has gradually become a research focus.

With the technological wave of large language models such as GPT (Generative Pre-Training Transformer) [11, 12] and BERT (Bidirectional Encoder Representation from Transformers) [13], large models in speech domain have also ushered in a development opportunity. Recently, multilingual and multi-task large speech models have gradually become a popular paradigm for solving the problem of low-resource ASR. Among them, OpenAI’s weakly supervised speech processing model Whisper can achieve multiple tasks such as ASR, speech translation, language identification, and speech activity detection for 100 languages simultaneously [14]. It has excellent evaluation results on many open-source datasets. Therefore, Whisper becomes the research object of this paper.

Despite its excellent capabilities, Whisper still falls short of meeting practical application requirements in low-resource ASR tasks, leaving considerable room for improvement. Currently, some studies use a small amount of supervised data to fine-tune Whisper and thereby improve its performance for target languages. Sicard et al. fine-tuned Whisper to reduce the recognition error rate of Swiss German dialects [15]. Liu et al. proposed a parameter-efficient fine-tuning method that can quickly adapt to child ASR tasks [16]. Xie et al. fine-tuned Whisper on a homemade dataset and effectively improved performance on mixed-language ASR tasks [17]. Waheed et al. established an ASR and dialect recognition system for various Arabic dialects by fine-tuning Whisper and some other models [18].

Although these studies have made effective explorations in theory and practice, there are still three issues that need further in-depth research regarding Whisper’s low-resource ASR task. First, to what extent fine-tuning can improve performance? Second, which part of the model has critical importance in fine-tuning? Third, what are the advantages and disadvantages of parameter-efficient fine-tuning methods? These issues involve different fine-tuning strategies for Whisper, but they share commonalities and can be comprehensively compared and analyzed.

In view of this, this paper deeply explores various fine-tuning strategies for Whisper in low-resource ASR and conducts detailed analysis and discussion. Specifically, this paper uses a small amount of supervised data from seven low-resource languages and adopts three fine-tuning methods (vanilla fine-tuning, fine-tuning with specific parameters, and fine-tuning with additional modules) to experimentally analyze above issues. The results and analysis show that all fine-tuning strategies explored in this paper can significantly improve the performance of Whisper. The vanilla fine-tuning can greatly enhance the performance of target languages, specific parameter fine-tuning can further improve performance based on the vanilla fine-tuning, and additional module fine-tuning can effectively prevent catastrophic forgetting with negligible performance loss and achieve parameter-efficient fine-tuning.

The main contributions of this paper are as follows: (1) exploring the ability range that Whisper’s fine-tuning strategies can achieve in low-resource ASR; (2) further exploring the inherent mechanism of Whisper’s speech encoding capabilities by observing and analyzing the specific manifestations of different sub-modules in the model; (3) comparing and analyzing the advantages and disadvantages of different fine-tuning strategies subjectively and objectively. The related conclusions can be directly applied to future research and engineering practices.

2 Material and method

2.1 Whisper

Whisper, developed by OpenAI, is an advanced speech processing model capable of performing various tasks such as ASR, speech translation, language recognition, and speech activity detection in 100 languages.

In terms of data construction, the development team formed a weakly supervised dataset of 680,000 h of speech data through extensive collection, automation screening, and processing. They also conducted multi-task standardized annotation on the transcribed text. When it comes to model architecture, the team chose a multi-layer stacked Transformer with encoder-decoder structure as the basic network structure. Depending on the number of layers, the dimension of feature representation (width), and the number of attention heads, the model was divided into five versions: Tiny, Small, Base, Medium, and Large. Table 1 summarizes the specifics of each version. Additionally, on the basis of the large version, Large-v2 had 2.5 times more training iterations, while Large-v3 used data collection and processing as well as pseudo-labeling with Large-v2 to increase the training data to 5 million hours. Both the large-v2 and large-v3 models outperform the large model, and the large-v3 model demonstrates even stronger capabilities than the large-v2. In terms of model training, Whisper uses multi-task training to update and optimize model parameters, including recognition, English translation, speech activation detection, and language identification.

Table 1 Architecture details of the Whisper model family

Whisper boasts superior multi-language ASR and translation capabilities. In some languages, its performance is even comparable to or better than that of humans. Currently, Whisper is becoming increasingly popular due to its advanced features and extensive applications.

2.2 Fine-tuning

Fine-tuning involves adjusting the model parameters to fit the hypothesis space of target task using a much smaller amount of data compared to the pretraining. It is typically used when the source and target domains are similar. There are numerous studies on using fine-tuning techniques to improve model performance. Well-known self-supervised speech models such as Wav2vec series [19, 20], Hubert [21], WavLM [22], and MMS (Massively Multilingual Speech) [23] require fine-tuning on domain-specific data to adapt to downstream tasks. Jain et al. explored different pretraining and fine-tuning methods for the Wav2vec 2.0 model on the ASR task for child speech [24]. Zhang et al. analyzed various combinations of pretraining and fine-tuning on 15 low-resource languages in the OpenASR21 challenge [25]. Pasad et al. used various metrics, including canonical correlation analysis, mutual information, word recognition, and word similarity, to study and analyze the characteristics of Wav2vec 2.0’s layer representations and guide model improvements for better fine-tuning strategies [26].

However, there are three main challenges with fine-tuning. Firstly, due to the large parameter size of the models, which can be in the hundreds of millions or even billions, updating all parameters during fine-tuning can be computationally and time-consuming. Secondly, fitting a small number of data with large amounts of parameters may lead to overfitting, resulting in poor generalization performance. Finally, large models tend to have general capabilities for multiple tasks or languages, but after fine-tuning, they may only perform well on the target task or language, losing their general abilities, which is known as catastrophic forgetting. Both overfitting and catastrophic forgetting can degrade the generalization ability of a trained model, but the former is specific to the training and test data of a single language, while the latter concerns multiple languages.

To address these issues, numerous research efforts have been made to explore improved fine-tuning strategies. Rosin et al. used a partial parameter freezing strategy when adapting an ASR system to German and explored the impact of different freezing configurations on system performance [27]. Pasad et al. reinitialized the last 1–3 layers of the Wav2vec 2.0 model, achieving even better results than the pretrained model [26]. Kannan et al. introduced a bottleneck adapter into model, facilitating full adaptation to specific languages [28]. Yu et al. introduced LoRA (Low-Rank Adaptation) into ASR systems with decreased training times by factors between 5.4 and 3.6 while using only 0.08% of the parameters of the pretrained model [29].

3 Experiments and analysis

3.1 Data and baseline

We use seven languages in the Fleurs dataset [30] for our experiments: Afrikaans (Af), Belarusian (Be), Icelandic (Is), Kazakh (Kk), Marathi (Mr), Nepali (Ne), and Swahili (Sw). We have listed the areas affiliated with all experimental languages in Table 2. We believe that there is a relationship between areas and the basic phonological structures of languages, which is useful for our in-depth analysis of the interactions between multiple languages. Therefore, adopting languages from multiple areas can ensure the generalization of our experimental conclusions. Fleurs is a multi-parallel speech dataset where each language contains about 12 h of speech supervision. It can be used for various speech tasks such as ASR, speech translation, machine translation, and retrieval. The training set contains less than 10 h of supervision, and the speakers in the training set are different from those in the development and test sets. The data are sourced from Hugging FaceFootnote 1, and we conducted training, validation, and testing using the train, validation, and test set of each language, respectively. Table 2 presents the detailed statistics of the data applied in this paper.

Table 2 The detailed statistics of the data applied in this paper. h means hour. Num_rows is the number of speech rows, and the numbers in “( )” represent (train: validation: test)

 Table 3 shows the performance of Whisper on seven languages [14], measured using the word error rate (WER) as the evaluation metric. Due to changes in the training data for Large-v3, many low-resource languages have more data, deviating from the low-resource trend. The focus of this paper is on comparing and analyzing fine-tuning strategies for Whisper, rather than on the application of lightweight models. Therefore, we selected Large-v2 as the baseline for our experiments and analysis, which we refer to as the pretrained model (PT). We believe that the relevant conclusions of this paper can be directly transferred to smaller models.

Table 3 WERs (%) of 7 languages tested by OpenAI. The number after each language is the duration of training data in Whisper pretraining, and h means hour. These 7 languages have less than 20 h of data, which satisfies the low-resource feature

3.2 Vanilla fine-tuning

To explore the extent to which vanilla fine-tuning can improve the low-resource ASR performance of Whisper, we first fine-tuned the model using the seven language datasets mentioned above. We used the exampleFootnote 2 provided by Hugging Face as a reference for fine-tuning. Specifically, we used AdamW [31] for optimization and training with a total of 35 epochs. After warming up 10% of the total steps, the learning rate reached 0.00001 and then decayed linearly. This setting ensures sufficient training. During training, checkpoints were saved every 200 steps, and then, the performance of all checkpoints was evaluated using the test set. Finally, the checkpoint with the lowest WER was selected as the fine-tuned model. We used a single 48GB NVIDIA A40 GPU, and the batch size for all languages was 8.

Table 4 compares the test results before (PT) and after (FT) fine-tuning. It is evident that fine-tuning Whisper-large-v2 with insufficient 20 h of data can significantly reduce the error rate on that language. Specifically, compared with PT, FT reduced the average WER by 56.94%, relatively. This result indicates that Whisper’s multi-language and multi-task pretraining not only enables it to have preliminary capabilities for ASR in many languages, but also updates the model parameters to a suitable initial high-dimensional space, enabling rapid adaptation to target tasks. Vanilla fine-tuning can further migrate the model parameters towards a target task-friendly direction even with limited data.

Table 4 WERs (%) of 7 languages tested before (PT) and after (FT) fine-tuning

3.3 Fine-tuning with specific parameters

There are some related research summaries that multi-layer stacked encoder can gradually extract deep information of speech from bottom to top [25, 32]. The bottom-level representation often stores the most basic information of speech, such as pronunciation units, lip and tooth movements, while the high-level representation tends to express advanced features of speech content. Due to the similarity of human lip and tooth structures, the pronunciation methods are similar, and the differences between languages are not significant. Therefore, updating the bottom-level parameters of the encoder is redundant during model fine-tuning. Reducing unnecessary parameter updates during model fine-tuning has two benefits: first, the number of parameters to be updated is reduced, which can alleviate the consumption of computational power and time; second, it can effectively prevent overfitting when training large models with very limited data.

To further analyze the speech encoding capabilities of Whisper and determine the redundant areas of encoder parameter updates or the core areas that affect task performance in the model encoder, we used canonical correlation analysis (CKA) to visualize the representations of each layer of the encoder before and after fine-tuning. Specifically, we randomly sampled 100 utterances from each language dataset and used the CKA method in [26] to calculate the similarity of the representations of all layers of the encoder before and after fine-tuning.

Figure 1 visually demonstrates the similarity. It can be observed that for all languages, the change pattern of the representations of each layer of the encoder before and after fine-tuning is quite similar. Taking Kk (yellow line) as an example, the similarity of the representations of the first 15 layers of the encoder remains almost unchanged, with a value above 0.95. Starting from the 16th layer, the similarity gradually decreases, with smaller similarity at higher layers, and the downward trend becoming more evident. We believe that this pattern is consistent with the conclusions of existing researches [25], which states that the bottom layers of the model capture shared features among languages, while the higher layers extract language-specific features. The intermediate layers serve as a smooth transition from shared features to language-specific features. It should be noted that Kk exhibits uniqueness in Fig. 1, where the similarity of its initial layers first decreases and then increases, and the curve for Kk does not fall within the set formed by the curves of the other six languages. We believe that the unique performance of Kk is related to its geographical origin. Among all seven languages, only Kk originates from Central Asia and belongs to the Turkic language family. Differences in language characteristics lead to variations in model performance.

Fig. 1
figure 1

Similarities of encoder layer-wise representations by CKA in 7 languages before and after fine-tuning

To further confirm the validity of the above analysis, we made changes to the fine-tuning method of Whisper and analyzed it through objective results. During model parameter updates, we froze the bottom layers of the encoder, with the number of frozen layers being n. The other layers of the model were still updated during training, with n=10, 11,\(\ldots\), 31. All layers are updated when = 32, which is fine-tuning. We do not have a comparison of n=1, 2, …, 9, because operations like freezing lose their original meaning when the number of freezing layers is too small.

Figure 2 shows the results for each language under different numbers of frozen layers. It can be observed that first, as the number of frozen layers increases, the performance of the model on all languages consistently tends to worsen, indicating that the conclusion that the bottom and top layers of the encoder encode different information is correct; second, the best results for each language appear when \(\textit{n} \le 20\), indicating that updating the bottom-level parameters during ordinary fine-tuning is redundant.

Fig. 2
figure 2

WERs (%) for different number of freezing bottom layers per language

Table 5 shows the performance of fine-tuning with frozen encoder bottom layers (FFT) and vanilla fine-tuning, showing only the best results of FFT. By comparison, we found that for all languages, compared to FT, FFT achieved consistent WER reduction by 6.27%, relatively. Combined with the above results, it can be found that freezing the parameters of the lower layers of the encoder not only effectively reduces the number of parameters to be updated during fine-tuning, computational complexity, and time consumption, but also slightly improves performance.

Table 5 WERs (%) of 7 languages tested by FT, bottom layers freezing fine-tuning (FFT) and top-layer reinitialization fine-tuning (RIFT). The number before “/” is the best WER (%) of all results per language. The number after “/” is the number of layer for freezing or reinitialization

In addition, according to the conclusions in [26], the possible reason for the large changes in the top layers of the encoder before and after fine-tuning is that the pretraining did not provide good initial parameters for the top layers, i.e., the initial state deviated from the target task. Reinitializing the top layers may make the initial parameters more suitable for the target task. To further discuss and analyze the speech encoding capabilities of Whisper, we implemented top-layer reinitialization fine-tuning. We reinitialized the top m layers while freezing the bottom (32 - m) layers, where = 1, 2, \(\ldots\), 5. In Whisper, each layer of the encoder consists of two sublayers: a linear layer and a normalization layer. For the linear layer, we initialize the weight parameters with Xavier Normalization [33] and initialize the bias parameters to 0. For the normalization layer, we initialize the weight parameters to 1 and the bias parameters to 0.

Figure 3 shows the results of reinitialization fine-tuning (RIFT) for each language, and Table 5 presents the best results across all reinitialization layers. It can be observed that, for any language, the results continuously worsen as the number of reinitialized layers increases. When = 5, the ASR capability is completely lost, and the effect of only reinitializing the last layer of the encoder is the best. However, compared with RIFT and FT, we find that reinitialization does not improve the performance. This indicates that the reinitialization operation is ineffective, suggesting that Whisper’s multi-language multi-task training has helped its ASR capability find a reasonable initialization parameter space, and randomly changing the model parameters will worsen its performance.

Fig. 3
figure 3

WERs (%) for different number of re-initialization top layers per language

3.4 Fine-tuning with additional modules

The above fine-tuning methods have two main drawbacks: First, as shown in Table 1, the Large-v2 model has 1.15 B parameters. Even if the bottom layers of the encoder are frozen, the remaining number of parameters still exceeds 1 B (19.87 M per layer in the encoder). Therefore, both vanilla fine-tuning and fine-tuning with specific parameters require a large number of parameter updates, consuming a significant amount of computational power and time. Second, while adjusting and updating model parameters to optimize performance on the target task, it also necessarily leads to a decrease in the generalizability of the model on the original multi-task, which is known as catastrophic forgetting.

One effective way to address the above issues is to insert additional lightweight modules into the model [34]. During fine-tuning on the target language, only the parameters of the additional module are updated, while the original parameters of the model are frozen. This allows the updated parameters of the additional module to effectively transfer to the target language while preserving the original functionality of the model. Additionally, because the additional module generally has lightweight properties, compared with large models based on transformer as the basic network structure, the increase in parameter quantity is very small, typically around 1% of the pre-trained model. This lightweight characteristic greatly reduces the number of parameters that can be updated during model fine-tuning, enabling efficient fine-tuning with parameters. In order to explore the advantages and disadvantages of efficient parameter tuning methods, in this paper, we conducted experiments and analysis on two classes of additional module: the bottleneck adapter (BA) [34] and LoRA [35].

For BA, we inserted it into every layer of the encoder and decoder to be the last module. Figure 4 shows the insertion method of BA and its internal structure. The input to the BA module first undergoes layer normalization (LN), then down-samples to a specific smaller dimension d (MLP), passes through a GLEU (Gaussian Error Linear Units) activation, and then up-samples back to the original dimension (MLP). The input and output maintain residual connections. During fine-tuning with additional BA modules, we used a learning rate of 0.0003, and otherwise followed the same procedures as regular fine-tuning. We first explored the impact of different values of d on the results (we assumed that language is independent of the value of d, so we only used Icelandic for experimental analysis), with = 32, 64, 128, 256. Figure 5 shows the impact of BA’s intermediate dimension, with = 64 achieving the best results.

Fig. 4
figure 4

Schematic diagram of inserting an adapter into Whisper

Fig. 5
figure 5

WER (%) of different value of d for Icelandic

Then, with = 64, we inserted BA after each layer of the encoder and decoder. Table 6 shows additional BA fine-tuning (BAFT) with PT and FT. Compared with FT, the average WER of BAFT increased by 1.63%. However, during the fine-tuning process, the number of parameters that were updated was approximately 0.8% of the original model, and the fine-tuning time was reduced by 5.2 times. This indicates that after inserting BA, although the performance of Whisper slightly decreased, the costs of computational resources and time during training were significantly reduced. Additionally, comparing BAFT and PT, it can be found that the average WER decreased by 52.91%, relatively. This suggests that fine-tuning with BA is still an effective fine-tuning strategy. It should be noted that the insertion of BA increases the model’s parameter count, leading to longer inference times. Nevertheless, because of BA’s lightweight nature, this increase in inference time is insignificant, and the inclusion of BA both minimizes the need for parameter updates and mitigates catastrophic forgetting, highlighting its significant practical benefits.

Table 6 WERs (%) of 7 languages tested by PT, FT, BAFT, LoRAFT, and BAFT-T. The two columns, “Paras” and “Time,” represent the number of trainable parameters and the average training time, respectively

For fine-tuning with LoRA, we applied it to the query and value positions within the attention modules of each layer of the encoder and decoder, following the exampleFootnote 3 in PEFT [36]. The rank was set to 32, and alpha was set to 64. We used a single 80G NVIDIA A100 GPU with a batch size of 4, and the remaining hyperparameters were consistent with vanilla fine-tuning.

As shown in Table 6, the results of fine-tuning with LoRA (LoRAFT) are slightly inferior to those of BAFT. However, the number of parameters that need to be updated for LoRAFT is even smaller, only 0.67% of the original model. Compared with PT, the relative WER of LoRAFT can be reduced by 47.01%, indicating that this strategy is still effective. In addition, we have found that BAFT reduces training time by 5.2 times, and LoRAFT reduces training time by 5.6 times. Therefore, as concluded in [29], the overall performance of the model is significantly enhanced by introducing lightweight additional modules.

The above results and analysis demonstrate that both fine-tuning with BA and fine-tuning with LoRA can improve model performance while significantly reducing the amount and cost of parameter updates. However, when compared with vanilla fine-tuning, they perform slightly worse. Therefore, it is reasonable to choose an appropriate fine-tuning strategy based on practical conditions.

Finally, keeping BA inserted in all layers of the decoder, but we only insert it into the top p layers of the encoder, with = 1, …, 10. Figure 6 shows the results of BAFT inserted in the top layers of the encoder (BAFT-T). It can be observed that as the number of layers increases, the performance gradually improves. Table 6 provides the best results for each language and the corresponding number of top layers. It can be seen that the optimal number of top layers for inserting BA in the encoder differs among different target languages. However, for most languages (Af, Be, Kk, Ne), the best results are achieved when = 9. There are also some languages that achieve the best results when = 8 (Sw) or = 10 (Mr), which are close to = 9. Only Icelandic performs best when = 6. We believe that the uniqueness of Icelandic’s performance is related to the duration of its training data. Specifically, among the seven languages, only Icelandic’s training data is less than 3 h. The small amount of data leads to effective training of BA when = 6, but the fit between data and model is not good when > 6. Additionally, comparing BAFT-T with BAFT, it can be seen that inserting BA in the top layer of the encoder has a similar effect to inserting it in all layers, but the former can further reduce the number of parameters that need to be updated. This can be considered a more effective approach.

Fig. 6
figure 6

WERs (%) of BAFT for only top encoder layers

4 Conclusions

Multilingual and multitask large speech models are becoming a popular paradigm for solving low-resource speech recognition problems. By using a small amount of data for fine-tuning, the model’s performance on the target task can be effectively improved. This paper starts from the current research’s limitations, exploring the scope of performance improvement that fine-tuning can bring to Whisper through experiments and analysis of five fine-tuning strategies. We found that all fine-tuning strategies mentioned in this paper can effectively improve the performance of Whisper on the target language ASR. Among them, vanilla fine-tuning can greatly improve the ASR performance in the target language, fine-tuning with freezing the bottom layers has the strongest ability, while re-initializing the top layers is ineffective. Adding bottleneck adapters and LoRA fine-tuning can significantly reduce computational and time costs, while sacrificing only a small amount of speech recognition ability. The above conclusions can be applied in domain applications and engineering practices.

Comparing all the experimental results and analysis in this paper: Fine-tuning with additional adapter has the dual benefits of avoiding catastrophic forgetting and reducing training time, thus being the focus of our next research. Moreover, the similarity among languages can help further improve the model’s performance in the target language, making multi-lingual data fine-tuning another promising avenue for exploration.