1 Introduction

Automatic Speech Recognition (ASR) technology plays a pivotal role in facilitating human-computer interaction by converting speech signals into text [1]. Indeed, ASR technology built on deep learning has made significant strides in recent years [2]. However, as people’s demands for accuracy and robustness in ASR models continue to grow, there are challenges in meeting these requirements. While the development of hybrid deep neural network models (DNNs) [3], encompassing acoustic, linguistic, and lexical models, has led to improved accuracy in automatic voice recognition, these models involve multiple modules and a tedious training procedure. Each module requires independent tuning, which can result in cumulative errors in the overall model. In response to these challenges, the field of voice recognition has undergone a noteworthy shift from hybrid models towards end-to-end (E2E) models [4, 5]. The E2E speech recognition model employs a single network to directly transform input speech sequences into output token sequences. By merging the acoustic model and linguistic model from traditional speech recognition into a unified network, the E2E model effectively simplifies the structure of the speech recognition process. This transition to end-to-end models brings the advantage of streamlining the ASR model, reducing complexity, and potentially improving overall performance and robustness. As research continues in this direction, we can anticipate further advancements in ASR technology, ultimately catering to the increasing demands of diverse applications and enhancing the quality of life for users.

Currently, there are several research directions in the field of end-to-end speech recognition: connectionist temporal classification (CTC) method [6,7,8], recurrent neural network transducers (RNN-T) [9], and attention-based models (AED) [10]. These end-to-end (E2E) models treat automatic speech recognition (ASR) as a sequence-to-sequence problem, where a neural network is directly employed to learn the mapping from speech to text. The CTC method has been extensively researched due to its straightforward modeling process, which involves only an encoder and outputs each token independently. However, its identification accuracy is often subpar because it assumes conditional independence between output tokens, and decoding speed is fast. The RNN-T model comprises two networks: an encoder that maps input acoustic frames to a higher level for characterization and a prediction and union network that forms the decoder [11]. This decoder network utilizes autoregression, relying on past prediction data. However, RNN-T training can be unstable, requiring more memory and potentially limiting training speed. Consequently, the resulting model may be less effective at recognizing objects and more challenging for businesses to implement. Many advanced ASR systems [12] are based on the AED model, which incorporates an encoder for encoding acoustic data and a decoder for generating the most likely word sequence or sentence.While this model considers both previously generated tokens and the acoustic context when producing tokens, it can lead to identification delays. Moreover, the estimated alignment in the attention-based process is vulnerable to noise corruption, much like in real-world speech recognition tasks, resulting in subpar recognition performance for the model. As ASR technology continues to evolve, researchers are actively exploring ways to enhance the capabilities of end-to-end speech recognition models, aiming to strike a balance between accuracy, efficiency, and robustness for practical applications.

The combination of CTC-Attention model has emerged as the dominant approach for end-to-end speech recognition systems [13, 14]. This model utilizes a multi-task learning framework and is trained with both CTC and Attention model objectives. The architecture consists of a shared encoder, a CTC layer, and an attention decoder. The shared encoder employs transformer [15] or conformer [16] blocks to effectively learn local and global properties of the input speech sequences, enhancing the model’s ability to capture relevant information. The CTC linear and log-softmax layers use the CTC loss function during training to optimize the softmax output. The CTC layer operates in streaming mode for the first channel, allowing for real-time streaming results. The attention decoder, consisting of transformer blocks, generates improved contextual representations and is utilized for the second channel during decoding. The attention-based decoder re-scores multiple candidate outcomes N-best in a teacher-forced manner, enabling more precise results during decoding. The recognized phrases are then reranked based on the scores, further improving recognition accuracy. Researchers have discovered that the combination of CTC loss and AED leads to faster training convergence and superior recognition results. As a result, this approach has become the standard reference scheme for training end-to-end speech recognition models. However, existing end-to-end speech recognition models face limitations in mining supervised information from vast amounts of unsupervised data. They primarily focus on the output characteristics of the last layer of the encoder and overlook inter-layer information. This limitation leaves room for improvement in model characterization, data utilization, and model resilience. Continued research in these areas presents opportunities for advancing end-to-end speech recognition systems, ultimately leading to more powerful and efficient models that can better utilize unsupervised data and improve recognition performance in various applications

Based on the latest research developments, we propose an innovative end-to-end speech recognition model that combines multi-scale feature fusion with multi-view self-supervised learning. The model is trained using a hybrid strategy, incorporating both supervised and self-supervised training approaches. The primary focus of the model is on leveraging the inter-layer information of the shared encoder to enhance its characterization capability. By utilizing the diversity of this information, the model becomes more adept at representing speech data accurately. Additionally, the model incorporates multi-view self-supervised learning, which maximizes the utilization of data information and improves the model’s resilience. This is achieved by creating various shared encoder sub-models, each excluding some information, and then using multi-view self-supervised learning to effectively exploit the data. The shared encoder consists of multiple conformer blocks, allowing it to learn both local and global features of the input speech sequence. The multi-scale feature fusion module (MFF) plays a crucial role in the model, providing different weights for the output of various conformer blocks and combining these weights to generate the final output representation. The outputs of each conformer block are then stitched together to form the overall representation. The model’s decoding process involves using both the CTC and Attention decoders on the output representation. To validate the performance of the proposed model, we use WeNet [17, 18], a speech recognition tool, as the benchmark, and the Aishell-1 [19] dataset for training and testing. Subsequently, it was further tested on the English corpus WSJ. The experimental results demonstrate the significant reduction in character error rate and improved speech recognition performance when compared to the baseline, employing four different decoding techniques. This confirms the effectiveness and potential of the proposed end-to-end speech recognition model, showcasing its capability to enhance voice recognition accuracy and performance.

2 Related Work

Based on different training objectives, SSL methods can be categorized into generative learning, discriminative learning, and multi-task learning. The research line of generative learning can be traced back to the auto-encoding model, which reconstructs the entire speech from continuous [20,21,22] or discrete [23] latent variables. Recent works propose to predict future frames from the history with an autoregressive model [24,25,26,27], or recover the masked frames from the corrupted speech with a non-autoregressive model [28,29,30,31,32]. Apart from generative learning, discriminative learning has also gathered interests recently. The well-known examples include CPC [33], wav2vec [34], vq-wav2vec [35], wav2vec 2.0 [36], DiscreteBERT [37], and HuBERT [38]. However, self-supervised paradigms require careful design, and such representations can be difficult to interpret. There is no guarantee that the model will learn a "good" speech representation in terms of identifying the most valuable information.

Convolutional neural networks (CNN) have been proven to be a useful model for handling various visual tasks [39,40,41,42]. Despite their great success, CNNs still have their limitations. They mainly focus on local spatial modeling and lack global context fusion. Models based on CNNs cannot handle long-range dependencies well. Recently, in the field of speech processing, ECAPATDNN [43] and its follow-up efforts [44, 45] achieved a significant breakthrough based on TDNN blocks and the squeeze-and-excitation (SE) [46] layer unified with Res2Block [47]. They provided an equal error rate of less than 1\(\%\) on the VoxCeleb 1-O benchmark test. Among them, MFA-Conformer [48], which is based on multi-scale feature fusion, has achieved remarkable results in the speaker recognition task. However, the application of multi-scale feature fusion in speech recognition tasks is still rare.

Inspired by these recent advancements, we propose an innovative end-to-end speech recognition model that combines multi-scale feature fusion with multi-view self-supervised learning. The model uses a mixed training strategy that encompasses both supervised and self-supervised learning methods.

3 The Overall Architecture of MM-ASR

Figure 1 depicts the overall layout of the multi-view self-supervised learning and multi-scale feature fusion end-to-end speech recognition model developed in this research. The model is built on a common joint CTC-Attention model with conformer blocks for the shared encoder and self-supervised loss construction by contrastive learning. It also includes a self-attentive mechanism for the multi-scale feature fusion module, a CTC layer, and an attention decoder made up of transformer blocks for the decoder.

Fig. 1
figure 1

MM-ASR model architecture

3.1 Conformer Structure

The architecture of the network proposed in this study integrates both Convolutional Neural Networks (CNN) and the Transformer model to extract vocal representations. While CNNs are known for their effectiveness in extracting local properties, they often fall short in capturing global properties. The self-attention module, on the other hand, is proficient in capturing long-range global context dependencies, thereby compensating for the CNN’s inability to capture global features. Hence, the Transformer network is incorporated to tackle this shortcoming. The network configuration of the encoder used in this study is shown in Fig. 2, composed of N layers of identical Conformer blocks [16].

The network is organized as a stack of four modules, each employing a residual connection structure [49]. These modules include the feedforward module, the multi-head self-attention (MHSA) module, the convolutional module, and a second feedforward module. The MHSA and the convolution module represent the core components of the Conformer block. The MHSA utilizes the relative position encoding scheme as proposed in the Transformer-XL model [50], which encodes the input considering the relative position deviation. It takes into account both the global content offset and the global position offset.

Following the MHSA is the convolutional module, which comprises Pointwise convolution, Depthwise convolution, and Glu and Swish activation layers. To assist in learning local features and facilitate the training of deep learning models, a BatchNorm layer is placed after the convolutional layer. Mathematically, for the input \(x_i\) of Conformer block i, the output \(y_i\) of Conformer block can be expressed as:

$$\begin{aligned}&{\widetilde{x}}_i=LN(x_i+\frac{1}{2}FFN(x_i)) \end{aligned}$$
(1)
$$\begin{aligned}&x^{\prime }_i=LN({\widetilde{x}}_i+MHSA({\widetilde{x}}_i)) \end{aligned}$$
(2)
$$\begin{aligned}&x^{\prime \prime }_i=LN(x^{\prime }_i+Conv(x^{\prime }_i)) \end{aligned}$$
(3)
$$\begin{aligned}&y_i=LN(x^{\prime \prime }_i+\frac{1}{2}FFN(x^{\prime \prime }_i)) \end{aligned}$$
(4)

where FFN refers to feed-forward module, MHSA refers to multi-headed self-attentive module, Conv refers to convolution module, and LN refers to layer parametric module.

Fig. 2
figure 2

Conformer model structure diagram

3.2 Shared Encoder Based on Multi-view self-supervised Learning

Supervised learning is a deep learning approach that identifies a functional relationship between input and output by categorizing or regressing labeled data. However, it cannot fully exploit the data as it only learns from labeled data. In contrast, self-supervised learning is a potent technique for extracting applicable and generalizable latent representations from large volumes of unlabeled data. This approach is commonly employed in sequence-to-sequence (seq2seq) model pre-training and in facilitating downstream tasks [51,52,53]. Through auxiliary or pretext tasks, the network is trained to acquire representations that are beneficial for downstream tasks, mining its supervised knowledge from large-scale unsupervised data.

Based on the above analysis, this study designs a shared encoder leveraging multi-view self-supervised learning. Figure 3 illustrates the network structure of this encoder. The green section in Fig. 3 denotes the encoder employing N layers of identical Conformer blocks to more efficiently capture speech features. The units that are randomly dropped during the training phase are depicted in the blue portion of the multi-view self-supervised learning slab. The self-supervised learning slab employs the dropout regularization technique [54] to construct two distinct encoder views, thereby reducing the model’s generalization error. The dropout algorithm specifically randomly discards some units in each layer of the neural network to prevent co-adaptation and overfitting. This study uses a self-supervised approach to regularize the output prediction of the sub-model, leveraging the structural randomness introduced by the dropout process. The outputs of the encoder views are compared to extract more reliable characterization information. To better exploit the data and enhance the robustness of the model, the supervised loss is coupled with the self-supervised contrastive loss.

Given the shared encoder input data \(x_i\), \(x_i\) is fed twice during each training cycle in order to pass through the network’s forward. As a result, it is possible to derive two distributions of the common encoder output, designated as \(P_1\left( y_i \mid x_i\right) \) and \(P_2\left( y_i \mid x_i\right) \). \(D_{k L}\left( P_1\left( y_i \mid x_i\right) \Vert P_2\left( y_i \mid x_i\right) \right) \) gives the Kullback-Leibler (KL) scatter between \(P_1\left( y_i \mid x_i\right) \) and \(P_2\left( y_i \mid x_i\right) \). Since the discard operation randomly discards the units in the shared encoder, the forward pass is carried out using two different shared encoder views of the same encoder, as was previously indicated. The self-supervised method utilized in this study then regularizes the model predictions during the training stage by minimizing the bidirectional Kullback–Leibler (KL) dispersion between the same batches \(P_1\left( y_i \mid x_i\right) \) and \(P_2\left( y_i \mid x_i\right) \) as follows:

$$\begin{aligned} L_{K L}=\frac{1}{2}\left( D_{k L}\left( P_1\left( y_i \mid x_i\right) \Vert P_2\left( y_i \mid x_i\right) \right) +D_{k L}\left( P_2\left( y_i \mid x_i\right) \Vert P_1\left( y_i \mid x_i\right) \right) \right) \end{aligned}$$
(5)
Fig. 3
figure 3

Structure diagram of a shared encoder based on multi-view self-supervised learning

3.3 Multi-scale Feature Fusion Module

In existing speech recognition models, the diversity of information between different layers is often overlooked, limiting their ability to represent the data. When the final speech representation is extracted by the encoder, they only pass the features output from the last layer to the decoder. This study proposes an attention-based multi-scale feature fusion module (MFF) to address this issue by maximizing the utilization of inter-layer information to enhance the model representation information capabilities.

Based on the analysis, the scale information is extracted by each conformer block of each layer in the shared encoder, and there is a reliance between the scale information of the different layers. In this work, we explicitly model the dependencies between each conformer block using the proposed multi-scale feature fusion module. After learning these dependencies, we sum the output of each conformer block and use the scale information extracted from each layer to form N-dimensional features. This process results in acoustic features with stronger characterization information. The structure of this module is depicted in Fig. 4.

Fig. 4
figure 4

MFF structure diagram

The implementation process of the multi-scale feature fusion module involves the following steps: The output from each conformer block is first combined into \(X \in {\mathbb {R}}^{C \times H \times W}\). After X is transformed into the matrix \(A \in {\mathbb {R}}^{C \times N}\) and subjected to transposition, matrix multiplication, and softmax operations, the attention map \(V \in {\mathbb {R}}^{C \times C}\) is produced:

$$\begin{aligned} v_{j i}=\frac{\exp \left( a_i \cdot a_j\right) }{\sum _{i=1}^C \exp \left( a_i \cdot a_j\right) } \end{aligned}$$
(6)

where the impact of the ith conformer block on the jth conformer block of the metric is indicated by the notation \(v_{j i}\). The output of dimension (C \(\times \) N) is then molded into (C \(\times \) H \(\times \) W) by performing matrix multiplication of this attention graph V with matrix A. After learning the dependency relationship, the result is multiplied by the scale factor \(\beta \), and an element-by-element summing operation is then carried out with X to generate the output of each conformer block \(Y \in {\mathbb {R}}^{C \times H \times W}\):

$$\begin{aligned} y_j=\beta \sum _{i=1}^C\left( v_{j i} a_i\right) +a_j \end{aligned}$$
(7)

where \(\beta \) is initialized to 0 and gradually learns to assign larger weights. In Eq. (7), the process of the multi-scale feature fusion module is described as follows: The weighted sum of all conformer block output features and the initial conformer block output features, represents the resultant features of each conformer block after learning the dependencies. This module models the dependencies between different conformer blocks, which helps to obtain more robust speech representations. The final acoustic representation, provided to the decoder, is generated by aggregating the outputs of each conformer block after learning their dependencies. Through this process of weighted summation and integration of information from multiple conformer blocks, the end-to-end speech recognition model is empowered to effectively represent and comprehend complex speech patterns. This enhances the model’s overall capability to achieve accurate and robust speech recognition.

$$\begin{aligned} y_c=\sum _{j=1}^C y_j \end{aligned}$$
(8)

where \(y_c\) is the final output of the multi-scale feature fusion module.

3.4 Decoder

The Connectionist Temporal Classification (CTC) method, developed by Graves et al. [6], is a technique primarily used to address the problem of output alignment between labels and neural network predictions.

To determine the likelihood of the CTC target sequence, the CTC model takes into account all feasible alignment routes between the target sequence y and the input sequence x. This likelihood is specified as:

$$\begin{aligned} P(y \mid x)=\sum _{q \in \beta ^{-1}(y)} P(q \mid x) \end{aligned}$$
(9)

where q is one of the pathways, and \(\beta ^{-1}(y)\) is the set of all paths that could map from the input sequence to the output label. Equation (10) illustrates the definition of the CTC loss function as the sum of the negative log probability of obtaining the appropriate label during training.

$$\begin{aligned} L_{C T C}=-\ln P(y \mid x) \end{aligned}$$
(10)

Therefore, the CTC method significantly simplifies the training and modeling processes for speech recognition models. In this study, we use the CTC model as one of the decoders. The CTC model’s architecture comprises linear and log-softmax layers. During the training phase, we apply the CTC loss function to the softmax output, which helps to transform the output of the shared encoder after the MMF activation into the CTC model.

The Attention Decoder in this paper is made up of many similar Transformer blocks. Wherein, the Multi-Head Cross-Attention module (MHCA), is added to the Feedforward and Self-Attention modules in order to execute multi-head attention on the output of the shared encoder after passing through the MMF. The attention decoder in this study uses relative position encoding to be consistent with the shared encoder. Mathematically, the output \(y_i\) for input \(x_i\) of transformer block i in the attention decoder can be written as follows:

$$\begin{aligned}&x^{\prime }_i=LN(x_i+MHSA(x_i)) \end{aligned}$$
(11)
$$\begin{aligned}&x^{\prime \prime }_i=LN(x^{\prime }_i+MHCA(x^{\prime }_i,{\widetilde{y}})) \end{aligned}$$
(12)
$$\begin{aligned}&y_i=LN(x^{\prime \prime }_i+FFN(x^{\prime \prime }_i)) \end{aligned}$$
(13)

where the shared encoder output following the MMF is referred to as \({\widetilde{y}}\). MHSA stands for multi-headed self-attentive module, MHCA for multi-headed cross-attentive module, and LN for layer norm module, where FFN is for feedforward module.

3.5 Multi-task Learning Paradigm

The model proposed in this study employs two supervised losses, namely the Connectionist Temporal Classification (CTC) loss and the Attention-based Encoder-Decoder (AED) loss, in addition to a self-supervised comparison loss. The training process follows a hybrid end-to-end approach that combines both supervised and self-supervised training methods. By integrating both CTC and AED losses into one of the supervised losses, the model benefits from improved convergence while fully capturing token dependencies within the data. Equations (14) and (15) define the joint supervised and self-supervised losses, where x is the acoustic feature and y is the corresponding label. The CTC decoder and attention decoder losses are denoted by the variables \(L_{C T C}(x, y)\), \(L_{A E D}(x, y)\), and \(\lambda \in (0,1)\), which is the hyperparameter that balances the weights of the two losses. are hyperparameters that weigh the significance of losses that are both supervised and self-supervised.

$$\begin{aligned}&L_S=\lambda L_{C T C}(x, y)+(1-\lambda ) L_{A E D}(x, y) \end{aligned}$$
(14)
$$\begin{aligned}&L=\lambda L_S+\mu L_{K L} \end{aligned}$$
(15)

3.6 Analyze

Compared with supervised learning, self-supervised learning methods attempt to learn powerful contextual representations from audio data only, and then fine-tune the model on paired data. Currently, there are some pre-trained models that achieve excellent performance, but these require a large amount of external data and model parameters for training. Moreover, these models mainly address general representations for speech tasks. Specifically, models such as CPC and the wav2vec series use contrastive InfoNCE loss to distinguish between related positive samples and negative samples. Inspired by masked language model loss in NLP, DiscreteBERT and HuBERT predict discrete targets in masked regions. However, our method focuses on an end-to-end ASR model that requires only a small amount of labeled data for training and achieves excellent performance through the proposed multi-view contrastive self-supervised approach.

The multi-scale feature fusion network structure is relatively flexible and there is no clear boundary. The receptive field of the high-level network is relatively large, and the semantic information representation ability is strong, but the resolution of the feature map is low, and the geometric information representation ability is weak. The receptive field of the low-level network is relatively small, and the geometric detail information representation ability is strong. Although the resolution is high, the semantic information representation ability is weak. The multi-scale feature fusion network makes the model easier to achieve significant results on complex tasks by fusing deep and shallow layer features. The latest research has demonstrated the potential of voice models on full-stack voice tasks by using the weighted sum of embeddings from different layers. They found that different layers contain useful information for different tasks. For example, the top hidden states are useful for ASR, while the bottom layers are more effective for speaker verification. Therefore, this study proposes an attention-based multi-scale feature fusion module (MFF) to enhance the model’s ability to represent information by maximizing inter-layer information utilization.

4 Performance Testing and Analysis

We first demonstrate our results on the Aishell-1 test dataset to gain a deeper understanding of our method. Subsequently, we further validate the effectiveness of the method on the English corpus WSJ (80-h). To evaluate the effectiveness of the multi-scale feature fusion method and the multi-view self-supervised learning module, we conducted ablation experiments to compare the differences. The performance of the model is evaluated based on the character error rate (CER).

4.1 Dataset

The Aishell company provides the Aishell-1 dataset, an open-source speech dataset that resamples high-fidelity microphone audio data to 16 kHz, 16-bit WAV format. The dataset consists of speech data from 400 speakers, representing diverse dialect regions in China, and covers a wide range of topics such as technology, sports, entertainment, current news, finance, and economics. The Aishell-1 dataset is divided into three sets: a training set with 340 speakers, containing 150 h of speech data, a validation set with 40 speakers, comprising 10 h of speech data, and a test set with 20 speakers, containing 5 h of speech data. In total, the dataset contains 165 h of speech data. The composition of the Aishell-1 dataset is detailed in Table 1. The test set consists of 7176 speech samples. For this project, the Aishell-1 dataset was utilized for both training and testing the proposed speech recognition model.

Table 1 Aishell-1 speech corpus composition structure

4.2 Experimental Setup

The test configuration for this experiment includes an AMD R9-3090X processor, 32 GB of RAM, and an NVIDIA RTX-3090 GPU graphics card. The software environment is a 64-bit Ubuntu 20.04 operating system running the Pytorch deep learning framework.

The input features consist of an 80-dimensional log-Mel filter bank (Fbank) with a 25-ms window and a 10-ms shift. We perform speed perturbation on the entire data at 0.9, 1.0, and 1.1 speeds to generate a 3x speed variation. SpecAugment is applied with 2 frequency masks with maximum frequency mask (F = 10) and 2 time masks with maximum time mask (T = 50).

To reduce the computational burden, a two-dimensional convolutional down-sampling technique is employed at the front end of the shared encoder. The kernel size is 3*3, the stride is 2, which means a total of 4 subsampling operations. The shared encoder comprises 12 conformer blocks with four multi-headed attentions, each using 256 attention dimensions and 2048 feedforward dimensions, consistent with the baseline model. The attention decoder includes six transformer blocks with four multi-headed attentions. During joint training and decoding, the weights of the CTC branches are set to 0.3 and 0.5, respectively. Gradient accumulation is used during training to stabilize the process, with gradients updated every 4 batches [55]. To prevent overfitting, dropout operations and label smoothing regularization are applied to each conformer and transformer block. The Adam optimizer is used for training, with a learning rate schedule of 25,000 warm-up steps and an initial learning rate of 0.002. Additionally, We conducted experiments with different hyperparameters \(\mu \) selected as 0, 0.01, 0.05, 0.1, 1, 10, and different numbers of MFF fusion layers selected as 2, 3, 4, 12.

4.3 Evaluation Metrics

In automatic speech recognition, the results are usually presented as a list of words and phrases. During this process, three types of errors can occur: insertion, deletion, and substitution errors. Insertion errors involve adding an extra word to the recognition result; Deletion errors occur when the correct word is missing from the recognition result and substitution errors replace the correct word in the recognition result with an incorrect word. In English, the recognition success is typically measured in words, and the error rate is referred to as Word Error Rate (WER). For languages like Russian and Viennese, the appropriate evaluation metric is the Word Error Rate (WER) as well. However, in languages like Chinese, word ambiguity is a challenge, making it difficult to directly measure errors in words. Therefore, the Character Error Rate (CER) is commonly used as the evaluation index for Chinese speech recognition, and similar languages like Japanese also employ CER. As the Chinese speech dataset Aishell-1 is employed in this experiment, CER is used as the evaluation index, and its formula is as follows:

$$\begin{aligned} C E R=\frac{N_{D e l}+N_{S u b}+N_{I n s}}{N_{R e f}} \end{aligned}$$
(16)

where, \(N_{S u b}\) represents the number of words in which a substitution error occurs; \(N_{I n s}\) represents the number of words in which an insertion error occurs; and \(N_{R e f}\) represents the total number of words in the test set. \(N_{D e l}\) represents the number of words for which the recognition result has a deletion error in comparison to the actual annotation. Insertion errors make it possible for CER to be greater than 100\(\%\) with a minimum of 0.

4.4 Performance Testing and Analysis

The experiments for the multi-scale feature fusion module aim to investigate the impact of fusing the output data from different numbers of conformer blocks on the model’s recognition performance. The experimental results are summarized in Table 2 as follows: B6+B12 in the shared encoder correspond to fusing the output data from the sixth and twelfth conformer blocks. B4+B8+B12 in the shared encoder indicate the fusion of the output data from the fourth, eighth, and twelfth conformer blocks. B3+B6+B9+B12 in the shared encoder represent the fusion of the output data from the third, sixth, ninth, and twelfth conformer blocks. All blocks, as proposed in this work, symbolize the fusion of the output data from every conformer block in the shared encoder. The ablation experiment only focuses on the MFF module without the addition of SSL. The results clearly demonstrate that the recognition performance is positively influenced by the number of fused blocks. And with the increase in the number of fusion blocks, the recognition performance also improves. Specifically, the performance of models with two, three, or four blocks fused is inferior to that of the model with all blocks fused, confirming the importance of incorporating the output data from all conformer blocks for improved recognition performance.

Table 2 Experimental results of different number of blocks (CER\(\%\))

In this study, experiments are carried out for the multi-view self-supervised learning module to examine the impact of the hyperparameter \(\mu \) on the model recognition performance. The experimental results are displayed in Table 3. When \(\mu =0.05\), the self-supervised loss and supervised loss are balanced to obtain the best performance, which implies that it is crucial to balance the self-supervised loss and supervised loss in joint training.

Table 3 Weight sensitivity study on \(\mu \)

In this study, ablation experiments are conducted to demonstrate the effectiveness of the MM-ASR model’s multi-scale feature fusion module and multi-view self-supervised learning method. The experimental results are displayed in Table 4. The baseline model is the original WeNet model, with the decoder trained in supervised learning mode using features from the network’s final layer. The MM-ASR model, proposed in this paper, incorporates both the multi-scale feature fusion module and multi-view self-supervised learning method. Two additional variants are also evaluated: -SSL, which is the MM-ASR model with the multi-view self-supervised learning method eliminated, and -MFF, which is the MM-ASR model with the multi-scale feature fusion module removed. The experimental results demonstrate the efficacy of both multi-scale feature fusion and multi-view self-supervised learning. The MM-ASR model, which combines supervised and self-supervised losses for training and focuses on interlayer information, exhibits improved model resilience and achieves a lower character error rate (CER) compared to the original WeNet model. The proposed approach leads to a significant enhancement in voice recognition ability, reducing the character error rate by approximately 4.6\(\%\) when compared to the baseline. This demonstrates the effectiveness of the multi-scale feature fusion and multi-view self-supervised learning techniques in improving the performance of the end-to-end speech recognition model.

Table 4 Ablation study of the MM-ASR (CER\(\%\))

Table 5 presents a comparison of the Character Error Rate (CER) results between the MM-ASR model proposed in this study and several widely available models on the Aishell-1 test dataset. The models used for comparison include CTC/Attention [56], CAT [57], ESPnet [58], BAT [59], Paraformer [60], UMA [61] and WeNet [17, 18]. All assessment results in the paper are rounded to two decimal places for consistency. The findings in Table 5 demonstrate that the MM-ASR model outperforms the other models, indicating its superior performance in terms of speech recognition accuracy. This clearly demonstrates the effectiveness of multi-scale feature fusion and self-supervised learning within a single neural network. The experimental outcomes provide strong evidence supporting the effectiveness and usefulness of the proposed MM-ASR model for end-to-end speech recognition tasks, confirming its superiority compared to publicly available models like CTC/Attention, CAT, ESPnet, BAT, Paraformer, UMA and WeNet.

Table 5 Experimental results on the Aishell-1 test dataset (CER\(\%\))

Table 6 shows a comparison of character error rate (CER) results between the MM-ASR model proposed in this study and several widely available models on the English corpus WSJ (80-h). The models used for comparison include CTC/attention, CAT, ESPnet, LF-MMI [62], CTC-CRF ST-NAS [63], Wav2letter++ [64], and WeNet. The results in Table 6 demonstrate that on the English corpus WSJ, the MM-ASR model still outperforms other models.

Table 6 Experimental results on the WSJ dataset (CER\(\%\))

5 Conclusion

In this paper, a combination of supervised and self-supervised training techniques is leveraged to construct and train an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning. The proposed method emphasizes the use of inter-layer information in a shared encoder to improve the model’s ability to represent and process speech data. Self-supervised contrast loss is proposed in the shared encoder section to increase the model’s robustness, and the model is trained by combining supervised and self-supervised loss techniques. Additionally, the multi-view self-supervised learning component and the multi-scale feature fusion module’s ablation experiments are carried out to show their usefulness in the performance of model identification, respectively. Experimental research is done to determine the impact of combining various numbers of conformer blocks and balancing the hyperparameters \(\mu \) of self-supervised loss and supervised loss on the performance of model recognition. The Aishell-1 dataset is used in this study to assess the suggested technique. We further validate the effectiveness of this method on the English corpus WSJ. The experimental findings demonstrate that the strategy enhances the speech recognition model’s performance in terms of recognition.