Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Zhao, Jingyu; Li, Ruwei; Tian, Maocun; An, Weidong

doi:10.1007/s11063-024-11614-z

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Open access
Published: 08 May 2024

Volume 56, article number 168, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Download PDF

Jingyu Zhao¹,
Ruwei Li¹,
Maocun Tian¹ &
…
Weidong An¹

443 Accesses
Explore all metrics

Abstract

To address the challenges of the poor representation capability and low data utilization rate of end-to-end speech recognition models in deep learning, this study proposes an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning (MM-ASR). It adopts a multi-task learning paradigm for training. The proposed method emphasizes the importance of inter-layer information within shared encoders, aiming to enhance the model’s characterization capability via the multi-scale feature fusion module. Moreover, we apply multi-view self-supervised learning to effectively exploit data information. Our approach is rigorously evaluated on the Aishell-1 dataset and further validated its effectiveness on the English corpus WSJ. The experimental results demonstrate a noteworthy 4.6$\%$ reduction in character error rate, indicating significantly improved speech recognition performance . These findings showcase the effectiveness and potential of our proposed MM-ASR model for end-to-end speech recognition tasks.

Deep learning for time series classification: a review

Article 02 March 2019

CBAM: Convolutional Block Attention Module

A review on the long short-term memory model

Article 13 May 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Automatic Speech Recognition (ASR) technology plays a pivotal role in facilitating human-computer interaction by converting speech signals into text [1]. Indeed, ASR technology built on deep learning has made significant strides in recent years [2]. However, as people’s demands for accuracy and robustness in ASR models continue to grow, there are challenges in meeting these requirements. While the development of hybrid deep neural network models (DNNs) [3], encompassing acoustic, linguistic, and lexical models, has led to improved accuracy in automatic voice recognition, these models involve multiple modules and a tedious training procedure. Each module requires independent tuning, which can result in cumulative errors in the overall model. In response to these challenges, the field of voice recognition has undergone a noteworthy shift from hybrid models towards end-to-end (E2E) models [4, 5]. The E2E speech recognition model employs a single network to directly transform input speech sequences into output token sequences. By merging the acoustic model and linguistic model from traditional speech recognition into a unified network, the E2E model effectively simplifies the structure of the speech recognition process. This transition to end-to-end models brings the advantage of streamlining the ASR model, reducing complexity, and potentially improving overall performance and robustness. As research continues in this direction, we can anticipate further advancements in ASR technology, ultimately catering to the increasing demands of diverse applications and enhancing the quality of life for users.

Currently, there are several research directions in the field of end-to-end speech recognition: connectionist temporal classification (CTC) method [6,7,8], recurrent neural network transducers (RNN-T) [9], and attention-based models (AED) [10]. These end-to-end (E2E) models treat automatic speech recognition (ASR) as a sequence-to-sequence problem, where a neural network is directly employed to learn the mapping from speech to text. The CTC method has been extensively researched due to its straightforward modeling process, which involves only an encoder and outputs each token independently. However, its identification accuracy is often subpar because it assumes conditional independence between output tokens, and decoding speed is fast. The RNN-T model comprises two networks: an encoder that maps input acoustic frames to a higher level for characterization and a prediction and union network that forms the decoder [11]. This decoder network utilizes autoregression, relying on past prediction data. However, RNN-T training can be unstable, requiring more memory and potentially limiting training speed. Consequently, the resulting model may be less effective at recognizing objects and more challenging for businesses to implement. Many advanced ASR systems [12] are based on the AED model, which incorporates an encoder for encoding acoustic data and a decoder for generating the most likely word sequence or sentence.While this model considers both previously generated tokens and the acoustic context when producing tokens, it can lead to identification delays. Moreover, the estimated alignment in the attention-based process is vulnerable to noise corruption, much like in real-world speech recognition tasks, resulting in subpar recognition performance for the model. As ASR technology continues to evolve, researchers are actively exploring ways to enhance the capabilities of end-to-end speech recognition models, aiming to strike a balance between accuracy, efficiency, and robustness for practical applications.

The combination of CTC-Attention model has emerged as the dominant approach for end-to-end speech recognition systems [13, 14]. This model utilizes a multi-task learning framework and is trained with both CTC and Attention model objectives. The architecture consists of a shared encoder, a CTC layer, and an attention decoder. The shared encoder employs transformer [15] or conformer [16] blocks to effectively learn local and global properties of the input speech sequences, enhancing the model’s ability to capture relevant information. The CTC linear and log-softmax layers use the CTC loss function during training to optimize the softmax output. The CTC layer operates in streaming mode for the first channel, allowing for real-time streaming results. The attention decoder, consisting of transformer blocks, generates improved contextual representations and is utilized for the second channel during decoding. The attention-based decoder re-scores multiple candidate outcomes N-best in a teacher-forced manner, enabling more precise results during decoding. The recognized phrases are then reranked based on the scores, further improving recognition accuracy. Researchers have discovered that the combination of CTC loss and AED leads to faster training convergence and superior recognition results. As a result, this approach has become the standard reference scheme for training end-to-end speech recognition models. However, existing end-to-end speech recognition models face limitations in mining supervised information from vast amounts of unsupervised data. They primarily focus on the output characteristics of the last layer of the encoder and overlook inter-layer information. This limitation leaves room for improvement in model characterization, data utilization, and model resilience. Continued research in these areas presents opportunities for advancing end-to-end speech recognition systems, ultimately leading to more powerful and efficient models that can better utilize unsupervised data and improve recognition performance in various applications

Based on the latest research developments, we propose an innovative end-to-end speech recognition model that combines multi-scale feature fusion with multi-view self-supervised learning. The model is trained using a hybrid strategy, incorporating both supervised and self-supervised training approaches. The primary focus of the model is on leveraging the inter-layer information of the shared encoder to enhance its characterization capability. By utilizing the diversity of this information, the model becomes more adept at representing speech data accurately. Additionally, the model incorporates multi-view self-supervised learning, which maximizes the utilization of data information and improves the model’s resilience. This is achieved by creating various shared encoder sub-models, each excluding some information, and then using multi-view self-supervised learning to effectively exploit the data. The shared encoder consists of multiple conformer blocks, allowing it to learn both local and global features of the input speech sequence. The multi-scale feature fusion module (MFF) plays a crucial role in the model, providing different weights for the output of various conformer blocks and combining these weights to generate the final output representation. The outputs of each conformer block are then stitched together to form the overall representation. The model’s decoding process involves using both the CTC and Attention decoders on the output representation. To validate the performance of the proposed model, we use WeNet [17, 18], a speech recognition tool, as the benchmark, and the Aishell-1 [19] dataset for training and testing. Subsequently, it was further tested on the English corpus WSJ. The experimental results demonstrate the significant reduction in character error rate and improved speech recognition performance when compared to the baseline, employing four different decoding techniques. This confirms the effectiveness and potential of the proposed end-to-end speech recognition model, showcasing its capability to enhance voice recognition accuracy and performance.

2 Related Work

Based on different training objectives, SSL methods can be categorized into generative learning, discriminative learning, and multi-task learning. The research line of generative learning can be traced back to the auto-encoding model, which reconstructs the entire speech from continuous [20,21,22] or discrete [23] latent variables. Recent works propose to predict future frames from the history with an autoregressive model [24,25,26,27], or recover the masked frames from the corrupted speech with a non-autoregressive model [28,29,30,31,32]. Apart from generative learning, discriminative learning has also gathered interests recently. The well-known examples include CPC [33], wav2vec [34], vq-wav2vec [35], wav2vec 2.0 [36], DiscreteBERT [37], and HuBERT [38]. However, self-supervised paradigms require careful design, and such representations can be difficult to interpret. There is no guarantee that the model will learn a "good" speech representation in terms of identifying the most valuable information.

Convolutional neural networks (CNN) have been proven to be a useful model for handling various visual tasks [39,40,41,42]. Despite their great success, CNNs still have their limitations. They mainly focus on local spatial modeling and lack global context fusion. Models based on CNNs cannot handle long-range dependencies well. Recently, in the field of speech processing, ECAPATDNN [43] and its follow-up efforts [44, 45] achieved a significant breakthrough based on TDNN blocks and the squeeze-and-excitation (SE) [46] layer unified with Res2Block [47]. They provided an equal error rate of less than 1$\%$ on the VoxCeleb 1-O benchmark test. Among them, MFA-Conformer [48], which is based on multi-scale feature fusion, has achieved remarkable results in the speaker recognition task. However, the application of multi-scale feature fusion in speech recognition tasks is still rare.

Inspired by these recent advancements, we propose an innovative end-to-end speech recognition model that combines multi-scale feature fusion with multi-view self-supervised learning. The model uses a mixed training strategy that encompasses both supervised and self-supervised learning methods.

3 The Overall Architecture of MM-ASR

Figure 1 depicts the overall layout of the multi-view self-supervised learning and multi-scale feature fusion end-to-end speech recognition model developed in this research. The model is built on a common joint CTC-Attention model with conformer blocks for the shared encoder and self-supervised loss construction by contrastive learning. It also includes a self-attentive mechanism for the multi-scale feature fusion module, a CTC layer, and an attention decoder made up of transformer blocks for the decoder.

3.1 Conformer Structure

The architecture of the network proposed in this study integrates both Convolutional Neural Networks (CNN) and the Transformer model to extract vocal representations. While CNNs are known for their effectiveness in extracting local properties, they often fall short in capturing global properties. The self-attention module, on the other hand, is proficient in capturing long-range global context dependencies, thereby compensating for the CNN’s inability to capture global features. Hence, the Transformer network is incorporated to tackle this shortcoming. The network configuration of the encoder used in this study is shown in Fig. 2, composed of N layers of identical Conformer blocks [16].

The network is organized as a stack of four modules, each employing a residual connection structure [49]. These modules include the feedforward module, the multi-head self-attention (MHSA) module, the convolutional module, and a second feedforward module. The MHSA and the convolution module represent the core components of the Conformer block. The MHSA utilizes the relative position encoding scheme as proposed in the Transformer-XL model [50], which encodes the input considering the relative position deviation. It takes into account both the global content offset and the global position offset.

Following the MHSA is the convolutional module, which comprises Pointwise convolution, Depthwise convolution, and Glu and Swish activation layers. To assist in learning local features and facilitate the training of deep learning models, a BatchNorm layer is placed after the convolutional layer. Mathematically, for the input $x_i$ of Conformer block i, the output $y_i$ of Conformer block can be expressed as:

$$\begin{aligned}&{\widetilde{x}}_i=LN(x_i+\frac{1}{2}FFN(x_i)) \end{aligned}$$

(1)

$$\begin{aligned}&x^{\prime }_i=LN({\widetilde{x}}_i+MHSA({\widetilde{x}}_i)) \end{aligned}$$

(2)

$$\begin{aligned}&x^{\prime \prime }_i=LN(x^{\prime }_i+Conv(x^{\prime }_i)) \end{aligned}$$

(3)

$$\begin{aligned}&y_i=LN(x^{\prime \prime }_i+\frac{1}{2}FFN(x^{\prime \prime }_i)) \end{aligned}$$

(4)

where FFN refers to feed-forward module, MHSA refers to multi-headed self-attentive module, Conv refers to convolution module, and LN refers to layer parametric module.

3.2 Shared Encoder Based on Multi-view self-supervised Learning

Supervised learning is a deep learning approach that identifies a functional relationship between input and output by categorizing or regressing labeled data. However, it cannot fully exploit the data as it only learns from labeled data. In contrast, self-supervised learning is a potent technique for extracting applicable and generalizable latent representations from large volumes of unlabeled data. This approach is commonly employed in sequence-to-sequence (seq2seq) model pre-training and in facilitating downstream tasks [51,52,53]. Through auxiliary or pretext tasks, the network is trained to acquire representations that are beneficial for downstream tasks, mining its supervised knowledge from large-scale unsupervised data.

Based on the above analysis, this study designs a shared encoder leveraging multi-view self-supervised learning. Figure 3 illustrates the network structure of this encoder. The green section in Fig. 3 denotes the encoder employing N layers of identical Conformer blocks to more efficiently capture speech features. The units that are randomly dropped during the training phase are depicted in the blue portion of the multi-view self-supervised learning slab. The self-supervised learning slab employs the dropout regularization technique [54] to construct two distinct encoder views, thereby reducing the model’s generalization error. The dropout algorithm specifically randomly discards some units in each layer of the neural network to prevent co-adaptation and overfitting. This study uses a self-supervised approach to regularize the output prediction of the sub-model, leveraging the structural randomness introduced by the dropout process. The outputs of the encoder views are compared to extract more reliable characterization information. To better exploit the data and enhance the robustness of the model, the supervised loss is coupled with the self-supervised contrastive loss.

Given the shared encoder input data $x_i$, $x_i$ is fed twice during each training cycle in order to pass through the network’s forward. As a result, it is possible to derive two distributions of the common encoder output, designated as $P_1\left( y_i \mid x_i\right) $ and $P_2\left( y_i \mid x_i\right) $. $D_{k L}\left( P_1\left( y_i \mid x_i\right) \Vert P_2\left( y_i \mid x_i\right) \right) $ gives the Kullback-Leibler (KL) scatter between $P_1\left( y_i \mid x_i\right) $ and $P_2\left( y_i \mid x_i\right) $. Since the discard operation randomly discards the units in the shared encoder, the forward pass is carried out using two different shared encoder views of the same encoder, as was previously indicated. The self-supervised method utilized in this study then regularizes the model predictions during the training stage by minimizing the bidirectional Kullback–Leibler (KL) dispersion between the same batches $P_1\left( y_i \mid x_i\right) $ and $P_2\left( y_i \mid x_i\right) $ as follows:

$$\begin{aligned} L_{K L}=\frac{1}{2}\left( D_{k L}\left( P_1\left( y_i \mid x_i\right) \Vert P_2\left( y_i \mid x_i\right) \right) +D_{k L}\left( P_2\left( y_i \mid x_i\right) \Vert P_1\left( y_i \mid x_i\right) \right) \right) \end{aligned}$$

(5)

3.3 Multi-scale Feature Fusion Module

In existing speech recognition models, the diversity of information between different layers is often overlooked, limiting their ability to represent the data. When the final speech representation is extracted by the encoder, they only pass the features output from the last layer to the decoder. This study proposes an attention-based multi-scale feature fusion module (MFF) to address this issue by maximizing the utilization of inter-layer information to enhance the model representation information capabilities.

Based on the analysis, the scale information is extracted by each conformer block of each layer in the shared encoder, and there is a reliance between the scale information of the different layers. In this work, we explicitly model the dependencies between each conformer block using the proposed multi-scale feature fusion module. After learning these dependencies, we sum the output of each conformer block and use the scale information extracted from each layer to form N-dimensional features. This process results in acoustic features with stronger characterization information. The structure of this module is depicted in Fig. 4.

The implementation process of the multi-scale feature fusion module involves the following steps: The output from each conformer block is first combined into $X \in {\mathbb {R}}^{C \times H \times W}$. After X is transformed into the matrix $A \in {\mathbb {R}}^{C \times N}$ and subjected to transposition, matrix multiplication, and softmax operations, the attention map $V \in {\mathbb {R}}^{C \times C}$ is produced:

$$\begin{aligned} v_{j i}=\frac{\exp \left( a_i \cdot a_j\right) }{\sum _{i=1}^C \exp \left( a_i \cdot a_j\right) } \end{aligned}$$

(6)

where the impact of the ith conformer block on the jth conformer block of the metric is indicated by the notation $v_{j i}$. The output of dimension (C $\times $ N) is then molded into (C $\times $ H $\times $ W) by performing matrix multiplication of this attention graph V with matrix A. After learning the dependency relationship, the result is multiplied by the scale factor $\beta $, and an element-by-element summing operation is then carried out with X to generate the output of each conformer block $Y \in {\mathbb {R}}^{C \times H \times W}$:

$$\begin{aligned} y_j=\beta \sum _{i=1}^C\left( v_{j i} a_i\right) +a_j \end{aligned}$$

(7)

where $\beta $ is initialized to 0 and gradually learns to assign larger weights. In Eq. (7), the process of the multi-scale feature fusion module is described as follows: The weighted sum of all conformer block output features and the initial conformer block output features, represents the resultant features of each conformer block after learning the dependencies. This module models the dependencies between different conformer blocks, which helps to obtain more robust speech representations. The final acoustic representation, provided to the decoder, is generated by aggregating the outputs of each conformer block after learning their dependencies. Through this process of weighted summation and integration of information from multiple conformer blocks, the end-to-end speech recognition model is empowered to effectively represent and comprehend complex speech patterns. This enhances the model’s overall capability to achieve accurate and robust speech recognition.

$$\begin{aligned} y_c=\sum _{j=1}^C y_j \end{aligned}$$

(8)

where $y_c$ is the final output of the multi-scale feature fusion module.

3.4 Decoder

The Connectionist Temporal Classification (CTC) method, developed by Graves et al. [6], is a technique primarily used to address the problem of output alignment between labels and neural network predictions.

To determine the likelihood of the CTC target sequence, the CTC model takes into account all feasible alignment routes between the target sequence y and the input sequence x. This likelihood is specified as:

$$\begin{aligned} P(y \mid x)=\sum _{q \in \beta ^{-1}(y)} P(q \mid x) \end{aligned}$$

(9)

where q is one of the pathways, and $\beta ^{-1}(y)$ is the set of all paths that could map from the input sequence to the output label. Equation (10) illustrates the definition of the CTC loss function as the sum of the negative log probability of obtaining the appropriate label during training.

$$\begin{aligned} L_{C T C}=-\ln P(y \mid x) \end{aligned}$$

(10)

Therefore, the CTC method significantly simplifies the training and modeling processes for speech recognition models. In this study, we use the CTC model as one of the decoders. The CTC model’s architecture comprises linear and log-softmax layers. During the training phase, we apply the CTC loss function to the softmax output, which helps to transform the output of the shared encoder after the MMF activation into the CTC model.

The Attention Decoder in this paper is made up of many similar Transformer blocks. Wherein, the Multi-Head Cross-Attention module (MHCA), is added to the Feedforward and Self-Attention modules in order to execute multi-head attention on the output of the shared encoder after passing through the MMF. The attention decoder in this study uses relative position encoding to be consistent with the shared encoder. Mathematically, the output $y_i$ for input $x_i$ of transformer block i in the attention decoder can be written as follows:

$$\begin{aligned}&x^{\prime }_i=LN(x_i+MHSA(x_i)) \end{aligned}$$

(11)

$$\begin{aligned}&x^{\prime \prime }_i=LN(x^{\prime }_i+MHCA(x^{\prime }_i,{\widetilde{y}})) \end{aligned}$$

(12)

$$\begin{aligned}&y_i=LN(x^{\prime \prime }_i+FFN(x^{\prime \prime }_i)) \end{aligned}$$

(13)

where the shared encoder output following the MMF is referred to as ${\widetilde{y}}$. MHSA stands for multi-headed self-attentive module, MHCA for multi-headed cross-attentive module, and LN for layer norm module, where FFN is for feedforward module.

3.5 Multi-task Learning Paradigm

The model proposed in this study employs two supervised losses, namely the Connectionist Temporal Classification (CTC) loss and the Attention-based Encoder-Decoder (AED) loss, in addition to a self-supervised comparison loss. The training process follows a hybrid end-to-end approach that combines both supervised and self-supervised training methods. By integrating both CTC and AED losses into one of the supervised losses, the model benefits from improved convergence while fully capturing token dependencies within the data. Equations (14) and (15) define the joint supervised and self-supervised losses, where x is the acoustic feature and y is the corresponding label. The CTC decoder and attention decoder losses are denoted by the variables $L_{C T C}(x, y)$, $L_{A E D}(x, y)$, and $\lambda \in (0,1)$, which is the hyperparameter that balances the weights of the two losses. are hyperparameters that weigh the significance of losses that are both supervised and self-supervised.

$$\begin{aligned}&L_S=\lambda L_{C T C}(x, y)+(1-\lambda ) L_{A E D}(x, y) \end{aligned}$$

(14)

$$\begin{aligned}&L=\lambda L_S+\mu L_{K L} \end{aligned}$$

(15)

3.6 Analyze

Compared with supervised learning, self-supervised learning methods attempt to learn powerful contextual representations from audio data only, and then fine-tune the model on paired data. Currently, there are some pre-trained models that achieve excellent performance, but these require a large amount of external data and model parameters for training. Moreover, these models mainly address general representations for speech tasks. Specifically, models such as CPC and the wav2vec series use contrastive InfoNCE loss to distinguish between related positive samples and negative samples. Inspired by masked language model loss in NLP, DiscreteBERT and HuBERT predict discrete targets in masked regions. However, our method focuses on an end-to-end ASR model that requires only a small amount of labeled data for training and achieves excellent performance through the proposed multi-view contrastive self-supervised approach.

The multi-scale feature fusion network structure is relatively flexible and there is no clear boundary. The receptive field of the high-level network is relatively large, and the semantic information representation ability is strong, but the resolution of the feature map is low, and the geometric information representation ability is weak. The receptive field of the low-level network is relatively small, and the geometric detail information representation ability is strong. Although the resolution is high, the semantic information representation ability is weak. The multi-scale feature fusion network makes the model easier to achieve significant results on complex tasks by fusing deep and shallow layer features. The latest research has demonstrated the potential of voice models on full-stack voice tasks by using the weighted sum of embeddings from different layers. They found that different layers contain useful information for different tasks. For example, the top hidden states are useful for ASR, while the bottom layers are more effective for speaker verification. Therefore, this study proposes an attention-based multi-scale feature fusion module (MFF) to enhance the model’s ability to represent information by maximizing inter-layer information utilization.

4 Performance Testing and Analysis

We first demonstrate our results on the Aishell-1 test dataset to gain a deeper understanding of our method. Subsequently, we further validate the effectiveness of the method on the English corpus WSJ (80-h). To evaluate the effectiveness of the multi-scale feature fusion method and the multi-view self-supervised learning module, we conducted ablation experiments to compare the differences. The performance of the model is evaluated based on the character error rate (CER).

4.1 Dataset

The Aishell company provides the Aishell-1 dataset, an open-source speech dataset that resamples high-fidelity microphone audio data to 16 kHz, 16-bit WAV format. The dataset consists of speech data from 400 speakers, representing diverse dialect regions in China, and covers a wide range of topics such as technology, sports, entertainment, current news, finance, and economics. The Aishell-1 dataset is divided into three sets: a training set with 340 speakers, containing 150 h of speech data, a validation set with 40 speakers, comprising 10 h of speech data, and a test set with 20 speakers, containing 5 h of speech data. In total, the dataset contains 165 h of speech data. The composition of the Aishell-1 dataset is detailed in Table 1. The test set consists of 7176 speech samples. For this project, the Aishell-1 dataset was utilized for both training and testing the proposed speech recognition model.

Table 1 Aishell-1 speech corpus composition structure

Full size table

4.2 Experimental Setup

The test configuration for this experiment includes an AMD R9-3090X processor, 32 GB of RAM, and an NVIDIA RTX-3090 GPU graphics card. The software environment is a 64-bit Ubuntu 20.04 operating system running the Pytorch deep learning framework.

The input features consist of an 80-dimensional log-Mel filter bank (Fbank) with a 25-ms window and a 10-ms shift. We perform speed perturbation on the entire data at 0.9, 1.0, and 1.1 speeds to generate a 3x speed variation. SpecAugment is applied with 2 frequency masks with maximum frequency mask (F = 10) and 2 time masks with maximum time mask (T = 50).

To reduce the computational burden, a two-dimensional convolutional down-sampling technique is employed at the front end of the shared encoder. The kernel size is 3*3, the stride is 2, which means a total of 4 subsampling operations. The shared encoder comprises 12 conformer blocks with four multi-headed attentions, each using 256 attention dimensions and 2048 feedforward dimensions, consistent with the baseline model. The attention decoder includes six transformer blocks with four multi-headed attentions. During joint training and decoding, the weights of the CTC branches are set to 0.3 and 0.5, respectively. Gradient accumulation is used during training to stabilize the process, with gradients updated every 4 batches [55]. To prevent overfitting, dropout operations and label smoothing regularization are applied to each conformer and transformer block. The Adam optimizer is used for training, with a learning rate schedule of 25,000 warm-up steps and an initial learning rate of 0.002. Additionally, We conducted experiments with different hyperparameters $\mu $ selected as 0, 0.01, 0.05, 0.1, 1, 10, and different numbers of MFF fusion layers selected as 2, 3, 4, 12.

4.3 Evaluation Metrics

In automatic speech recognition, the results are usually presented as a list of words and phrases. During this process, three types of errors can occur: insertion, deletion, and substitution errors. Insertion errors involve adding an extra word to the recognition result; Deletion errors occur when the correct word is missing from the recognition result and substitution errors replace the correct word in the recognition result with an incorrect word. In English, the recognition success is typically measured in words, and the error rate is referred to as Word Error Rate (WER). For languages like Russian and Viennese, the appropriate evaluation metric is the Word Error Rate (WER) as well. However, in languages like Chinese, word ambiguity is a challenge, making it difficult to directly measure errors in words. Therefore, the Character Error Rate (CER) is commonly used as the evaluation index for Chinese speech recognition, and similar languages like Japanese also employ CER. As the Chinese speech dataset Aishell-1 is employed in this experiment, CER is used as the evaluation index, and its formula is as follows:

$$\begin{aligned} C E R=\frac{N_{D e l}+N_{S u b}+N_{I n s}}{N_{R e f}} \end{aligned}$$

(16)

where, $N_{S u b}$ represents the number of words in which a substitution error occurs; $N_{I n s}$ represents the number of words in which an insertion error occurs; and $N_{R e f}$ represents the total number of words in the test set. $N_{D e l}$ represents the number of words for which the recognition result has a deletion error in comparison to the actual annotation. Insertion errors make it possible for CER to be greater than 100$\%$ with a minimum of 0.

4.4 Performance Testing and Analysis

The experiments for the multi-scale feature fusion module aim to investigate the impact of fusing the output data from different numbers of conformer blocks on the model’s recognition performance. The experimental results are summarized in Table 2 as follows: B6+B12 in the shared encoder correspond to fusing the output data from the sixth and twelfth conformer blocks. B4+B8+B12 in the shared encoder indicate the fusion of the output data from the fourth, eighth, and twelfth conformer blocks. B3+B6+B9+B12 in the shared encoder represent the fusion of the output data from the third, sixth, ninth, and twelfth conformer blocks. All blocks, as proposed in this work, symbolize the fusion of the output data from every conformer block in the shared encoder. The ablation experiment only focuses on the MFF module without the addition of SSL. The results clearly demonstrate that the recognition performance is positively influenced by the number of fused blocks. And with the increase in the number of fusion blocks, the recognition performance also improves. Specifically, the performance of models with two, three, or four blocks fused is inferior to that of the model with all blocks fused, confirming the importance of incorporating the output data from all conformer blocks for improved recognition performance.

Table 2 Experimental results of different number of blocks (CER$\%$)

Full size table

In this study, experiments are carried out for the multi-view self-supervised learning module to examine the impact of the hyperparameter $\mu $ on the model recognition performance. The experimental results are displayed in Table 3. When $\mu =0.05$, the self-supervised loss and supervised loss are balanced to obtain the best performance, which implies that it is crucial to balance the self-supervised loss and supervised loss in joint training.

Table 3 Weight sensitivity study on $\mu $

Full size table

In this study, ablation experiments are conducted to demonstrate the effectiveness of the MM-ASR model’s multi-scale feature fusion module and multi-view self-supervised learning method. The experimental results are displayed in Table 4. The baseline model is the original WeNet model, with the decoder trained in supervised learning mode using features from the network’s final layer. The MM-ASR model, proposed in this paper, incorporates both the multi-scale feature fusion module and multi-view self-supervised learning method. Two additional variants are also evaluated: -SSL, which is the MM-ASR model with the multi-view self-supervised learning method eliminated, and -MFF, which is the MM-ASR model with the multi-scale feature fusion module removed. The experimental results demonstrate the efficacy of both multi-scale feature fusion and multi-view self-supervised learning. The MM-ASR model, which combines supervised and self-supervised losses for training and focuses on interlayer information, exhibits improved model resilience and achieves a lower character error rate (CER) compared to the original WeNet model. The proposed approach leads to a significant enhancement in voice recognition ability, reducing the character error rate by approximately 4.6$\%$ when compared to the baseline. This demonstrates the effectiveness of the multi-scale feature fusion and multi-view self-supervised learning techniques in improving the performance of the end-to-end speech recognition model.

Table 4 Ablation study of the MM-ASR (CER$\%$)

Full size table

Table 5 presents a comparison of the Character Error Rate (CER) results between the MM-ASR model proposed in this study and several widely available models on the Aishell-1 test dataset. The models used for comparison include CTC/Attention [56], CAT [57], ESPnet [58], BAT [59], Paraformer [60], UMA [61] and WeNet [17, 18]. All assessment results in the paper are rounded to two decimal places for consistency. The findings in Table 5 demonstrate that the MM-ASR model outperforms the other models, indicating its superior performance in terms of speech recognition accuracy. This clearly demonstrates the effectiveness of multi-scale feature fusion and self-supervised learning within a single neural network. The experimental outcomes provide strong evidence supporting the effectiveness and usefulness of the proposed MM-ASR model for end-to-end speech recognition tasks, confirming its superiority compared to publicly available models like CTC/Attention, CAT, ESPnet, BAT, Paraformer, UMA and WeNet.

Table 5 Experimental results on the Aishell-1 test dataset (CER$\%$)

Full size table

Table 6 shows a comparison of character error rate (CER) results between the MM-ASR model proposed in this study and several widely available models on the English corpus WSJ (80-h). The models used for comparison include CTC/attention, CAT, ESPnet, LF-MMI [62], CTC-CRF ST-NAS [63], Wav2letter++ [64], and WeNet. The results in Table 6 demonstrate that on the English corpus WSJ, the MM-ASR model still outperforms other models.

Table 6 Experimental results on the WSJ dataset (CER$\%$)

Full size table

5 Conclusion

In this paper, a combination of supervised and self-supervised training techniques is leveraged to construct and train an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning. The proposed method emphasizes the use of inter-layer information in a shared encoder to improve the model’s ability to represent and process speech data. Self-supervised contrast loss is proposed in the shared encoder section to increase the model’s robustness, and the model is trained by combining supervised and self-supervised loss techniques. Additionally, the multi-view self-supervised learning component and the multi-scale feature fusion module’s ablation experiments are carried out to show their usefulness in the performance of model identification, respectively. Experimental research is done to determine the impact of combining various numbers of conformer blocks and balancing the hyperparameters $\mu $ of self-supervised loss and supervised loss on the performance of model recognition. The Aishell-1 dataset is used in this study to assess the suggested technique. We further validate the effectiveness of this method on the English corpus WSJ. The experimental findings demonstrate that the strategy enhances the speech recognition model’s performance in terms of recognition.

Availability of data and materials

The data that support the findings of this study are available from the corresponding author, upon reasonable request.

References

Seltzer ML, Ju Y-C, Tashev I, Wang Y-Y, Yu D (2011) In-car media search. IEEE Signal Process Mag 28(4):50–60. https://doi.org/10.1109/MSP.2011.941065
Article Google Scholar
Graves A, Mohamed A-R, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing, vancouver, BC, Canada, pp 6645-6649, https://doi.org/10.1109/ICASSP.2013.6638947
Hinton G et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97. https://doi.org/10.1109/MSP.2012.2205597
Article Google Scholar
Wang D, Wang X, Lv S (2019) An overview of end-to-end automatic speech recognition. Symmetry 11(8):1018
Article Google Scholar
Li J (2022) Recent advances in end-to-end automatic speech recognition. APSIPA Trans Signal Inf Process 11(1)
Graves A, Fernández S, Gomez F, et al (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning. pp 369-376
Deng K, et al (2022) Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), Singapore, Singapore, pp 8517-8521, https://doi.org/10.1109/ICASSP43922.2022.9747887
Nakagome Y, Komatsu T, Fujita Y, et al (2022) InterAug: augmenting noisy intermediate predictions for CTC-based ASR. arXiv preprint arXiv:2204.00174
Graves A (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711
Kim S, Hori T, Watanabe S (2017) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, pp 4835-4839, https://doi.org/10.1109/ICASSP.2017.7953075
Rao K, Sak H, Prabhavalkar R (2017) Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In: (2017) IEEE automatic speech recognition and understanding workshop (ASRU). Okinawa, Japan, pp 193–199. https://doi.org/10.1109/ASRU.2017.8268935
Karita S, Soplin NEY, Watanabe S et al (2019) Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration[C]//Proceedings of the Annual Conference of the International Speech Communication Association. INTERSPEECH. 2019:1408–1412
Google Scholar
Kim S, Hori T, Watanabe S (2017) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, pp 4835-4839, https://doi.org/10.1109/ICASSP.2017.7953075
Zhang B, Wu D, Yao Z, et al (2020) Unified streaming and non-streaming two-pass end-to-end model for speech recognition. arXiv preprint arXiv:2012.05481
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Gulati A, Qin J, Chiu CC, et al (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100
Yao Z, Wu D, Wang X, et al (2021) Wenet: production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv preprint arXiv:2102.01547
Zhang B, Wu D, Peng Z, et al (2022) Wenet 2.0: more productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455
Bu H, Du J, Na X, Wu B, Zheng H (2017) AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In: 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), Seoul. Korea (South) 2017, pp 1–5. https://doi.org/10.1109/ICSDA.2017.8384449
Chen Y-C, Huang S-F, Lee H-y, Wang Y-H, Shen C-H (2019) Audio word2vec: sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 27(9):1481–1493
Article Google Scholar
Hsu W-N, Zhang Y, Glass J (2017) Learning latent representations for speech generation and transformation. In: Interspeech, pp 1273–1277
Hsu W N, Zhang Y, Glass J (2017) Unsupervised learning of disentangled and interpretable representations from sequential data. Adv Neural Inf Process Syst 30
Chorowski J, Weiss RJ, Bengio S et al (2019) Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans Audio Speech Lang Process 27(12):2041–2053
Article Google Scholar
Chung Y A, Tang H, Glass J (2020) Vector-quantized autoregressive predictive coding. arXiv preprint arXiv:2005.08392
Chung Y A, Hsu W N, Tang H, et al (2019) An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240
Chung Y A, Glass J (2020) Generative pre-training for speech with autoregressive predictive coding. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE pp 3497-3501
Chung Y A, Glass J (2020) Improved speech representations with multi-target autoregressive predictive coding. arXiv preprint arXiv:2004.05274
Liu A H, Chung Y A, Glass J (2020) Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406
Liu AT, Li SW, Lee H (2021) Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Trans Audio Speech Lang Process 29:2351–2366
Article Google Scholar
Liu A T, Yang S, Chi P H, et al (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6419–6423
Ling S, Liu Y, Salazar J, et al (2020) Deep contextualized acoustic representations for semi-supervised speech recognition. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6429–6433
Ling S, Liu Y (2020) Decoar 2.0: deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659
Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748
Schneider S, Baevski A, Collobert R, et al (2019) wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862
Baevski A, Schneider S, Auli M (2019) vq-wav2vec: self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453
Baevski A, Zhou Y, Mohamed A et al (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst 33:12449–12460
Google Scholar
Baevski A, Mohamed A (2020) Effectiveness of self-supervised pre-training for ASR. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7694–7698
Hsu WN, Bolte B, Tsai YHH et al (2021) Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans Audio Speech Lang Process 29:3451–3460
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25
Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1653–1660
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
Google Scholar
Desplanques B, Thienpondt J, Demuynck K (2020) Ecapatdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143
Thienpondt J, Desplanques B, Demuynck K (2021) Integrating frequency translational invariance in tdnns and frequency positional information in 2d resnets to enhance speaker verification. arXiv preprint arXiv:2104.02370
Liu T, Das R K, Lee K A, et al (2022) MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7517–7521
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp7132–7141
Gao SH, Cheng MM, Zhao K et al (2019) Res2net: a new multi-scale backbone architecture. IEEE Trans Pattern Anal Mach Intell 43(2):652–662
Article Google Scholar
Zhang Y, Lv Z, Wu H, et al (2022) Mfa-conformer: multi-scale feature aggregation conformer for automatic speaker verification. arXiv preprint arXiv:2203.15249
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Dai Z, Yang Z, Yang Y, et al (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860
Devlin J, Chang M W, Lee K, et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748
Chen Z, Zhang Y, Rosenberg A, Ramabhadran B, Wang G, Moreno P (2021) Injecting text in self-supervised speech pretraining. In: IEEE automatic speech recognition and understanding workshop (ASRU). Cartagena, Colombia pp 251–258. https://doi.org/10.1109/ASRU51503.2021.9688018
Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet Google Scholar
Hermans JR, Spanakis G, Möckel R (2017) Accumulated gradient normalization. In: Asian conference on machine learning. PMLR, pp 439–454
Karita S, et al (2019) A comparative study on transformer vs rnn in speech applications. In: IEEE automatic speech recognition and understanding workshop (ASRU), Singapore, pp 449–456, https://doi.org/10.1109/ASRU46091.2019.9003750
An K, Xiang H, Ou Z (2020) CAT: a CTC-CRF based ASR toolkit bridging the hybrid and the end-to-end approaches towards data efficiency and low latency. arXiv preprint arXiv:2005.13326
Watanabe S, Hori T, Karita S, et al (2018) Espnet: end-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015
An K, Shi X, Zhang S (2023) BAT: boundary aware transducer for memory-efficient and low-latency ASR. arXiv preprint arXiv:2305.11571
Gao Z, Li Z, Wang J, et al (2023) FunASR: a fundamental end-to-end speech recognition toolkit. arXiv preprint arXiv:2305.11013
Fang Y, Li X (2023) Unimodal aggregation for CTC-based speech recognition. arXiv preprint arXiv:2309.08150
Hadian H, Sameti H, Povey D, Khudanpur S (2018) Flatstart single-stage discriminatively trained HMM-based models for ASR. IEEE/ACM Trans Audio Speech Lang Process 26(11):1949–1961
Article Google Scholar
Zheng H, An K, Ou Z (2021) Efficient neural architecture search for end-to-end speech recognition via straight-through gradients. In: 2021 IEEE spoken language technology workshop (SLT). IEEE, pp 60–67
Zeghidour N, Xu Q, Liptchinsky V, Usunier N, Synnaeve G, Collobert R (2018) Fully convolutional speech recognition. arXiv preprint arXiv:1812.06864

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, Beijing University of Technology, Beijing, China
Jingyu Zhao, Ruwei Li, Maocun Tian & Weidong An

Authors

Jingyu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Ruwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Maocun Tian
View author publications
You can also search for this author in PubMed Google Scholar
Weidong An
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The contributions of the authors are as follows: Corresponding author RL designed the algorithm together with JZ, and provided the experimental equipment as well as improved the writing and logic of the paper. JZ verified the algorithm experimentally and wrote the first draft of the paper. MT and Weidong An organized the experimental data, visualized the experimental results, and assisted JZ to complete the experiments. This paper was co-authored by all authors, who have read and approved the final manuscript.

Corresponding author

Correspondence to Ruwei Li.

Ethics declarations

Conflict of interest

We the authors of this manuscript entitled “Multi-view self-supervised learning and multi-scale feature fusion for automatic speech recognition” declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Informed consent

The submission of this article has been approved by all authors, and the data used in this article has been agreed by the relevant authorities and does not raise issues such as privacy and information security.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, J., Li, R., Tian, M. et al. Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition. Neural Process Lett 56, 168 (2024). https://doi.org/10.1007/s11063-024-11614-z

Download citation

Accepted: 06 April 2024
Published: 08 May 2024
DOI: https://doi.org/10.1007/s11063-024-11614-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Abstract

Similar content being viewed by others

Deep learning for time series classification: a review

CBAM: Convolutional Block Attention Module

A review on the long short-term memory model

1 Introduction

2 Related Work