1 Introduction

In recent years, speech-based human–computer interaction and speech communication have grown commonplace due to the growth of the telecommunications industry and the popularity of speech communication technologies such as online conferencing. Especially since COVID-19 epidemic, many people have to work from home, and they can communicate and cooperate seamlessly through online conferencing software. During online conferencing, however, speech quality is frequently degenerated by a variety of factors, such as background noise, reverberation, packet loss and network jitter. For example, delays in speech quality can lead to a sharp decline in communication efficiency during an online conference. However, there is a high demand for speech quality assessment. This requires speech communication service providers to estimate the perceived speech quality to monitor the speech services offered to their customers [1].

Fig. 1
figure 1

Spectrograms of different MOS in a 5, b 4, c 3, d 2 and e 1

Depending on the availability of the clear reference, speech quality assessment techniques can be categorized as intrusive or non-intrusive [2]. Although intrusive methods are more accurate, the requirement of a clean reference limits them to many practical applications in realistic scenarios because clean reference is typically not present in quality of service (QoS) monitoring such as online conferencing applications. Without the intervention of human listeners, existing intrusive objective speech quality measurements can automatically assess the effectiveness of communication systems, but such methods require a reference signal with excellent sound quality. In contrast, it is necessary to investigate non-intrusive methods that simply need an evaluation signal to provide a quality score without the need for a sonically superior, unimpaired reference signal. It is consistent with the user’s perception when using online conferencing applications and is more relevant for research.

A few indicators, such as signal-to-noise ratio (SNR) [3], signal-to-distortion ratio (SDR) [4] and signal-to- interference ratio (SIR) [5], have been used to estimate the speech quality. These indicators provide an objective measure of target speech. The ITU-T Recommendation P.808’s MOS is the most widely used SQA indicator of user opinion. Using the absolute category rating (ACR) approach, a speech corpus is rated on a scale of 1–5 by human listeners. Even though MOSs are thought to be the most effective method for evaluating speech quality on subjective listening tests, they involve having a lot of participants do listening exams and giving perceptual evaluations. To address the aforementioned issues, numerous assessment models have been proposed [1, 6,7,8,9,10,11,12,13,14,15]. For example, Fu et al. [12] predicted frame by frame the quality of augmented speech by using a BLSTM-based quality assessment model called Quality-Net. Yoshimura et al. [9] proposed a fully connected neural network and CNN-based synthetic speech naturalness predictor to predict MOS. Lo et al. [13] proposed MOSNet to predict the quality of synthetic speech. Tseng et al. [14] utilizing self-supervised representations for MOS prediction. In order to make better use of the scores of each judge in the MOS dataset, Leng et al. [15] proposed a MOS predictor MBNet with a mean subnet and a bias subnet. It is important to note that different listeners may provide different ratings for the same corpus. As a result, it is still a challenge to develop a model with high relevance to the subjective human evaluation.

Inspired by the successful combination of CNN and RNN and the ResNet’s powerful ability to extract local features, this paper introduces a non-intrusive speech quality evaluation method based on ResNet and BiLSTM. In addition, attention mechanisms are employed to focus on different parts of the input [16]. Specifically, the multi-head attention is used to obtain weights that can be scored according to the BiLSTM output, resulting in scores that are more relevant to human scores. In summary, the contributions of this work are as follows: (1) A variant ResNet model for non-intrusive speech quality assessment is designed. In this model, a time invariant convolution is adopted to generate feature maps while keeping the time series information. (2) We propose an effective method based on ResNet-BiLSTM for non-intrusive speech quality assessment, which outperforms the state-of-the-art models in terms of accuracy on PSTN Corpus [17] and ITU-T P Supplement-23.

The remainder of this paper is organized as follows. Section 2 presents the proposed methods. The experimental results are discussed in Sect. 3. Section 4 provides an explanation of the discussions. Finally, the conclusions and recommendations for further research are summarized in Sect. 5.

Fig. 2
figure 2

Model architecture. The 257-dimensional 2D audio spectrogram as input into the four residual blocks. Subsequently, the features extracted by ResNet are sent to BiLSTM with attention. Finally two FC layers and an average pooling layer are used to obtain the final prediction

2 Methods

2.1 Feature extraction

Spectral representation is the basis of digital signal processing. Magnitude spectrogram is commonly used to represent speech signals. Figure 1 shows the spectrograms at different MOS values. It can be found that there are differences in spectrograms with different scores. Thus, we used sequence of frame-based spectral feature to preprocess speech signals. For every 256 sample points, each speech file was subjected to a short-time Fourier transform (STFT), resulting in 257-dimensional spectral features. The STFT of signal x(t) is defined as follows:

$$\begin{aligned} \textrm{STFT}_{x}(t,f) = \int _{ - \infty }^{ + \infty }x(\tau ){g}(\tau - t){e^{ - j2\pi f\tau }}d\tau \end{aligned}$$
(1)

where g(t) is the window function. From the above formula, the STFT of signal x(t) is the signal multiplied by an analysis window \({g}(\tau - t)\) centered on t. Multiplying x(t) by the analysis window function \({g}(\tau - t)\) is equivalent to taking out a slice of the signal near the analysis time point t. For a given time t, STFT\({_x} (t, f)\) can be regarded as the spectrum at that time. Next, the magnitude spectrogram is calculated by taking the mode of STFT\({_x} (t, f)\), and the result is that the signal x.

2.2 Model architecture

The architecture of the proposed ResNet-BiLSTM model is shown as Fig. 2. Given a speech sample, first, we calculate the magnitude spectrogram x for the speech by STFT with librosa [18] tool. The next step is to obtain local feature maps by ResNet. To capture long-term dependencies in speech signal, bidirectional LSTM is used where input is \(x_t\) and output is h. The model performs better due to the self-attention mechanism [16]. To align the dimension of the learning function with a frame-level dimension, two fully connected layers are added. Finally, the utterance-level MOS is derived by average pooling. All the activation functions are rectified linear units (ReLU) [19]. Batch normalization (BN) [20] is applied to convolution operations. The architecture of ResNet-BiLSTM is detailed in Table 1.

2.3 ResNet

In time series data modeling, CNN is frequently employed and performs effectively [21, 22]. By stacking additional convolution layers, CNN increases the size of its receptive field. However, due to the disappearance or explosion of gradients, deeper models are not always better. Therefore, He et al.[23] proposed ResNet in 2015, which contains a skip connection. ResNet training network output is defined as follows:

$$\begin{aligned} x_{l+1} = F(x_l)+x_l \end{aligned}$$
(2)

where \(x_{l+1}\) and \(x_l\) denote the residual block output and input, and F(x) is residual mapping function.

In this work, we propose an enhanced residual block that only requires a convolution operation along the time dimension to generate a feature map with time series information. The residual block for this work is shown in Fig. 3. For the SQA task, the information of the entire speech is useful, and the above operation preserves the integrity of the time series information, so it is beneficial to predict the speech quality. To extract local features for each local region in the input spectral features, ResNet utilizes a set of convolutional filters to generate feature maps. This is indicated as follows:

$$\begin{aligned} (h_k)_{ij} = (W_k \otimes x)_{ij} + b_k \end{aligned}$$
(3)

where x stands for the input feature maps, \(W_k\) and \(b_k\) stand for the kth filter and bias, respectively, and \((h_k)_{ij}\) signifies the (ij) element of the kth output feature map. The 2D spatial convolution operation is represented by the symbol \(\otimes \).

There are four residual blocks, and each block has a different number of layers compared to ResNet-18 and ResNet-50. To minimize the number of the trainable parameters, we use fewer residual blocks in the proposed ResNet-BiLSTM. Each residual block is configured with the same number of layers.

A BN layer is added to each residual module. BN layer makes all samples in a minibatch associated. Therefore, the network will not generate a definite result from a given training sample, which means that the output of the same sample no longer depends only on the sample itself but also on the other samples in the same batch. And when the network takes a random batch, the whole network will not learn in this direction. Therefore, over-fitting is avoided to a certain extent.

Table 1 Configuration of ResNet-BiLSTM architectures
Fig. 3
figure 3

Residual block used by the proposed model. Each block uses the output of the previous block as input

2.4 BiLSTM with attention

Long short-term memory (LSTM) [24], a type of temporal recurrent neural network, was developed to address the problem of long-term dependence that all RNNs share and which all RNNs have in the form of a chain of repeating neural network modules. For each layer, LSTM processes at time t by computing as follows:

$$\begin{aligned} f_t= & {} {\sigma }_g(W_fx_t + U_fh_{t-1} + b_f)\end{aligned}$$
(4)
$$\begin{aligned} i_t= & {} {\sigma }_g(W_ix_t + U_ih_{t-1} + b_i)\end{aligned}$$
(5)
$$\begin{aligned} o_t= & {} {\sigma }_g(W_ox_t + U_oh_{t-1} + b_o)\end{aligned}$$
(6)
$$\begin{aligned} c_t= & {} f_tc_{t-1} + i_t{\sigma }_c(W_cx_t + U_ch_{t-1} + b_c)\end{aligned}$$
(7)
$$\begin{aligned} h_t= & {} o_t{\sigma }_h(c_t) \end{aligned}$$
(8)

where \({\sigma }_g\) denotes sigmoid function, \({\sigma }_c\) and \({\sigma }_h\) denote hyperbolic tangent function, c is the internal cell states, h is the hidden states, and f, i and o are gates.

BiLSTM, which is composed of forward LSTM and backward LSTM, can better capture forward and backward information than LSTM which can only encode front to back information. The backward pass performs the opposite of the forward pass, turning all \(t -1\) to \(t +1\) in Eq. (3)–Eq. (7) to provide feature maps. Then we concatenate the outputs of forward and backward as follows:

$$\begin{aligned} h=[h_fw:h_bw] \end{aligned}$$
(9)

where \(h_fw\) and \(h_bw\) represent the output of forward and backward, respectively.

Because speech is a kind of time series signal, it is closely related to its time sequence. Thus, the BiLSTM mechanism has been shown to be effective in speech quality assessment [12, 13, 25].

Utilizing attention mechanism has produced a number of outcomes in various domains [26, 27]. Consequently, we use self-attention mechanism to obtain weights that can be scored based on the BiLSTM output, where it takes the same input to derive the query \(Q_h\), key \(K_h\) and value \(V_h\) pairs. This is defined as follows:

$$\begin{aligned} f_{\mathrm{self-attention}}(Q_h,K_h,V_h)~{=}~\textrm{softmax}\left\{ \frac{Q_hK_{h}^{T} }{\sqrt{d_h} } \right\} V_h \end{aligned}$$
(10)

where \(d_h\) is the dimension of \(Q_h\), \(K_h\), \(V_h\). \(Q_h = hW^Q\) is the query, which \(W^Q\) is the corresponding projection weight. Similar results can be obtained for \(K_h\) and \(V_h\) using the appropriate projection weights \(W^K\) and \(W^V\).

2.5 Objective function

We regard MOS prediction as a regression task. The MOS evaluation of each dataset was used as the ground truth for training the model. Fu et al. and Lo et al. [12, 13] introduced frame-level prediction error to obtain a utterance-level prediction more related to human score. The objective function maps the final utterance-level MOS score by averaging the frame-level scores over each frame. Specifically, the ground-truth MOS is used for all frames in the speech utterance to calculate the frame-level MSE. And the frame-level MSE helps the model converge with better prediction accuracy [12]. Thus, we set the objective function as follows:

$$\begin{aligned} O = \frac{1}{S}\sum _{S = 1}^{S}[ ( \hat{X_s} - X_s)^2 + \frac{1 }{T_s}\sum _{t = 1}^{T_s}(\hat{X_s}-x_{s, t})^2 ) ] \end{aligned}$$
(11)

where \(\hat{X_s}\) is the truthful MOS, \(X_s\) is the predicted MOS for the sth speech sample, besides, \(T_s\) is the total number of frames in the sth speech sample, S is the total number of speech samples for training and \(x_{s, t}\) is the frame-level prediction at time t.

3 Datasets

In this section, the available datasets are presented, some of which come from the datasets recommended with ConferencingSpeech 2022 [28]. Another set of narrowband datasets are taken from the ITU-T P Supplement-23.

3.1 ConferencingSpeech 2022

ConferencingSpeech 2022 challenges are proposed to stimulate the development of NISQA technology in online conferencing applications. This challenge will provide four datasets: NISQA Corpus [29], PSTN Corpus, IU Bloomington Corpus [30] and Tencent CorpusFootnote 1 . Of these, all three datasets are publicly available for the first time, with the exception of the NISQA corpus. It is worth noting that IU Bloomington Corpus uses ITU-R BS.1534 for subjective test, and the score range is 0–100 instead of 1–5. Therefore, it is not taken into account in the experiment.

NISQA Corpus This dataset is a publicly available SQA dataset containing more than 14,000 speech samples with simulated (e.g., codecs, packet loss, background noise) and lives (e.g., mobile phone, Zoom, Skype, WhatsApp) conditions.

PSTN Corpus This dataset comes from the public audiobook dataset Librivox and is filtered to get good quality speeches. It is then divided into 10-second segments and DNS Challenge 2021 is used to add noise. There are a total of 58,709 speech samples, of which 40,739 are based on noise reference files and 17,970 on clean reference files.

Tencent Corpus This dataset contains samples with and without reverberation. There are around 10,000 Chinese corpus without reverberation, and each speech exhibits the simulated degradation commonly encountered during online conferencing. A total around 4000 simulated impairment and live-recorded speech samples are considered in the case of reverberation.

3.2 ITU-T P supplement-23

Speech samples from the ITU-T P Supplement-23 were utilized in the characterization tests of the G.729 8 kbit/s codec. Ten datasets make up this corpus; however, only the seven that rated on the ACR scale are employed in this work. Each dataset contains coded speech with frame loss and noise circumstances and is scored in a narrowband context.

4 Experiments

4.1 Experimental settings

For the ResNet-BiLSTM, the dropout [31] rate is set to 0.3. Adam [32] Optimizer is used to train the model with a learning rate of 0.0001. Early stopping is used for the validation set’s MSE with 10 epochs patience. We use Tensorflow [33] as the framework on an Nvidia RTX-2080Ti GPU to conduct our experiments.

Table 2 Prediction results on NISQA Corpus with and without attention mechanism

We train and evaluate our model on four speech datasets with MOS labels (trained and evaluated on each individual datasets). And we use zero padding to ensure that all audio has the same length. Root mean squared error (RMSE), linear correlation coefficient (LCC) and Spearman’s rank correlation coefficient (SRCC) are taken as the evaluation indicators. For more details, root mean square error (RMSE) is the mean square difference between the value that was actually observed and the value that the model anticipated. It can be obtained through:

$$\begin{aligned} \textrm{RMSE}=\sqrt{\frac{1}{N-1}\sum _{i = 1}^{N}(X_i-Y_i)^2} \end{aligned}$$
(12)

where \(X_i\)indicates subjective score MOS, \(Y_ i\) represents the predicted objective score MOS and N represents the total number of speech samples.

Linear correlation coefficient (LCC) describes the linear correlation between subjective score and algorithm score (normal distribution). It can be obtained through:

$$\begin{aligned} \textrm{LCC}=\frac{\sum _{i = 1}^{N}(X_i-{\overline{X}})(Y_i-{\overline{Y}})}{\sqrt{\sum (X_i-{\overline{X}})^2}\sqrt{\sum (Y_i-{\overline{Y}})^2}} \end{aligned}$$
(13)

Spearman’s rank correlation coefficient (SRCC) calculates the correlation coefficient of the monotonic relationship between two variables. It can be obtained through:

$$\begin{aligned} \textrm{SRCC}=\frac{\sum _{i = 1}^{N}(x_i-{\overline{x}})(y_i-{\overline{y}})}{\sqrt{\sum (x_i-{\overline{x}})^2}\sqrt{\sum (y_i-{\overline{y}})^2}} \end{aligned}$$
(14)

where N raw data \(X_i\), \(Y_i\) are converted into hierarchical data \(x_i\), \(y_i\).

Table 3 Prediction results with different models
Fig. 4
figure 4

Scatter plot of predictions on the Tencent Corpus in a without reverberation and b with reverberation

4.2 Impact of attention mechanism

Table 2 shows the RMSE, LCC and SRCC values of models with and without attention mechanism. Table 2 demonstrates that the performance of RMSE improved from 0.5168 to 0.5068 after the attention mechanism was used in the proposed ResNet-BiLSTM model.

4.3 Impact of reverberation

In the real world, non-intrusive speech quality assessment typically occurs while there is reverberation. In light of this, robustness of SQA method is a crucial requirement. Tencent Corpus contains speech examples with and without reverberation. Figure 4a, b shows the scatter plot predictions by ResNet-BiLSTM on Tencent Corpus without and with reverberation, respectively. Figure 5 shows that the proposed ResNet-BiLSTM model maintains good performance in the presence of reverberation and predictions are highly correlated with human evaluations.

Fig. 5
figure 5

Histogram of the predictions on Tencent Corpus with and without reverberation

4.4 Comparison of different models

In this section, the effectiveness of various models is compared. Table 3 shows the RMSE, LCC and SRCC values of different models on PSTN Corpus and ITU-T P Supplement-23. From Table 3, ResNet-BiLSTM yielded 0.5127 in RMSE, 0.8087 in LCC and 0.8078 in SRCC, which are all superior to other models on PSTN Corpus. In addition, on ITU-T P Supplement-23, the LCC and SRCC performance of ResNet-BiLSTM is comparable to that of NISQA, while RMSE is 6.5% lower than MOSNet.

5 Discussions

In general, ResNet is unsuitable for speech signals because it has a lot of trainable parameters and pooling processes, which may destroy contextual information of speech signals. In this work, we employ a variant ResNet which makes it effectively extract local feature maps while keeping the time series data with a manageable amount of trainable parameters. Therefore, it is possible to investigate various intricate CNNs in combination with RNNs to improve NISQA’s accuracy even further.

The proposed ResNet-BiLSTM model has a limitation in that it rarely predicts MOS with low and high scores. It can be seen from Fig. 6, most of the predicted MOS from the ResNet-BiLSTM model are in the range of 1.5 and 4.5. The use of MSE-based objective function may cause this issue and it can be improved by modifying the objective function in the future research.

6 Conclusions

In this study, we investigated the ResNet and BiLSTM (ResNet-BiLSTM) combination for non-intrusive speech quality assessment tasks. And an attention mechanism is used to obtain weights that can be scored based on the BiLSTM output. Based on the results of the extensive evaluation of MOS for human perception, the experimental results show that the prediction score generated by our model is close to the human score. In the future, we will improve the structure of the network to improve the accuracy of the prediction scores to better serve online conferencing applications. In addition, we expect that the proposed model can also identify the causes of quality loss when giving MOS, and provide timely feedback to the service providers so they can adjust the call quality of real-time meetings in time to provide users with excellent usage experience.

Fig. 6
figure 6

Histogram of the predictions of ResNet-BiLSTM