1 Introduction

On the face of global increment in the number of internet and mobile users around the world, the usages of Voice over Internet protocol (VoIP) applications, for example, Google Meet, Microsoft Skype, Apple FaceTime, etc. are growing with high pace. VoIP has become vital for the today’s society including remote working, online communication like video conferencing, etc. To fulfill the user’s expectations of better quality of experience (QoE) while using such VoIP applications, it is necessary to measure and monitor real-time speech quality. Various factors influence the QoE including system, network, content, and context of use (Falk et al., 2010). The service and system factors include the types of channels (mono or stereo), position of microphone, central processing unit (CPU) overload, etc. Jitter, packet loss and delay of the transmitted speech signal are included within the network factors. Content, that is, the characteristics of speech and voice may be affected by processing and can influence the QoE. The location of using a particular service comprises the contextual factors. For example, in the environments that are inherently noisy such as at the exhibition, restaurant, station or airport compared to the potential quietness at the home. The contextual factors are primarily the centre of focus in the current research work.

The traditional method to measure the quality of speech is absolute category rating (ACR; ITU, 1996), in which a number of subjects listen to the speech material played for them in a suitable environment and they give their quality ratings. It is then averaged to obtain the mean opinion score (MOS). It is highly reliable and accurate method of obtaining speech quality ratings. But, it is time-consuming and impractical for real-time speech quality measuring and monitoring. Also, it is not possible to have a group of listeners at every node of speech communication networks for the speech quality subjective ratings (Holub et al., 2017). On the contrary, measuring and monitoring real-time speech quality using objective speech quality assessment metrics are more convenient, faster and practical.

Various objective speech quality assessment metrics are standardized by the International Telecommunication Union (ITU), for example, signal-based metrics which deploy received (degraded) speech signals for estimating the quality of speech. Two types of signal-based metrics exist (Möller et al., 2011): full-reference metric (also called “intrusive” or “double-ended” metric) and no-reference metric (also called “non-intrusive” or “single-ended” metric) as shown in Fig. 1.

Fig. 1
figure 1

Flow diagram of intrusive and non-Intrusive speech quality assessment metric

Intrusive (reference-based) metrics usually calculate the distance between spectral representations of the transmitted reference (clean) signal and the received degraded signal. For example, the perceptual evaluation of speech quality (PESQ; Rix et al., 2001), the perceptual objective listening quality assessment (POLQA; ITU, 2011), the virtual speech quality objective listener (ViSQOL; Hines et al., 2015b) and its improved version ViSQOL v3 (Chinen et al., 2020) are some of the popular intrusive metrics. Since there is no access to the reference speech signal in real-time at receiver end and at different nodes of any telecommunication system, hence, the deployment of intrusive metrics for speech quality measuring and monitoring is not suitable for the telecommunication networks.

Non-intrusive (no-reference or reference-free) speech quality metrics (SQMs) are usually preferred for real-time measuring and monitoring of speech quality and the scenarios where the reference speech signal is not available. These metrics only deploy the degraded received speech signal to predict the quality of speech, and could be easily installed at the end point of VoIP channels for monitoring the quality of speech (Shome et al., 2019). Two no-reference objective SQMs are standardized for assessment of narrow-band speech signals (Möller et al., 2011). First, the ITU recommended P.563 (2004) and second, the American National Standard Institute (ANSI) standardized “auditory non-intrusive quality estimation plus (ANIQUE+; Kim & Tarraf, 2007).” ANIQUE+ is a perceptual model which simulates the functional roles of human auditory system and deploys improved modelling of quality estimation using a statistical learning paradigm (Kim & Tarraf, 2007). ANIQUE+ is only commercially available. P.563 is publicly available. However, both P.563 and ANIQUE+ metrics do not take into account the type/class of input speech signal while predicting the quality of speech.

Several non-standardized speech quality assessment metrics are developed in literature for non-intrusive objective speech quality evaluation. For example, the low complexity speech quality assessment metric (LCQA; (Bruhn et al., 2012), metric using multiple time-scale auditory features (Dubey & Kumar, 2017) and deep neural network (DNN) based speech quality assessment metrics (Avila et al., 2019; Catellier & Voran, 2020; Fu et al., 2018; Ooster et al., 2018; Soni & Patil, 2021; Wang et al., 2019). To monitor the quality of speech over a communication network, the LCQA algorithm deploys low complexity (execution time). In order to estimate the quality of speech, it maps the global statistical features, for example, mean, variance, skewness and kurtosis obtained from speech codecs using Gaussian mixture model (GMM). For each frame, the global features of speech signals are calculated from the speech-coding parameters (Grancharov et al., 2006). Reported results indicate that LCQA metric performs poorly in the presence of competing speaker (Jaiswal & Hines, 2018) type degradations. Moreover, the LCQA metric is restricted only to the parametric representation of the input speech signal without its perceptual transform. In Dubey and Kumar (2017), the author used a combination of different auditory features, such as Lyon’s auditory features, Mel frequency cepstral coefficients (MFCC) and line spectral frequencies (LSF) and then trained a joint GMM for computing the quality of speech. However, the metric do not discriminate different types of degraded noisy speech signals before estimating the quality of speech as each type of speech/noise has different spectral characteristics and it can influence the estimation of speech quality. Some recent works on DNN-based speech quality assessment metrics include (Avila et al., 2019; Catellier & Voran, 2020; Fu et al., 2018; Ooster et al., 2018; Wang et al., 2019). For example, MFCC features are extracted and then used for training a DNN in order to predict the quality of speech in Avila et al. (2019), and a waveform-based convolutional neural network (CNN) is trained to predict the quality of speech in Catellier and Voran (2020). An output-based speech quality assessment metric incorporating autoencoder and support vector regression is implemented using NTT-AT Chinese corpus containing different types of noise degradations in Wang et al. (2019). Further, Soni and Patil (2021) predicts quality of speech from the speech signal by deploying deep autoencoder (DAE) and sub-band autoencoder (SBAE) features and then training an artificial neural network (ANN) on noisy speech samples obtained from the NOIZEUS speech corpus (Hu & Loizou, 2006). However, all these DNN-based metrics predict the quality of speech directly, that is, without identifying the type of noise class (context) of the input speech signal. Since each type of noise class has a separate behaviour and spectral characteristics, therefore, identifying the type of noise class (context) of speech signal and then switching to that noise class for predicting the noise class-specific (context-specific) speech quality could have a significant effect on the speech quality prediction accuracy.

Some metrics estimate the quality of speech using the network and the terminal parameters, known as, parametric metrics (Yang et al., 2016), for example, the E-Model (Bergstra & Middelburg, 2003). Network delay and packet loss are the parts of network parameters. Terminal parameters comprise jitter buffer overflow, coding distortions, jitter buffer delay, and echo cancellation. Using these parameters, impairments of the received speech signals are predicted and then rating factor is converted into mean opinion scoreFootnote 1 (MOS). However, the limitation of the E-Model includes its inability in representing the non-linear relationship between perceptual characteristics of speech signal and network planning parameters due to the dynamic change in the characteristics of speech signal. Moreover, the parametric metrics do not deploy the speech signal in quality predictions, therefore, unsuitable in predicting the quality of speech based on signal-noise characteristics (Möller et al., 2011). In order to meet the desired QoE of human while using VoIP applications, it is essential to have a no-reference signal-based speech quality assessment metric which should be aware of the type of noise class (context) of the input speech signal while measuring and monitoring real-time speech quality.

Fig. 2
figure 2

Block diagram of the proposed speech quality prediction metric

While digging thoroughly inside the literature, we find that there is no such non-intrusive signal-based speech quality assessment metric that can firstly identify the noise class (context) of the speech signal prior to predicting the quality of speech in real-time, therefore, we motivated to develop a noise class-awared QoE prediction metric for real-time prediction of speech quality and its efficient utilization in monitoring the quality of speech. Using artificial intelligence (AI) and machine learning (ML) techniques, smart decision making AI-based algorithms can be introduced into the mobile devices for improving the QoE and the performance gains of the end-user. To that end, our proposed non-intrusive noise class-awared speech quality prediction metric comprises of three main components. First, a noise class-classifier for classifying the noise class (context) of the input speech signal; second, a voice activity detector (VAD) for identifying the voiced segments inside the input speech signal; and third, noise class-specific or context-specific speech quality prediction metric (CSQM) for predicting noise class-specific speech quality of the input speech signal. For training the noise class-classifier and the CSQM excellently, it is desirable to have a large amount of noisy speech samples. However, due to the availability of small number of speech samples of different noise classes in the NOIZEUS speech corpus (Hu & Loizou, 2006), we have also addressed these challenges by developing a novel machine learning classifier algorithm for accurately classifying the noise class (context) of the speech signal and a collection of novel DNN-based CSQMs which are trained for each noise class to estimate the quality of speech.

The internet service providers can easily deploy our proposed speech quality prediction metric for measuring and monitoring the service performance quality continuously and can detect the impairments by identifying the noise class (context). This can assist in identifying the potential root causes and in installing the QoE-aware management actions to react and maintain the end-user QoE levels (Jahromi et al., 2018). The key contributions of the proposed work are as follows:

  • Addressing speech quality monitoring problem by developing a novel noise class (context) aware SQM using three-step process: first step is to obtain noise-class (context) classifier; second step is to perform pre-processing using VAD; and third step is to develop noise-class (context) specific SQM. All the three steps have separate relevance.

  • Developing a novel noise-class (context) classifier using a novel feature extraction technique to train the ML classifier.

  • Developing a VAD algorithm to detect the presence of speech signal segments.

  • Developing a group of DNN-based SQMs which are trained and optimized for a specific noise class, that is, noise-class (context) sensitive.

  • Classifying the noise class (context) of the input noisy speech signal and then switching to a specific speech quality model (SQM) results in a better speech quality prediction and it will allow the end-user to perceive a better QoE over VoIP applications.

  • The proposed 3-step QoE framework shows improved performance and a significant advantage over the simple SQM where the speech quality is predicted directly without identifying the noise-class (context).

The remainder of this paper is structured as follows: Sect. 2 presents detailed explanation of the proposed SQM and its associated components. Section 3 describes the experimental dataset. The evaluation methodologies of each component and overall metric are described in Sect. 4. Section 5 presents the system design for executing the program and choice of different parameters and hyper-parameters. Section 6 presents and discusses the results of each component and overall metric before concluding remarks and future directions are presented in Sect. 7.

2 Proposed speech quality metric

Figure 2 shows each building block of the proposed noise class-awared speech quality prediction metric. It constitutes of mainly three sub-parts: (a) noise-class classifier to identify the noise class (context) of speech signal; (b) VAD to segregate the voiced and non-voiced segments from the speech signal; and (c) noise class-specific SQMs which are trained and optimized for each specific noise class using deep neural networks (DNNs), that is, context-aware DNN-based SQMs, in order to evaluate the quality of speech under that particular noisy scenario. Behind developing our proposed QoE metric, the hypothesis is: “With the prior knowledge of the noise class of the speech signal under test using a classifier, the test signal can be directed to the corresponding speech quality assessment metric which is trained and optimised for that specific noise degradation.”

Fig. 3
figure 3

Block diagram of noise-class classifier illustrating feature extraction technique to train the classifier (Jaiswal & Hines, 2020)

Table 1 Class of speech enhancement algorithms with associated noise estimation techniques (Hu & Loizou, 2006, 2007)

2.1 Noise-class classifier

The noise-class classifier is the first building block of the proposed metric. In developing robust classifier, we use the information that different speech enhancement algorithms (Hu & Loizou, 2006, 2007) have different associated noise estimation techniques (see Table 1), resulting in varying success while enhancing the noisy speech signals. The key is that with different levels of enhancements of noisy speech signals by different SE algorithms, their objective speech quality ratings (MOS scores) will also be different. With this prior information, each input noisy speech signal is processed via 12 standard speech enhancement algorithms (Hu & Loizou, 2006, 2007) as outlined in Table 1. These 12 processed speech signals along with the one original unprocessed noisy speech signal results in 13 different variations for each input signal. These 13 signal variants (12 processed and 1 unprocessed) are then injected to the P.563 metric (ITU, 2004) (P.563 is discussed next) for acquiring 13 different speech quality ratings, called “MOS score” (see Table 3). Further, these MOS scores are integrated in order to deploy it as input feature vector for training the classifier and then identifying the noise class. This approach of feature extraction from the noisy speech samples for training a classifier in order to detect the noise class is the only one of its kind in the literature. The block diagram of the noise-class classifier illustrating feature extraction technique is shown in Fig. 3. In the following sub-sections, two sub-parts of the noise-class classifier (speech enhancement algorithms and objective SQM) and used ML classifier are discussed.

2.1.1 Speech enhancement

For improving the speech signals degraded by different types of noises, speech enhancement (SE) algorithms are deployed (Das et al., 2020; Moore et al., 2017; Saleem & Khattak, 2019). The performance of SE algorithms depends on the characteristics of surrounding noise and noise estimation algorithms associated with it. The 12 standard SE algorithms presented in Hu and Loizou (2006, 2007) consist of four classes. Each class of SE algorithm in Hu and Loizou (2006, 2007) involves a separate noise estimation technique resulting in varying success in enhancing the degraded speech. The different classes of SE algorithms with associated noise estimation techniques are presented in Table 1.

2.1.2 ITU-T P.563 metric

Table 2 Requirements of speech signals in P.563 (ITU, 2004)
Table 3 Speech quality rating (MOS Scale)

ITU standardized P.563 (2004) is a no-reference speech quality assessment metric designed for estimating quality of active speech in narrow-band speech signals. Three main principles (Malfait et al., 2006) define the P.563 metric. First, physical model of vocal tract; second, reconstruction of the reference signal to evaluate unmasked distortions; and third, focus on specific distortions e.g., temporal clipping, robotization, and noise. P.563 involves several steps in predicting the quality of speech which includes pre-processing, detection of dominant distortion class, and perceptual mapping. The pre-processing includes reverse filtering, adjustment of speech level, identification of speech portions and calculation of speech and noise levels using a VAD (ITU, 2004). Distortion class includes unnaturalness of speech, robotic voice, beeps, background noise, signal-to-noise ratio (SNR), mutes, interruptions, extracted from the voiced parts of speech signals. Finally, a dominant distortion class is detected and mapped to a single mean opinion score denoted as “mean opinion score of objective listening quality (MOS-LQO)” as presented in Table 3. Table 2 presents the requirements of speech signals when evaluating using P.563.

2.1.3 Classifier used

A major challenge in our data-driven approach is to develop an accurate and robust classifier with small number of speech samples that should have the ability to detect the noise class of input speech signal correctly. To that end, different ML classifiers (Reddy et al., 2021; Shami & Verhelst, 2007; Singh & Singh, 2021), for example, XGBoost, decision tree, random forest, logistic regression, K-nearest neighbor, support vector machine, and Naive Bayes are investigated as per the classification approach in Jaiswal and Hines (2020) where it has been performed for 8 noise classes instead of 4 noise classes. The simulation results for 4 noise classes (see Tables 6, 7) report that XGBoost classifier has high accuracy in classifying the noise class of speech signal. Therefore, we deploy the XGBoost classifier.

The extreme gradient boosting (also called XGBoost) is ensemble of classification and regression trees (CART; Chen & Guestrin, 2016). In XGBoost, the trees are optimized using gradient boosting technique (Friedman, 2001) which minimises the loss function of the model by adding weak learners using gradient descent. Let the output of a tree be:

$$f(x)=w_{q}(x_{i}),$$
(1)

where x is the input vector and \(w_{q}\) is the score of the corresponding leaf q. Then, the output of an ensemble of K trees will be (Dimitrakopoulos et al., 2018):

$$y_{i} = \sum _{k=1}^{K} {f_k}(x_{i}).$$
(2)

The XGBoost algorithm tries to minimize the following objective function J at step t (Dimitrakopoulos et al., 2018):

$$J(t) = \sum _{i=1}^{n} L(y_{i},\hat{y_{i}}^{t-1}+ f_t(x_{i})) + \sum _{i=1}^{t} \varOmega (f_{i}),$$
(3)

where the first term contains the train loss function L (e.g. mean square error for regression and binary cross entropy for classification) between the actual class y and the predicted output \({\hat{y}}\) for n samples and the second term is the regularization term which helps in controlling the complexity of the model and avoiding overfitting. With T being the number of leaves, \(\gamma\) being the pseudo regularization hyper-parameter, and \(\lambda\) being the L2 norm for leaf weights, the complexity \(\varOmega (f)\) is defined as (Dimitrakopoulos et al., 2018):

$$\varOmega (f) = \gamma T + \frac{1}{2} \lambda \sum _{j=1}^{T} {w_j}^{2}.$$
(4)

Our anticipation is that the relationship between the unprocessed speech quality estimates and the enhanced speech quality estimates would be learnt by the classifier in correctly classifying the noise class of input speech signal.

2.2 Voice activity detection

The VAD is the second building block of the proposed metric. Based on the speech features, VAD identifies the active voiced segments in the input speech signal (Fukuda et al., 2018) as shown in Fig. 4. The speech quality and intelligibility are based on the voiced segments of the input speech signals only. It works as a pre-processing unit for processing the input speech signal in order to segregate silences prior to injecting it to the speech quality estimation component. It has been noticed in our previous study (Jaiswal, 2022) that the weighted spectral centroid (WS) VAD performs excellent as compared to other VADs in detecting and separating the voiced segments from the noisy speech signal.

Fig. 4
figure 4

Block diagram of used WS VAD illustrating feature extraction technique to segregate silences from the speech signal

The weighted spectral centroid VAD extracts spectral centroid (SC) feature from the speech signal. Spectral centroid is the centre of mass of the spectrum and its high value corresponds to the “brightness” of the sound. In order to develop the WS VAD, the speech samples are divided into overlapping frames of size 25 ms with 10 ms shift and then short-time spectral centroid is calculated for each frame as (Jaiswal, 2022):

$${SC}_{i}= \frac{\sum _{k=1}^{N} (k+1) X_{i}(k)}{\sum _{k=1}^{N} X_{i}(k)},$$
(5)

where \(X_{i}(k),\,k = 1,2,\ldots ,N\) is the discrete Fourier transform (DFT) coefficients of the ith short-time frame of signal X having frame-length N. After calculating short-time SC of each frame, a smoothing filter, that is, median filterFootnote 2 (Deligiannidis & Arabnia, 2014) is applied throughout the feature sequence. Then, histogram is computed and its local maximas are detected. Thereafter, threshold T is measured dynamically using the weighted average between the first and the second local maxima as, \(T = \frac{W_{1} M_{1} + W_{2} M_{2}}{W_{1} + W_{2}}\), where \(M_{1}\) and \(M_{2}\) are the first and the second local maxima, respectively. \(W_{1}\) and \(W_{2}\) are the user-defined weights, and are set as \(W_{1} = 5\) and \(W_{2} = 1\). The measured threshold is applied to each frame for extracting the voiced segments of the speech signal.

Fig. 5
figure 5

VAD in P.563 (Jaiswal & Hines, 2018)

It is important to mention here that pre-processing stage in P.563 (ITU, 2004) includes an internal VAD which is based on the adaptive power threshold, using an iterative approach as shown in Fig. 5. However, this internal VAD only computes a range of features and the active and inactive portions of speech activity but does not pre-process to remove silences from the speech sample which can be seen in full details of P.563 specifications (ITU, 2004). Also, most of the VAD algorithms fail when the level of background noise increases (Ramirez et al., 2007). For example, it has been seen in Hines et al. (2015a) that the P.563 metric performs poorly in estimating speech quality on speech samples containing silences. Therefore, we deploy WS VAD (in Step-2) to pre-process the speech samples prior to developing a collection of noise class-specific (context-specific) SQMs in Step-3.

2.3 Speech quality estimation

A collection of DNN-based noise class-specific SQMs, trained and optimised for a particular noise class (context) are the third building block of the proposed metric. The fully-connected DNNs and feature selection techniques, for example, lasso and ridge are deployed for its development.

2.3.1 Fully-connected DNNs

A DNN is an artificial neural networks (ANNs) consisting of L layers, including one input layer, \(L-2\) hidden layers, and one output layer as shown in Fig. 6. The output of the DNN is a cascade of non-linear transformation of the input. Consider a feed-forward DNN with L layers, labelled \(l = 1,\ldots ,L\) and each having corresponding dimension \(q_{l}\). The layer l is defined by the linear operation \({W_{l}} \in {\mathbb {R}}^{{q_{l-1}} \times q_{l}}\) followed by a non-linear activation function \(\sigma _{l} : R^{q_{l}} \rightarrow R^{q_{l}}.\) Layer l receives input from the \(l-1\) layer denoted as, \({w_{l-1}} \in R^{q_{l-1}}\), the resulting output of the layer l, \(w_{l} \in R^{q_{l}}\), is then computed as \({w_{l}} := \sigma _{l}({W_{l}}{{w_{l-1}}})\), where \(\sigma _{l}(\cdot )\) is the point-wise activation function. The final output of DNN, \(w_{L}\), is then related to the input \(w_{0}\) by propagating through various layers of the DNN as \(w_{L} = \sigma _{L}(W_{L}(\sigma _{L-1}(W_{L-1}(\ldots (\sigma _{1}({W_{1}}{w_{0}}))))))\), where the DNN learns layer-wise weights \(w_{1},w_{2},\ldots ,w_{L}\) (Eisen et al., 2018).

Fig. 6
figure 6

A simple three layered feed-forward DNN to develop noise-class (context) specific speech quality metric (SQM)

For developing a collection of noise-class (context) specific speech quality metrics (SQMs), noisy speech signals processed through WS VAD (that is, silence separated speech signals) and noise-class specific training data are fed to the input of DNN as shown in Fig. 6. It is important to note here that zero padding (Engelberg, 2008) is used at the end of the processed speech samples in order to make the length of each processed speech samples same before feeding to the DNN. The output of the DNN is the speech quality prediction (MOS score). The activation function of hidden layers \(\sigma _{l}\) include rectified linear unit (ReLU), defined as \({\sigma _{l}(x) = 0}\) for \(x<0\) and x for \(x>0\), and linear, defined as \({\sigma _{l}(x) = cx}\) where c is a constant. It varies for each noise class. The output layer \(\sigma _{o}\) includes a linear activation function. The noise-class classifier acts as a filter and firstly classifies the noise class of input speech signal and then activates the corresponding trained DNN in order to obtain the predicted speech quality (MOS score).

2.3.2 Lasso feature selection

Least absolution shrinkage and selection operator (Lasso; Li et al., 2017) is a high-dimensional feature selection technique used to separate the relevant features from the irrelevant ones and helps in reducing the complexity of the model. It uses cross-validation to adjust the tuning parameters along all aspects of the model and to prevent overfitting. It is a linear regression method that minimizes the usual square error loss plus L1 penalty on the regression coefficients. The minimization of the overall loss function is given as (Jain et al., 2014):

$${\hat{\beta }} \leftarrow \mathop {\text{argmin}}\limits _\beta \left\{ {\sum _{n=1}^{N} \left( y_{n} - b - \sum _{d=1}^{D} \beta _{d} x_{dn}\right) ^{2}} \right\} + {\lambda {||\beta ||}_{1}},$$
(6)

where \(\{x_{n},y_{n}\}_{n=1}^{N}\) is the training data, b is the intercept, and the Lagrange multiplier \(\lambda\) balances the trade-off between the squared error loss and the L1 penalty \({||\beta ||}_{1}\) on the regression coefficients. The L1 norm imposes a competition between the regression coefficients, resulting in shrinking some of them to 0, and therefore producing a sparse model that explains the data with as little features as possible. The \(\lambda\) is typically inferred from cross validation. This optimization is performed using the 5-fold cross validation.

2.3.3 Ridge feature selection

Ridge (Chowdhury et al., 2018) is also one of the high-dimensional feature selection techniques used to obtain the relevant features for training the DNN models smoothly with reduced complexity. It is a variant of the regularized least squares problems where the choice of the penalty function is the squared L2 norm. With \(A \in {\mathbb {R}}^{nxd}\) being the design matrix, \(b \in {\mathbb {R}}^{n}\) being the response vector, and \(\lambda > 0\) being the regularization parameter, the ridge coefficients minimize penalized residual sum of squares, given as (Chowdhury et al., 2018):

$$Z = \mathop {\text{min}}\limits _{x} \left\{ {||Ax - b||}_{2}^{2} + {\lambda } {{||x||}_{2}^{2}}\right\} .$$
(7)

Solving standard least-squares problems without regularization may provide a good fit with the training data but may not generalize well with the test data. Therefore, to address this problem, ridge regression waives off the requirements of unbiased estimator. At the cost of introducing bias, ridge regression reduces the variance and thus can reduce the overall mean square error (MSE). The tuning parameters are adjusted properly to all aspects of the model using efficient leave-one-out (Meijer & Goeman, 2013) cross-validation technique, which helps in preventing overfitting as well.

3 Experimental dataset

In daily life, noise appears in different shapes and forms. Each type of noise has a separate behaviour and spectral characteristics. For example, fan noise coming from the personal computer is stationary whereas restaurant noise, where multiple people speak in the background, is non-stationary due to its constantly changing spectral characteristics. Therefore, improving the speech quality of the speech signal degraded by non-stationary noise is more difficult than stationary noise.

Different datasets have different applicabilities. For example, ITU-T P.Supplement-23 database (1998) contains the coded version of speech utterances used in the ITU-T 8 kbps codec characterization tests (Dubey & Kumar, 2013). The test performed three experiments. The Experiment-1 examines G.729 codec with coded speech samples, thus, not useful for our study. The Experiment-2 investigates the effect of background noise for transmission quality and the method of assessment used here is comparison category rating (CCR) not ACR, thus, not suitable for our study. Experiment-3 investigates effects of channel degradations using coded speech samples, thus, also not useful for our study. Hence, this dataset is not suitable for testing our proposed QoE metric due to the unavailability of different types of noise degradations present in speech signals under noisy environments. Furthermore, there is no large amount of open-source real-world noisy speech dataset with listener quality rating or subjective quality rating that can help in the assessment of our proposed metric. Obtaining subjective quality rating of large noisy dataset is a cumbersome work.

Since the speech signal is degraded with different types of environmental noises, therefore, a noisy speech dataset is needed to investigate the performance of the proposed QoE metric. NOIZEUS (Hu & Loizou, 2006) is an open-source noisy speech dataset containing different types of noise degradations. It has 30 phonetically-balanced IEEE English sentences, pronounced by three male and three female speakers. Each sentences are degraded with four types of commonly occurring real-world noises, for example, airport, exhibition, restaurant and station at two SNRs, that is, 5 dB and 10 dB. The noises are taken from the AURORA database (Hirsch & Pearce, 2000). Each sentences are down-sampled from 25 to 8 kHz, that is, narrow-band speech samples. All the speech samples are saved in .wav format (16 bit PCM, mono) and the average duration of each utterance is three second.

4 Evaluation of the proposed speech quality metric

In this section, we present the evaluation methodology used for the evaluation of the proposed metric and its associated building blocks.

4.1 Evaluation of the noise-class classifier

To evaluate the noise-class classifier, we take 30 noisy speech samples available from the NOIZEUS speech corpus having four real-world noise classes, that is, airport, exhibition, restaurant and station at two SNRs, that is, 5 dB and 10 dB. Thus, we have 60 \((30\,\text {samples} \times 2\,\text {SNRs})\) noisy speech samples for each noise class, resulting in total 240 \((30\,\text {samples} \times 4\,\text {noise classes} \times 2\,\text {SNRs})\) noisy speech samples. For extracting the features in order to train noise-class classifier, each noisy speech sample is processed through 12 standard speech enhancement algorithms (see Table 1), resulting in 12 processed speech samples along with one unprocessed original noisy speech sample, that is, total 13 (12 + 1) variants of speech samples. Next, these 13 variants of speech samples, that is, \(\hat{n_{1}}(t),\hat{n_{2}}(t),\ldots ,\hat{n_{13}}(t)\) are injected to the P.563 metric (see Fig. 3) to obtain 13 different objective speech quality predictions, that is, \(\text {MOS}_{1} ,\text {MOS}_{2},\ldots ,\text {MOS}_{13}\). We, then, integrate these 13 MOS scores and deploy them as the input feature vector to train the classifier for classifying the noise class (context) of the given test speech sample.

We have only 60 noisy speech samples in each noise class,Footnote 3 which is very small for training a data-driven classifier accurately. There are four noise classes. Training with a multi-class ML classifier results in a poor classification accuracy of only 38%. Therefore, we perform one-vs-all approach of multi-class classification, that is, binary classification with imbalanced datasets for each noise class as per the strategy shown in Fig. 7. We assign the first noise class (e.g., Airport) as “class 0” and the remaining three noise classes as “class 1”, and label it as “Airport”. Similarly, we assign the second noise class as “class 0” and the remaining three noise classes as “class 1”, and label it as “Exhibition”. We follow the same strategy for the remaining noise classes. This makes the binary class samples imbalanced. Therefore, for balancing both noise classes, we reduce the size of majority noise class (class 1) equals to the size of minority noise class (class 0) using the under-sampling technique (Drummond & Holte, 2003; Fernández et al., 2018). Once both noise classes are balanced, we divide the data of each noise class into 80:20 ratio for training and testing the classifier.

Fig. 7
figure 7

Flow diagram of binary classification with imbalanced dataset

To evaluate the performance of classifier, F-score (FS) or test accuracy and geometric mean (G-mean) or balanced accuracy are investigated. F-score (Belarouci & Chikh, 2017) is the weighted harmonic mean of precision (PR) and recall (RC). G-mean (Belarouci & Chikh, 2017) is the geometric average of the classification precision of the minority and the majority class. It evaluates the model’s ability to correctly classify the minority and the majority class. With TP, TN, FP, and FN being the true positive, true negative, false positive and false negative, respectively, the F-score and G-mean are given as (Jaiswal & Hines, 2020):

$$\begin{aligned}&PR = \frac{TP}{TP+FP},\quad RC = \frac{TP}{TP+FN}, \\&FS = {\frac{2(PR \times RC)}{(PR + RC)}}, \end{aligned}$$
(8)
$$\begin{aligned}&G{\text {-mean}} = \sqrt{\frac{TP}{TP+FN} \times \frac{TN}{TN+FP}}. \end{aligned}$$
(9)

4.2 Evaluation of the VAD

To evaluate the performance of VAD, we obtain the frame-wise binary maskFootnote 4 of noisy speech samples using the WS VAD and compare it to the ground truth (GT) VAD maskFootnote 5 to obtain the TP, TN, FP and FN for calculating precision (PR), recall (RC) and F-score (FS) (Jaiswal & Hines, 2020). F-score measures the test accuracy and it’s maximum value is 1 (PR = RC = 1) which signifies that the voice activity decision of the VAD algorithm is equal to its reference transcription.

4.3 Evaluation of the noise-class specific speech quality metric

The availability of sufficiently large learning datasets is the key to the success of DNN models in replicating different optimization-based speech quality solution. However, it is not possible to obtain large amount of learning datasets openly in order to satisfy the data hungry nature of the DNN models because most of the real-data is publicly unavailable and/or costly to obtain. Alternatively, one can generate realistic data using advanced deep learning techniques, such as, generative adversarial networksFootnote 6 (GANs; Goodfellow et al., 2014). However, that is beyond the scope of this research.

Further, the processed speech samples of each noise class via WS VAD and the training data of the corresponding noise class (see Fig. 3) are injected as the input to the DNNs (see Fig. 6). The subjective speech quality predictions (MOS-LQS) of each noise class obtained from Dubey and Kumar (2015) are the output to the DNNs. A collection of DNNs (see Fig. 6) are trained and optimised for each noise class to obtain noise-class specific speech quality prediction metrics. The motivation for exploiting DNNs is primarily due to its universal approximation capability (Sun et al., 2017), supplemented by the fact that trained DNN models are computationally simple (Ye et al., 2017) to execute.

Next, the lasso and ridge feature selection techniques are used to extract the most appropriate feature for training the DNN of each noise class. The lasso feature selection technique is used for station class and the ridge feature selection technique is used for airport, exhibition and restaurant noise class. The layers of the DNNs are densely connected. Various combinations of hidden layers and weights are experimented for achieving the best DNN model of each noise class that could be trained in reasonable time. For the performance evaluation of the DNNs, the training and testing mean square error (MSE) of each noise class are computed. MSE is defined as the average of the square of the differences between the actual (subjective) and the predicted (objective) quality scores. With n being the number of speech samples, \(y_{1},y_{2},\ldots ,y_{n}\) and \({\hat{y}}_{1},{\hat{y}}_{2},\ldots ,{\hat{y}}_{n}\) being the actual (subjective) and the predicted (objective) quality scores, respectively, MSE is given as:

$$MSE = \frac{1}{n}\sum _{i=1}^{n} (y_{i} - {\hat{y}}_{i})^{2}.$$
(10)

4.4 Evaluation of the overall proposed metric

The Pearson correlation \(\rho _{p}\), Spearman correlation \(\rho _{s}\) and root mean square error (RMSE) between the objective quality scores (MOS-LQO) obtained by the proposed metric for each test sample of each noise class and their corresponding subjective listener quality scores (MOS-LQS) are used for the performance evaluation of the overall proposed metric. With \({\mu }_{y}\) and \(\mu _{{\hat{y}}}\) being the mean of subjective and predicted quality scores respectively, and \(d_{i}\) being the difference between ranks of the subjective and the predicted quality scores, these measures are given as (Falk & Chan, 2006; Sharma et al., 2016):

$$\begin{aligned}&\rho _{p} = \frac{\sum _{i=1}^{n} ({\hat{y}}_{i} - \mu _{{\hat{y}}}) (y_{i} - {\mu }_{y})}{\sqrt{\sum _{i=1}^{n} ({\hat{y}}_{i} - \mu _{{\hat{y}}})^{2} (y_{i} - {\mu }_{y})^{2}}}, \end{aligned}$$
(11)
$$\begin{aligned}&\rho _{s} = 1 - \frac{6\sum _{i=1}^{n} {d_{i}^{2}}}{n(n^{2}-1)}, \end{aligned}$$
(12)
$$\begin{aligned}&RMSE = \sqrt{\frac{1}{n}\sum _{i=1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}. \end{aligned}$$
(13)
Table 4 Parameters and hyper-parameters of XGBoost classifier for each noise class
Table 5 Parameters and hyper-parameters of each DNN-based speech quality metric (SQM)

5 System design

This section describes the simulation platform used and different choices of parameters and hyper-parameters for training the classifier and the DNN models to obtain the proposed SQM. MATLAB 2018a is used to implement the VAD algorithm. The standard P.563 metric is programmed in “C” language. The classifier and the DNN-based SQMs are implemented in Python 3.7.3 having TensorFlow 2.2.0 on Windows 10 laptop with Intel Core i5 8th generation processor, Intel UHD Graphics 620, and 16 GB of memory.

5.1 Choice of parameters and hyper-parameters

To obtain the noise-class classifier which can accurately classify the noise class of the noisy speech signal and to develop a collection of robust DNN-based SQMs trained for each noise class, different parameters and hyper-parameters are experimented in our system design. Different search algorithms, for example, grid search, and random search (Bergstra & Bengio, 2012) are also performed to obtain the most appropriate parameters.

For the tested XGBoost classifier, the choices of parameters and hyper-parameters are presented in Table 4.

To obtain the robust DNN-based SQMs trained and optimized for each noise class, that is, context-specific speech quality metrics, different number of hidden layers with variable number of neurons are experimented. Rectified linear unit (ReLU) and linear activation function are used for the hidden layers depending on each noise class. Linear activation function is used for the output layer. The weights of the hidden layers and the output layer are initialized normally and using Glorot uniform initializer (Glorot & Bengio, 2010). MSE is used as the loss/cost function. The NADAM (Dozat, 2016) optimizer, which is the same as Adam (Kingma & Ba, 2015) optimizer with Nesterov momentum, is used for the stochastic optimization of the DNNs. Proper learning rates of the optimizer are selected and mini-batches of different sizes are used for each SQM. Dropouts (Srivastava et al., 2014) of different values are used after each hidden layer to minimize the risk of overfitting and to speed up the DNN training. The dataset is standardized by taking the mean and scaling to unit variance. Table 5 presents the choices of parameters and hyper-parameters of DNN-based SQMs trained and optimized for each noise class individually.

5.2 Training stage

For training the DNN-based SQMs, the data of each noise class is divided into 80:20 ratio, that is, 80% data is used for training and 20% for testing. The entire training data is used for optimizing the weights of each DNN model.

5.3 Testing stage

During testing, the test speech samples of each noise class are passed to the trained classifier (XGBoost) for correctly identifying the noise class of that test sample. After identifying its noise class, the test sample is passed to the corresponding trained and optimized SQM for obtaining the optimized objective speech quality predictions (MOS-LQO). In addition, all test speech samples together are also passed through the trained classifier for detecting its noise classes and then corresponding SQMs to obtain the objective speech quality predictions (MOS-LQO).

6 Results and discussion

This section presents the numerical results, which fit into our proposed metric and offers the solution to the gap discovered in the literature. It also discusses the results of each building block to show-case the effectiveness of our proposed metric.

6.1 Noise-class classifier response

The Precision, Recall and F-score of the XGBoost classifier for each noise class are presented in Table 6 and the G-mean (balanced accuracy) of the XGBoost classifier for each noise class is presented in Table 7. It can be seen from both Tables that the classifier has average test accuracy and balanced accuracy of 82%. Therefore, XGBoost classifier is used further to develop the complete proposed metric.

Table 6 Precision, Recall and F-score of XGBoost classifier for each noise class along with average score
Table 7 G-mean of XGBoost classifier for each noise class along with average score

6.2 Voice activity detection response

The performance of the WS VAD (Jaiswal, 2022) is measured on noisy speech samples of each noise class of the NOIZEUS speech corpus (see Sect. 3) at two SNRs, that is, 5dB and 10 dB and compared to the ground truth (GT) VAD, energy (E) VAD and weighted energyFootnote 7 (WE) VAD. The Precision, Recall and F-score are calculated for each noise class and the F-score is presented in Table 8. It can be noticed that the E and WE VAD are inaccurately detecting the voiced segments as compared to the GT VAD, resulting in poor accuracy. However, the WS VAD performs excellent in correctly detecting the speech components for each noise class, resulting in high accuracy. Moreover, while testing all 240 noisy speech samples together, the WS VAD shows an accuracy of 94.7%. Therefore, the WS VAD is highly robust to pre-process the speech samples and can be integrated further for developing the complete proposed metric.

Table 8 F-score of VAD for each noise class along with all speech samples from each noise class together

6.3 Speech quality metric response

Table 9 Model learning for each speech quality metric (SQM)
Fig. 8
figure 8

Accuracy in terms of MSE for each speech quality metric a Airport SQM, b Exhibition SQM, c Restaurant SQM, and d Station SQM

Table 9 presents the training and the testing errors while training a collection of DNNs to develop SQMs associated with each noise class. It can be noticed that the accuracy, that is, the test MSE of each SQM is comparable to its counter training MSE. All quality predictions of test samples are estimated with a small error in the range of 0.01 to 0.04, that is, 1 to 4%. It reflects that the MSE between the training and the testing quality estimates of the individual SQM is very small. Hence, the results imply a better quality predictions for our SQMs.

Figure 8 shows the accuracy of each trained speech quality metric in terms of MSE. It can be easily visualized that the individual metric learning is smoother and it converges towards the local minima, that is, as the number of epochs increases, the training loss (MSE) decreases and becomes stable. The testing curve follows it, resulting in optimized accuracy. Therefore, these noise-class specific (context-specific) optimized SQMs can be integrated further for developing the complete proposed metric.

6.4 Overall proposed metric response

This section presents Pearson correlation “\(\rho _{p}\)”, Spearman correlation “\(\rho _{s}\)” and RMSE between the objective quality scores (MOS-LQO) obtained for each noise class and their corresponding subjective listener quality scores (MOS-LQS). Three different scenarios are tested to investigate the effectiveness of our proposed metric.

6.4.1 Scenario A: proposed metric without noise-class classifier

In this scenario, the noise-class classifier is removed, that is, the noise class of the input noisy speech sample is not detected. The input noisy speech samples are just processed by the WS VAD and then the processed samples are injected to the P.563 metric to obtain the objective quality scores (MOS-LQO). Table 10 presents \(\rho _{p}\), \(\rho _{s}\) and RMSE for each noise class in this scenario.

It can be observed from Table 10 that the \(\rho _{p}\) and \(\rho _{s}\) of each noise class including all noise classes grouped together are very poor, that is, both correlation values range in between 26 and 52% only. When all noise classes are grouped together, then the correlation values of \(\rho _{p}\) and \(\rho _{s}\) are only 0.428 and 0.432, respectively, which are very less. Moreover, the RMSE is also very high, ranging between 83 and 97%. This implies that Scenario A is not suitable for measuring and monitoring quality of speech in real-time. Moreover, the results also direct that classifying the noise class may result in a better speech quality prediction which is incorporated in our proposed metric (Scenario C).

Table 10 \(\rho _{p}\), \(\rho _{s}\) and RMSE for each noise class with a grouped results for all noise classes without Noise-class Classifier (Scenario A)

6.4.2 Scenario B: proposed metric without both noise-class classifier and VAD

In this scenario, both noise-class classifier and VAD are removed, that is, neither the noise class of the input noisy speech samples are detected nor the input noisy speech samples are processed by the VAD. In other words, the input noisy speech samples are directly injected to the P.563 metric to obtain the objective quality scores (MOS-LQO). Table 11 presents \(\rho _{p}\), \(\rho _{s}\) and RMSE for each noise class in this scenario.

It can be observed from Table 11 that the \(\rho _{p}\) and \(\rho _{s}\) of each noise class including all noise classes grouped together are very poor, that is, both correlation values range in between 35 and 52% only. When all noise classes are grouped together, then the correlation values of \(\rho _{p}\) and \(\rho _{s}\) are only 0.440 and 0.448, respectively, which are very less. Moreover, the RMSE is also very high, ranging between 80 and 94%. However, the results obtained in this scenario (Scenario B) is slightly better than the results obtained in Scenario A. This implies that Scenario B is also not suitable for measuring and monitoring quality of speech in real-time. Moreover, the results also direct that classifying the noise class and pre-processing the speech samples via a VAD may result in a better speech quality prediction. Therefore, both noise-class classifier and VAD are incorporated in our proposed metric (Scenario C).

Table 11 \(\rho _{p}\), \(\rho _{s}\) and RMSE for each noise class with a grouped results for all noise classes without both Noise-class Classifier and VAD (that is, only P.563 is present) (Scenario B)

6.4.3 Scenario C: proposed metric

This is our main proposed noise class-awared speech quality prediction metric scenario where both noise-class classifier and VAD are present, that is, the noise class of the input noisy speech samples are firstly detected, and then processed with the WS VAD to segregate silences, thereafter the processed speech samples are injected to the P.563 metric to obtain objective speech quality scores (MOS-LQO). Table 12 presents \(\rho _{p}\), \(\rho _{s}\) and RMSE for each noise class in this scenario.

It can be observed from the results of our proposed metric (Scenario C), presented in Table 12, that the \(\rho _{p}\) and \(\rho _{s}\) of station class is highest, followed by the airport class. The exhibition and restaurant class also have better \(\rho _{p}\) but less than the airport class. The restaurant class has better \(\rho _{s}\), followed by the exhibition and the airport class. The correlation value in terms of \(\rho _{p}\) ranges in between 76 and 97%. Similarly, the correlation value in terms of \(\rho _{s}\) ranges in between 63 and 92%. When all noise classes are grouped together, then the correlation values of \(\rho _{p}\) and \(\rho _{s}\) are 0.931 and 0.928, respectively, which are better. In addition, the RMSE of the restaurant class is smallest among other noise classes. The RMSE varies in a range of 23–33%, which is in the acceptable range. Overall, our proposed metric performs outstanding, giving around 93% accuracy in terms of both correlations and lowest RMSE when tested with all test samples taken from each noise class together. This implies that incorporation of both noise-classifier and a VAD in our proposed metric (Scenario C) is justified. As a result, our proposed metric is suitable for measuring and monitoring speech quality in real-time.

Table 12 \(\rho _{p}\), \(\rho _{s}\) and RMSE for each noise class with a grouped results for all noise classes in our main proposed metric (Scenario C)

6.4.4 Comparison of all scenarios

It can be observed from Tables 1011 and 12 that the \(\rho _{p}\), \(\rho _{s}\) and RMSE of our proposed SQM (Scenario C) are better than both Scenario A and Scenario B for each noise class and for all noise classes tested together. The \(\rho _{p}\) value (0.931) is highest (see Table 12) in our proposed metric (Scenario C) as compared to both Scenario A (0.428, see Table 10) and Scenario B (0.440, see Table 11) when tested with all noise classes together. Similarly, the \(\rho _{s}\) value (0.928) is highest (see Table 12) in our proposed metric (Scenario C) as compared to both Scenario A (0.432, see Table 10) and Scenario B (0.448, see Table 11) when tested with all noise classes together. In addition, the RMSE value (0.232) is lowest (see Table 12) in our proposed metric (Scenario C) as compared to both Scenario A (0.920, see Table 10) and Scenario B (0.892, see Table 11) when tested with all noise classes together. Moreover, both correlation values in our proposed metric (Scenario C) are around 53% higher as compared to both Scenario A and Scenario B. Similarly, our proposed metric (Scenario C) has around 73% lower RMSE as compared to both Scenario A and Scenario B. This reflects that our proposed metric is performing outstanding as compared to these two different baseline scenarios. The deployment of both pipelines, that is, the noise-class classifier for detecting the noise class of the input noisy speech sample and the WS VAD for identifying and separating the voiced segments, are the key components of our proposed speech quality metric. In particular, it is the sensitivity of noise class that is leading to the better results in our proposed metric.

Fig. 9
figure 9

Correlations and RMSE for two different baselines (Scenario A and Scenario B) and our proposed metric (Scenario C): a Airport class, b Exhibition class, c Restaurant class, d Station class, and e All noise classes together

6.4.5 Plots of correlations and RMSE

In order to clearly visualize and compare the obtained results, Fig. 9 presents the correlations and RMSE of our proposed metric (Scenario C) with two different baselines (Scenario A and Scenario B). It can be seen that our proposed metric is having higher correlations and lower RMSE in case of each noise class and all noise classes together, showing superior performance.

6.4.6 Scatter plot between subjective and objective MOS

To explore the impact of our proposed metric, the scatter plot between the objective quality predictions obtained from each speech quality metric which is trained for each noise class individually (denoted by MOS-LQO) and the corresponding subjective quality predictions (denoted by MOS-LQS) is depicted in Fig. 10. A good correlation can be observed with the test samples of each noise class here.

Fig. 10
figure 10

Subjective vs. objective quality predictions of each noise class

6.4.7 Comparison of execution time

Table 13 Execution time for a single speech sample

One of the important figures of merit of the speech quality assessment metric for real-time speech quality measuring and monitoring is the processing time. In this regard, Table 13 presents the difference in computational complexity between our proposed metric and the P.563 metric while executing a single test speech sample. It can be seen that our proposed metric is faster, saving computational time.

The overall simulation results of the proposed SQM show that the deep learning-based approach has a great advantage when the speech signals are distorted by various types of noise degradations in the surroundings. It demonstrates that DNNs have the great potential to remember and analyze the complicated characteristics of speech signals. Therefore, we believe that our proposed SQM can be deployed by the internet service providers for measuring and monitoring real-time speech quality in the environments where the speech quality is degraded due to the presence of different types of background noises and then QoE-aware management actions can be taken in order to react and maintain the end-user QoE levels.

7 Conclusions and future work

This paper proposes a context-aware speech quality assessment metric for measuring and monitoring the real-time speech quality of voice communication systems under noisy environments. The first component of the proposed metric is a noise-class classifier which uses speech enhancement algorithms in conjunction with P.563 metric to compute speech quality scores (MOS scores) to be used as input feature vector for training the classifier to detect the noise class (context) of the input noisy speech signal. The second component is a VAD, which pre-processes the speech samples to identify and segregate the voiced segments. The third component is noise class-specific (context-specific) SQMs, which are developed by training and optimizing DNNs for each noise class individually. The noise class of the input test speech sample is identified by the noise-class classifier and then the corresponding SQM is activated by the classifier to obtain the optimized speech quality predictions (MOS scores). Results illustrate that the correlation between the subjective and the objective quality predictions is high and the RMSE is less for each noise class individually and when all noise classes are tested together, for our proposed metric as compared to the metric where the noise class is not detected before the prediction of speech quality. The proposed metric also takes less amount of time to execute the speech sample. This indicates that the proposed metric estimates the perceived quality of speech sequences with better accuracy. The proposed speech quality assessment metric is computationally efficient and presents a significant advantage over the SQMs present in the literature where the speech quality is predicted directly without identifying the noise class explicitly.

The proposed metric is not limited to narrow-band speech signals. The narrow-band nature of the test data is used for the developed model. However, the same strategy can be applied with wide-band speech signals in future. We also estimate that extending the dataset further with varieties of noise classes will produce even more better results. Developing a VoIP conversational dataset and its subjective quality rating to validate the proposed metric, since there is no publicly available VoIP conversational dataset in the literature, are the part of future considerations.