1 Introduction

The objective of speech enhancement (SE, also called noise reduction) algorithms is to improve one or more perceptual aspects of the noisy speech by decreasing the background noise without affecting the intelligibility of the speech [1]. Research on SE can be traced back to 40 years ago with two patents by Schroeder [2], where an analog implementation of the spectral magnitude subtraction method was described. Since then, the problem of enhancing speech degraded by uncorrelated additive noise, when only the noisy speech is available, has become an area of active research [3]. Researchers and engineers have approached this challenging problem by exploiting different properties of speech and noise signals to achieve better performance [4].

SE techniques have a broad range of applications, from hearing aids to mobile communication, voice-controlled systems, multiparty teleconferencing, and automatic speech recognition (ASR) systems [4]. The algorithms can be summarized into four classes: spectral subtractive [5]–[8], sub-space [9],[10], statistical model-based [11]–[13], and Wiener-type [3],[14]–[16] algorithms.

Much progress has been made in the development of SE algorithms capable of improving speech quality [17],[18] which was evaluated mainly by the objective performance criteria such as signal-to-noise ratio (SNR) [19]. However, SE algorithm that improves speech quality may not perform well in real-world listening situations where background noise level and characteristics are constantly changing [20]. The first intelligibility study done by Lim [21] in the late 1970s found no intelligibility improvement with the spectral subtraction algorithm for speech corrupted in white noise at −5 to 5 dB SNR. Thirty years later, a study conducted by Hu and Loizou [1] found that none of the examined eight different algorithms improved speech intelligibility relative to unprocessed (corrupted) speech. Moreover, according to [1], the algorithms with the highest overall speech quality may not perform the best in terms of speech intelligibility (e.g., logMMSE [12]). And the algorithm which performs the worst in terms of overall quality may perform well in terms of preserving speech intelligibility (e.g., KLT [9]). To our knowledge, very few speech enhancement algorithms [22]–[25] claimed to improve speech intelligibility by subjective tests for either normal-hearing listeners or hearing-impaired listeners. Hence, we focused in this paper on improving performance on speech intelligibility of the SE algorithm.

From [19], we know that the perceptual effects of attenuation and amplification distortion on speech intelligibility are not equal. Amplification distortion in excess of 6.02 dB (region III) bears the most detrimental effect on speech intelligibility, while the attenuation distortion (region I) was found to yield the least effect on intelligibility. Region I+II constraints are the most robust in terms of yielding consistently large benefits in intelligibility independent of the SE algorithm used. However, in order to divide those three regions [19], the estimated magnitude spectrum needs to be compared with the clean spectrum which we usually do not have in real circumstances.

In this paper, we explored the multitaper spectrum which was shown in [26] to have good bias and variance properties. The spectral estimate was further refined by wavelet thresholding the log multitaper spectrum in [16]. The refined spectrum was proposed in this paper to be used as an alternative of the clean spectrum. Then, the region I+II constraints were imposed and incorporated in the derivation of the gain function of the Wiener algorithm based on a priori SNR [3]. We have experimentally evaluated its performance under a variety of noise types and SNR conditions.

The structure of the rest of this paper is organized as follows. Section 2 provides the background information on wavelet thresholding the multitaper spectrum, and Section 3 presents the proposed approach which imposes constraints on the Wiener filtering gain function. Section 4 contains the speech and noise database and metrics used in the evaluation. The simulation results are given in Section 5. Finally, a conclusion of this work and the discussion are given in Section 6.

2 Wavelet thresholding the multitaper spectrum

In real-world scenarios, the background noise level and characteristics are constantly changing [20]. Better estimation of the spectrum is required to alleviate the distortion caused by SE algorithms. For speech enhancement, the most frequently used power spectrum estimator is direct spectrum estimation based on Hann windowing. However, windowing reduces only the bias not the variance of the spectral estimate [27]. The multitaper spectrum estimator [26], on the other hand, can reduce this variance by computing a small number (L) of direct spectrum estimators (eigenspectra) each with a different taper (window) and then averaging the L spectral estimates. The underlying philosophy is similar to Welch’s method of modified periodogram [27].The multitaper spectrum estimator is given by

S ̂ mt (ω)= 1 L k = 0 L 1 S ̂ k mt ω
(1)

with

S ̂ k mt (ω)= m = 0 N 1 a k ( m ) x ( m ) e jωm 2 ,
(2)

where N is the data length, and a k is the k th sine taper used for the spectral estimate S ̂ k mt (·), which is proposed by Riedel and Sidorenko [28] and defined by

a k (m)= 2 N + 1 sin πk ( m + 1 ) N + 1 ,m=0,,N1.
(3)

The sine tapers were proved in [28] to produce smaller local bias than the Slepian tapers, with roughly the same spectral concentration.

The multitaper estimated spectrum can be further refined by wavelet thresholding techniques [29]–[31]. Improved periodogram estimates were proposed in [29], and improved multitaper spectrum estimates were proposed in [30],[31]. The underlying idea behind those techniques is to represent the log periodogram as ‘signal’ plus the ‘noise’, where the signal is the true spectrum and the noise is the estimation error [32]. It was shown in [33] that if the eigenspectra defined in Equation 2 are assumed to be uncorrelated, the ratio of the estimated multitaper spectrum S ̂ mt (ω) and the true power spectrum S(ω) conforms to a chi-square distribution with 2L degrees of freedom, i.e.,

υ(ω)= Ŝ mt ( ω ) S ( ω ) χ 2 L 2 2 L ,0<ω<π.
(4)

Taking the log of both sides, we get

log Ŝ mt (ω)=logS(ω)+logυ(ω).
(5)

From Equation 5, we know that the log of the multitaper spectrum can be represented as the sum of the true log spectrum plus a logχ2 distributed noise term. It follows from Bartlett and Kendall [34] that the distribution of logυ(ω) is with mean ϕ(L)−log(L) and variance ϕ(L), where ϕ(·) and ϕ(·) denote, respectively, the digamma and trigamma functions. For L≥5, the distribution of logυ(ω) will be close to a normal distribution [35]. Hence, provided L is at least 5, the random variable η(ω)

η(ω)=logυ(ω)ϕ(L)+log(L)
(6)

will be approximately Gaussian with zero mean and variance σ η 2 = ϕ (L). If Z(ω) is defined as

Z(ω)=log Ŝ mt (ω)ϕ(L)+log(L),
(7)

then we have

Z(ω)=logS(ω)+η(ω),
(8)

i.e., the log multitaper power spectrum plus a known constant (log(L)−ϕ(L)) can be written as the true log power spectrum plus approximately Gaussian noise η(ω) with zero mean and known variance σ η 2 [30].

The model in Equation 8 is well suited for wavelet denoising techniques [36]–[39] for eliminating the noise η(ω) and obtaining a better estimate of the log spectrum. The idea behind refining the multitaper spectrum by wavelet thresholding can be summarized into four steps [16].

  • Obtain the multitaper spectrum using Equations 1 to 3 and calculate Z(w) using Equation 7.

  • Apply a standard periodic discrete wavelet transform (DWT) out to level q0 to Z(w) to get the empirical DWT coefficients zj,k at each level j, where q0 is specified in advance [40].

  • Apply a thresholding procedure to zj,k.

  • The inverse DWT is applied to the thresholded wavelet coefficients to obtain the refined log spectrum.

3 Speech enhancement based on constrained Wiener filtering algorithm

Among the numerous techniques that were developed, the Wiener filter can be considered as one of the most fundamental SE approaches, which has been delineated in different forms and adopted in various applications [4]. The Wiener gain function is the least aggressive, in terms of suppression, providing small attenuation even at extremely low SNR levels.

A block diagram of the proposed SE algorithm is shown in Figure 1. The initial four frames are assumed to be noise only. The algorithm can be described as follows. The input noisy speech signal is decomposed into frames of 20-ms length with an overlap of 10 ms by the Hann window. Each segment was transformed using a 160-point discrete Fourier transform (DFT). The spectrum of the segmented noisy and noise signal are estimated by the multitaper method and then further refined by wavelet thresholding technique. The estimated ‘clean’ spectrum was gotten from the refined multitaper estimated noisy and noise spectrum. On the other hand, the noise-corrupted sentences were enhanced by the Wiener algorithm based on a priori SNR estimation [3]. The region I+II constraints were then imposed on the enhanced spectrum. Finally, the inverse fast Fourier transform (FFT) was applied to obtain the enhanced speech signal.

Figure 1
figure 1

Block diagram of the proposed speech enhancement algorithm.

The implementation details of the proposed method can be described in the following four steps. For each speech frame,

  • compute the multitaper power spectrum S ̂ y mt of the noisy speech y using Equation 1 and estimate the multitaper power spectrum S ̂ x mt of the clean speech signal by S ̂ x mt = S ̂ y mt S ̂ n mt , where S ̂ n mt is the multitaper power spectrum of the noise. S ̂ n mt can be obtained using noise samples collected during speech absent frames. Here, L is set to 16. Any negative elements of S ̂ x mt are floored as follows:

    S ̂ x mt = S ̂ y mt S ̂ n mt , if S ̂ y mt > S ̂ n mt β S ̂ n mt , if S ̂ y mt S ̂ n mt ,
    (9)

    where β is the spectral floor set to β=0.002.

  • compute Z(ω)=log Ŝ y mt (ω)ϕ(L)+log(L) and then apply the DWT of Z(ω) out to level q0 to obtain the empirical DWT coefficients zj,k for each level j, where q0 is specified to be 5 [40]. Threshold the wavelet coefficients zj,k and apply the inverse DWT to the thresholded wavelet coefficients to obtain the refined log spectrum, log S ̂ y ωmt (ω), of the noisy singal. Repeat the above procedure to obtain the refined log spectrum, log S ̂ n ωmt (ω), of the noise signal. The estimated power spectrum S ̂ x ωmt (ω) of the clean speech signal can be estimated using

    S ̂ x ωmt (ω)= S ̂ y ωmt (ω) S ̂ n ωmt (ω)
    (10)
  • let Y(ω,t) denote the magnitude of the noisy spectrum at time frame t and frequency bin ω estimated by the method in [41]. Then, the estimate of the signal spectrum magnitude is obtained by multiplying Y(ω,t) with a gain function G(ω,t) as X ̂ (ω,t)=G(ω,t)·Y(ω,t). The Wiener gain function is based on the a priori SNR and is given by

    G(ω,t)= SNR prio ( ω , t ) 1 + SNR prio ( ω , t ) ,
    (11)

    where SNRprio is the a priori SNR estimated using the decision-directed approach [3],[19] as follows:

    SNR prio = α · X M 2 ( ω , t 1 ) P ̂ D 2 ( ω , t 1 ) + ( 1 α ) · max Y 2 ( ω , t ) P ̂ D 2 ( ω , t ) 1 , 0 ,
    (12)

    where P ̂ D 2 (ω,t) is the estimate of the power spectral density of background noise, and α is a smoothing constant (typically set to α=0.98).

  • to maximize speech intelligibility, the final enhanced spectrum, X M (ω,t), can be obtained by utilizing the region I+II constraints to the enhanced spectrum X ̂ (ω,t) as follows:

    X M (ω,t)= X ̂ ( ω , t ) , if X ̂ ( ω , t ) < 2 S ̂ x ωmt ( ω ) 0 else
    (13)

    Finally, the enhanced speech signal can be obtained by apply the inverse FFT of X M (ω,t).

The above estimator was applied to 20-ms duration frames of the noisy signal with 50% overlap between frames. The enhanced speech signal was combined using the overlap and add method.

4 Evaluation setup

The proposed SE algorithm was tested using a speech database that was corrupted by eight different real-world noises at different SNRs. The system was evaluated using both the composite evaluation measures proposed in [42] and the SNRLOSS measure proposed in [43].

4.1 Database description

For the evaluation of SE algorithms, NOIZEUS [44] is preferred since it is a noisy speech corpus recorded by [18] to facilitate comparison of SE algorithms among different research groups [20]. The noisy database contains thirty IEEE sentences [45] which were recorded in a sound-proof booth using Tucker Davis Technologies (TDT; Alachua, FL, USA) recording equipment. The sentences were produced by three male and three female speakers (five sentences/speaker). The IEEE database was used as it contains phonetically balanced sentences with relatively low word-context predictability. The 30 sentences were selected from the IEEE database so as to include all phonemes in the American English language. The sentences were originally sampled at 25 kHz and downsampled to 8 kHz.

To simulate the receiving frequency characteristics of the telephone handsets, the intermediate reference system (IRS) filter used in ITU-T P.862 [46] for evaluation of the perceptual evaluation of speech quality (PESQ) measures was independently applied to the clean and noise signal [17]. Then, noise segment of the same length as the speech signal was randomly cut out of the noise recordings, appropriately scaled to reach the desired SNR levels (−8,−5,−2,0,5,10, and 15 dB) and finally added to the filtered clean speech signal. Noise signals were taken from the AURORA database [47] and included the following recordings from different places: train, babble (crowd of people), car, exhibition hall, restaurant, street, airport, and train station. Therefore, in total, there are 1,680 (30 sentences × 8 noises × 7 SNRs) noisy speech segments in the test set.

4.2 Performance evaluation

The performance of an SE algorithm can be evaluated both subjectively and objectively. In general, subjective listening test is the most accurate and preferable method for evaluating speech quality and intelligibility. However, it is time consuming and cost expensive. Recently, many researchers have placed much effort on developing objective measures that would predict subjective quality and intelligibility with high correlation [42],[43],[48],[49] with subjective listening test. Among them, the composite objective measures [42] were proved to have high correlation with subjective ratings and, at the same time, capture different characteristics of the distortions present in the enhanced signals [35], while the SNRLOSS measure [43] was found appropriate in predicting speech intelligibility in fluctuating noisy conditions by yielding a high correlation for predicting sentence recognition. Therefore, the composite objective measures and the SNRLOSS measure were adopted to predict the performance of the proposed SE algorithm on subjective quality and speech intelligibility, respectively.

4.2.1 The composite measures to predict subjective speech quality

The composite objective measures are obtained by linearly combining existing objective measures that highly correlate with subjective ratings. The objective measures include segmental SNR (segSNR) [18], weighted-slope spectral (WSS) [50], PESQ[51], and log likelihood ration (LLR) [18].

The three new composite measures obtained from multiple linear regression analysis are given below:

  • Csig: A five-point scale of signal distortion (SIG) formed by linearly combining the LLR, PESQ, and WSS measures (Table 1).

  • Cbak: A five-point scale of noise intrusiveness (BAK) formed by linearly combining the segSNR, PESQ, and WSS measures (Table 1).

  • Covl: The mean opinion score of overall quality (OVRL) formed by linearly combining the PESQ, LLR, and WSS measures.

Table 1 Scale of signal distortion, background intrusiveness and overall quality

The three new composite measures obtained from multiple linear regression analysis are given below:

C sig = 3.093 1.029 · LLR + 0.603 · PESQ 0.009 · WSS
(14)
C bak = 1.634 + 0.478 · PESQ 0.007 · WSS + 0.063 · segSNR
(15)
C ovl = 1.594 + 0.805 · PESQ 0.512 · LLR 0.007 · WSS
(16)

The correlation coefficients between the three composite measures and real subjective measures are given in Table 2[42]. All three parameters should be maximized in order to get the best performance.

Table 2 Correlation coefficients between the composite measures and subjective measure

4.2.2 The SNRLOSS measure to predict speech intelligibility

The SNR loss in band j and frame m is defined as follows [43]:

SL(j,m)= SNR X (j,m) SNR X ̂ (j,m),
(17)

where SNR X (j,m) is the input SNR in band j, SNR X ̂ (j,m) is the SNR of the enhanced signal in the j th frequency band at the m th frame.

Assuming the SNR range is restricted to [−SNRLim,SNRLim] dB (SNRLim=3 in this paper), the S L(j,m) term is then limited as follows:

SL ̂ (j,m)=min(max(SL(j,m), SNR Lim ), SNR Lim )
(18)

and subsequently mapped to the range of [0, 1] using the following equation:

SNR LOSS (j,m)= C SNR Lim SL ̂ ( j , m ) , if SL ̂ ( j , m ) < 0 C + SNR Lim SL ̂ ( j , m ) , if SL ̂ ( j , m ) 0
(19)

where C and C+ are parameters (fixed to be 1 in this paper) controlling the slopes of the mapping function which was defined in the range of [0, 1]; therefore, the frame SNRLOSS is normalized to the range of 0≤SNRLOSS(j,m)≤1. The average SNRLOSS is finally computed by averaging SNRLOSS(j,m) over all frames in the signal as follows:

SNR ¯ LOSS = 1 M m = 0 M 1 f SNR LOSS (m),
(20)

where M is the total number of data segments in the signal, and f SNRLOSS(m) is the average (across bands) SNR loss computed as follows:

f SNR LOSS (m)= j = 1 K W ( j ) · SNR LOSS ( j , m ) j = 1 K W ( j ) ,
(21)

where W(j) is the weight (i.e., band importance function [52]) placed on the j th frequency band and was taken from Table B.1 in the ANSI standard [52].

The implementation of the SNRLOSS measure was supplied in the website of the authors in [43]. The smaller the value of the SNRLOSS measure is, the better performance of the SE algorithm is achieved.

5 Simulation results

The evaluation of the subjective quality and intelligibility of the speech enhanced by our proposed SE algorithm are reported in this section. Three other SE schemes, namely, wavelet thresholding (WT) [16], KLT [9], and Wiener algorithm with clean signal present (Wiener_Clean) [19], were also evaluated in order to gain a comparative analysis of the proposed SE algorithm. The KLT algorithm was proved in [1] and [22] by subjective tests to perform well in terms of preserving speech intelligibility for normal hearing listeners and improving speech intelligibility significantly for cochlear implant users in regard to recognition of sentences corrupted by stationary noises, respectively. The Wiener_Clean algorithm was taken as the ground truth in this paper because there is clean signal used in the algorithm. The unprocessed noisy signal (UP) was also evaluated by the SNRLOSS measure for comparison purposes. The implementations of these three schemes were taken from the implementations in [18].

5.1 Performance of predicting subjective quality

5.1.1 Performance average over all eight kinds of noise

In Figure 2, the proposed algorithm is compared with WT and KLT algorithms in terms of the composite measures averaging over all eight noises for seven SNRs. The four objective measures (LLR, segSNR, WSS, and PESQ) that composed the composite measures were also given in the first row for reference. The Wiener_Clean algorithm, as the ground truth, performed the best for all four objective evaluation measures. According to [42], the LLR measure performed the best in terms of predicting signal distortion, while the PESQ measure gave the best prediction for both noise intrusiveness and overall speech quality. From the first row of Figure 2, we can notice that our proposed algorithm gives better performance than both WT and KLT in terms of the LLR measure for all seven SNRs tested. Moreover, when SNR is smaller than 5 dB, our proposed algorithm also performed better than both WT and KLT for the PESQ measure.

Figure 2
figure 2

Averaged over seven SNRs. The composite measure comparisons for four SE schemes (WT, KLT, proposed, and Wiener_Clean) averaged over seven SNRs (−8,−5,−2,0,5,10, and 15 db) for eight kinds of noise.

The second row of Figure 2 shows the composite measures, which include Csig, Cbak, and Covl, estimated by the combination of all those four objective measures expressed in the first row. In terms of both signal distortion Csig and overall quality Covl, our proposed method performs the best when SNR is less than 10 dB. Specifically speaking, for overall quality measure Covl, the proposed algorithm improved 10.94%, 18.94%, 21.63%, 23.66%, and 6.67% for −8,−5,−2,0, and 5 dB, respectively, when compared with the KLT method. In general, the proposed algorithm achieved 13.88% and 6.40% improvement for Csig and Covl, respectively, when average over all seven tested SNR levels. However, for Cbak, the WT and KLT algorithms give similar and better results than our proposed one when SNR is no smaller than 0 dB. The improvement was 0.98%, 6.98%, 11.11%, and 16.55% for 0, 5, 10, and 15 dB, respectively. In average, the WT and KLT methods were 5.14% better than our proposed algorithm in terms of background intrusiveness Cbak.

5.1.2 Performance average over seven SNRs

Figure 3 shows the three different composite measures averaged over seven SNRs for eight kinds of noise computed for WT, KLT, Wiener_Clean, and proposed SE algorithms. The Wiener_Clean algorithm still works as the ground truth here. From Figure 3, it is clear that in terms of Csig, the KLT works much better than WT. Hence, the proposed algorithm is compared with only the KLT method here. We observe that on average, the proposed algorithm is better than the KLT method in terms of Csig for train (9.19%), babble (15.93%), car (14.51%), exhibition hall (7.74%), restaurant (16.74%), street (13.42%), airport (16.64%), and train station (16.23%) noises. The number in the bracket indicates the Csig by which our proposed algorithm is better than the KLT method. The mean Csig over all eight noise types of our proposed SE algorithm is 13.88% better than that of the KLT method. Furthermore, the proposed SE algorithms outperforms the KLT in terms of Covl by an average of 6.40% over all eight kinds of noise that were considered. However, in terms of background intrusiveness Cbak, the KLT algorithm gives an average of 5.14% better results than our proposed algorithm.

Figure 3
figure 3

Averaged over eight kinds of noise. The composite measure comparisons for four SE schemes (WT, KLT, proposed, and Wiener_Clean) averaged over eight kinds of noise for seven SNRs (−8,−5,−2,0,5,10, and 15 dB).

Thus, in conclusion, the proposed SE algorithm was predicted to be able to achieve the best overall subjective quality for most SNRs and all noise types considered when comparing with WT and KLT algorithms.

5.2 Performance of predicting speech intelligibility

The SNRLOSS measure values obtained from each algorithm (include UP) were subjected to statistical analysis in order to assess their significant differences. A highly significant effect (p<0.005) was found in all SNR levels and all types of noise by analysis of variance (ANOVA). Following the ANOVA, multiple comparison statistical tests according to Tukey’s HSD test were done to assess the significance between algorithms. The difference was deemed significant if the p value was smaller than 0.05.

Table 3 gives the statistical comparisons of the SNRLOSS measure between unprocessed noisy sentences (UP) and enhanced sentences by four SE algorithms (WT, KLT, Wiener_Clean, and proposed). At the same time, the comparisons between our proposed SE algorithm and the other three algorithms were also given. From Table 3, we know that when compared with the UP, our proposed algorithm was predicted by the SNRLOSS measure to be able to improve the intelligibility in low SNRs for most noises tested (italicized). The R in the table gives the percentage by which our algorithm is better than others; the value is negative because better performance gave smaller SNRLOSS measure. Furthermore, our proposed SE algorithm was also compared with the WT and KLT algorithms and was proved to supply better performance for most conditions tested.

Table 3 Statistical comparisons of the SNR LOSS measure between unprocessed sentences and enhanced sentences by SE algorithms

6 Conclusions

The main contribution of this paper was the introduction of a new SE algorithm based on imposing constraint on Wiener gain function. Experiments were done on NOIZEUS database for eight kinds of noise (AURORA database) across seven different SNRs ranging from −8 to 15 dB. The Wiener_Clean algorithm was taken as the ground truth. The performance of our proposed algorithm was compared with WT and KLT methods. The results were analyzed mainly by three composite measures and the SNRLOSS measure to predict the performance on subjective quality and speech intelligibility, respectively. Through extensive experiments, we showed that when averaged over all eight kinds of noises, our proposed SE algorithm achieved the best results in terms of predicting signal distortion Csig and overall quality Covl when SNR is no more than 10 dB. Furthermore, we investigated the individual performance on each noise type. Our proposed SE algorithm outperformed the KLT algorithm for all noise types tested in terms of both Csig and Covl. On the other hand, the SNRLOSS measure comparisons with both the UP and other SE algorithms predicted that our proposed algorithm was able to improve speech intelligibility for low SNR levels and outperform WT and KLT algorithms for most conditions examined.

It is important to point out that the three composite measures and the SNRLOSS measure used in this paper are adopted for predicting the subjective quality and intelligibility of noisy speech enhanced by noise suppression algorithms because of their high correlation with real subjective tests [42],[43]. Further subjective tests on both normal-hearing listeners and hearing-impaired listeners are needed to verify the effectiveness of the proposed algorithm on improving both subjective quality and speech intelligibility. It is also worth mentioning that depending on the nature of the application, some practical SE systems may require very high quality speech but can tolerate a certain amount of noise, while other systems may want speech as clean as possible even with some degree of speech distortion. Therefore, it should be noted that according to different applications, different SE algorithms should be chosen to meet the variant requirement.