1 Introduction

The process of speech enhancement plays an important role in performance of the speech processing applications under real time conditions. The speech data corrupted by various types of noises will have an adverse effect on the performance of the system. Therefore, it is indeed to reduce the various types of noises in the degraded speech data using noise reduction techniques [1]. A technique based on optimal smoothing and minimum-statistics (OSMS) was proposed to calculate the power spectral density (PSD) of non-stationary noise in degraded speech data [2, 3]. The proposed algorithm had the ability to calculate the noise PSD in degraded speech data without using voice activity detection (VAD). Instead of using VAD, the author used tracks of spectral minima in each and every frequency band without having a difference between the activity of speech or speech pause. The optimal smoothing parameter was derived by suppressing the mean square error (MSE) in each step for recursive smoothing of PSD of degraded speech data. The method was amalgamated with a noise reduction algorithm for speech enhancement. The results revealed that the proposed technique outperformed the existing estimators.

The estimation of noise was demonstrated in [4] by proposing a method called minima controlled through recursive averaging (MCRA). By taking the average of past power spectral values of degraded speech data, the noise was approximated. The ratio of local energy of noisy speech to its minimum value within the stipulated time window has given the presence of speech in subbands. The author concluded that the noise approximation is very computationally robust and efficient in terms of signal-to-noise ratio (SNR). The improvement in MCRA for noise approximation under different conditions was demonstrated by Cohen [5]. The noise approximation was calculated by taking the average past power spectral values using a smoothing parameter which was considered by the presence of signal probability. The author claimed that the IMCRA algorithm worked better under non-stationary noisy environments and different SNR conditions.

A noise reduction algorithm was developed in [6] for the speech signals that are degraded by short-time stationary noises and electrical disturbances. The spectral amplitude approximation was used for the enhancement of corrupted speech data. No speech pause detectors were required for the developed noise reduction algorithm, the author claimed. The experimental results showed that there was an improvement in noise reduction in the degraded speech data compared to existing algorithms. The noise estimation and characteristics of noise were explained in [7]. To estimate the noise, the past noisy segments were considered, and no external speech pause detectors were used for the detection of pauses in speech. The implemented algorithm can also be integrated with non-linear spectral subtraction (SS) algorithms, the authors claimed. The results revealed that the proposed algorithm outperformed the existing speech enhancement algorithms. In [8], the time and frequency domain representations of corrupted speech data were used for estimating the noise. The VAD was deployed to identify the active regions of the speech signal. The concept of attenuation procedure was used on speech and non-speech activity regions to obtain the enhanced speech data. The proposed technique had complexity in computation and it is feasible for hearing aid applications. The obtained experimental results were compared with two speech enhancement techniques and the proposed method had given better performance, the authors claimed.

In conclusion, there have been significant advancements in the area of speech enhancement under various degraded conditions. However, research specifically focused on the noise estimation under highly non-stationary noisy conditions is limited [3,4,5,6,7,8,9,10, 19,20,21, 24]. In the process of noise estimation, understanding the complexities and suppression of various types of noises under highly non-stationary noisy scenarios would be a unique contribution. Existing works have covered noise estimation techniques under different SNR conditions, but there is a gap concerning the elimination of noise under sudden variations in noise level scenarios. The main objective of the proposed work is to fill this gap by proposing a robust noise estimation technique for speech enhancement under highly non-stationary noisy scenarios. The major contributions to the proposed work are as follows:

  • Develop a robust technique for noise estimation in highly non-stationary noisy conditions.

  • Estimation of noise in each segment using time-frequency dependent smoothing factors which are calculated using Bayesian probability of speech presence.

  • Computation of the noise estimate without using VAD.

  • Performance evaluation of the proposed method with existing techniques in terms of speech quality and intelligibility after speech enhancement.

The rest of the article is organized as follows: Section 2 describes the related work. Section 3 gives the implementation of proposed OSMC technique for speech enhancement. The experimental setup and results analysis of proposed and existing noise estimation techniques are described in Section 4. The Section 5 demonstrates the conclusions.

2 Related work

A mono channel speech enhancement algorithm using the Wiener filter was proposed in [21]. The proposed algorithm estimates the noise using a first-order recursive equation and uses a parameter for smoothing to update the noise in every frame. The experiments were conducted on the speech sentences which were degraded by various types of noises. The proposed algorithm outperformed the conventional SS technique in estimating the noise in corrupted speech data, as demonstrated by objective speech quality measures. In [22], a technique was proposed for enhancing a short time spectral amplitude (STSA) in frequency domain. For modeling the magnitudes of DFT of clean speech signal, Weibull distribution was used under Gaussian noise scenario. A decision-directed approach with the smoothing factor was considered to approximate a priori SNR. The priori probability of absence of speech was set to 0.2. The experiments were investigated on the TIMIT speech database and the speech sentences were degraded by various additive non-stationary noises. The obtained results demonstrated the efficacy of the proposed technique over existing algorithms in terms of quality and intelligibility of speech.

An algorithm was proposed for noise power spectral density (PSD) estimation using an high-pass filter derivative under noisy conditions [23]. The authors have used a spectral-flatness adaptive thresholding method for the detection of activity of speech in the corrupted speech frames. The NOIZEUS database was used for the subjective and objective evaluation of the proposed noise PSD estimation algorithm and its comparison with competing methods. The achieved results revealed that the proposed technique has performed well for various degraded conditions compared to the competing algorithms. The work in [26] presented a real-time mono channel speech enhancement technique for suppressing stationary and transient noise. The methods employed include quantile noise estimation for stationary noise suppression and the transient noise suppression is based on normalized variance and gravity center of the signal. The results revealed that the proposed algorithm outperformed the existing techniques under high SNR conditions.

The authors have proposed an optimized SS method for mono channel speech enhancement [24]. The proposed method has performed a noise reduction through SS based on minimal statistics. The obtained results revealed that the proposed method outperformed the competing techniques in terms of quality of speech. In [20], the modified version of minimum-statistics noise estimation (MSNE) was implemented for speech enhancement. The noise was estimated by considering the minima statistics of degraded speech data. The experimental results showed that the proposed estimator outperforms the competing techniques under medium degraded conditions and it has given less performance under highly non-stationary noisy conditions.

An algorithm was proposed for the degraded speech enhancement by combining SS-VAD and MMSE-spectrum power estimator based on zero crossing (SPZC) [9]. The babble and musical noise suppression in degraded speech data was achieved by considering the efficacy of SS-VAD and MMSE-SPZC. For the experimentation, the TIMIT and Kannada speech databases were used. To evaluate the performance of the proposed technique with existing methods, the PESQ and composite measures were considered. The experimental results show that the amalgamation of SS-VAD and MMSE-SPZC estimator has given better improvements in quality and intelligibility of enhanced speech compared to individual techniques. In [25], to optimize the subspace partitioning, the authors have proposed an approach for speech enhancement using modified version of accelerated particle swarm optimization. The degraded speech data was divided into noise only, speech plus noise, and speech only components. The VAD [29,30,31] was used for the detection of voice activity. The results shown that the proposed method has performed well under different corrupted conditions compared to existing methods. The authors also claimed that there was an improvement with less speech distortion.

The enhancements in the Kannada automatic speech recognition (ASR) system were described in [10]. Initially, the Kannada ASR system was developed in [11] for noisy Kannada speech data. Due to various types of degradation, the performance of the Kannada ASR system led to less speech recognition accuracies. Therefore, the authors in [10] have developed a robust noise reduction technique by combining SS-VAD and MMSE-SPZC estimator [9]. The developed noise reduction technique was integrated with an interactive voice response system as a front end. Further, to improve the accuracy of speech recognition, the deep neural network and subspace Gaussian mixture modeling techniques were used. The experiments were conducted by taking a few speech sentences from TIMIT and Kannada speech databases. The results demonstrated that there was a consistent improvement in the accuracy of speech recognition using the enhanced speech data compared to the earlier Kannada ASR system [11]. A system was developed in [19] for robust speech encoding under degraded conditions. The authors have combined the SS-VAD with LPC for encoding the noisy speech data. The experimental results revealed that the proposed combined SS-VAD and linear predictive coding (LPC) method has given a better performance in terms of SNR and compression ratios. The authors also mentioned that the proposed method works better for higher SNR and is inefficient under medium and negative SNR conditions.

A spatial procedure to SS for speech enhancement was proposed in [27]. The drawbacks of SS were addressed with experimental demonstration of various types of noises. The experimental results revealed that the proposed method has given better results in terms of speech quality and intelligibility compared to existing conventional SS algorithms. In [28], the authors have used the speech enhancement algorithm [27] to improve the accuracy of speech recognition under real-time conditions. The noise elimination algorithm was kept before the feature extraction part in the spoken query system. The experiments revealed that there was an improvement in the suppression of word error rate in the offline ASR models and the accuracy of the speech recognition was improved during the online testing.

In conclusion, the work related to noise estimation has demonstrated significant progress and advancements under different noisy conditions. Nevertheless, one crucial aspect that needs consideration is the estimation of noise under highly non-stationary noisy scenarios. The estimation of noise under an instantaneous increase in noise levels can effectively support improving the efficacy of other speech processing applications such as ASR, speech encoding, speaker verification, speaker identification, speaker recognition and etc,. under noisy conditions.

3 Implementation of proposed OSMC technique for speech enhancement

Consider a sampled degraded speech signal c(i) is corrupted by noise model b(i) when it is added to clean speech signal a(i), where i is the sampling time index. An assumption is made that, a(i) and b(i) are independent statistically and have zero mean.

$$\begin{aligned} c(i) = a(i) + b(i) \end{aligned}$$
(1)

The degraded sampled speech signal c(i) is transformed into frequency domain by considering windowing h(i) to a frame of N consecutive samples of c(i). The fast Fourier transform (FFT) analysis of sliding window yields a set of frequency domain samples (signals) which can be mathematically written as

$$\begin{aligned} c(\lambda , \textrm{k})=\sum _{\mu =0}^{L-1} c(\lambda \textrm{S}+\mu ) h(\mu ) \textrm{e}^{-\textrm{j} 2 \pi \textrm{k} \mu / \textrm{N}} \end{aligned}$$
(2)

where \(\lambda \) is subsampled index of time and k is the index of frequency bin and \(\textit{k}\in {0,1,\ldots ,L-1}\) which is fairly related to the normalized centered frequency, \(\Omega _k=2\pi k/L\). The probability density function (PDF) of \( C (\lambda ,k)\) can be represented as follows:

$$\begin{aligned} f_{\mid C(\lambda , k)\mid ^{2}}(x)=\frac{U(x)}{\sigma _{B}^{2}(\lambda , k)+\sigma _{A}^{2}(\lambda , k)} e^{-x /\left( \sigma _{B}^{2}(\lambda , k)+\sigma _{A}^{2}(\lambda , k)\right) } \end{aligned}$$
(3)

where \(\sigma _A^2(\lambda ,k) = \textit{E}{\mid A(\lambda ,k)\mid ^2}\) and \(\sigma _B^2(\lambda ,k)\ = \textit{E}{\mid B(\lambda ,k)\mid ^2}\) are the PSDs of the clean speech signal and noise model respectively. The term U(x) defines the unit step function. The first order recursive equation is used to compute the smooth power spectrum of degraded speech data. It can be written as follows:

$$\begin{aligned} \textrm{Z}(\lambda , k)=\beta \textrm{Z}(\lambda -1, \textrm{k})+(1-\beta )\mid \textrm{C}(\lambda , \textrm{k})\mid ^{2} \end{aligned}$$
(4)

Tracking the minima of degraded speech data was explained by Martin in [3]. The term \(\beta \) in (4) can be computed as

$$\begin{aligned} \beta =\left( T_{\textrm{SM}} f_{s} / S-1\right) /\left( T_{\textrm{SM}} f_{s} / S+1\right) \end{aligned}$$
(5)

where \(T_{SM}\) is the window length of 0.2 seconds and S=128. By taking the value of sampling frequency, \(f_{s}\) of 8 kHz, S and \(T_{SM}\), the value of \(\beta \) is computed. The tracking method was mainly depends on length of the minimum search window. A nonlinear method [6] is used in the proposed method for tracking the minimum of degraded speech data by continuously taking the past spectral average values.

$$\begin{aligned} H_{d}\left( \omega \right) =\left\{ \begin{array}{rl} Z_{\min }(\lambda , k)= \gamma {Z_{\min }(\lambda -1, k)} + &{}\frac{1-\gamma }{1-\xi }(Z(\lambda , k)-\xi Z(\lambda -1, k)) ; \\ {} &{} \text{ If } Z_{\min }(\lambda -1, k)<Z(\lambda , k) \\ Z(\lambda , k) ; &{} \text{ Otherwise } \end{array}\right. \end{aligned}$$
(6)

where \(Z_{\min }(\lambda ,k)\) is the local minima of the corrupted speech data power spectrum and \(\gamma \) and \(\xi \) constants which are considered while experimentation (\(\gamma \) = 0.999 and \(\xi \) = 0.8). The factor \(\xi \) used to control the adaptation duration of minima. To find the activity of speech in each frequency bin, the ratio of degraded speech spectrum to its local minima is considered and can be written as follows:

$$\begin{aligned} T_{r}(\lambda , k)=\frac{Z(\lambda , k)}{{Z_{\min }}(\lambda , k)} \end{aligned}$$
(7)

A Bayes minimum-cost decision rule related to speech activity detection is given by

$$\begin{aligned} \frac{p\left( T_{r} \mid H_{1}\right) }{p\left( T_{r} \mid H_{0}\right) } \lessgtr _{H_{0}^{\prime }}^{H_{1}^{\prime }} \frac{c_{10} P\left( H_{0}\right) }{C_{01} P\left( H_{1}\right) } \end{aligned}$$
(8)

where the priori probabilities are represented by \(P(H_{0})\) and \(P(H_{1})\) for speech absence and presence respectively. The cost for computing \(H_i^{\prime }\) and \(H_j^{\prime }\) is represented by the term \(c_{ij}\). The probability of likelihood ratio \(\frac{p\left( T_{r} \mid H_{1}\right) }{p\left( T_{r} \mid H_{0}\right) }\) is a monotonic function, therefore, the decision of (8) can be generalized as follows:

$$\begin{aligned} T_{r}(\lambda , k) \lessgtr _{H_{0}^{\prime }}^{H_{1}^{\prime }} \theta (k) \end{aligned}$$
(9)

The value of \(T_{r}(\lambda , k)\) is compared with threshold value. If the value of \(T_{r}(\lambda , k)\) is greater than threshold, there is speech activity in that particular frequency bin else there is an absence of speech activity. This is mainly depending on the rule that the corrupted speech power spectrum is nearly equal to its local minima value when there is no speech activity. The speech activity decision making is described as follows:

$$\begin{aligned} D(\lambda , k)= {\left\{ \begin{array}{ll}1 ; &{} \text{ Presence } \text{ of } \text{ speech } \text{ if } {\text {Tr}}(\lambda , k)>\theta (k) \\ 0 ; &{} \text{ Otherwise } \end{array}\right. } \end{aligned}$$
(10)

where \(\theta (k)\) is defined as frequency dependent threshold given experimentally. It can be approximated as

$$\begin{aligned} \theta (\textrm{k})=\left\{ \begin{array}{rr} 2 &{} 1 \le k \le L F \\ 2 &{} L F<k \le M F \\ 5 &{} M F<k \le f_s / 2 \end{array}\right. \end{aligned}$$
(11)

where MF and LF are the bin of frequencies with reference to 3kHz and 1kHz respectively and S is the sampling frequency. In [4], the value of \(\theta (k)\) was fixed for all frequencies. From the above principle, the probability of speech presence \(D(\lambda ,k)\) is updated using first order recursion equation:

$$\begin{aligned} z(\lambda , k)=\mu _{z} z(\lambda -1, k)+\left( 1-\mu _{z}\right) D(\lambda , k) \end{aligned}$$
(12)

where \(\mu _z\) is smoothing constant. The above recursion equation uses the correlation for presence of speech in adjacent frames or segments. By using the probability of speech presence approximation, the time-frequency dependent smoothing factor is calculated as:

$$\begin{aligned} \mu _{\textrm{s}}(\lambda , k) \triangleq \mu _{\textrm{d}}+\left( 1-\mu _{\textrm{d}}\right) \textrm{z}(\lambda , k) \end{aligned}$$
(13)

where \(\mu _{\textrm{d}}\) is a constant and the value of \(\mu _{\textrm{d}}\) and \(\mu _z\) are 0.85 and 0.25 respectively. The value of \(\mu _{\textrm{s}}(\lambda , k)\) is always in the range of \(\mu _{\textrm{d}} \le \mu _{\textrm{s}}(\lambda , k) \le 1\). The spectrum of noise estimate is updated as follows:

$$\begin{aligned} \textit{E}(\lambda , k)=\mu _{\textrm{s}}(\lambda , k) \textrm{E}(\lambda -1, k)+\left( 1-\mu _{\textrm{s}}(\lambda , k)\right) \mid C(\lambda , k)\mid ^{2} \end{aligned}$$
(14)

where \({E(\lambda ,k)}\) is the approximation of noise power spectrum. The proposed algorithm is summarized as follows: Once the classification of presence of speech activity/non-speech activity is completed, the probability of speech presence/absence is updated. Using this probability, the time-frequency dependent smoothing factor is computed. At last, the noise spectrum approximation is updated using frequency-time dependent smoothing factor. The proposed noise estimator is used in speech enhancement technique to reconstruct the speech signal. Floored the negative values and ensured the conjugate symmetry for real reconstruction. The phase information is multiplied with whole frame of FFT and converted back to time domain using inverse fast Fourier transform. The above discussion of the proposed method is represented in terms of flowchart and pseudo code, which are shown in Fig. 1 and Algorithm 1 respectively.

Fig. 1
figure 1

Flowchart of the proposed algorithm

Algorithm 1
figure a

A pseudo code of the proposed algorithm.

4 Experimental results and analysis

Initially, the two performance metrics for the assessment of speech quality and intelligibility are discussed. Later, the integration procedure of the proposed noise estimation technique with a speech enhancement algorithm is explained. Further, the performance evaluation of the proposed method with existing noise estimation techniques in terms of speech quality and intelligibility after speech enhancement is demonstrated.

4.1 Performance metrics

The evaluation of proposed and existing techniques in terms of speech quality and intelligibility involves the consideration of two performance measures, namely segmental SNR (SSNR) and normalized covariance metric (NCM) [12, 14,15,16,17]. The ITU-T has established the SSNR as an objective metric for assessing speech quality. The NCM metric is formally characterized as the statistical measure of the covariance among the input and output response envelope signals [13]. The NCM is determined by subjecting the stimuli to a band pass filter that segments the signal into K bands across its bandwidth. The utilization of the Hilbert transform is employed for the computation of the envelope of every band, which is subsequently subjected to down sampling at a rate of 25kHz. Hence, the range of envelope modulation frequencies is restricted to 0-12.5kHz. Let us assume that \(x_i(t)\) represents the down-sampled envelope in the \(i^{th}\) frequency band of the unaffected signal, while \(y_i(t)\) denotes the down-sampled envelope of the transformed signal. The NCM of \(i^{th}\) frequency band can be written as follows:

$$\begin{aligned} r_{i}=\frac{\sum _{t}\left( x_{i}(t)-\mu _{i}\right) \left( y_{i}(t)-\nu _{i}\right) }{\sqrt{\sum _{t}\left( x_{i}(t)-\mu _{i}\right) ^{2}} \sqrt{\sum _{t}\left( y_{i}(t)-\nu _{i}\right) ^{2}}} \end{aligned}$$
(15)
Fig. 2
figure 2

The representation of input speech sentence from NOIZEUS database, output waveforms and spectrograms of: (a). Clean speech, (b). Noisy speech (babble noise at 5 dB SNR), Enhanced speech signal using (c). OSMS [3], (d). MCRA [4], (e). IMCRA [5], (f). Spectral minima tracking [6], (g). Weighted spectral averaging [7], (h). Connected time-frequency [8], (i). SS-VAD and MMSE-SPZC [9], (j). SS-VAD [19], (k). MSNE [20] and (l). Proposed algorithm (OSMC).

\(\mu _{\textrm{i}}\) and \(\textrm{v}_{\textrm{i}}\) are the average values of clean and processed signals respectively. The SNR in each band is written as follows:

$$\begin{aligned} \textrm{SNR}_{i}=10 \log _{10}\left( \frac{r_{i}^{2}}{1-r_{i}^{2}}\right) \end{aligned}$$
(16)

The transmission indices (TI) for \(i^{th}\) bands are calculated using the below equation:

$$\begin{aligned} \textrm{TI}_{i}=\frac{\textrm{SNR}_{i}+15}{30} \end{aligned}$$
(17)

The NCM index is generated by averaging the TI values across all frequency bands:

$$\begin{aligned} \textrm{NCM}=\frac{\sum _{i=1}^{K} W_{i} \times \textrm{TI}_{i}}{\sum _{i=1}^{K} W_{i}} \end{aligned}$$
(18)

where \(W_i\) are the weights that are applied to each of the K bands. The proposed OSMC noise estimation technique is integrated with Wiener speech enhancement [16] technique using the following gain function:

$$\begin{aligned} G(\lambda , k)= & {} \frac{A(\lambda , k)}{A(\lambda , k)+E(\lambda , k) \mu _{k}} \end{aligned}$$
(19)
$$\begin{aligned} A(\lambda , k)= & {} \max \left\{ \Vert C(\lambda , k)\Vert ^{2}-E(\lambda , k), v E(\lambda , k)\right\} \end{aligned}$$
(20)

where v = 0.001. The value of \(\mu _{k}\) in (19) computed by the posterior SSNR [16].

4.2 A comparative analysis of proposed noise estimation technique with existing algorithms

The NOIZEUS [18] and Kannada speech databases are considered for the experimentation. The clean speech data is degraded by various types of noises, viz., babble, restaurant, and street noises at 5 dB and 10 dB SNR levels. The Kannada speech sentences are recorded in a controlled/uncontrolled environment and subsequently subjected to degradation by the same noises and same SNR level. For the analysis perspective, the proposed noise estimation technique is compared with existing algorithms as follows: The main drawback of [3,4,5] is that, the noise estimation incurs a delay in computation and given poor reduction in noise for the speech data degraded by car, babble, street and train noises. The algorithms in [6, 7] failed to contrast between the increase in noise floor and speech power. The algorithms in [8,9,10, 19,20,21, 24] failed to update the noise estimation when the floor of noise increases suddenly and stays at a particular level. In comparison with the existing signal processing-based noise estimation techniques [3,4,5,6,7,8,9,10, 19,20,21, 24], the spectrum of estimation of noise using proposed method remains unaffected by a small search window. The proposed method

uses a time-frequency dependent threshold value to compute the probability of speech presence/absence. In addition to this, the proposed algorithm works better for the floor of noise increases instantaneously and stays at a level.

Table 1 Performance assessment of the proposed and existing methods in terms of SSNR for NOIZEUS and Kannada speech databases
Table 2 Performance evaluation of the proposed and existing methods in terms of NCM for NOIZEUS and Kannada speech databases

The schematic representation of input-output waveforms and corresponding spectrograms of the proposed technique and existing algorithms are shown in Fig. 2. From the Figure, it can be inferred that the output waveform and spectrogram of the proposed method revealed that there is a significant amount of suppression of noise in the enhanced speech data compared to the competing algorithms. The Tables 1 and 2 show the performance evaluation of the proposed noise estimation technique with existing algorithms in terms of SSNR and NCM for NOIZEUS and Kannada speech databases respectively. From the Tables, it can be observed that, the proposed noise reduction algorithm outperformed the competing techniques in terms of the average values of SSNR and NCM. It is also observed that, there is a consistent and better improvement in the quality and intelligibility of speech data after enhancement across various types of noises at 5 dB and 10 dB SNR levels.

5 Conclusions

In this work, an algorithm was proposed for degraded speech enhancement by OSMC through iterative averaging under highly non-stationary noisy conditions. Unlike the state-of-the-art signal processing-based noise estimation techniques, the proposed algorithm updates the noise continuously under highly non-stationary noisy scenarios without using VAD. The experiments were conducted on NOIZEUS and Kannada speech databases for different types of noises and SNR levels. The SSNR and NCM metrics were used for the performance evaluation of proposed and existing algorithms for the assessment of speech quality and intelligibility. The evaluated results revealed that the proposed algorithm has given a consistent and better improvements over the existing algorithms in terms of SSNR and NCM values for different types of noises and SNR levels. In the future, it is interested to see whether the proposed technique could lead to better speech quality and intelligibility under negative SNR and deep noisy conditions. The future scope of the current work is to integrate the proposed noise elimination algorithm with an end-to-end ASR system to improve the accuracy of speech recognition under real-time conditions.