An ensemble of optimal smoothing and minima controlled through iterative averaging for speech enhancement under uncontrolled environment

G P, Raghudathesh; C B, Chandrakala; B, Dinesh Rao; G, Thimmaraja Yadava

doi:10.1007/s11042-024-19174-z

An ensemble of optimal smoothing and minima controlled through iterative averaging for speech enhancement under uncontrolled environment

Open access
Published: 17 April 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

An ensemble of optimal smoothing and minima controlled through iterative averaging for speech enhancement under uncontrolled environment

Download PDF

Raghudathesh G P¹,
Chandrakala C B²^na1,
Dinesh Rao B³^na1 &
…
Thimmaraja Yadava G⁴^na1

155 Accesses
Explore all metrics

Abstract

Although better progress has been made in the area of speech enhancement, a significant performance degradation still exists under highly non-stationary noisy conditions. These conditions have a detrimental impact on the performance of the speech processing applications such as automatic speech recognition, speech encoding, speaker verification, speaker identification, and speaker recognition. Therefore, in this work, a robust noise estimation technique is proposed for speech enhancement under highly non-stationary noisy scenarios. The proposed work introduces an optimal smoothing and minima controlled (OSMC) through an iterative averaging method for noise estimation. Firstly, the computation of smooth power spectrum of degraded speech data and tracking the minima by continuously taking the past spectral average values are considered. Then, to find the activity of speech in each frequency bin, the ratio of degraded speech spectrum to its local minimum is considered, and a Bayes minimum-cost rule is applied for the decision-making. Finally, the spectrum of noise is estimated using the time-frequency dependent smoothing factors which mainly depend on the estimation of the probability of speech presence. The experiments are conducted on NOIZEUS and Kannada speech databases. The evaluated results demonstrated that the proposed OSMC technique exhibits better speech quality and intelligibility performance compared to existing algorithms under highly non-stationary noisy conditions.

Improved Empirical Mode Decomposition Using Optimal Recursive Averaging Noise Estimation for Speech Enhancement

Article 21 June 2021

Variable Quantile Level Based Noise Suppression for Robust Speech Recognition

Speech enhancement based on Bayesian decision and spectral amplitude estimation

Article Open access 07 October 2015

1 Introduction

The process of speech enhancement plays an important role in performance of the speech processing applications under real time conditions. The speech data corrupted by various types of noises will have an adverse effect on the performance of the system. Therefore, it is indeed to reduce the various types of noises in the degraded speech data using noise reduction techniques [1]. A technique based on optimal smoothing and minimum-statistics (OSMS) was proposed to calculate the power spectral density (PSD) of non-stationary noise in degraded speech data [2, 3]. The proposed algorithm had the ability to calculate the noise PSD in degraded speech data without using voice activity detection (VAD). Instead of using VAD, the author used tracks of spectral minima in each and every frequency band without having a difference between the activity of speech or speech pause. The optimal smoothing parameter was derived by suppressing the mean square error (MSE) in each step for recursive smoothing of PSD of degraded speech data. The method was amalgamated with a noise reduction algorithm for speech enhancement. The results revealed that the proposed technique outperformed the existing estimators.

The estimation of noise was demonstrated in [4] by proposing a method called minima controlled through recursive averaging (MCRA). By taking the average of past power spectral values of degraded speech data, the noise was approximated. The ratio of local energy of noisy speech to its minimum value within the stipulated time window has given the presence of speech in subbands. The author concluded that the noise approximation is very computationally robust and efficient in terms of signal-to-noise ratio (SNR). The improvement in MCRA for noise approximation under different conditions was demonstrated by Cohen [5]. The noise approximation was calculated by taking the average past power spectral values using a smoothing parameter which was considered by the presence of signal probability. The author claimed that the IMCRA algorithm worked better under non-stationary noisy environments and different SNR conditions.

A noise reduction algorithm was developed in [6] for the speech signals that are degraded by short-time stationary noises and electrical disturbances. The spectral amplitude approximation was used for the enhancement of corrupted speech data. No speech pause detectors were required for the developed noise reduction algorithm, the author claimed. The experimental results showed that there was an improvement in noise reduction in the degraded speech data compared to existing algorithms. The noise estimation and characteristics of noise were explained in [7]. To estimate the noise, the past noisy segments were considered, and no external speech pause detectors were used for the detection of pauses in speech. The implemented algorithm can also be integrated with non-linear spectral subtraction (SS) algorithms, the authors claimed. The results revealed that the proposed algorithm outperformed the existing speech enhancement algorithms. In [8], the time and frequency domain representations of corrupted speech data were used for estimating the noise. The VAD was deployed to identify the active regions of the speech signal. The concept of attenuation procedure was used on speech and non-speech activity regions to obtain the enhanced speech data. The proposed technique had complexity in computation and it is feasible for hearing aid applications. The obtained experimental results were compared with two speech enhancement techniques and the proposed method had given better performance, the authors claimed.

In conclusion, there have been significant advancements in the area of speech enhancement under various degraded conditions. However, research specifically focused on the noise estimation under highly non-stationary noisy conditions is limited [3,4,5,6,7,8,9,10, 19,20,21, 24]. In the process of noise estimation, understanding the complexities and suppression of various types of noises under highly non-stationary noisy scenarios would be a unique contribution. Existing works have covered noise estimation techniques under different SNR conditions, but there is a gap concerning the elimination of noise under sudden variations in noise level scenarios. The main objective of the proposed work is to fill this gap by proposing a robust noise estimation technique for speech enhancement under highly non-stationary noisy scenarios. The major contributions to the proposed work are as follows:

Develop a robust technique for noise estimation in highly non-stationary noisy conditions.
Estimation of noise in each segment using time-frequency dependent smoothing factors which are calculated using Bayesian probability of speech presence.
Computation of the noise estimate without using VAD.
Performance evaluation of the proposed method with existing techniques in terms of speech quality and intelligibility after speech enhancement.

The rest of the article is organized as follows: Section 2 describes the related work. Section 3 gives the implementation of proposed OSMC technique for speech enhancement. The experimental setup and results analysis of proposed and existing noise estimation techniques are described in Section 4. The Section 5 demonstrates the conclusions.

2 Related work

A mono channel speech enhancement algorithm using the Wiener filter was proposed in [21]. The proposed algorithm estimates the noise using a first-order recursive equation and uses a parameter for smoothing to update the noise in every frame. The experiments were conducted on the speech sentences which were degraded by various types of noises. The proposed algorithm outperformed the conventional SS technique in estimating the noise in corrupted speech data, as demonstrated by objective speech quality measures. In [22], a technique was proposed for enhancing a short time spectral amplitude (STSA) in frequency domain. For modeling the magnitudes of DFT of clean speech signal, Weibull distribution was used under Gaussian noise scenario. A decision-directed approach with the smoothing factor was considered to approximate a priori SNR. The priori probability of absence of speech was set to 0.2. The experiments were investigated on the TIMIT speech database and the speech sentences were degraded by various additive non-stationary noises. The obtained results demonstrated the efficacy of the proposed technique over existing algorithms in terms of quality and intelligibility of speech.

An algorithm was proposed for noise power spectral density (PSD) estimation using an high-pass filter derivative under noisy conditions [23]. The authors have used a spectral-flatness adaptive thresholding method for the detection of activity of speech in the corrupted speech frames. The NOIZEUS database was used for the subjective and objective evaluation of the proposed noise PSD estimation algorithm and its comparison with competing methods. The achieved results revealed that the proposed technique has performed well for various degraded conditions compared to the competing algorithms. The work in [26] presented a real-time mono channel speech enhancement technique for suppressing stationary and transient noise. The methods employed include quantile noise estimation for stationary noise suppression and the transient noise suppression is based on normalized variance and gravity center of the signal. The results revealed that the proposed algorithm outperformed the existing techniques under high SNR conditions.

The authors have proposed an optimized SS method for mono channel speech enhancement [24]. The proposed method has performed a noise reduction through SS based on minimal statistics. The obtained results revealed that the proposed method outperformed the competing techniques in terms of quality of speech. In [20], the modified version of minimum-statistics noise estimation (MSNE) was implemented for speech enhancement. The noise was estimated by considering the minima statistics of degraded speech data. The experimental results showed that the proposed estimator outperforms the competing techniques under medium degraded conditions and it has given less performance under highly non-stationary noisy conditions.

An algorithm was proposed for the degraded speech enhancement by combining SS-VAD and MMSE-spectrum power estimator based on zero crossing (SPZC) [9]. The babble and musical noise suppression in degraded speech data was achieved by considering the efficacy of SS-VAD and MMSE-SPZC. For the experimentation, the TIMIT and Kannada speech databases were used. To evaluate the performance of the proposed technique with existing methods, the PESQ and composite measures were considered. The experimental results show that the amalgamation of SS-VAD and MMSE-SPZC estimator has given better improvements in quality and intelligibility of enhanced speech compared to individual techniques. In [25], to optimize the subspace partitioning, the authors have proposed an approach for speech enhancement using modified version of accelerated particle swarm optimization. The degraded speech data was divided into noise only, speech plus noise, and speech only components. The VAD [29,30,31] was used for the detection of voice activity. The results shown that the proposed method has performed well under different corrupted conditions compared to existing methods. The authors also claimed that there was an improvement with less speech distortion.

The enhancements in the Kannada automatic speech recognition (ASR) system were described in [10]. Initially, the Kannada ASR system was developed in [11] for noisy Kannada speech data. Due to various types of degradation, the performance of the Kannada ASR system led to less speech recognition accuracies. Therefore, the authors in [10] have developed a robust noise reduction technique by combining SS-VAD and MMSE-SPZC estimator [9]. The developed noise reduction technique was integrated with an interactive voice response system as a front end. Further, to improve the accuracy of speech recognition, the deep neural network and subspace Gaussian mixture modeling techniques were used. The experiments were conducted by taking a few speech sentences from TIMIT and Kannada speech databases. The results demonstrated that there was a consistent improvement in the accuracy of speech recognition using the enhanced speech data compared to the earlier Kannada ASR system [11]. A system was developed in [19] for robust speech encoding under degraded conditions. The authors have combined the SS-VAD with LPC for encoding the noisy speech data. The experimental results revealed that the proposed combined SS-VAD and linear predictive coding (LPC) method has given a better performance in terms of SNR and compression ratios. The authors also mentioned that the proposed method works better for higher SNR and is inefficient under medium and negative SNR conditions.

A spatial procedure to SS for speech enhancement was proposed in [27]. The drawbacks of SS were addressed with experimental demonstration of various types of noises. The experimental results revealed that the proposed method has given better results in terms of speech quality and intelligibility compared to existing conventional SS algorithms. In [28], the authors have used the speech enhancement algorithm [27] to improve the accuracy of speech recognition under real-time conditions. The noise elimination algorithm was kept before the feature extraction part in the spoken query system. The experiments revealed that there was an improvement in the suppression of word error rate in the offline ASR models and the accuracy of the speech recognition was improved during the online testing.

In conclusion, the work related to noise estimation has demonstrated significant progress and advancements under different noisy conditions. Nevertheless, one crucial aspect that needs consideration is the estimation of noise under highly non-stationary noisy scenarios. The estimation of noise under an instantaneous increase in noise levels can effectively support improving the efficacy of other speech processing applications such as ASR, speech encoding, speaker verification, speaker identification, speaker recognition and etc,. under noisy conditions.

3 Implementation of proposed OSMC technique for speech enhancement

Consider a sampled degraded speech signal c(i) is corrupted by noise model b(i) when it is added to clean speech signal a(i), where i is the sampling time index. An assumption is made that, a(i) and b(i) are independent statistically and have zero mean.

$$\begin{aligned} c(i) = a(i) + b(i) \end{aligned}$$

(1)

The degraded sampled speech signal c(i) is transformed into frequency domain by considering windowing h(i) to a frame of N consecutive samples of c(i). The fast Fourier transform (FFT) analysis of sliding window yields a set of frequency domain samples (signals) which can be mathematically written as

$$\begin{aligned} c(\lambda , \textrm{k})=\sum _{\mu =0}^{L-1} c(\lambda \textrm{S}+\mu ) h(\mu ) \textrm{e}^{-\textrm{j} 2 \pi \textrm{k} \mu / \textrm{N}} \end{aligned}$$

(2)

where $\lambda $ is subsampled index of time and k is the index of frequency bin and $\textit{k}\in {0,1,\ldots ,L-1}$ which is fairly related to the normalized centered frequency, $\Omega _k=2\pi k/L$. The probability density function (PDF) of $ C (\lambda ,k)$ can be represented as follows:

$$\begin{aligned} f_{\mid C(\lambda , k)\mid ^{2}}(x)=\frac{U(x)}{\sigma _{B}^{2}(\lambda , k)+\sigma _{A}^{2}(\lambda , k)} e^{-x /\left( \sigma _{B}^{2}(\lambda , k)+\sigma _{A}^{2}(\lambda , k)\right) } \end{aligned}$$

(3)

where $\sigma _A^2(\lambda ,k) = \textit{E}{\mid A(\lambda ,k)\mid ^2}$ and $\sigma _B^2(\lambda ,k)\ = \textit{E}{\mid B(\lambda ,k)\mid ^2}$ are the PSDs of the clean speech signal and noise model respectively. The term U(x) defines the unit step function. The first order recursive equation is used to compute the smooth power spectrum of degraded speech data. It can be written as follows:

$$\begin{aligned} \textrm{Z}(\lambda , k)=\beta \textrm{Z}(\lambda -1, \textrm{k})+(1-\beta )\mid \textrm{C}(\lambda , \textrm{k})\mid ^{2} \end{aligned}$$

(4)

Tracking the minima of degraded speech data was explained by Martin in [3]. The term $\beta $ in (4) can be computed as

$$\begin{aligned} \beta =\left( T_{\textrm{SM}} f_{s} / S-1\right) /\left( T_{\textrm{SM}} f_{s} / S+1\right) \end{aligned}$$

(5)

where $T_{SM}$ is the window length of 0.2 seconds and S=128. By taking the value of sampling frequency, $f_{s}$ of 8 kHz, S and $T_{SM}$, the value of $\beta $ is computed. The tracking method was mainly depends on length of the minimum search window. A nonlinear method [6] is used in the proposed method for tracking the minimum of degraded speech data by continuously taking the past spectral average values.

$$\begin{aligned} H_{d}\left( \omega \right) =\left\{ \begin{array}{rl} Z_{\min }(\lambda , k)= \gamma {Z_{\min }(\lambda -1, k)} + &{}\frac{1-\gamma }{1-\xi }(Z(\lambda , k)-\xi Z(\lambda -1, k)) ; \\ {} &{} \text{ If } Z_{\min }(\lambda -1, k)<Z(\lambda , k) \\ Z(\lambda , k) ; &{} \text{ Otherwise } \end{array}\right. \end{aligned}$$

(6)

where $Z_{\min }(\lambda ,k)$ is the local minima of the corrupted speech data power spectrum and $\gamma $ and $\xi $ constants which are considered while experimentation ($\gamma $ = 0.999 and $\xi $ = 0.8). The factor $\xi $ used to control the adaptation duration of minima. To find the activity of speech in each frequency bin, the ratio of degraded speech spectrum to its local minima is considered and can be written as follows:

$$\begin{aligned} T_{r}(\lambda , k)=\frac{Z(\lambda , k)}{{Z_{\min }}(\lambda , k)} \end{aligned}$$

(7)

A Bayes minimum-cost decision rule related to speech activity detection is given by

$$\begin{aligned} \frac{p\left( T_{r} \mid H_{1}\right) }{p\left( T_{r} \mid H_{0}\right) } \lessgtr _{H_{0}^{\prime }}^{H_{1}^{\prime }} \frac{c_{10} P\left( H_{0}\right) }{C_{01} P\left( H_{1}\right) } \end{aligned}$$

(8)

where the priori probabilities are represented by $P(H_{0})$ and $P(H_{1})$ for speech absence and presence respectively. The cost for computing $H_i^{\prime }$ and $H_j^{\prime }$ is represented by the term $c_{ij}$. The probability of likelihood ratio $\frac{p\left( T_{r} \mid H_{1}\right) }{p\left( T_{r} \mid H_{0}\right) }$ is a monotonic function, therefore, the decision of (8) can be generalized as follows:

$$\begin{aligned} T_{r}(\lambda , k) \lessgtr _{H_{0}^{\prime }}^{H_{1}^{\prime }} \theta (k) \end{aligned}$$

(9)

The value of $T_{r}(\lambda , k)$ is compared with threshold value. If the value of $T_{r}(\lambda , k)$ is greater than threshold, there is speech activity in that particular frequency bin else there is an absence of speech activity. This is mainly depending on the rule that the corrupted speech power spectrum is nearly equal to its local minima value when there is no speech activity. The speech activity decision making is described as follows:

$$\begin{aligned} D(\lambda , k)= {\left\{ \begin{array}{ll}1 ; &{} \text{ Presence } \text{ of } \text{ speech } \text{ if } {\text {Tr}}(\lambda , k)>\theta (k) \\ 0 ; &{} \text{ Otherwise } \end{array}\right. } \end{aligned}$$

(10)

where $\theta (k)$ is defined as frequency dependent threshold given experimentally. It can be approximated as

$$\begin{aligned} \theta (\textrm{k})=\left\{ \begin{array}{rr} 2 &{} 1 \le k \le L F \\ 2 &{} L F<k \le M F \\ 5 &{} M F<k \le f_s / 2 \end{array}\right. \end{aligned}$$

(11)

where MF and LF are the bin of frequencies with reference to 3kHz and 1kHz respectively and S is the sampling frequency. In [4], the value of $\theta (k)$ was fixed for all frequencies. From the above principle, the probability of speech presence $D(\lambda ,k)$ is updated using first order recursion equation:

$$\begin{aligned} z(\lambda , k)=\mu _{z} z(\lambda -1, k)+\left( 1-\mu _{z}\right) D(\lambda , k) \end{aligned}$$

(12)

where $\mu _z$ is smoothing constant. The above recursion equation uses the correlation for presence of speech in adjacent frames or segments. By using the probability of speech presence approximation, the time-frequency dependent smoothing factor is calculated as:

$$\begin{aligned} \mu _{\textrm{s}}(\lambda , k) \triangleq \mu _{\textrm{d}}+\left( 1-\mu _{\textrm{d}}\right) \textrm{z}(\lambda , k) \end{aligned}$$

(13)

where $\mu _{\textrm{d}}$ is a constant and the value of $\mu _{\textrm{d}}$ and $\mu _z$ are 0.85 and 0.25 respectively. The value of $\mu _{\textrm{s}}(\lambda , k)$ is always in the range of $\mu _{\textrm{d}} \le \mu _{\textrm{s}}(\lambda , k) \le 1$. The spectrum of noise estimate is updated as follows:

$$\begin{aligned} \textit{E}(\lambda , k)=\mu _{\textrm{s}}(\lambda , k) \textrm{E}(\lambda -1, k)+\left( 1-\mu _{\textrm{s}}(\lambda , k)\right) \mid C(\lambda , k)\mid ^{2} \end{aligned}$$

(14)

where ${E(\lambda ,k)}$ is the approximation of noise power spectrum. The proposed algorithm is summarized as follows: Once the classification of presence of speech activity/non-speech activity is completed, the probability of speech presence/absence is updated. Using this probability, the time-frequency dependent smoothing factor is computed. At last, the noise spectrum approximation is updated using frequency-time dependent smoothing factor. The proposed noise estimator is used in speech enhancement technique to reconstruct the speech signal. Floored the negative values and ensured the conjugate symmetry for real reconstruction. The phase information is multiplied with whole frame of FFT and converted back to time domain using inverse fast Fourier transform. The above discussion of the proposed method is represented in terms of flowchart and pseudo code, which are shown in Fig. 1 and Algorithm 1 respectively.

4 Experimental results and analysis

Initially, the two performance metrics for the assessment of speech quality and intelligibility are discussed. Later, the integration procedure of the proposed noise estimation technique with a speech enhancement algorithm is explained. Further, the performance evaluation of the proposed method with existing noise estimation techniques in terms of speech quality and intelligibility after speech enhancement is demonstrated.

4.1 Performance metrics

The evaluation of proposed and existing techniques in terms of speech quality and intelligibility involves the consideration of two performance measures, namely segmental SNR (SSNR) and normalized covariance metric (NCM) [12, 14,15,16,17]. The ITU-T has established the SSNR as an objective metric for assessing speech quality. The NCM metric is formally characterized as the statistical measure of the covariance among the input and output response envelope signals [13]. The NCM is determined by subjecting the stimuli to a band pass filter that segments the signal into K bands across its bandwidth. The utilization of the Hilbert transform is employed for the computation of the envelope of every band, which is subsequently subjected to down sampling at a rate of 25kHz. Hence, the range of envelope modulation frequencies is restricted to 0-12.5kHz. Let us assume that $x_i(t)$ represents the down-sampled envelope in the $i^{th}$ frequency band of the unaffected signal, while $y_i(t)$ denotes the down-sampled envelope of the transformed signal. The NCM of $i^{th}$ frequency band can be written as follows:

$$\begin{aligned} r_{i}=\frac{\sum _{t}\left( x_{i}(t)-\mu _{i}\right) \left( y_{i}(t)-\nu _{i}\right) }{\sqrt{\sum _{t}\left( x_{i}(t)-\mu _{i}\right) ^{2}} \sqrt{\sum _{t}\left( y_{i}(t)-\nu _{i}\right) ^{2}}} \end{aligned}$$

(15)

$\mu _{\textrm{i}}$ and $\textrm{v}_{\textrm{i}}$ are the average values of clean and processed signals respectively. The SNR in each band is written as follows:

$$\begin{aligned} \textrm{SNR}_{i}=10 \log _{10}\left( \frac{r_{i}^{2}}{1-r_{i}^{2}}\right) \end{aligned}$$

(16)

The transmission indices (TI) for $i^{th}$ bands are calculated using the below equation:

$$\begin{aligned} \textrm{TI}_{i}=\frac{\textrm{SNR}_{i}+15}{30} \end{aligned}$$

(17)

The NCM index is generated by averaging the TI values across all frequency bands:

$$\begin{aligned} \textrm{NCM}=\frac{\sum _{i=1}^{K} W_{i} \times \textrm{TI}_{i}}{\sum _{i=1}^{K} W_{i}} \end{aligned}$$

(18)

where $W_i$ are the weights that are applied to each of the K bands. The proposed OSMC noise estimation technique is integrated with Wiener speech enhancement [16] technique using the following gain function:

$$\begin{aligned} G(\lambda , k)= & {} \frac{A(\lambda , k)}{A(\lambda , k)+E(\lambda , k) \mu _{k}} \end{aligned}$$

(19)

$$\begin{aligned} A(\lambda , k)= & {} \max \left\{ \Vert C(\lambda , k)\Vert ^{2}-E(\lambda , k), v E(\lambda , k)\right\} \end{aligned}$$

(20)

where v = 0.001. The value of $\mu _{k}$ in (19) computed by the posterior SSNR [16].

4.2 A comparative analysis of proposed noise estimation technique with existing algorithms

The NOIZEUS [18] and Kannada speech databases are considered for the experimentation. The clean speech data is degraded by various types of noises, viz., babble, restaurant, and street noises at 5 dB and 10 dB SNR levels. The Kannada speech sentences are recorded in a controlled/uncontrolled environment and subsequently subjected to degradation by the same noises and same SNR level. For the analysis perspective, the proposed noise estimation technique is compared with existing algorithms as follows: The main drawback of [3,4,5] is that, the noise estimation incurs a delay in computation and given poor reduction in noise for the speech data degraded by car, babble, street and train noises. The algorithms in [6, 7] failed to contrast between the increase in noise floor and speech power. The algorithms in [8,9,10, 19,20,21, 24] failed to update the noise estimation when the floor of noise increases suddenly and stays at a particular level. In comparison with the existing signal processing-based noise estimation techniques [3,4,5,6,7,8,9,10, 19,20,21, 24], the spectrum of estimation of noise using proposed method remains unaffected by a small search window. The proposed method

uses a time-frequency dependent threshold value to compute the probability of speech presence/absence. In addition to this, the proposed algorithm works better for the floor of noise increases instantaneously and stays at a level.

Table 1 Performance assessment of the proposed and existing methods in terms of SSNR for NOIZEUS and Kannada speech databases

Full size table

Table 2 Performance evaluation of the proposed and existing methods in terms of NCM for NOIZEUS and Kannada speech databases

Full size table

The schematic representation of input-output waveforms and corresponding spectrograms of the proposed technique and existing algorithms are shown in Fig. 2. From the Figure, it can be inferred that the output waveform and spectrogram of the proposed method revealed that there is a significant amount of suppression of noise in the enhanced speech data compared to the competing algorithms. The Tables 1 and 2 show the performance evaluation of the proposed noise estimation technique with existing algorithms in terms of SSNR and NCM for NOIZEUS and Kannada speech databases respectively. From the Tables, it can be observed that, the proposed noise reduction algorithm outperformed the competing techniques in terms of the average values of SSNR and NCM. It is also observed that, there is a consistent and better improvement in the quality and intelligibility of speech data after enhancement across various types of noises at 5 dB and 10 dB SNR levels.

5 Conclusions

In this work, an algorithm was proposed for degraded speech enhancement by OSMC through iterative averaging under highly non-stationary noisy conditions. Unlike the state-of-the-art signal processing-based noise estimation techniques, the proposed algorithm updates the noise continuously under highly non-stationary noisy scenarios without using VAD. The experiments were conducted on NOIZEUS and Kannada speech databases for different types of noises and SNR levels. The SSNR and NCM metrics were used for the performance evaluation of proposed and existing algorithms for the assessment of speech quality and intelligibility. The evaluated results revealed that the proposed algorithm has given a consistent and better improvements over the existing algorithms in terms of SSNR and NCM values for different types of noises and SNR levels. In the future, it is interested to see whether the proposed technique could lead to better speech quality and intelligibility under negative SNR and deep noisy conditions. The future scope of the current work is to integrate the proposed noise elimination algorithm with an end-to-end ASR system to improve the accuracy of speech recognition under real-time conditions.

Data Availability

The manuscript has no associated data.

References

Rabiner L, Juang BH (1993) Fundamentals of speech recognition. Prentice-Hall Inc, New Jersey
Google Scholar
Ramirez J, Gorriz JM, Segura JC (2007) Voice activity detection: Fundamentals and speech recognition system robustness. I-Tech Education and Publishing. https://doi.org/10.5772/4740
Martin Rainer (2001) Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans Speech Audio Process 9:504–512. https://doi.org/10.1109/89.928915
Article Google Scholar
Cohen Israel (2002) Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Process Lett 9:12–15. https://doi.org/10.1109/97.988717
Article Google Scholar
Cohen Israel (2003) Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Trans Speech Audio Process 11:466–475. https://doi.org/10.1109/TSA.2003.811544
Article Google Scholar
Doblinger G (1995) Computationally efficient speech enhancement by spectral minima tracking in subbands. Citeseer 2:1513–1516. https://doi.org/10.21437/Eurospeech.1995-370
Hirsch H, Ehrlicher C (1995) Noise estimation techniques for robust speech recognition. In 1995 International conference on acoustics, speech, and signal processing, speech, signal processing vol. 1 pp. 153-156. https://doi.org/10.1109/ICASSP.1995.479387
Sorensen K, Andersen S (2005) Speech enhancement with natural sounding residual noise based on connected time-frequency speech presence regions. EURASIP J Adv Signal Process 18:2954–2964. https://doi.org/10.1155/ASP.2005.2954
Article Google Scholar
Thimmaraja Yadava G, Jayanna HS (2018) Speech enhancement by combining spectral subtraction and minimum mean square error-spectrum power estimator based on zero crossing. Int J Speech Technol (IJST) Springer 22:639–648. https://doi.org/10.1007/s10772-018-9506-9
Article Google Scholar
Thimmaraja Yadava G, Jayanna HS (2020) Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling. Int J Speech Technol (IJST) Springer 23:149–167. https://doi.org/10.1007/s10772-020-09671-5
Article Google Scholar
Thimmaraja Yadava G, Jayanna HS (2017) A spoken query system for the agricultural commodity prices and weather information access in Kannada language. Int J Speech Technol (IJST) Springer 20:635–644. https://doi.org/10.1007/s10772-017-9428-y
Article Google Scholar
Kates James M, Arehart Kathryn H (2005) Coherence and the speech intelligibility index. J Acoust Soc Am 117:2224. https://doi.org/10.1121/1.1862575
Article Google Scholar
Ma J, Hu Y, Loizou PC (2009) Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. J Acoust Soc Am 125:3387–3405. https://doi.org/10.1121/1.3097493
Article Google Scholar
John H. L. Hansen, Bryan L. Pellom (1998) An effective quality evaluation protocol for speech enhancement algorithms. Proceedings 5th international conference on spoken language processing (ICSLP 1998) Sydney, Australia. https://doi.org/10.21437/ICSLP.1998-350
Stahl V, Fischer A, Bippus R (2000) Quantile based noise estimation for spectral subtraction and Wiener filtering. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings vol 3 pp 1873–1875. https://doi.org/10.1109/ICASSP.2000.862122
Yi Hu, Loizou PC (2004) Speech enhancement based on wavelet thresholding the multitaper spectrum. IEEE Trans Speech Audio Process 12:59–67. https://doi.org/10.1109/TSA.2003.819949
Article Google Scholar
Hu Y, Loizou PC (2008) Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process 16:229–238. https://doi.org/10.1109/TASL.2007.911054
Article Google Scholar
Hu Y, Loizou PC (2007) Subjective evaluation and comparison of speech enhancement algorithms. Speech Commun 49:588–601. https://doi.org/10.1016/j.specom.2006.12.006
Article Google Scholar
Thimmaraja Yadava G, Nagaraja BG, Jayanna HS (2021) Speech enhancement and encoding by combining SS-VAD and LPC. Int J Speech Technol Springer 24:165–172. https://doi.org/10.1007/s10772-020-09786-9
Article Google Scholar
Tan Zheng-Hua, Sarkar Achintya K R, Dehak Najim (2020) rVAD: An unsupervised segment-based robust voice activity detection method. Comput Speech Lang 59:1–21. https://doi.org/10.1016/j.csl.2019.06.005
Article Google Scholar
Jaiswal RK, Yeduri SR, Cenkeramaddi LR (2022) Single-channel speech enhancement using implicit Wiener filter for high-quality speech communication. Int J Speech Technol Springer 25:745–758. https://doi.org/10.1007/s10772-022-09987-4
Article Google Scholar
Bahrami M, Faraji N (2021) Minimum mean square error estimator for speech enhancement in additive noise assuming Weibull speech priors and speech presence uncertainty. Int J Speech Technol Springer 24:97–108. https://doi.org/10.1007/s10772-020-09767-y
Article Google Scholar
Roy S, Paliwal KK (2021) A noise PSD estimation algorithm using derivative-based high-pass filter in non-stationary noise conditions. EURASIP J Audio Speech Music Process 32. https://doi.org/10.1186/s13636-021-00220-9
Gupta M, Singh RK, Singh S (2023) Analysis of optimized spectral subtraction method for single channel speech enhancement. Wireless Pers Commun 128:2203–2215. https://doi.org/10.1007/s11277-022-10039-y
Article Google Scholar
Ghorpade K, Khaparde A (2023) Single-channel speech enhancement using single dimension change accelerated particle swarm optimization for subspace partitioning. Circuits Syst Signal Process 42:4343–4361. https://doi.org/10.1007/s00034-023-02324-3
Article Google Scholar
Liang Ruiyu, Xie Yue, Cheng Jiaming, Tang Guichen, Sun Shinuo (2021) Real-time speech enhancement algorithm for transient noise suppression. Multimed Tools Appl 80:3681–3702. https://doi.org/10.1007/s11042-020-09849-8
Article Google Scholar
Thimmaraja Yadava G, Nagaraja BG, Jayanna HS (2022) A spatial procedure to spectral subtraction for speech enhancement. Multimed Tools Appl 81:23633–2364. https://doi.org/10.1007/s11042-022-12152-3
Article Google Scholar
Thimmaraja Yadava G, Nagaraja BG, Jayanna HS (2023) Amalgamation of noise elimination and TDNN acoustic modelling techniques for the advancements in continuous Kannada ASR system. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-16100-7
Article Google Scholar
Jainar SJ, Sale PL, Nagaraja BG (2020) VAD, feature extraction and modelling techniques for speaker recognition: a review. Int J Signal Imaging Syst Eng 12:1–18. https://doi.org/10.1504/IJSISE.2020.113552
Article Google Scholar
Nagaraja BG, Jayanna HS (2016) Feature extraction and modelling techniques for multilingual speaker recognition: a review. Int J Signal Imaging Syst Eng 9:67–78. https://doi.org/10.1504/IJSISE.2016.075000
Article Google Scholar
Nagaraja BG, Jayanna HS (2013) Multilingual speaker identification by combining evidence from lpr and multitaper mfcc. J Intell Syst 22:241–251. https://doi.org/10.1515/jisys-2013-0038
Article Google Scholar

Download references

Funding

Open access funding provided by Manipal Academy of Higher Education, Manipal.

Author information

Chandrakala C B, Dinesh Rao B and Thimmaraja Yadava G are contributed equally to this work.

Authors and Affiliations

Manipal School of Information Sciences, Manipal Academy of Higher Education, Manipal, 576104, Karnataka, India
Raghudathesh G P
Department of Information and Communication Technology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, 576104, Karnataka, India
Chandrakala C B
Manipal School of Information Science, Manipal Academy of Higher Education, Manipal, 576104, Karnataka, India
Dinesh Rao B
Department of Electronics and Communication Engineering, Nitte Meenakshi Institute of Technology, Bengaluru, 560064, Karnataka, India
Thimmaraja Yadava G

Authors

Raghudathesh G P
View author publications
You can also search for this author in PubMed Google Scholar
Chandrakala C B
View author publications
You can also search for this author in PubMed Google Scholar
Dinesh Rao B
View author publications
You can also search for this author in PubMed Google Scholar
Thimmaraja Yadava G
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chandrakala C B.

Ethics declarations

Conflicts of interest

The authors have no conflict of interests on the manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

G P, R., C B, C., B, D.R. et al. An ensemble of optimal smoothing and minima controlled through iterative averaging for speech enhancement under uncontrolled environment. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19174-z

Download citation

Received: 14 June 2023
Revised: 04 March 2024
Accepted: 02 April 2024
Published: 17 April 2024
DOI: https://doi.org/10.1007/s11042-024-19174-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An ensemble of optimal smoothing and minima controlled through iterative averaging for speech enhancement under uncontrolled environment

Abstract

Similar content being viewed by others

Improved Empirical Mode Decomposition Using Optimal Recursive Averaging Noise Estimation for Speech Enhancement

Variable Quantile Level Based Noise Suppression for Robust Speech Recognition

Speech enhancement based on Bayesian decision and spectral amplitude estimation

1 Introduction

2 Related work

3 Implementation of proposed OSMC technique for speech enhancement

4 Experimental results and analysis

4.1 Performance metrics

4.2 A comparative analysis of proposed noise estimation technique with existing algorithms

5 Conclusions

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An ensemble of optimal smoothing and minima controlled through iterative averaging for speech enhancement under uncontrolled environment

Abstract

Similar content being viewed by others

Improved Empirical Mode Decomposition Using Optimal Recursive Averaging Noise Estimation for Speech Enhancement

Variable Quantile Level Based Noise Suppression for Robust Speech Recognition

Speech enhancement based on Bayesian decision and spectral amplitude estimation

1 Introduction

2 Related work

3 Implementation of proposed OSMC technique for speech enhancement

4 Experimental results and analysis

4.1 Performance metrics

4.2 A comparative analysis of proposed noise estimation technique with existing algorithms

5 Conclusions

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation