Speech enhancement using long short term memory with trained speech features and adaptive wiener filter

Garg, Anil

doi:10.1007/s11042-022-13302-3

Speech enhancement using long short term memory with trained speech features and adaptive wiener filter

Published: 14 July 2022

Volume 82, pages 3647–3675, (2023)
Cite this article

Download PDF

Multimedia Tools and Applications Aims and scope Submit manuscript

Speech enhancement using long short term memory with trained speech features and adaptive wiener filter

Download PDF

Anil Garg¹

1627 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Speech enhancement is the process of enhancing the clarity and intelligibility of speech signals that have been degraded due to background noise. With the assistance of deep learning, a novel speech signal enhancement model is introduced in this research. The proposed model is divided into two phases: (i) Training (ii) Testing. In the training phase, the noise spectrum and signal spectrum are estimated via a Non-negative Matrix Factorization (NMF) from the noisy input signal. Then, Empirical Mean Decomposition (EMD) features are extracted from the Wiener filter. The de-noised signal is acquired from EMD, the bark frequency is evaluated and the Fractional Delta AMS features are extracted. The key contribution of this study is the use of the Long Short Term Memory (LSTM) model to properly estimate the tuning factor η of the Wiener filter for all input signals. The LSTM was trained by the extracted features (EMD) via a modified wiener filter for decomposing the spectral signal and the output of EMD is the denoised enhanced speech signal. A comparative evaluation is carried out between the proposed and existing models in terms of error measures.

A review on the long short-term memory model

Article 13 May 2020

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Speech is typically distorted in real-world environments by both room resonances and background noises [10]. The goal of speech enhancement is to remove a specific amount of noise from a noisy speech signal while retaining the speech component and reducing speech distortion as much as possible [29]. Speech augmentation is required in a variety of applications, including mobile communication and speech recognition [46]. Speech is one of the most common ways for humans to share information [11]. Speech is the most critical mode for interaction in today’s technological society. Speech is indeed a tool that allows us to communicate with each other [34]. In recent days, Covid-19 is also detected from the speech signal [26]. To find out presence or absence of Covid-19 the speech signal is used [5]. The goal of single channel speech enhancement is to improve the quality and intelligibility of speech that has been corrupted by environmental noises, which degrades many real-world applications such as speech recognition, hearing aids, and speech telephony [32]. Speech enhancement is a common technique for improving speech quality [8]. These disruptions diminish speech quality and intelligibility, particularly whenever the Signal-to-Noise Ratio (SNR) is low. Binaural speech enhancement strategies are of particular interest for assistive listening devices, such as hearing aids or headsets, where the end user expects both high speech quality and speech clarity [38]. The noise corrupted signal is improved by using spatial or temporal modifications [20]. The speech signal is regarded as the quickest and most natural way to communicate with humans [12].Understanding distorted speech can be complicated for both Normal Hearing (NH) and Hearing Impaired (HI) listeners. Many voice-related applications, such as Automatic Speech Recognition (ASR) and Speaker Identification (SID) seem to perform poorly in presence of noise [39]. Therefore, speech enhancement is essential.

There’s been a lot of study towards improving speech in noisy environments [2]. Speech recognition in background noise appears to be difficult for people with hearing loss [3]. The goal of speech enhancement algorithms is to remove additive background noise from a noisy speech signal in order to improve its quality or intelligibility [9]. In general, speech enhancement refers to the processing of noisy speech signals in order to improve signal perception via better decoding by systems or humans [14]. Noises such as airport noise, train noise, and street noise frequently distort speech signals. These noises have a negative impact on the quality of the speech signal, especially in voice communication, automatic speech recognition, and speaker identification [15]. Feature selection is an important step in improving a system for recognising emotions in speech [2]. Speech recognition in background noise appears to be difficult for people with hearing loss [3]. Background noise is the primary source of speech degradation, particularly in hands-free scenarios [22]. The field of speech enhancement (SE) is concerned with the enhancement of speech signals that have been degraded by noise [16]. Speech enhancement in non-stationary noise environments is a difficult area of study [13]. Speech enhancement aims to improve the clarity and intelligibility of noisy speech [28]. The major intention behind the speech enhancement is to suppress the noise and to boost up the SNR of noisy speech signals in challenging environments. The most renowned techniques like spectral subtraction, Minimum Mean Square Error (MMSE), Log MMSE, OM-LSA, Wiener filtering, etc. are being more commonly preferred for Speech Enhancement [19]. Speech enhancement (SE) is the problem of estimating clean-speech signals from noisy single-channel or multiple-channel audio recordings [27]. Speech enhancement techniques have been studied for several decades with a variety of promising applications, such as telecommunications and hearing aid systems, to mitigate the harmful effects of background noise and interference [33]. Acoustically added background noise to speech can degrade the performance of digital voice processors used for applications such as speech compression, recognition, transcription and authentication [7]. The key aim of these speech enhancement methods is to enhance the Speech SNR. Techniques have been introduced regarding boosting up the speech quality and compacting speech bandwidth by suppressing the additive background noise [6]. Deep Neural Networks (DNN) are mostly deployed in speech enhancement [41]. This method generally produces a measure of time-frequency mask that was employed to evaluate the clean speech spectrum [21]. Optimal mask generation was also introduced in the traditional method [44]. But this masking strategy generally provides residual and musical noise in the enhanced speech. Kalman Filtering (KF) based speech enhancement is introduced in [17]; here the Linear Prediction Coefficients (LPCs) are calculated using a DNN. Although under nonstationary noise settings, the noise covariance is computed during speech gaps, which is ineffective. In addition, a deep audio-visual speech enhancement is suggested [35], but this approach might break in low SNR values.

Recently, Generative Adversarial Network (GAN) based speech enhancement was utilized to overcome the traditional difficulties. Especially, Speech Enhancement GAN (SEGAN) [30], conditional GAN (cGAN) [23], Wasserstein GAN (WGAN) [18], and Relativistic Standard GAN (RSGAN) [1] techniques were introduced. Despite the success of GAN-based speech improvement techniques, two major difficulties were present that was training instability and a lack of consideration for varied speech characteristics [42]. Therefore, researchers have been making a significant contribution to this field for decades. However, the accuracy and intelligibility of the outcomes weren’t always adequate.

Thus, to overcome the existing issues, an LSTM with trained speech features and an adaptive Wiener filter is introduced in this work. The major contribution of this research is listed below:

For decomposing the speech spectral signal, a modified wiener filter is introduced.
In addition, the LSTM model is introduced to properly estimate the tuning factor of the Wiener filter for all input signals.
In a testing phase, the LSTM model has been trained by the extracted features (EMD) via a modified wiener filter.

The rest of this paper is organized as: Section 2 addresses the literature works undergone in speech enhancement. Section 3 tells about the proposed speech enhancement model: an architectural description. In addition, Section 4 depicts about the processing steps of proposed speech enhancement model. The results acquired with the proposed work are discussed in Section 5. This paper is concluded in Section 6.

1.1 Problem statement

Most studies have shown that reducing signal noise without distorting speech is a difficult challenge, which is one of the main reasons why perfect enhancement systems aren’t available. In this research, we focus the issues such as lower robustness, not suitable for complex noise conditions [37], more residual noise existing, lower SNR [45], reduction in speech intelligibility [43], lower denoise effect, lower PESQ [40], and low consideration of speech quality measures [10].

Compared to the existing models, the proposed work introduces a wiener filter-assisted deep learning LSTM model. The LSTM model estimates the tuning factor of the Wiener filter with the aid of extracted features to obtain the de-noised speech signal. For simulation, the proposed model considers the speech quality measures such as SDR, PESQ, SNR, RMSE, CORR, ESTOI, and STOI. Moreover, the proposed model attains higher SNR, PESQ, robustness, and also the proposed model is well suited for complex noisy environments.

2 Literature review

In 2017, Zou et al. [46] introduced two speech amplification frameworks with super gaussian speech modeling. Under the assumption that the Discrete Cosine Transform (DCT) coefficients of clean speech were modelled by Laplacian or a Gamma distribution and the DCT coefficients of the noise were Gaussian distributed, the clean speech components were calculated using the MMSE estimator. Then, underneath the condition of speech presence ambiguity, MMSE estimators were retrieved. The correct estimators of speech statistical parameters were indeed recommended. A modern decision-directed approach has been used to approximate the speech Laplacian element. According to the simulation data, the suggested algorithm generates very little residual disruption and has higher speech efficiency than Gaussian-based speech amplification algorithms.

In 2020, Zhang et al. [37] developed an LSTM-Convolutional-BLSTM Encoder-Decoder (LCLED) for enhancing the speech signal. Transpose convolution and skip connection were both included in the LCLED. Besides that, a priori SNR has been used as a learning objective of LCLED to achieve a higher level of enhanced speech. Post-processing is done using the MMSE method. The findings indicate that the suggested LCLED increases the accuracy and intelligibility of enhanced speech. Furthermore, the running time of the LCLED model was 130 sec.

In 2021, Khattak et al. [29] proposed a “phase compensated perceptually weighted -order Bayesian estimator” to modify both magnitude and phase spectra to improve noisy speech. They have changed the step of noisy speech spectra alone in the proposed methodology. Second, they have manipulated the magnitude-spectra using a perceptually motivated-order Bayesian estimator. Further, to obtain a stronger gain function, the estimator combines the benefits of the perceptually-weighted and -order spectral amplitude estimators. To recreate the interference attenuated speech signals, the compensated phase spectra, and approximate magnitude spectra were merged. Using the NOIZEUS and AURORA repositories, the proposed speech amplification strategy was tested for various noise ranges (0 dB to +10 dB) in terms of quantitative accuracy and intelligibility tests. In both non-stationary and stationary noisy settings, the proposed improvement approach significantly enhances productivity and ensures intelligibility.

In 2020, Tan et al. [45] introduced a Fully Convolutional Neural Network (FCNN) “to achieve end-to-end speech enhancement. The encoder and decoder, as well as an extra Convolutional-Based Short-Time Fourier Transform (CSTFT) layer and CISTFT layer, were applied to simulate forward as well as inverse STFT operations, respectively. Since the fundamental phonetic information of speech is presented more clearly by Time-Frequency (T-F) representations, these layers seek to incorporate frequency-domain knowledge into the proposed model. In addition, the Temporal Convolutional Module (TCM), which would be successful for processing the long-term correlations of speech signals, was indeed integrated amongst encoder and decoder. According to the experimental findings, the suggested paradigm consistently outperforms other competitive speech amplification models.

In 2020, Zhu et al. [43] used the “Deep Neural Network (DNN)-augmented colored-noise Kalman filter” to develop a novel speech enhancement system. The authors have modelled the noise as well as clean speech signal in the form of an Autoregressive (AR) process. The multi-objective DNN was trained via LPCs to map the Line Spectrum Frequencies (LSF) from the noisy acoustic features. The denoising was done to the noisy speech by applying the ‘colored-noise Kalman filter with DNN estimated parameters”. Finally, residual noise in the Kalman-filtered speech was removed using a post-subtraction procedure. The proposed work has achieved the best estimation accuracy for street noise and produced better outcomes in unseen noise.

In 2021, Wei et al. [40] have proposed a Constant Q Transform (CQT) intending to enhance the resolution of the lower frequency speech signals. The NMF/ Sparse NMF (SNMF) algorithm has been used in the backend. At low SNR, PESQ, and STOI the experimental results demonstrate that the proposed approach outperforms the Short-Time Fourier Transform (STFT) baseline in terms of enhancement ability.

In 2020, Zhou et al. [32] have suggested a modified bark spectral distortion loss mechanism, which can be thought of as an auditory perception-based MSE to replace the traditional MSE in DNN-based speech amplification approaches to increase objective perceptual efficiency even further. When compared to DNN-based methods using the traditional MSE criteria, experiments demonstrated that the proposed method can boost speech enhancement efficiency, particularly in terms of objective perceptual quality in all experimental settings.

In 2021, Chen et al. [10] have proposed a multi-objective-based multi-channel speech amplification approach. For dealing with noise and reverberation, the proposed work used the Bidirectional Long Short-Term Memory (BiLSTM) network. To the BiLSTM network, the Log-Power Spectra (LPS) of noisy speech was given as an input for each channel of the microphone array to predict the LPS and Ideal Ratio Mask (IRM) of clean speech. The intermediary LPS including IRM obtained features from both channels was further treated as a single LPS using a fusion layer. Moreover, among the clean speech LPS and the fused single-channel, the interaction taking place was learned via a DNN. Experimental findings showed the suggested speech amplification method’s viability and adaptability.

The advantages, as well as the challenges of the existing literature works discussed in the literature section, are manifested in Table 1.

Table 1 Review on Speech Enhancement models

Speech enhancement using long short term memory with trained speech features and adaptive wiener filter

Abstract

Similar content being viewed by others

A review on the long short-term memory model

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

1 Introduction

1.1 Problem statement

2 Literature review

3 Proposed speech enhancement model: An architectural description

4 Processing steps of proposed speech enhancement model

4.1 Wiener filter

4.2 Empirical mode curve decomposition

4.3 Fractional DeltaAMS feature

4.4 LSTM network

4.5 Modified wiener filtering

5 Results and discussion

5.1 Influence on airport noise under varying SNR

5.2 Influence on exhibition hall noise under varying SNR

5.3 Influence on restaurant noise under varying SNR

5.4 Influence on railway station noise under varying SNR

5.5 Influence on street noise under varying SNR

5.6 Statistical Anaysis

5.7 Discussions

5.8 Practical implication

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation