1 Introduction

Speech is typically distorted in real-world environments by both room resonances and background noises [10]. The goal of speech enhancement is to remove a specific amount of noise from a noisy speech signal while retaining the speech component and reducing speech distortion as much as possible [29]. Speech augmentation is required in a variety of applications, including mobile communication and speech recognition [46]. Speech is one of the most common ways for humans to share information [11]. Speech is the most critical mode for interaction in today’s technological society. Speech is indeed a tool that allows us to communicate with each other [34]. In recent days, Covid-19 is also detected from the speech signal [26]. To find out presence or absence of Covid-19 the speech signal is used [5]. The goal of single channel speech enhancement is to improve the quality and intelligibility of speech that has been corrupted by environmental noises, which degrades many real-world applications such as speech recognition, hearing aids, and speech telephony [32]. Speech enhancement is a common technique for improving speech quality [8]. These disruptions diminish speech quality and intelligibility, particularly whenever the Signal-to-Noise Ratio (SNR) is low. Binaural speech enhancement strategies are of particular interest for assistive listening devices, such as hearing aids or headsets, where the end user expects both high speech quality and speech clarity [38]. The noise corrupted signal is improved by using spatial or temporal modifications [20]. The speech signal is regarded as the quickest and most natural way to communicate with humans [12].Understanding distorted speech can be complicated for both Normal Hearing (NH) and Hearing Impaired (HI) listeners. Many voice-related applications, such as Automatic Speech Recognition (ASR) and Speaker Identification (SID) seem to perform poorly in presence of noise [39]. Therefore, speech enhancement is essential.

There’s been a lot of study towards improving speech in noisy environments [2]. Speech recognition in background noise appears to be difficult for people with hearing loss [3]. The goal of speech enhancement algorithms is to remove additive background noise from a noisy speech signal in order to improve its quality or intelligibility [9]. In general, speech enhancement refers to the processing of noisy speech signals in order to improve signal perception via better decoding by systems or humans [14]. Noises such as airport noise, train noise, and street noise frequently distort speech signals. These noises have a negative impact on the quality of the speech signal, especially in voice communication, automatic speech recognition, and speaker identification [15]. Feature selection is an important step in improving a system for recognising emotions in speech [2]. Speech recognition in background noise appears to be difficult for people with hearing loss [3]. Background noise is the primary source of speech degradation, particularly in hands-free scenarios [22]. The field of speech enhancement (SE) is concerned with the enhancement of speech signals that have been degraded by noise [16]. Speech enhancement in non-stationary noise environments is a difficult area of study [13]. Speech enhancement aims to improve the clarity and intelligibility of noisy speech [28]. The major intention behind the speech enhancement is to suppress the noise and to boost up the SNR of noisy speech signals in challenging environments. The most renowned techniques like spectral subtraction, Minimum Mean Square Error (MMSE), Log MMSE, OM-LSA, Wiener filtering, etc. are being more commonly preferred for Speech Enhancement [19]. Speech enhancement (SE) is the problem of estimating clean-speech signals from noisy single-channel or multiple-channel audio recordings [27]. Speech enhancement techniques have been studied for several decades with a variety of promising applications, such as telecommunications and hearing aid systems, to mitigate the harmful effects of background noise and interference [33]. Acoustically added background noise to speech can degrade the performance of digital voice processors used for applications such as speech compression, recognition, transcription and authentication [7]. The key aim of these speech enhancement methods is to enhance the Speech SNR. Techniques have been introduced regarding boosting up the speech quality and compacting speech bandwidth by suppressing the additive background noise [6]. Deep Neural Networks (DNN) are mostly deployed in speech enhancement [41]. This method generally produces a measure of time-frequency mask that was employed to evaluate the clean speech spectrum [21]. Optimal mask generation was also introduced in the traditional method [44]. But this masking strategy generally provides residual and musical noise in the enhanced speech. Kalman Filtering (KF) based speech enhancement is introduced in [17]; here the Linear Prediction Coefficients (LPCs) are calculated using a DNN. Although under nonstationary noise settings, the noise covariance is computed during speech gaps, which is ineffective. In addition, a deep audio-visual speech enhancement is suggested [35], but this approach might break in low SNR values.

Recently, Generative Adversarial Network (GAN) based speech enhancement was utilized to overcome the traditional difficulties. Especially, Speech Enhancement GAN (SEGAN) [30], conditional GAN (cGAN) [23], Wasserstein GAN (WGAN) [18], and Relativistic Standard GAN (RSGAN) [1] techniques were introduced. Despite the success of GAN-based speech improvement techniques, two major difficulties were present that was training instability and a lack of consideration for varied speech characteristics [42]. Therefore, researchers have been making a significant contribution to this field for decades. However, the accuracy and intelligibility of the outcomes weren’t always adequate.

Thus, to overcome the existing issues, an LSTM with trained speech features and an adaptive Wiener filter is introduced in this work. The major contribution of this research is listed below:

  • For decomposing the speech spectral signal, a modified wiener filter is introduced.

  • In addition, the LSTM model is introduced to properly estimate the tuning factor of the Wiener filter for all input signals.

  • In a testing phase, the LSTM model has been trained by the extracted features (EMD) via a modified wiener filter.

The rest of this paper is organized as: Section 2 addresses the literature works undergone in speech enhancement. Section 3 tells about the proposed speech enhancement model: an architectural description. In addition, Section 4 depicts about the processing steps of proposed speech enhancement model. The results acquired with the proposed work are discussed in Section 5. This paper is concluded in Section 6.

1.1 Problem statement

Most studies have shown that reducing signal noise without distorting speech is a difficult challenge, which is one of the main reasons why perfect enhancement systems aren’t available. In this research, we focus the issues such as lower robustness, not suitable for complex noise conditions [37], more residual noise existing, lower SNR [45], reduction in speech intelligibility [43], lower denoise effect, lower PESQ [40], and low consideration of speech quality measures [10].

Compared to the existing models, the proposed work introduces a wiener filter-assisted deep learning LSTM model. The LSTM model estimates the tuning factor of the Wiener filter with the aid of extracted features to obtain the de-noised speech signal. For simulation, the proposed model considers the speech quality measures such as SDR, PESQ, SNR, RMSE, CORR, ESTOI, and STOI. Moreover, the proposed model attains higher SNR, PESQ, robustness, and also the proposed model is well suited for complex noisy environments.

2 Literature review

In 2017, Zou et al. [46] introduced two speech amplification frameworks with super gaussian speech modeling. Under the assumption that the Discrete Cosine Transform (DCT) coefficients of clean speech were modelled by Laplacian or a Gamma distribution and the DCT coefficients of the noise were Gaussian distributed, the clean speech components were calculated using the MMSE estimator. Then, underneath the condition of speech presence ambiguity, MMSE estimators were retrieved. The correct estimators of speech statistical parameters were indeed recommended. A modern decision-directed approach has been used to approximate the speech Laplacian element. According to the simulation data, the suggested algorithm generates very little residual disruption and has higher speech efficiency than Gaussian-based speech amplification algorithms.

In 2020, Zhang et al. [37] developed an LSTM-Convolutional-BLSTM Encoder-Decoder (LCLED) for enhancing the speech signal. Transpose convolution and skip connection were both included in the LCLED. Besides that, a priori SNR has been used as a learning objective of LCLED to achieve a higher level of enhanced speech. Post-processing is done using the MMSE method. The findings indicate that the suggested LCLED increases the accuracy and intelligibility of enhanced speech. Furthermore, the running time of the LCLED model was 130 sec.

In 2021, Khattak et al. [29] proposed a “phase compensated perceptually weighted -order Bayesian estimator” to modify both magnitude and phase spectra to improve noisy speech. They have changed the step of noisy speech spectra alone in the proposed methodology. Second, they have manipulated the magnitude-spectra using a perceptually motivated-order Bayesian estimator. Further, to obtain a stronger gain function, the estimator combines the benefits of the perceptually-weighted and -order spectral amplitude estimators. To recreate the interference attenuated speech signals, the compensated phase spectra, and approximate magnitude spectra were merged. Using the NOIZEUS and AURORA repositories, the proposed speech amplification strategy was tested for various noise ranges (0 dB to +10 dB) in terms of quantitative accuracy and intelligibility tests. In both non-stationary and stationary noisy settings, the proposed improvement approach significantly enhances productivity and ensures intelligibility.

In 2020, Tan et al. [45] introduced a Fully Convolutional Neural Network (FCNN) “to achieve end-to-end speech enhancement. The encoder and decoder, as well as an extra Convolutional-Based Short-Time Fourier Transform (CSTFT) layer and CISTFT layer, were applied to simulate forward as well as inverse STFT operations, respectively. Since the fundamental phonetic information of speech is presented more clearly by Time-Frequency (T-F) representations, these layers seek to incorporate frequency-domain knowledge into the proposed model. In addition, the Temporal Convolutional Module (TCM), which would be successful for processing the long-term correlations of speech signals, was indeed integrated amongst encoder and decoder. According to the experimental findings, the suggested paradigm consistently outperforms other competitive speech amplification models.

In 2020, Zhu et al. [43] used the “Deep Neural Network (DNN)-augmented colored-noise Kalman filter” to develop a novel speech enhancement system. The authors have modelled the noise as well as clean speech signal in the form of an Autoregressive (AR) process. The multi-objective DNN was trained via LPCs to map the Line Spectrum Frequencies (LSF) from the noisy acoustic features. The denoising was done to the noisy speech by applying the ‘colored-noise Kalman filter with DNN estimated parameters”. Finally, residual noise in the Kalman-filtered speech was removed using a post-subtraction procedure. The proposed work has achieved the best estimation accuracy for street noise and produced better outcomes in unseen noise.

In 2021, Wei et al. [40] have proposed a Constant Q Transform (CQT) intending to enhance the resolution of the lower frequency speech signals. The NMF/ Sparse NMF (SNMF) algorithm has been used in the backend. At low SNR, PESQ, and STOI the experimental results demonstrate that the proposed approach outperforms the Short-Time Fourier Transform (STFT) baseline in terms of enhancement ability.

In 2020, Zhou et al. [32] have suggested a modified bark spectral distortion loss mechanism, which can be thought of as an auditory perception-based MSE to replace the traditional MSE in DNN-based speech amplification approaches to increase objective perceptual efficiency even further. When compared to DNN-based methods using the traditional MSE criteria, experiments demonstrated that the proposed method can boost speech enhancement efficiency, particularly in terms of objective perceptual quality in all experimental settings.

In 2021, Chen et al. [10] have proposed a multi-objective-based multi-channel speech amplification approach. For dealing with noise and reverberation, the proposed work used the Bidirectional Long Short-Term Memory (BiLSTM) network. To the BiLSTM network, the Log-Power Spectra (LPS) of noisy speech was given as an input for each channel of the microphone array to predict the LPS and Ideal Ratio Mask (IRM) of clean speech. The intermediary LPS including IRM obtained features from both channels was further treated as a single LPS using a fusion layer. Moreover, among the clean speech LPS and the fused single-channel, the interaction taking place was learned via a DNN. Experimental findings showed the suggested speech amplification method’s viability and adaptability.

The advantages, as well as the challenges of the existing literature works discussed in the literature section, are manifested in Table 1.

Table 1 Review on Speech Enhancement models

3 Proposed speech enhancement model: An architectural description

The proposed speech enhancement model’s design is shown in Fig. 1, with the overall mechanism divided into “two main phases (i) Training Phase (ii) Testing Phase”.

Fig. 1
figure 1

Schematic overview of the proposed work

The proposed model will be constructed by following three major phases: (a) noise spectrum and signal spectrum estimation, (b) feature extraction, (c) speech enhancement. In the training phase, the noisy signal W(t) (“airport noise, exhibition noise, restaurant noise, station noise, and Street noise”) is incorporated into the clear speech signal S(t). The formulated noisy speech signal is shown in Eq. (1)

$$ R(t)=S(t)+W(t) $$
(1)

Then, for this R(t), the NMF-based spectrum is estimated to find the noise spectrum SpeN(n) and signal spectrum SpeS(n) respectively. The obtained spectrum (noise and signal) is given as input to the statistical wiener filter, from which the filtered signal F(n) is generated. Since the tuning factor η plays a key role in the Wiener filter, it has to be determined for each signal and is trained in the LSTM algorithm. These, filtered signals F(n) are subjected to EMD, from which the denoised signal can be obtained. Then, from the denoised signal acquired via EMD, the bark frequency b(n) is evaluated. Then, from the computed bark frequency, the fractional delta AMS-based features are extracted. Subsequently, with these extracted features fFD − AMS, the LSTM algorithm (a deep learning model) is trained. The LSTM provides the suited tuning factor ηtuned for the entire input signal in modified Wiener filter.

This ηtuned is fed as input to EMD via a modified wiener filter for decomposing the spectral signal, and the output of EMD is the denoised signal.

4 Processing steps of proposed speech enhancement model

This is the initial step, where the noise spectrum SpeN(n) and signal spectrum SpeS(n) are extracted from the noisy signal R(t). The NMF model has higher physical significance and is simpler to easy than the traditional matrix decomposition algorithm; this is the reason behind the utilization of the NMF in this research work. We can get a priori information in voice applications by using train data with NMF instead of the clean signal.

To improve the speech signal, the noisy signal and the speech signal in the time-frequency (γ, p) domain is computed using STFT as defined in Eq. (2). In Eq. (3), the clear speech STFT S(p, γ) the distorted speech STFT R(p, γ), and the noise signal STFT W(p, γ) are included in the pth frequency bin of the γ frame. Eq. (3) shows the statistical formula for the “noisy speech’s magnitude spectrum” approximation, which is the most often, used assumption for NMF-based speech and audio signal processing.

$$ R\left(p,\gamma \right)=S\left(p,\gamma \right)+W\left(p,\gamma \right) $$
(2)
$$ \mid R\left(p,\gamma \right)\mid =\mid S\left(p,\gamma \right)+W\left(p,\gamma \right)\mid $$
(3)

Eq. (5) shows the magnitude spectrum matrices of the various signals and jp, γ shows the magnitude spectral value corresponding to γ frame for the pth bin. The frequency bin count is denoted by H, and the time frames are denoted by I.

$$ J=\left[{j}_{p,\gamma}\right]\in {N}_{+}^{H\times I} $$
(4)

Eq. (4) is separately used in the training stage for the training data \( {J}_S\in {N}_{+}^{H\times {I}_S} \) and \( {J}_W\in {N}_{+}^{H\times {I}_W} \), and the results are the basis matrices in terms of clear speech \( {F}_S=\left[{r}_{Hl}^S\right]\in {N}_{+}^{H\times {L}_S} \) and noise \( {F}_W=\left[{r}_{Hl}^W\right]\in {N}_{+}^{H\times {L}_W} \), respectively. Moreover, L denotes the total number of base vectors. In Eq. (5), T denotes the transpose of a H × I matrix ζ, whose entities are equal to one. In the enhancement stage, the basis matrices are fixed as \( {F}_S=\left[{F}_S{F}_W\right]\in {N}_{+}^{H\times \left({L}_S+{L}_W\right)} \). The activation matrix \( {E}_{\hat{T}}={\left[{E}_S^{T\prime }{E}_W^{T\prime}\right]}^{T\prime}\in {N}_{+}^{\left({L}_S+{L}_W\right)\times {I}_{\hat{T}}} \) corresponding to the noisy speech is estimated from \( {J}_{\hat{T}}\in {N}_{+}^{H\times {I}_{\hat{T}}} \) by means of employing the NMF activation update. Furthermore, the clear speech spectrum is evaluated from the speech signal only after obtaining the activation matrix as per Eq. (6), with the help of the Wiener Filter (WF). In Eq. (6), the approximate Positive Semi-Definite (PSD) matrices corresponding to simple speech are denoted by VS = [VS(p, γ)], while the measured PSD matrices corresponding to noisy speech are denoted by \( V{\prime}_W=\left[V{\prime}_W\left(p,\gamma \right)\right]\in {N}_{+}^{H\times {I}_{\hat{T}}} \). The next solution is obtained by temporal smoothing the time, as seen in Eq. (7) and (8). Moreover, Eq. (7) and (8), demonstrate the temporal smoothing factor of speech ωT and noise ωW, respectively.

$$ {\displaystyle \begin{array}{c}F\leftarrow F\otimes \frac{\left(J/F.E\right)E}{\zeta E},\\ {}E\leftarrow E\otimes \frac{F\left(J/F.E\right)}{F^{T^{\prime }}\zeta}\end{array}} $$
(5)
$$ Q=\frac{V{\prime}_S}{V{\prime}_S+V{\prime}_W}\otimes \hat{T} $$
(6)
$$ V{\prime}_S\left(p,\gamma \right)={\rho}_SV{\prime}_S\left(p,\gamma -1\right)+\left(1-{\rho}_T\right){\left({\left[{F}_S{E}_S\right]}_{p\gamma}\right)}^2 $$
(7)
$$ V{\prime}_W\left(p,\gamma \right)={\rho}_WV{\prime}_W\left(p,\gamma -1\right)+\left(1-{\rho}_W\right){\left({\left[{F}_W{E}_W\right]}_{p\gamma}\right)}^2 $$
(8)

The signal spectrum SpecS and the noise spectrum SpecN is obtained as the outcomes. The Wiener filtering method is used to filter the received noise spectrum SpecN and signal spectrum SpecS.

4.1 Wiener filter

“The Wiener filter’s purpose is to compute a statistical estimate of an unknown signal by taking a similar signal as an input and filtering it to create the estimate as an output”. The Wiener filter is being used on a wide scale in signal amplification techniques [36]. The Wiener filter is premised on the idea of estimating the clean signal from the distorted noise signal. The major goal of the Wiener filter is to diminish the noise from the corrupted signal. Thus, the approximation is done by reducing the MSE between the target signal and the noise distorted signal.

The estimated noise spectrum SpecN and signal spectrum SpecS are fed as input to statistical wiener filtering, in which SpecN and SpecS are filtered. The solution to this frequency-domain optimization problem is given by the filter transfer function shown in Eq. (9). The signal spectrum SpecS and the noise spectrum SpecN are treated as uncorrelated and stationary signals to arrive at this equation. Moreover, SpecS has a power spectral density of pdfS(ω), while SpecN has a power spectral density of pdfW(ω). Moreover, Eq. (10) shows the statistical formula for SNR, and Eq. (11) shows how the SNR formula can be used in the filter conversion function. Moreover. GW(ω) represents the approximate signal magnitude range.

$$ F\left(\omega \right)=\frac{pdf_S\left(\omega \right)}{pdf_S\left(\omega \right)+{pdf}_W\left(\omega \right)} $$
(9)
$$ SNR=\frac{pdf_S\left(\omega \right)}{G_W\left(\omega \right)} $$
(10)
$$ F\left(\omega \right)={\left[1+\frac{1}{SNR}\right]}^{-1} $$
(11)

At the end of filtration, the filtered signal F(n) is generated. Then, from these filtered signals, the features like the EMD, bark frequency, and delta AMS are extracted.

4.2 Empirical mode curve decomposition

The EMD features are extracted from F(n). Huang proposed EMD as an adaptive strategy in which a limited number of Intrinsic Mode Functions (IMF) were applied to reflect complex data. IMFs ye(n) and residue q(n) are decomposed from the data set F(n). Eq. (12) describes the logical formula that corresponds to this decomposition.

$$ y(n)=\sum \limits_e{y}_e(n)+y(n) $$
(12)

The steps are given below:

  • Step 1: Initialization,

  • Step 2: The dth IMF is removed using the measures below.

    1. (a)

      Let k0(n) ≔ qd − 1(n) and m ≔ 1 in the equation.

    2. (b)

      The entire km − 1(n) ‘s local maxima and minima are established.

    3. (c)

      Using cubic splines interpolation, the envelope UBm − 1(n) is defined by the maxima and LBm − 1(n) for km − 1(n) by the minima.

    4. (d)

      he mean zm − 1(n) for both envelopes belonging to km − 1(n) is calculated as \( {z}_{m-1}(n)=\frac{1}{2}\left({UB}_{m-1}(n)-{LB}_{m-1}(n)\right) \). Low-frequency local pattern is the name given to this moving mean. Furthermore, the assessment of high-frequency local detail takes place through the shifting method.

    5. (e)

      km(n) ≔ km − 1(n) − zm − 1(n) is used to shape the mth dimension.

      • If km(n) does not meet any of the IMF conditions, the shifting process is continued by increasing mm + 1 at step (b).

      • If km(n) satisfies all of the IMF conditions, then set yd(n) ≔ km(n) and qd(n) ≔ qd − 1(n) − yd(n)

  • Step 3: If qd(n) represents a residuum, the shifting process may be stopped; otherwise, resume the shifting process by increasing d, d + 1 and starting from step 1.

Furthermore, the EMD algorithm immediately achieves the completeness of the decomposition process as \( y(n)=\sum \limits_{d=1}^v{y}_d+q \), which represents an identity. Since equivalent frequencies can be used by neighbouring IMFs at different time points, the locally orthogonal IMFs provided by the EMD algorithm do not guarantee global orthogonality. As a result of this, the bark frequency b(u) is obtained.

4.3 Fractional DeltaAMS feature

The spectrum amplitudes of b(n) are the AMS characteristics. The delta features are introduced by the minute variations in the frequency and time domains, and let f(tim, freq) be the AMS function vector to b(n). The determined feature vector is also known as Eq. (13) – Eq. (15).

$$ f\left( tim,\boldsymbol{freq}\right)=\left[f\left( tim,\boldsymbol{freq}\right),\Delta {f}_T\Big( tim,\boldsymbol{freq}\Big),\Delta {a}_k\Big( tim,\boldsymbol{freq}\Big)\right] $$
(13)
$$ {\displaystyle \begin{array}{c}\Delta {f}_{Tim}\left( tim,\boldsymbol{freq}\right)=f\left( tim,\boldsymbol{freq}\right)-f\left( tim-1,\boldsymbol{freq}\right);\\ {}\mathrm{where} tim=2,\dots, Tim\end{array}} $$
(14)
$$ \Delta {f}_{Tim}\left( tim=1,\boldsymbol{freq}\right)=f\left( tim=2,\boldsymbol{freq}\right)-f\left( tim=1,\boldsymbol{freq}\right) $$
(15)

The measured delta function vector across frequency and time is denoted by ΔfTim(tim, freq). The fractional calculus is used to obtain the most important delta-AMS features in the delta-AMS features. Incorporating the fractional calculus improves the convergence speed while reducing computing load. As a result, Eq. (15) can be rewritten as

$$ \Delta f\left( tim,\boldsymbol{freq}\right)=f\left( tim-1,\boldsymbol{freq}\right)\cong {E}^{\sigma}\left[\Delta f\left( tim,\boldsymbol{freq}\right)\right] $$
(16)

Here, t and s represents the windows length and count of frames, respectively and Eσf(tim, freq)] is fractional calculus. The incorporation of Eσf(tim, freq)] into delta-AMS features plays a key role in enhancing the speech signal. The formulated FD-AMS is denoted as per Eq. (17) and Eq. (18), respectively. The extracted fractional delta-AMS features are denoted as fFD − AMS.

$$ \Delta {f}_{Tim}\left( tim,\boldsymbol{freq}\right)={E}^{\sigma}\left[\Delta f\left( tim,\boldsymbol{freq}\right)\right] $$
(17)
$$ {\displaystyle \begin{array}{c}\Delta {f}_{Tim}\left( tim,\boldsymbol{freq}\right)=\Delta f\left( tim,\boldsymbol{freq}\right)-\frac{1}{2}\Delta f\left( tim-1,\boldsymbol{freq}\right)-\\ {}\frac{1}{6}\left(1-\sigma \right){E}^{\sigma}\left[\Delta f\left( tim-2,\boldsymbol{freq}\right)\right]-\\ {}\frac{1}{24}\sigma \left(1-\sigma \right)\left(2-\sigma \right)\left[\Delta f\left( tim-3,\boldsymbol{freq}\right)\right]\end{array}} $$
(18)

The LSTM network is trained with the extracted features fFD − AMS.

4.4 LSTM network

For speech enhancement, the extracted features fFD − AMS are subjected to LSTM. A list of repeating LSTM cells has been used in the LSTM setup. Each LSTM cell is made up of three multiplicative units, which represent the “forget gate, input gate, and output gate” [31]. These units enable LSTM memory cells to store and transfer data for longer periods of time. Let the variables M and C denotes the hidden and cell states, respectively. The operation is performed by the benchmark LSTM cell while generating outputs ηtuned. The formulation of LSTM is given below:

$$ {I}_t=\sigma \left({J}_I{X}_t+{K}_I{M}_{t-1}+{B}_I\right) $$
(19)
$$ {F}_t=\sigma \left({J}_F{X}_t+{K}_F{M}_{t-1}+{B}_F\right) $$
(20)
$$ {O}_t=\sigma \left({J}_O{X}_t+{K}_O{M}_{t-1}+{B}_O\right) $$
(21)
$$ {C}_t={F}_t{C}_{t-1}+{I}_t{G}_t $$
(22)
$$ {G}_t=\mathit{\tanh}\left({J}_G{X}_t+{K}_G{M}_{t-1}+{B}_G\right) $$
(23)
$$ {M}_t={O}_t\mathit{\tanh}\left({C}_t\right) $$
(24)

Here, It, Ft, and Ot are the input, forget, and output gates at a time t. The weights which map the hidden layer input to the input, forget as well as output gates are represented as JI, JF, and JO. The weight matrices which map the hidden layer output to gates are denoted by KI, KF, and KO. BI, BF, BO and BG are the bias vectors. The sigmoid function σ is used to represent the gate activation function. Furthermore, the cell outcome and layer outcome is denoted by Gt and Mt respectively. The architecture of the LSTM is shown in Fig. 2.

Fig. 2
figure 2

The architecture of LSTM

4.5 Modified wiener filtering

The importance of the tuning ratio ηtuned has been well established in this research work. Based on the b(u)(bark frequency) of the NMF-based filtered EMD signal, the estimated tuning ratio of the Wiener filter is fine-tuned by LSTM. Mathematically, b(u) can be expressed by Eq. (25).

$$ b(u)=13\mathit{\arctan}(0.76u)+3.5\mathit{\arctan}\left[{(0.33u)}^2\right] $$
(25)

For tuning η in a much precise manner, we have introduced a new modified wiener filter model, which overcomes the drawbacks of the existing wiener filtering model. The existing wiener filter couldn’t estimate the power spectra efficiently; it is challenging for the existing wiener filtering to acquire the perfect restoration for the random nature of the noise. Moreover, the existing wiener filter is comparatively slow to apply since they require working in the frequency domain. Interestingly, our new modified wiener filter overcomes all these drawbacks. The newly developed modified wiener filter can be formulated as per Eq. (26).

$$ H\left(\omega \right)=\frac{R\left(\omega \right)}{R\left(\omega \right)+ En/\left({E}_y- En\right).\alpha .R\left(\omega \right)} $$
(26)

Here, En denotes the noise-free speech, Ey points to the energy of the noise speech energy, and R(ω) is the noisy speech signal. In addition, α represents the noise suppression factor.

The properly estimated tuning ratio ηtuned acquired from LSTM is fed as input to the Wiener filter, instead of the constant η. The outcomes of the Modified Wiener filter are the filtered signal \( \overline{F_u(t)} \). Again, \( \overline{F_u(t)} \) is decomposed using EMD and the result is the enhanced denoised signal \( \overline{\overline{S(t)}} \). The training library is built using the established b(u) and tuning ratio ηtuned as inputs during the training phase. The testing procedure is described as an online procedure, while the training procedure is described as an offline procedure. In the offline phase, the required tuning factor for various noises is identified, and the LSTM is trained with this information.

In the training process, the training library is constructed by giving the known b(u) and tuning ratio ηtuned as inputs. The testing process is said to be the online process, while the training process is an offline process. The appropriate tuning factor for diverse noises is identified in the offline process and with this, the LSTM is trained. The tuning element is associated with the qualified network in the online mechanism, where the real improvement process takes place.

5 Results and discussion

MATLAB was used to implement the recently adopted speech amplification model. The data collection for the current study work was obtained from [4]. The five noise categories in this database, namely “airport noise, exhibition noise, restaurant noise, station noise, and street noise,” are added to speech signals at differing SNR levels (0 dB, 5 dB, 10 dB, and 15 dB, respectively) to measure the efficacy of the suggested work for speech enhancement. Memory bandwidth is the amount of memory that can be used to process files in a second The total memory of the system utilized is 12GB in which the memory bandwidth of the proposed method is 1.6GB. The time step for each sequence is given in Table 2. Here time step represents the number of samples.

Table 2 Time step for sequence

Figure 3 depicts the spectrums of the clean signal, noised signal (mixture of clean and noise signal), and de-noised signal for the airport, exhibition hall, restaurant, train station, and street, respectively. Signal-to-Distortion Ratio (SDR), Perceptual Evaluation Of Speech Quality (PESQ), Signal-To-Noise Ratio (SNR), Root-Mean-Square Error (RMSE), Correlation (CORR), Extended STOI (ESTOI), Short-Time Objective Intelligibility (STOI), and Cumulative Squared Euclidean Distance (CSED) are all used to analyze the performance of the proposed work. The comparative evaluation is made between the proposed and the existing models like multi-features+ DCNN based speech enhancement [15], Diminished Empirical Mean Curve Decomposition (D-EMCD) [14], Neural Network (NN) + auto correlation, Spectral Subtraction [7], Optimal Modified Minimum Mean Square Error Log-spectral Amplitude (OMLSA) [6], Two-step Noise Reduction (TSNR) [24], Harmonic Regeneration Noise Reduction (HRNR) [25], and Regularized Nonnegative Matrix Factorization (RNMF) [9], respectively.

Fig. 3
figure 3

The spectrum of the clean signal, noised signal (mixture of clean and noise signal), and de-noised signal for (a)airport, (b)exhibition hall, (c) restaurant, (d) train-station, and (e) street noise

5.1 Influence on airport noise under varying SNR

  • In order to validate our proposed work as a significant one even under varying noise conditions, we have added the airport noise signal Wair(t) onto the clear speech signal S(t). The formulated noisy speech signal Rair(t) is validated for varying SNR levels R(t) = S(t) + W(t). At 0 dB, 5 dB, and 10 dB, 15 dB, respectively we have added the Wair(t) onto the S(t). Now, the formulated airport noisy signal Rair(t) is validated over the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF in terms of SDR, PESQ, SNR, RMSE, CORR, ESTOI, STOI, respectively. The obtained results are tabulated in Tables 3, 4, 5 and 6, which corresponds to different SNR rates of 0 dB, 5 dB, 10 dB, and 15 dB, respectively. When looking at the results, it’s clear that the proposed work delivered the best results, with higher SDR, PESQ, CORR, ESTOI, STOI, and SNR, as well as a lower RMSE. Initially, when Rair(t) is added at 0 dB, the proposed work seems to have achieved the highest value as 6.89, which is the best score compared to multi-features+ DCNN based Speech Enhancement = 5.98, D-EMCD = 4.83, NN + auto correlation = −43.19, spectral subtraction = −8.80, OMLSA = -23.11, TSNR = −7.41, HRNR = −7.42, and RNMF = 5.82. In addition, PESQ of the proposed work is 2.17 at SNR = 0 dB, which is 4.9%, 10.43%, 75.06%, 74.7%, 40.78%, 40.78%, 34.95%, 36.325% and 11.09% better than the existing models like like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectively. When, Rair(t) is added at SNR = 5 dB, the proposed work is PESQ of the proposed work is 2.57, which is the maximal value when compared to multi-features+ DCNN based Speech Enhancement = 2.41, D-EMCD = 2.36, NN + auto correlation = 0.53, spectral subtraction = 0.82, OMLSA = 1.32, TSNR = 1.86, HRNR = 1.85, and RNMF = 2.31. In addition, the proposed work has achieved the maximal ESTOI as 0.71, which is 13.83%, 8.6%, 99.9%, 71.9%, 31.4%,31.4%, 34.8%, 37.65%, and 13.5% better than the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectively at SNR = 5 dB. In addition, when Rair(t) is added at 10 dB, the proposed work seems to have attained the favorable outcome as shown in Table 4. The CORR of the proposed work at SNR = 10 dB is 0.97, which is 0.18%, 0.14%, 99.9%, 92%, 91.85%, 91.85%, 97.6%, 97.7% and 8.4% better than the existing models like the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectively. Moreover, on observing the outcomes from Table 5, the proposed work had achieved the least RMSE as 0.007, which is 50.2%, 34.1%, 82.8%, 83.2%, 98.6%, 98.6%, 84.5%, 84.4%, and 69.2% better than the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectievly. The change is made in the wiener filter. Furthermore, the LSTM model is used to accurately estimate the tuning factor of the Wiener filter for all input signals. During the testing phase, the extracted features (EMD) were used to train the LSTM model using a modified Wiener filter. Thus, the proposed work had enhanced the quality of the speech signal even under the airport environment.

Table 3 Performance evaluation of proposed model over existing for Airport Noise at varying SNR = 0 dB
Table 4 Performance evaluation of proposed model over existing for Airport Noise at varying SNR = 5 dB
Table 5 Performance evaluation of proposed model over existing for Airport Noise at varying SNR = 10 dB
Table 6 Performance evaluation of proposed model over existing for Airport Noise at varying SNR = 15 dB

5.2 Influence on exhibition hall noise under varying SNR

The noise created from the exhibition hall Whall(t) is added to S(t) at varying SNR rates. The formulated noisy signal R(t) = S(t) + Whall(t) is evaluated in terms of SDR, PESQ, SNR, RMSE, CORR, ESTOI, STOI, respectively. The result acquired is tabulated in Tables 7, 8, 9 and 10, corresponding to varying SNR rates: 0 dB, 5 dB, and 10 dB, 15 dB, respectively. When adding Whall(t)=0 dB to S(t), the proposed work has achieved the highest SNR as 34.27, which is better than existing models like the existing models like multi-features+ DCNN based Speech Enhancement = 5.24, D-EMCD = 5.16, NN + auto correlation = −0.006, spectral subtraction = −0.31, OMLSA = −22.20, TSNR = −0.99, HRNR = −0.87, and RNMF = 2.81. In addition, the RMSE of the proposed work is 18.7%, 17.7%, 53.5%, 55.18%, 96.4%, 96.4%, 58.5%, 57.9% and 36.3% better than the existing models like the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectively. In addition, when the Whall(t) is added to S(t) at 5 dB, the SNR of the proposed work is 36.81, which is 77.16%, 77.85%, 99.9%, 99.3%, 39.7%, 39.7%, 97.3%, 97.52%, and 89.5% better than the existing models multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectively. In addition, the CORR of the proposed work is 2.1%, 0.4%, 99.9%, 99.56%, 91.8%, 91.8%, 97.9%, 97.9%, and 6.9% better than the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectively. Moreover, when the Whall(t) is added at 10 dB to S(t), the R(t) acquired from the proposed work in terms of PESQ is 2.663, which is better than existing models like multi-features+ DCNN based Speech Enhancement = 2.49, D-EMCD = 2.58, NN + auto correlation = 0.39, spectral subtraction = 0.92, OMLSA = 1.49, TSNR = 2.25, HRNR = 2.26, and RNMF = 2.48, respectively. Moreover, when Whall(t) is applied at 15 dB to S(t), the proposed speech signal is 0.3%, 0.7%, 84.3%, 64.4%, 44.2%, 44.2%, 8.6%, 8.15%, 8.15% and 6.3% better than the existing work in terms of PESQ. Moreover, in case of SNR for Whall(t) applied at 15 dB, the proposed work is 67.8%, 71.7%, 99.9%, 99.43%, 48.3%, 48.3%, 97.8%, 97.9%, and 87% better than the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectively. The final result is indeed a speech-enhanced signal with negligible noise. The wiener filter has undergone changes. Moreover, for all input signals, the LSTM model is used to accurately estimate the Wiener filter tuning factor. The extracted features (EMD) were used to train the LSTM model using a modified Wiener filter during the testing phase. Therefore, from the evaluation, it is clear that the proposed work is applicable even under the exhibition hall.

Table 7 Performance evaluation of proposed model over existing for exhibition Hall Noise at varying SNR = 0 dB
Table 8 Performance evaluation of proposed model over existing for exhibition Hall Noise at varying SNR = 5 dB
Table 9 Performance evaluation of proposed model over existing for exhibition Hall Noise at varying SNR = 10 dB
Table 10 Performance evaluation of proposed model over existing for exhibition Hall Noise at varying SNR = 15 dB

5.3 Influence on restaurant noise under varying SNR

The restaurant noise Wrest(t) is added to S(t) at varying SNR rates. The formulated noisy signal R(t) = S(t) + Wrest(t) is evaluated in terms of SDR, PESQ, SNR, RMSE, CORR, ESTOI, STOI, respectively. The result acquired is tabulated in Tables 11, 12, 13 and 14, corresponding to varying SNR rates: 0 dB, 5 dB, and 10 dB, 15 dB, respectively. While adding Wrest(t)=0 dB, the SNR of the proposed work is 33.90, which is better than the existing models like multi-features+ DCNN based Speech Enhancement =5.99, D-EMCD = 4.55, NN + auto correlation = −0.009, spectral subtraction = −0.17, OMLSA = −22.21, TSNR = −1.05, HRNR = −0.92, and RNMF = 2.58, respectievly. In addition, for Wrest(t)=0 dB, the RMSE of the proposed work is 0.02, which is 26.4%, 20.6%, 52.3%, 53.4%, 96.3%, 96.3%, 57.9%, 57.3% and 36.5% better than the existing models like the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectievly. Moreover, when adding Wrest(t) at 5 dB, the CORR of the proposed work is 0.93016, which is better than models like multi-features+ DCNN based Speech Enhancement = 0.91, D-EMCD = 0.92, NN + auto correlation = −0.000008, spectral subtraction = −0.005, OMLSA = 0.08, TSNR = 0.02, HRNR = 0.02, and RNMF = 0.86, respectievly. In addition, the PESQ of the proposed work is 0.12%, 4.7%, 77.01%, 66.4%, 45.1%, 45.1%, 22.7%, 22.49% and 8.1% better than the existing models like the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectievly. Moreover, when the Wrest(t)=10 dB is added to the input clear speech signal, the processed outcomes from proposed model has generated a higher speech quality signal’s. here, when Wrest(t)=10 dB, the RMSE of the proposed work is 33.6%, 31.9%, 78.3%, 78.8%, 98.3%, 98.3%, 80.5%, 80.37% and 64.6% better than the existing models like the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectievly. In addition, ESTOI of the proposed work is 0.79, which is 0.15%, 3.9%, 99.9%, 68.9%, 25.3%, 25.3%, 25.79%, 26.8% and 11.455 better than the existing models like the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectievly. Moreover, when the Wrest(t)=15 dB is added to the clear signal, the processed outcomes from the proposed work in terms of SNR is 6.4%, 72.1%, 99.9%, 99.4%, 46.2%, 46.28%, 97.7%, 97.86% and 88.04% better than the existing models like the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectievly. The modification is made to the wiener filter. The improved wiener filter is used rather than using conventional one. Thus, the betterment of the proposed work has been proved over the other existng models.

Table 11 Performance evaluation of proposed model over existing for restaurant noise at varying SNR = 0 dB
Table 12 Performance evaluation of proposed model over existing for restaurant noise at varying SNR = 5 dB
Table 13 Performance evaluation of proposed model over existing for restaurant noise at varying SNR = 10 dB
Table 14 Performance evaluation of proposed model over existing for restaurant noise at varying SNR = 15 dB

5.4 Influence on railway station noise under varying SNR

To the clear speech signal S(t), the railway station noise Wrail(t) is added under varying SNR rates, and the outcomes acquired after de-noising is evaluated in terms of SDR, PESQ, SNR, RMSE, CORR, ESTOI, STOI, respectively. The results acquired are tabulated in Tables 15, 16, 17 and 18 corresponding to varying SNR rates: 0 dB, 5 dB, and 10 dB, 15 dB respectively. On applying the Wrail(t) at SNR = 0 dB, the SNR of the proposed work is 35.03, which is the highest value and it is 83.26%, 84.2%, 99.9%, 99.9%, 36.6%, 36.6%, 97.6%, 97.9%, and 92.65% better than the existing models like multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectively. In addition, Wrail(t) is applied to S(t) at 5 dB, the proposed had achieved the least RMSE as 0.01, which is the least value when compared to multi-features+ DCNN based Speech Enhancement = 0.02, D-EMCD = 0.02, NN + auto correlation = 0.04, spectral subtraction = 0.04, OMLSA = 0.56, TSNR = 0.05, HRNR = 0.05, and RNMF = 0.03. In addition, the RMSE of the proposed work after the application of Wrail(t) at SNR = 10 dB is 0.0083884, which is 32.6%, 35.3%, 80.4%, 80.9%, 98.4%, 98.49%, 82.4%, 82.2% and 67.6% better than the existing models multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectively. In addition, while applying the Wrail(t) at SNR = 15 dB, the proposed work has achieved the highest SNR value as 40.21, while the existing models had recorded the SNR value as multi-features+ DCNN based Speech Enhancement = 9.49, D-EMCD = 11.6, NN + auto correlation = −0.0004, spectral subtraction = −0.23, OMLSA = −22.19, TSNR = −0.93, HRNR = −0.87, and RNMF = 5.05. Furthermore, after analyzing the proposed work with Wrail(t)=15 dB, the proposed work seems to have obtained the best results. During the testing phase, the extracted features (EMD) were used to train the LSTM model using a modified Wiener filter. As a result of the evaluation, it is clear that the proposed study is effective in improving the speech signal even when station noise is present.

Table 15 Performance evaluation of proposed model over existing for station noise at varying SNR = 0 dB
Table 16 Performance evaluation of proposed model over existing for station noise at varying SNR = 5 dB
Table 17 Performance evaluation of proposed model over existing for station noise at varying SNR = 10 dB
Table 18 Performance evaluation of proposed model over existing for station noise at varying SNR = 15 dB

5.5 Influence on street noise under varying SNR

The street noise Wstreet(t) is added to the clear signal S(t) at varying, SNR, as well as the results obtained after de-noising, are measured in terms of SDR, PESQ, SNR, RMSE, CORR, ESTOI, and STOI. The obtained results are tabulated in Tables 19, 20, 21 and 22, which correspond to different SNR rates of 0 dB, 5 dB, 10 dB, and 15 dB, respectively. From the acquired outcomes, the RMSE of the proposed work is found to be lower even under every variation in the application of Wstreet(t) rate. At SNR = 0 dB, 10 dB, 15 dB, 20 dB, the proposed work had achieved the least value RMSE as 0.02, 0.01, 0.009 and 0.006 respectively. Moreover, on analyzing the other outcomes, the proposed work had recorded the highest SDR, PESQ, SNR, CORR, ESTOI, and STOI, which are said to be the appropriate values for speech enhancement. In addition on adding Wstreet(t) at 5 dB, the SNR of the proposed work is 39.26, which is the better value when compared to existing models like multi-features+ DCNN based Speech Enhancement = 8.91, D-EMCD = 8.65, NN + auto correlation = −0.003, spectral subtraction = −0.18, OMLSA = −22.19, TSNR = −0.89, HRNR = −0.81, and RNMF = 3.49. Moreover, when Wstreet(t)=10 dB is applied to the clean speech signal, the outcome from the proposed work in terms of RMSE = 0.009, which is better than the existing models like multi-features+ DCNN based Speech Enhancement = 0.01, D-EMCD = 0.01, NN + auto correlation = 0.04, spectral subtraction = 0.04, OMLSA = 0.48, TSNR = 0.05, HRNR = 0.05, and RNMF = 0.02. Moreover, for all input signals, the LSTM model is used to accurately estimate the Wiener filter tuning factor. The extracted features (EMD) were used to train the LSTM model using a modified Wiener filter during the testing phase. Therefore from the evaluation, it is clear that the proposed work is much significant for enhancing the speed signal.

Table 19 Performance evaluation of proposed model over existing for street noise at varying SNR = 0 dB
Table 20 Performance evaluation of proposed model over existing for station noise at varying SNR = 5 dB
Table 21 Performance evaluation of proposed model over existing for station noise at varying SNR = 10 dB
Table 22 Performance evaluation of proposed model over existing for station noise at varying SNR = 15 dB

5.6 Statistical Anaysis

Table 23 shows the statistical analysis of the proposed model over existing methods. On considering the SDR measure, the STD value of the proposed model is 4.23%, 10.20%, 5.42%, 82.84%, 86.82%, 92.29%, 92.29%,and 58.46% better than the existing multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF models. In RMSE measure, the best value of the proposed model is 40%, 40%, 85%, 85%, 98.75%, 85%, 85%, and 70% superior to existing models multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectively. Further considering the SNR measure, the best value of the proposed model is 33.90, which is the better value when compared to existing models like multi-features+ DCNN based Speech Enhancement = 5.24, D-EMCD = 4.55, NN + auto correlation = −0.01, spectral subtraction = −0.31, OMLSA = −22.21, TSNR = −1.05, HRNR = −0.92, and RNMF = 2.57 respectively. Likewise, other measures also show a better performance. Therefore, from the analysis, the proposed model is proven to be a suitable model for speech enhancement.

Table 23 Statistical analysis of proposed model over existing methods

5.7 Discussions

The major goal of this study is to enhance the speech signals with various noise sources. The results section evaluated the proposed model with different noise sources such as “airport noise, exhibition noise, restaurant noise, station noise, and street noise”. The various noise sources are analyzed under different SNR values in terms of speech quality measures. By utilizing the modified wiener filter and extracted features assisted LSTM model, the denoised speech signal is obtained. Compared to the existing models, the proposed method achieves higher SDR, PESQ, CORR, ESTOI, STOI, and SNR, as well as lower RMSE values. Moreover, the proposed model overcomes the drawbacks such as reduction in speech intelligibility [43], lower PESQ [40], lower robustness [37], not being suitable for complex noise environments [37], lower speech quality [10], and low SNR [45] [29]. However, this proposed method lacks at some noise sources and it did not determine the spectral magnitude and spectral phase estimation.

5.8 Practical implication

The main potential applications of the proposed model are given below:

  • Hearing aids

  • Automatic speech recognition

  • Mobile communications

  • Video captioning for teleconferences

  • Voice over Internet protocol

  • Hand-free communications

This research provides better outcomes and it suits many potential application fields.

6 Conclusion

In this modern world, there is a need to improve the speech signal, where the target speech signal is disturbed by different noise sources. This research considered the various noise problems for speech enhancement that is similar to real-world situations and many noise sources which simultaneously diminish the quality and intelligibility of the speech. In this work, a novel speech signal enhancement model was introduced with the assistance of a deep learning model. The main contribution of this research lies in the proper estimation of the tuning factor η of the Wiener filter for all input signals. The training of η was done using the LSTM model. The experimental outcomes at various input SNRs have verified the supremacy of the proposed model with respect to SDR, PESQ, SNR, RMSE, CORR, ESTOI, STOI, respectively. Especially, for airport noise, the PESQ of the proposed work is 2.17 at SNR = 0 dB, which is 4.9%, 10.43%, 75.06%, 74.7%, 40.78%, 40.78%, 34.95%, 36.325%, and 11.09% better than the existing multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF models. Additionally, in RMSE measure, the best value of the proposed model is 40%, 40%, 85%, 85%, 98.75%, 85%, 85%, and 70% superior to existing models multi-features+ DCNN based Speech Enhancement, D-EMCD, NN + auto correlation, spectral subtraction, OMLSA, TSNR, HRNR, and RNMF, respectively. Thus, the superiority of the proposed model has been proven with complex-noise environments. In the future, we consider the issue and develop the speech enhancement model with advanced GAN.