1 Introduction

Speech is the vocalized form of human communication. It is also one of the preferred methods of communication between humans and machines in a variety of speech processing applications, including speech recognition (Schultz et al., 2021), speech coding (Kleijn et al., 2018), and speech quality estimation of disordered tracheoesophageal speech of patients (Ali et al., 2020). When the input speech signal is in its original (clean) form, these applications perform admirably. However, due to the presence of background noise in the surroundings, the input speech signal suffers due to the presence of noises such as stationary noise (fan, white noise) and non-stationary noise (exhibition, station, drone, helicopter, airplane), among others. As a result, in order to achieve better performance, it is necessary to improve the speech before injecting it into any specific speech processing applications.

Fig. 1
figure 1

Single channel speech enhancement system

Speech enhancement algorithms improve the speech quality that is degraded by additive noise such as exhibition noise, drone noise, etc., as shown in Fig. 1. Speech identification (Sheft et al., 2008), speech recognition system (Moore et al., 2017), unmanned aerial vehicle (UAV)-based search and rescue system (Deleforge et al., 2019) are among the speech enhancement applications. Speech enhancement improves the quality and intelligibility of noisy speech signals (Das et al., 2020). It can be used as a pre-processing block as well as a filter to remove background noise, as in the cellular phone (Ogunfunmi et al., 2015). Further, the spectral characteristics of the noise, as well as the number of available microphones, have an impact on the speech enhancement algorithms. A single microphone or channel, in general, is considered for mobile applications due to its cost and size (Loizou, 2013). Moreover, noise in the real-world can be stationary or non-stationary, with varying spectral properties. A speech enhancement algorithm is said to be effective when the noise in a noisy speech signal is estimated accurately without introducing perceptible distortion. An inaccurate noise estimation can distort the speech signal and musical noise (Loizou, 2013).

Spectral subtraction (SS) is a traditional and widely used speech enhancement algorithm for improving the speech signal which is degraded by the additive background noise (Boll, 1979; Loizou, 2013). The algorithm employs a forward and inverse Fourier transform of the input noisy speech signal. However, the algorithm suffers with the problem of “musical noise” (Loizou, 2013) present in the enhanced speech signals. A multi-band spectral subtraction algorithm for improving the fan-noise-degraded speech has been proposed in Saldanha and Shruthi (2016). However, the presence of non-stationary noise degrades the performance of this algorithm. In Yan et al. (2020), an iterative graph spectral subtraction algorithm to suppress the noise using graph spectral subtraction (GSS) has been proposed. The GSS leverages the differences in the spectrum of speech graphs and noise graph signals. However, this algorithm is explored only with the stationary noise, such as, white, and pink noise, and the non-stationary noise, such as, babble noise. They have not considered the aerial environmental noise to investigate the suitability of the algorithm.

Wiener filtering (WF) follows a conceptually similar approach for speech enhancement. A Wiener filter-based speech enhancement algorithm for additive white Gaussian noise (AWGN) and colored noise has been presented in Abd El-Fattah et al. (2014). The filter transfer function in Abd El-Fattah et al. (2014) is based on speech signal statistics; namely, the local mean and local variance. This algorithm, on the other hand, is implemented in the time domain. In the frequency domain, a speech enhancement algorithm that combines voiced speech probability with wavelet decomposition has been proposed in Bhowmick and Chandra (2017). This algorithm, however, is incapable of optimizing the multi-taper spectrum, resulting in poor speech signal denoising. Similarly, a combination of Wiener filter and Karhunen–Loeve transform (KLT) for enhancing the degraded speech has been introduced in Srinivasarao and Ghanekar (2020). A time-frequency mask-based parametric Wiener filter in Chiea et al. (2019) considers the trade-off between speech distortion and noise reduction for designing and minimizing the cost function in order to obtain optimal solution for noise suppression. However, these methods do not provide the degree of flexibility that allows the engineer to control the estimate of noise power spectral density (PSD) to suppress the musical noise components.

To make the spectral estimation smooth in order to reduce the musical noise during speech enhancement, Hu and Loizou (2004) employed low-variance spectral estimators based on wavelet thresholding the multi-taper spectrum of speech sample. Speech-shaped noise and car noise are investigated for the suitability of the algorithm. However, in Hu and Loizou (2004), the effect of aerial environmental noises which are having different spectral characteristics have not been analysed. In Charoenruengkit and Erdöl (2010), the effect of spectral estimate variance on the quality of speech enhancement system has been analysed. Reducing the variance of the spectral estimator improves the speech enhancement system. In Charoenruengkit and Erdöl (2010), babble, car, jet airplane, and white noise are considered. However, the aerial environment noises have not been considered. Spectral subtraction algorithm and SNR-dependent noise estimation approach have been considered in Islam et al. (2018) for speech enhancement. The magnitude and phase of the noisy speech spectrum are modified in order to enhance the speech. Moreover, babble and street noise are considered, however, the aerial environment noises have not been considered. The effect of musical noise on speech enhancement is also missing in Islam et al. (2018). In Kanehara et al. (2012), three different estimators to mathematically investigate the amount of musical noise generated during speech enhancement are investigated. A practical example is shown for the white and babble noise, but not for the aerial environment noises. Moreover, none of the above algorithms provide degree of flexibility to estimate the noise which is the major parameter in speech processing.

To this end, a single-channel speech enhancement algorithm based on implicit Wiener filter (IWF) with recursive noise estimation technique is proposed (Jaiswal & Romero, 2021), which provides degree of flexibility to the engineer to estimate the noise for speech enhancement. The algorithm is implemented in frequency domain, only for the speech degraded by non-stationary noise. However, the speech degraded by stationary noise is not considered in Jaiswal and Romero (2021). It also does not take into account the noisy environments created by the aerial environments such as UAV noise, helicopter noise, and airplane noise. In addition, it lacks embedded hardware implementation using low-cost edge computing node, Raspberry Pi. Motivated by this, in this work, we propose and extend the implicit Wiener filter-based speech enhancement algorithm that provides the degree of flexibility to suppress the noise to enhance the speech. The key contributions of this paper are as follows:

  • The proposed implicit Wiener filter-based speech enhancement algorithm is analysed in the presence of stationary noise such as the AWGN noise and real-world non-stationary noisy environments such as exhibition and station.

  • In addition, we consider novel aerial environmental noises such as drone, helicopter, and airplane to evaluate the performance of proposed speech enhancement algorithm.

  • The proposed speech enhancement algorithm provides degree of flexibility using \(\alpha\), which is the noise smoothing parameter, and \(\gamma\), which is the noise adjustable parameter, to estimate the noise in order to enhance the speech accurately.

  • Through extensive results, we compare the performance of the proposed speech enhancement algorithm with the spectral subtraction algorithm and achieve relatively better performance. The enhanced speech obtained with proposed speech enhancement algorithm shows the smoother spectrum.

  • Finally, we implement the proposed speech enhancement algorithm on the Raspberry Pi 4 Model B hardware. This, in turn, shows that the proposed speech enhancement algorithm can be implemented on the low computational embedded platforms.

The rest of the paper is organized as follows. The literature survey is described in Sect. 2. A review on the spectral subtraction algorithm is presented in Sects. 3 and  4 presents a review on the implicit Wiener filter in frequency domain. Section 5 describes the noise estimation technique, and Sect. 6 describes the experimental dataset. The evaluation methodology for the speech enhancement algorithm is outlined in Sect. 7. The implementation of the speech enhancement algorithms using edge computing system is described in Sect. 8. The results are presented and discussed in Sect. 9. Finally, Sect. 10 provides the concluding remarks and the scope for future work.

2 Related work

A deep neural network (DNN)-augmented colored-noise Kalman filter-based speech enhancement system has been proposed in Yu et al. (2020) that models both clean speech and noise as auto-regressive process. The parameters of auto-regressive processes comprising of linear prediction coefficients and driving noise variances which are obtained by training multi-objective DNNs (Yu et al., 2020). It is important to note here that the existing DNN-based speech enhancement models extract only local features from the noisy speech in a non-causal way. Thus, in Yuan (2020), a time-frequency smoothing neural network has been proposed. The proposed network in Yuan (2020) works on time-frequency correlation in the improved minima controlled recursive averaging (MCRA)-based feature calculation using long short-term memory (LSTM) and convolutional neural network (CNN) (Shrestha and Mahmood, 2019). A generative adversarial networks (GANs) (Creswell et al., 2018) based speech enhancement algorithm to estimate the clean signal from the corrupted signal has been proposed in Pascual et al. (2019). In You and Ma (2017), a modified scheme has been proposed that includes a smoothing adaptation to the frame signal-to-noise ratio (SNR) and a re-estimation of previous SNR in order to reduce the artifacts for speech enhancement. However, these DNN-based models require a huge amount of speech data for training. Therefore, we propose a speech enhancement algorithm which achieves a good performance with limited amount of data.

A multi-band algorithm to the spectral subtraction is proposed in Kamath et al. (2002), where the speech spectrum is divided into multiple non-overlapping bands. Afterwards, spectral subtraction has been performed independently in each band. Finally, the modified frequency bands are recombined to obtain the enhanced speech. This algorithm has been investigated on the colored noise alone. Further, the multi-band algorithm works on the fact that the noise does not affect the speech signal uniformly. In Asano et al. (2000), a speech enhancement algorithm based on the subspace approach has been proposed. The proposed algorithm in Asano et al. (2000) reduces the ambient noise by eliminating the noise-dominant eigenvalues. The basic principle of the subspace approach is that the clean signal might be confined to a subspace of the noisy Euclidean space. Therefore, the vector space of the noisy signal is decomposed into “signal” and “noise” subspaces using the singular value decomposition (SVD) or eigenvector–eigenvalue factorization (Loizou, 2013). However, above algorithms do not consider the non-stationary noise and the aerial environmental noise, which have different spectral characteristics, for speech enhancement.

The use of Raspberry Pi for the experimental evaluation of speech signals has gained attention, due to its low cost for edge computing applications. A UAV-based voice recognition system is considered for recognizing the humans buried under the rubble after a massive disaster (Yamazaki et al., 2019). Here, the Raspberry Pi is mounted on UAV to process the voice signals (Yamazaki et al., 2019). A real-time implementation of advanced binaural noise reduction algorithm is demonstrated using Raspberry Pi in Azarpour et al. (2017). In Drakopoulos et al. (2019), a DNN-based noise suppression scheme for audio signals is demonstrated using Raspberry Pi. Motivated by Yamazaki et al. (2019), Azarpour et al. (2017), Drakopoulos et al. (2019), in this work, the experimental evaluations of both algorithms, that is, proposed speech enhancement algorithm and spectral subtraction algorithm are performed using Raspberry Pi 4 Model B board.

3 Basic spectral subtraction algorithm

The spectral subtraction is one of the widely used algorithms for single-channel speech enhancement (Loizou, 2013). It efficiently estimates the clean speech spectrum from the noisy speech spectrum by subtracting the estimate of the noise spectrum. The implementation of spectral subtraction algorithm is performed with the following assumptions. (i) speech signals are assumed to be stationary; (ii) speech and noise are uncorrelated; and (iii) phase of the noisy speech is unchanged. Let y[n] represent the noisy speech which is defined as

$$\begin{aligned} y[n] = s[n] + d[n], \end{aligned}$$
(1)

where s[n] denotes the clean speech signal and d[n] is the noise signal. Since our focus is on speech enhancement in frequency domain, the expression of discrete short-time Fourier transform (STFT) of y[n], s[n], and d[n] are \(Y[\omega ,k]\), \(S[\omega ,k]\), and \(D[\omega ,k]\), respectively. Here, \(Y[\omega ,k]\), \(S[\omega ,k]\), and \(D[\omega ,k]\) are noisy, clean, and noise signal in frequency domain, respectively. \(Y[\omega ,k]\) is obtained as

$$\begin{aligned} Y[\omega ,k] = S[\omega ,k] + D[\omega ,k], \end{aligned}$$
(2)

where \(k \in \{1, 2,....,N\}\) denotes the frame index with N being the number of frames and \(\omega\) denotes the discrete angular frequency of a frame. With \(P_{yy}[\omega ,k]\) and \(\hat{P}{_{dd}}[\omega ,k]\) being the noisy speech power spectrum and estimated noise power spectrum, respectively, the estimate of the clean speech power spectrum, \({\hat{P}_{ss}[\omega ,k]}\), is obtained as Jaiswal and Romero (2021)

$$\begin{aligned} {\hat{P}_{ss}[\omega ,k]} = {P_{yy}[\omega ,k]} - {\hat{P}{_{dd}}[\omega ,k]} \end{aligned}$$
(3)

Finally, the enhanced speech signal is obtained by computing the inverse short-time Fourier transform (Inverse STFT) of the square root of \({\hat{P}_{ss}[\omega ,k]}\), using the phase of the noisy speech signal. A detailed derivation of the spectral subtraction algorithm is presented in Jaiswal and Romero (2021).

Spectral subtraction algorithm has a trade off between speech information and interference. To avoid speech distortion, it must be done carefully. There is a possibility that we may loose some clean speech information while subtracting the noise from the noisy signal. If we subtract too much, some speech information may be lost; if too little is subtracted, much of the interfering noise (musical noise) may be present. Musical noise is the noise with increasing variance that remains present in the estimated speech signal and may cause listening fatigue (Vaseghi, 2008).

Fig. 2
figure 2

Block diagram of the proposed Implicit Wiener filer-based speech enhancement algorithm

4 Implicit Wiener filter in frequency domain

The estimate of the clean speech obtained from spectral subtraction algorithm is influenced by the musical noise due to the negative subtraction value. Thus, Wiener filter, which is a linear estimator, minimizes the mean square error between the original (clean) and the estimated speech signal and takes care of the direct subtraction (Haykin, 1996; Loizou, 2013).

Given the clean speech power spectrum, \(P_{ss}[\omega ]\), and noise power spectrum, \(P_{dd}[\omega ]\), the condition for optimal transfer function of the Wiener filter \(H_{WF}[\omega ]\)Footnote 1 for speech enhancement in frequency domain is given as Loizou (2013), Jaiswal and Romero (2021)

$$\begin{aligned} \left[ H_{WF}[\omega ] = \frac{P_{ss}[\omega ]}{P_{ss}[\omega ] + P_{dd}[\omega ]} \right] \end{aligned}$$
(4)

For estimating the clean speech, \(P_{ss}[\omega ]\) and \(P_{dd}[\omega ]\) must be estimated accurately. Therefore, an additional flexibility is provided to these estimates by introducing two adjustable parameters \(\beta\) and \({\gamma }\). These parameters are achieved by so-called modified or parametric Wiener filter, resulting in the following implicit estimator (Lim & Oppenheim, 1979):

$$\begin{aligned} H[\omega ] = \left[ \frac{P_{ss}[\omega ]}{P_{ss}[\omega ] + {\gamma }P_{dd}[\omega ]} \right] ^{\beta } \end{aligned}$$
(5)

Here, \({\beta }\) represents the noise suppression factor and \({\gamma }\) is the noise adjustable parameter that controls the amount of perceived noise. The value of \({\gamma }\) is calculated using segmental SNR (SNRseg) (see Eq. 6) of each frame of the noisy speech signal as, \(\gamma = 4 - 0.15\,\text {SNRseg}\). For the frame having high segmental SNR, \({\gamma }\) should be small which is the case when speech is present. However, for the frame having low segmental SNR, \({\gamma }\) should be large which is the case when low-energy frames or pauses are present. The segmental SNR is defined as Loizou (2013)

$$\begin{aligned} \text {SNRseg} = \frac{10}{M} \sum _{m=0}^{M-1} \log \left( 1+\frac{\sum _{n=Nm}^{Nm+N-1} s^2[n]}{\sum _{n=Nm}^{Nm+N-1} {(s[n]-\hat{s}[n])}^2}\right) , \end{aligned}$$
(6)

where s[n] and \(\hat{s}[n]\) are the clean and the enhanced speech signal, respectively. N is the frame length and M is the number of frames in the speech signal.

Furthermore, to accommodate the non-stationarity of the speech signal, it is convenient to introduce the following approximation (Lim & Oppenheim, 1979)

$$\begin{aligned} P_{ss}[\omega ] \approx |\hat{S}[\omega ]|^2. \end{aligned}$$
(7)

Here, we approximate the true power spectral density of s[n] by its spectral energy. After substituting (7) in (5) with \({\beta }= 1/2\), \(\hat{S}[\omega ]\)Footnote 2 is obtained as Jaiswal and Romero (2021)

$$\begin{aligned} {|\hat{S}[\omega ]|} = \left[ {|Y[\omega ]|}^2 - {\gamma } P_{dd}[\omega ] \right] ^{1/2}. \end{aligned}$$
(8)

Finally, the enhanced speech signal is estimated by taking the inverse short-time Fourier transform of \(|\hat{S}[\omega ]|\) on a frame-by-frame basis. We use overlap-add method (Daher et al., 2010) to recombine spectra of individual frame with the phase of noisy speech signal. The block diagram of proposed Implicit Wiener filer-based speech enhancement algorithm is shown in Fig. 2.

5 Noise estimation

Different types of noises are encountered in daily life. Each type of noise has a distinct behavior and spectral characteristics. For example, AWGN noise is stationary whereas, exhibition, station, drone, helicopter, and airplane noise are non-stationary due to the constantly changing spectral characteristics. Accurate noise estimation from the noisy speech results in an improved speech quality. To estimate the noise spectrum, the speech enhancement algorithm employs a first order recursive equation (Loizou, 2013), which averages the previous noise estimates and the current noisy speech spectrum as

$$\begin{aligned} \hat{P}{_{dd}}[\omega ,k]= \alpha \hat{P}{_{dd}}[\omega ,k-1] +(1 - \alpha )P_{yy}[\omega ,k] \end{aligned}$$
(9)

where \(\alpha\) (\(0 \le \alpha \le 1\)) denotes the smoothing parameter. \(P_{yy}[\omega ,k]\), \(\hat{P}{_{dd}}[\omega ,k]\), and \(\hat{P}{_{dd}}[\omega ,k-1]\) denote the the short-time power spectrum of the noisy speech, estimate of the noise power spectrum in \(\omega {^{th}}\) frequency bin of current frame, and estimate of the past noise power spectrum, respectively. Here, the value of \(\alpha\) is selected analytically. For different values of \(\alpha\), segmental SNRs (Loizou, 2013) are calculated for each frame of the noisy speech sample. Then, the transition in segmental SNRs is observed to select the value of \(\alpha\). Usually, segmental SNR decreases as the \(\alpha\) increases.

6 Experimental dataset

In the presence of both stationary and non-stationary noises, the performance of the algorithms for speech enhancement is evaluated. We generate AWGN noise at different SNRs of 0 dB, 2.5 dB, and 5 dB for the stationary scenario. We use exhibition, station, drone, helicopter, and airplane noise at SNRs of 0 dB, 2.5 dB, and 5 dB for non-stationary scenarios. Noisy samples for the exhibition and the station are taken from the noisy speech corpus NOIZEUS (Hu & Loizou, 2006), in which the noise samples are used from the Aurora dataset (Hirsch & Pearce, 2000). NOIZEUS corpus was created when the phonetically balanced IEEE English sentences were uttered by three male and three female speakers, respectively. The drone noise is taken from the drone audio dataset (Al-Emadi et al., 2019). The helicopter and airplane noise are taken from the environmental sound classification (ESC) dataset (Piczak, 2015).

We also consider two clean speech utterances, one is pronounced by a male speaker and another by a female speaker. The male utterance is “A good book informs of what we ought to know” and the female utterance is “Let us all join as we sing the last chorus”. For the stationary scenario, we generate AWGN noise with SNRs of 0 dB, 2.5 dB, and 5 dB and combine it with each clean speech utterance to obtain noisy speech sample. Similarly, for the non-stationary scenario, exhibition and station noise, obtained from the Aurora dataset, are added with each clean speech utterance at 0 dB, 2.5 dB, and 5 dB SNR in order to obtain noisy speech sample. In addition, drone, helicopter, and airplane noise are also added to each clean speech utterance at 0 dB, 2.5 dB, and 5 dB SNR to obtain the corresponding noisy speech sample. The speech samples are narrow-band, with frequency of 8 kHz and an average duration of 3 seconds. The samples are saved in .WAV (16 bit PCM, mono) format.

7 Evaluation methodology of the speech enhancement algorithm

After generating noisy speech samples as discussed in Sect. 6, each speech sample is divided into multiple frames having a frame duration of 25 ms with 50% overlap. Each frame has a Hamming window with a duration of 25 ms. The windowed speech frames are analyzed using a 256-sampled short-time Fourier transform (STFT). The noise is estimated using Eq. (9). Since stationary and non-stationary noises have different time-frequency distributions and spectral characteristics, they reflect different impacts on the speech signals. Consequently, we calculate the frame-wise segmental SNR (see equation (6)) of each noisy speech sample (uttered by a male and a female speaker) to obtain the best value of smoothing parameter, \(\alpha\), for estimating the noise of each noise type. In the implicit Wiener filtering, we also consider initial 5 frames of the noisy speech sample as noise/silence to estimate the noise power spectral density (PSD) using equation (9). A simple voice activity detector is used to update the noise PSD.

The performance of speech enhancement algorithm is evaluated with four objective speech quality measures such as perceptual evaluation of speech quality (PESQ), log-likelihood ratio (LLR), cepstral distance (CD), and weighted spectral slope distance (WSS). For testing, we also have the original (clean) speech sample. PESQ (Hu & Loizou, 2006) compares the clean speech sample and the enhanced speech sample and generates a quality score that ranges from -0.5 to 4.5. A higher value of PESQ indicates better speech quality.

LLR is a spectral distance measure used to measure the mismatch between formants of the clean and the enhanced speech sample (Loizou, 2013). It usually ranges between 0 and 2, and is defined as Loizou (2013), Jaiswal and Romero (2021)

$$\begin{aligned} d_{LLR}(a_s,\bar{a}_{\hat{s}})= \log _{10} \left( \frac{{\bar{a}_{\hat{s}}^T}{R_s}{\bar{a}_{\hat{s}}}}{{a}_{s}^T {R_s}{a_s}}\right) , \end{aligned}$$
(10)

where \({a}_{s}^T\) and \(\bar{a}_{\hat{s}}^T\) denote the linear prediction coefficients (LPC) of the clean and the enhanced speech sample, respectively. T denotes transpose. \(R_s\) represents the auto-correlation matrix of the clean speech sample.

CD (Loizou, 2013; Hu & Loizou, 2007) provides an estimate of the log spectral distance between two spectra. It has the range of [0, 10] and it is calculated as Loizou (2013), Jaiswal and Romero (2021)

$$\begin{aligned} {d_{CD}(c_s,\bar{c}_{\hat{s}})= \frac{10}{\log _{e}10} \sqrt{ 2 \sum _{k=1}^{p} {{(c_s[k] - c_{\hat{s}}[k]})}^2 }}, \end{aligned}$$
(11)

where \(c_s[k]\) and \(\bar{c}_{\hat{s}}[k]\) represent the cepstrum coefficients (obtained from LPC) of the clean and the enhanced speech sample, respectively. p denotes the maximum order of the LPC coefficients.

WSS (Loizou, 2013) is based on the weighted difference between the spectral slopes in each band. It has the range of [0, 150]. With \(S_s[k]\), and \(\bar{S}_{\hat{s}}[k]\) being the spectral slope of the clean and the enhanced speech sample, respectively, and W[k] being the weight of the band k, WSS is obtained as Loizou (2013), Jaiswal & Romero (2021)

$$\begin{aligned} {d_{WSS}(C_s,\bar{C}_{\hat{s}})= \sum _{k=1}^{36} {W[k] (S_s[k] - \bar{S}_{\hat{s}}[k])}^2}. \end{aligned}$$
(12)

In addition to the objective performance measures, the informal listening test of the enhanced speech signal is also performed for both speech enhancement algorithms.

8 Implementation using the edge computing system

This section presents the implementation of speech enhancement algorithms on an edge computing system, that is, the Raspberry Pi (Azarpour et al., 2017).

For the experimental evaluation of the considered speech enhancement algorithms, we employ the Raspberry Pi 4 Model B, which acts as a mini computer, as shown in the Fig. 3. It is a faster, highly powerful, re-engineered, and completely upgraded model. It is equipped with ARM7 quad core processor having 1.5 GHz speed and supports upto 8 GB SDRAM. It has two USB 3.0 and two USB 2.0 ports. It supports Ethernet, WiFi, and bluetooth at 2.4 GHz and 5 GHz operating frequencies. It also has a camera, audio, and composite video ports.

Fig. 3
figure 3

Experimental system demonstrating Raspberry Pi 4 Model B board

Since Raspberry Pi (RPi) works with C/C++, we install the MATLAB support package for RPi hardware. This package is used for the configuration and the deployment of MATLAB code into the RPi board for the experimental evaluation. The MATLAB support package for RPi hardware is used to make a network connection between RPi and computer. Here, we connect the RPi to the computer using Ethernet. Further, this package compiles the MATLAB functions into C functions and deploys the C code into the RPi. The Raspberry Pi model B is compatible with MATLAB release R2020a and higher. Finally, the Raspberry Pi Resource Monitor App is used to track the resource consumption of the speech enhancement algorithms.

For the simulation analysis in MATLAB, we use audioread function to extract the sample values and frequency from a given speech sample. However, this function is not supported for deploying into RPi. Thus, for the experimental implementation of both speech enhancement algorithms, we read the sample values and frequency extracted in MATLAB. Table 1 lists the resources consumed in the RPi hardware for the experimental evaluation of both spectral subtraction and proposed speech enhancement algorithms.

Table 1 Resource consumption of Raspberry Pi 4 Model B for the experimental evaluation of both SS and IWF-based speech enhancement algorithms

9 Results and discussions

The speech enhancement algorithms are implemented in MATLAB R2021b on Windows 10 laptop having Intel Core i5 \(8^{th}\) generation processor, Intel UHD Graphics 620, and 16 GB of memory.

Table 2 Segmental SNR of spectral subtraction with recursive noise estimation algorithm for AWGN noise at SNRs 0 dB, 2.5 dB, and 5 dB
Table 3 Segmental SNR of spectral subtraction with recursive noise estimation algorithm for exhibition noise at SNRs 0 dB, 2.5 dB, and 5 dB
Table 4 Segmental SNR of spectral subtraction with recursive noise estimation algorithm for station noise at SNRs 0 dB, 2.5 dB, and 5 dB
Table 5 Segmental SNR of spectral subtraction with recursive noise estimation algorithm for drone noise at SNRs 0 dB, 2.5 dB, and 5 dB
Table 6 Segmental SNR of spectral subtraction with recursive noise estimation algorithm for helicopter noise at SNRs 0 dB, 2.5 dB, and 5 dB
Table 7 Segmental SNR of spectral subtraction with recursive noise estimation algorithm for airplane noise at SNRs 0 dB, 2.5 dB, and 5 dB

Tables 2, 3, 4, 5, 6, and 7 present the segmental SNR of the spectral subtraction with recursive noise estimation algorithm for AWGN, exhibition, station, drone, helicopter, and airplane noise at SNRs of 0 dB, 2.5 dB, and 5 dB with different values of smoothing parameter “\(\alpha\)”, for the speech uttered by both a male and a female speaker. It is observed from Tables 2, 3, 4, 5, 6, and 7 that for the increasing value of \(\alpha\), the SNRseg decreases for each input SNR scenario, showing deviation in the SNRseg. However, the SNRseg decreases at the extreme point \(\alpha =1\). Thus, from Tables 2, 3, 4, 5, 6, and 7, we observe that \(\alpha = 0.9\) is the most appropriate value.

Tables 8 and 9 present the results of PESQ, LLR, CD, and WSS for the speech uttered by both male and female speakers, respectively. These tables are for both the spectral subtraction and the implicit Wiener filter-based speech enhancement algorithms. Here, the speech signal is degraded by stationary noise (AWGN) and non-stationary noise (exhibition, station, drone, helicopter, and airplane) at SNRs 0 dB, 2.5 dB, and 5 dB, respectively. We also verified these performance metrics using the RPi experimental setup as shown in Fig. 3.

It can be noticed from Table 8 that when a male speaker pronounces speech utterance then the implicit Wiener filter-based speech enhancement algorithm is performing better than the spectral subtraction algorithm. The same is observed for both stationary and non-stationary noise at each input SNR except the station noise at 0 dB and helicopter noise at 2.5 dB, where the spectral subtraction algorithm has better PESQ, LLR and CD. It is also observed from Table 8 that the WSS of the spectral subtraction algorithm for each noise type at 2.5 dB is better. However, at other SNRs, WSS of the implicit Wiener filter algorithm is better. This indicates a poor noise reduction by the spectral subtraction algorithm, resulting in severe perceptual dissimilarity. Implicit Wiener filter algorithm performs better for airplane noise at 5 dB, giving highest PESQ, lowest LLR and WSS among other noise types. It also performs better for station noise at 5 dB, giving lowest CD among other noise types. This reflects that the speech degraded by the airplane noise at 5 dB is highly improved by the implicit Wiener filter algorithm. Finally, it is noticed that the Implicit Wiener filter algorithm exhibits better performance in reducing the non-stationary noise than the stationary noise.

Table 8 PESQ, LLR, CD and WSS of enhanced speech using Spectral subtraction (SS) and Implicit Wiener filter (IWF)-based speech enhancement algorithms for the speech pronounced by a Male speaker
Table 9 PESQ, LLR, CD and WSS of enhanced speech using Spectral subtraction (SS) and Implicit Wiener filter (IWF)-based speech enhancement algorithms for the speech pronounced by a Female speaker
Fig. 4
figure 4

PESQ, LLR and CD of the enhanced speech using implict Wiener filter-based speech enhancement algorithm, for a the speech uttered by a male speaker and degraded by each type of noise at 5 dB, and b the speech uttered by a female speaker and degraded by each type of noise at 5 dB

It can be noticed from Table 9 that when a female speaker pronounces the speech utterance then the implicit Wiener filter-based speech enhancement algorithm outperforms the spectral subtraction algorithm. This can be observed for both the stationary and the non-stationary noise at each input SNR, expect for the case of exhibition and station noise at 2.5 dB, where the spectral subtraction algorithm has better PESQ. The LLR, CD, and WSS of the spectral subtraction algorithm for the AWGN, exhibition, station, helicopter and airplane noise at 2.5 dB are better. However, at other SNRs, LLR, CD, and WSS of the implicit Wiener filter algorithm are better. This indicates a poor noise reduction by the spectral subtraction algorithm, resulting in severe perceptual dissimilarity. Implicit Wiener filter algorithm performs outstanding for helicopter noise at 5 dB, giving highest PESQ and lowest LLR and CD among other noise types. It also performs outstanding for airplane noise at 5 dB, giving lowest WSS among other noise types. This reflects that the speech degraded by the helicopter noise at 5 dB is improved more accurately by the implicit Wiener filter algorithm. Finally, it is noticed that the implicit Wiener filter algorithm exhibits better performance in reducing the non-stationary noise than the stationary noise.

Figure 4 shows the comparison of enhanced speech using the implicit Wiener filter-based speech enhancement algorithm. This comparison is for the speech uttered by both male and female speakers, and degraded by each type of noise at 5 dB. It is remarkable in Fig. 4a that the speech quality (PESQ) and LLR of male uttered speech degraded with airplane noise at 5 dB are better than the other noise types. However, the CD of male uttered speech degraded with station noise at 5 dB is better. This indicates that the speech degraded due to the airplane and the station noise at 5 dB, are enhanced more accurately with different measures employed. Similarly, it is also remarkable in Fig. 4b that the speech quality (PESQ), LLR, and CD of female uttered speech degraded with helicopter noise at 5 dB are better than the other noise types. This indicates that the speech degraded due to the helicopter noise at 5 dB is enhanced more accurately. In addition, the speech quality (PESQ) and WSS of the male uttered speech degraded with airplane noise at 5 dB are better than the female uttered speech using implicit Wiener filter-based speech enhancement algorithm. However, the LLR and CD of the female uttered speech degraded with the helicopter noise at 5 dB are better than the male uttered speech using implicit Wiener filter-based speech enhancement algorithm. This indicates that the male and the female uttered speeches are enhanced better with different quality measures in different noisy environments. It leads to design a noise-specific or context-specific speech enhancement algorithm. Overall, it again directs to deploy the implicit Wiener filter-based speech enhancement algorithm to reduce the non-stationary noise and perform speech enhancement tasks.

Fig. 5
figure 5

Time domain and spectrogram representations of the clean, noisy, and enhanced speech using the implicit Wiener filter-based speech enhancement algorithm, for the speech pronounced by a male speaker and degraded by AWGN noise at 5 dB

Fig. 6
figure 6

Time domain and spectrogram representations of the clean, noisy, and enhanced speech using the implicit Wiener filter-based speech enhancement algorithm, for the speech pronounced by a male speaker and degraded by airplane noise at 5 dB

Figures 5 and 6 show the time domain and spectrum representation of the clean, noisy, and enhanced speech signals using the implicit Wiener filter-based speech enhancement algorithm for the speech uttered by a male speaker and degraded by the AWGN and the airplane noise at 5 dB, respectively. It can be observed from Figs. 5 and 6 that the speech enhanced by the implicit Wiener filter-based speech enhancement algorithm has improved signal estimation, showing superior performance for the non-stationary noise than the stationary noise. Moreover, it is worth noting that the spectral components that are being filtered out using the proposed speech enhancement algorithm constructs an insignificant amount of musical noise in some of the speech samples, which is not very susceptible to the hearing, and is acceptable.

10 Conclusions and future work

In this paper, we have proposed and extended the implicit Wiener filter-based algorithm for speech enhancement, in the presence of the non-stationary noise (exhibition, station, drone, helicopter, and airplane) and the stationary noise (additive white Gaussian noise). The proposed speech enhancement algorithm employs a first order recursive noise estimation equation in order to estimate noise from the degraded speech. The equation employs a smoothing parameter for continuously updating noise in each frame. To obtain the most appropriate value of the smoothing parameter, segmental SNR is calculated in each frame of the noisy speech spectrum. The implementation results show that for various types of noise degradations tested, the proposed speech enhancement algorithm outperforms the spectral subtraction algorithm. Moreover, the envelop of the estimated/enhanced speech signal is close to the envelop of the clean speech spectrum. The enhanced speech signal is shown to have similar perceptual quality as the clean speech signal. The spectrogram of the enhanced speech is also similar to the clean speech spectrogram. In addition, the informal listening test of the enhanced speech signal using the proposed speech enhancement algorithm demonstrates a clear sound as compared to the spectral subtraction algorithm. The construction of the musical noise in the enhanced speech signal is too small which is acceptable. Furthermore, it is also shown that the proposed speech enhancement algorithm supports the low power edge computing device, such as, the Raspberry Pi.

As a result, the proposed speech enhancement algorithm may be a promising future candidate to enhance the speech signal of various speech processing applications, such as, speech quality prediction metric, speech recognition system, speech identification system, speech coding system, etc., where a crystal clear speech is required for the efficient communication. One can easily integrate these speech enhancement algorithms as a pre-processing block in order to improve the system performance, when implementing using the edge computing system. In addition, we also intend to examine different noise estimation techniques for designing and developing the speech enhancement algorithms.