1 Introduction

Speech enhancement pertains to the processing of speech corrupted by noise, echo, reverberation, etc. to improve its quality and intelligibility. In this paper, by speech enhancement, we refer to the problem of noise reduction. It is relevant in several scenarios, for example, mobile telephony in noisy environments, such as restaurants and busy traffic, suffers from unclear communication. Also, speech recognition units [1] and hearing aids [2] require speech enhancement as a preprocessing algorithm.

Speech enhancement algorithms can be broadly classified into single- and multi-channel algorithms based on the number of microphones used to acquire the input noisy speech. Multi-channel algorithms exhibit superior performance because of the additional spatial information available about the noise and speech sources. However, the need for single-channel speech enhancement cannot be ignored. For example, single microphone systems are preferred in low-cost mobile units. In addition, multi-channel methods include a single-channel algorithm as a post-processing step to suppress diffuse noise. In this paper, we focus on single-channel speech enhancement.

Single-channel speech enhancement has been a challenging research problem for the last four decades. Several techniques have been devised to arrive at efficient solutions for the problem. Among these, spectral subtraction is one of the earliest and simplest techniques [3]. Herein, an estimate of the noise magnitude spectrum is subtracted from the observed noisy magnitude spectrum to obtain an estimate of the clean speech magnitude spectrum. Several variations of this technique have been developed over the years [4]-[7]. Methods based on a statistical model of speech to estimate the speech spectral amplitude such as the minimum mean square error short-time spectral amplitude estimator (MMSE-STSA) method have been found to be successful [8]-[10]. The statistical approach explicitly uses the probability density function (pdf) of the speech and noise DFT coefficients. Also, it allows consideration of non-Gaussian prior distributions [11] and different ways of modeling the spectral data [12],[13]. Subspace-based algorithms [14] assume the clean speech to be confined to a subspace of the noisy space. The noisy vector space is decomposed into noise-only and speech-plus-noise subspaces. The noise subspace components are suppressed, and the speech-plus-noise subspace components are further processed. A comprehensive survey of these techniques is provided in [15]. However, most of these methods depend on an accurate estimate of the noise power spectrum, for example, estimation of the noise magnitude spectrum during silent segments in [3], or a priori signal-to-noise ratio (SNR) estimation in [9], or estimation of the noise covariance matrix in the subspace-based methods.

Noise estimation algorithms mainly include voice activity detector (VAD) [16],[17] and buffer-based methods [18]-[20]. While VADs are unreliable at low SNRs, the buffer-based methods are not fast enough to track the quickly varying noise in nonstationary noise conditions. Thus, while these algorithms perform well in stationary noise, their accuracy deteriorates under nonstationary conditions. An improvement over these algorithms is provided in [21] wherein a recursive approach is employed for online noise power spectral density (PSD) tracking by analytically retrieving the prior and posterior probabilities of speech absence, and noise statistics, using a maximum likelihood-based criterion. A low-complexity, fast noise tracking algorithm is proposed in [22],[23].

Speech enhancement algorithms which employ trained models, such as codebooks [24]-[28], hidden Markov models (HMM) [29]-[31], Gaussian mixture models (GMM) [32], non-negative matrix factorization (NMF) models [33], dictionaries [34], etc., for speech and noise data are able to process noisy speech with sufficient accuracy even under nonstationary noise conditions. For example, codebook-based speech enhancement (CBSE) algorithms [25],[26] estimate the noise power spectrum for short segments of noisy speech, thus tracking nonstationary noise better than the buffer-based methods [18]. However, model-based methods typically employ a priori speech models which are trained on speech data from multiple speakers. For applications where the input noisy speech is more frequent from a particular speaker, such as in mobile telephony, it is desirable to exploit the speaker dependency for better speech enhancement. Similarly, it might be beneficial to consider models trained on or adapted to a specific acoustic environment or language. In this paper, we introduce the notion of context-dependent (CD) models, where by the word ‘context’, we refer to one or more aspects such as the speaker, acoustic environment, emotion, language, speaking style, etc. of the input noisy speech. By employing CD models, improved enhancement of noisy speech can be expected. These models can be adapted online from a context-independent (CI) model during high SNR regions of the input signal. In this paper, we assume the availability of such adapted CD models and focus on the enhancement using the converged models.

When the context of the noisy input matches the context of the data used to train the model, CD models are expected to result in better speech enhancement than CI models. We refer to such scenarios as context match scenarios. However, in practice, the modeled and observed contexts may not always match, leading to a context mismatch. In such scenarios, a CD model may lead to poorer results, and so the CI model would be preferred. Thus, what is required is a method that retains the benefits of both the CD and CI models and provides robust results irrespective of the scenario at hand.

In this paper, we introduce a Bayesian framework to optimally combine the estimates from the CD and CI models to achieve robust speech enhancement under varying contexts. As different aspects of context can be expected to remain constant for an extended duration in the input noisy signal, the framework considers past information to improve the estimation process. Also, in practice, different aspects of context may occur at the same time. So, the framework is designed to include several codebooks at the same time.

As an example of the model-based algorithm, we use the CBSE technique that employs trained models of speech and noise linear predictive (LP) coefficients as priors [26]. A part of this work has been presented in [35]. This papers extends [35] by incorporating memory-based estimation, considers the use of multiple CD models, and presents a detailed experimental analysis for different noise types, input SNRs, and aspects of context. The framework developed is general and can be used for other representations such as mel-frequency cepstrum coefficients, higher resolution PSDs, as well as other models such as GMMs, HMMs, and NMF.

The remainder of the paper is organized as follows. In the next section, a brief outline of the CBSE techniques [25],[26] is provided. Following this, we derive the memory-based Bayesian framework to optimally combine estimates from several codebooks (CD/CI). Thereafter, we present the experimental results for the proposed framework under varying contexts, noise types, and input SNRs. Finally, we summarize the conclusions.

2 Codebook-based speech enhancement

Consider an additive noise model of the observed noisy speech y(n):

y(n)=x(n)+w(n),
(1)

where n is the time index, x(n) is the clean speech signal, and w(n) is the noise signal.

We assume that speech and noise are statistically independent and follow zero-mean Gaussian distribution. Under these assumptions, Equation 1 leads to the following relation in the frequency domain:

P y (ω)= P x (ω)+ P w (ω),
(2)

where P y (ω), P x (ω), and P w (ω) are PSDs of the observed noisy speech, clean speech, and noise respectively, and ω is the angular frequency.

Consider a short-time segment of the observed noisy speech given by a vector y=[y(1),…,y(N)]T, where N is the size of the segment. Let the vectors x and w be defined analogously. Let a x = a x 0 , , a x p denote the vector of LP coefficients for the short-time speech segment x corresponding to y, with a x 0 =1 and p the speech LP model order. Similarly, let a w = a w 0 , , a w q denote the LP coefficient vector for the short-time noise segment w corresponding to y, with a w 0 =1 and q being the noise LP model order. Then, the speech and noise PSDs can be written as:

P x (ω)= g x | A x ( ω ) | 2 and P w (ω)= g w | A w ( ω ) | 2 ,
(3)

where g x and g w denote the variance of the prediction error for speech and noise, respectively; A x (ω)= k = 0 p a x k e jωk ; and A w (ω)= k = 0 q a w k e jωk . Let

m x = a x , g x , m w = a w , g w .
(4)

m x is a model describing the speech PSD, and m w describes the noise PSD. Codebook-driven speech enhancement techniques [25],[26] estimate m x and m w for each short-time segment: a x and a w are selected from trained codebooks of vectors of speech and noise LP coefficients, C x and C w , respectively, and the gain terms g x and g w are computed online, resulting in good performance in nonstationary noise. A maximum likelihood approach is adopted in [25] and a Bayesian minimum mean squared error (MMSE) approach in [26].

The estimates m ̂ x and m ̂ w are used to construct a Wiener filter to enhance the noisy speech in the frequency domain:

H(ω)= P ̂ x ( ω ) P ̂ x ( ω ) + P ̂ w ( ω ) ,
(5)

where P ̂ x (ω) and P ̂ w (ω) are estimates of the speech and noise PSDs, respectively, described by m ̂ x and m ̂ w . The Wiener filter is one example of a gain function, and any other gain function can be employed using the obtained speech and noise PSD estimates.

3 Bayesian estimation under varying contexts

In this section, we develop a Bayesian framework to obtain estimates of the speech and noise LP parameters, m x and m w , using one or more CD codebooks and a CI speech codebook. The CD codebooks improve estimation accuracy in the event of a context match, and the CI codebook provides robustness in the event of a context mismatch. The Bayesian framework needs to optimally combine the estimates from the various codebooks with no prior knowledge on whether or not the observed context matches the context modeled by the codebooks.

Consider K speech codebooks C x 1 , , C x K , which include one or more CD codebooks and a CI codebook, depending on the contexts modeled. We consider a single noise codebook, C w , corresponding to the encountered noise type. Robustness to different noise types can be provided by extending the notion of context dependency to the noise codebooks as well. To maintain the focus on context dependency in speech, we only consider a single noise codebook.

As m x is a model for the speech PSD and m w is a model for the noise PSD, m=[m x ,m w ] is a model for the noisy PSD, given by the sum of the corresponding speech and noise PSDs. We consider m to be a random variable and seek its MMSE estimate, given the noisy observation, the speech codebooks, and the noise codebook. Let 1 denote the collection of all models of the noisy PSD corresponding to the speech codebook C x 1 and the noise codebook C w . The set 1 consists of quadruplets a x 1 i , g x , a w j , g w , where a x 1 i is the i th vector from the speech codebook C x 1 , a w j is the j th vector from the noise codebook C w , and the gain terms g x and g w are computed online for each combination of a x 1 i and a w j . Thus, 1 contains N x 1 × N w vectors, where N x 1 is the number of vectors in C x 1 and N w is the number of vectors in C w . The sets 2 ,, K are similarly defined, corresponding to the speech codebooks C x 2 ,, C x K . Let be a collection of all the models m contained in all the K speech codebooks and the noise codebook, i.e.,

= 1 2 K .
(6)

We consider the following K hypotheses:

H k : speech codebook C k best models the speech context for the current segment, 1≤kK.

At a given time T, one of the K hypotheses is valid. This corresponds to a state, and we write S T =H k to denote that at time T, the most appropriate speech codebook for the observed noisy segment is C k .

As mentioned in the introductory section, various aspects of context such as speaker, language, etc. can be expected to remain constant over multiple short-time segments, which can be exploited to improve estimation accuracy. The MMSE estimate of m for the T th short-time segment is thus obtained using not just the current noisy segment y T but a sequence that includes the current as well as past noisy segments, [ y1,…,y T ], where t is the segment index and y t , 1≤tT is a vector containing N noisy speech samples. The MMSE estimate of m can be written as

m ̂ = E m | y 1 , y 2 , y T = k = 1 K p S T = H k | y 1 , y 2 , , y T × E m | y 1 , y 2 , y T , S T = H k .
(7)

The two terms in the last line of (7) lend themselves to an intuitive representation. The second term E[m|y1,y2…,y T ,S T =H k ] corresponds to an MMSE estimate of m assuming that the context is best described by H k . The first term provides a relative importance score to this estimate, based on the likelihood that C x k is indeed the most appropriate speech codebook. The weighted summation corresponds to a soft estimation, which allows the coexistence of multiple contexts, e.g., speaker and language, each being modeled by a separate codebook. Next, we derive expressions for both these terms.

First, we consider the term p(S T =H k |y1,y2,…,y T ).Let

α T (k)=p y 1 , y 2 , y T , S T = H k ,k=1,2,,K
(8)

represent the forward probability as in standard HMM theory [36]. It can be recursively obtained as follows:

Basis step:

α 1 (k)=p H k p y 1 | H k ,k=1,2,,K.
(9)

The prior probabilities in the absence of any observation can be assumed to be equal in Equation 9. Thus, p(H k ) = 1 K , i.e., all hypotheses are equally likely.

Induction step: The state S T of the current noisy observation y T could have been reached from any of the states from the previous frame with a particular transition probability. This can be modeled as

α t + 1 (k)= l = 1 K α t ( l ) a lk p y t + 1 | H k ,
(10)

where 1≤tT−1 and l,k=1,2,…,K, and a lk represent the transition probability of reaching state k from state l. We assume the a priori transition probabilities to be known beforehand for a given set of speech codebooks. In this paper, we assume them to be fixed such that a lk takes higher values when l=k than otherwise, to capture the intuition that we typically do not rapidly switch between contexts such as speaker and language. Note that only the a priori transition probabilities are assumed to be fixed. The data-dependent part in Equation 10 is captured by the term p(yt+1|H k ), whose computation is addressed in the following. Using Equation 8,

p S T = H k | y 1 , y 2 , , y T = p y 1 , y 2 , , y T , S T = H k p y 1 , y 2 , , y T = α T ( k ) k = 1 K α T ( k ) .
(11)

Next, we consider the term E[m|y1,y2…,y T ,S T =H k ] in Equation 7. In this section, we are interested in exploiting memory to ensure that the codebook that is most relevant to the current context at hand receives a high likelihood, and this is captured by Equation 11. For a given codebook, E[m|y1,y2…,y T ,S T =H k ] provides an improved estimate of m by exploiting not only the current noisy observation y T but also the past noisy segments. An expression for this term can be derived as in [26], where memory was restricted to the previous frame in view of the signal nonstationarity. Here, to retain the focus on selecting the appropriate context, we assume

E m | y 1 , y 2 , y T , S T = H k =E m | y T , S T = H k .
(12)

In the following, we ignore the term S T and write E[ m|y T ,S T =H k ] as E[ m|y T ,H k ] for brevity. For a given hypothesis H k , we have

E m | y T , H k = m m p m | y T , H k = m m p y T | m , H k p m | H k p y T | H k .
(13)

Under a Gaussian LP model, m corresponds to an autocorrelation matrix R y for y T , which fully characterizes the pdf p(y T |m) as in

p y T | m = 1 ( 2 π ) N / 2 | R x + R w | 1 / 2 × exp y T R x + R w 1 y T 2 ,
(14)

where represents transpose, R y =R x +R w , R x = g x ( B x B x ) 1 , R w = g w B w B w 1 , B x is an N×N lower triangular Toeplitz matrix with [a x ,0,…,0] as the first column, and B w is an N×N lower triangular Toeplitz matrix with [a w ,0,…,0] as the first column. Thus, given a model m, y T is conditionally independent of H k , and we have

p y T | m , H k = p y T | m , k = 1 , 2 , , K.
(15)

The logarithm of the likelihood p(y T |m) in the Equation 14 can be efficiently computed in the frequency domain following the approach of [26]. The gain terms that maximize the likelihood can be computed as in [26].

Next, we consider the term p(m|H k ) in Equation 13. Under hypothesis H k , the speech signal in the observed segment is best described by the codebook C x k . We assume all the models resulting from a given codebook are equally likely. This assumption is valid, in general, if the codebook size is large and derived from a phonetically balanced large training set.

Thus, assuming all the models resulting from C x k are equally likely, we have

p m | H k = 1 | k | , m k = 0 , otherwise ,
(16)

where | k | is the cardinality of k . From Equations 13 and 16, we have

E m | y T , H k = 1 | k | m k m p y T | m p y T | H k ,
(17)

where

p y T | H k = 1 | k | m k p y T | m
(18)

and p(y T |m) is given by Equation 14. Equation 18 is used in Equations 9 and 10 to obtain the forward probabilities. Finally, the required MMSE estimate m ̂ is obtained by using Equations 11 and 17 in Equation 7. The speech and noise PSDs corresponding to m ̂ can be obtained using Equation 3 and the Wiener filter from Equation 5. To ensure stability of the estimated LP parameters, the weighted sum in Equation 7 can be performed in the line spectral frequency domain. Note that the weights are non-negative and add up to unity as is evident from Equation 11. Alternatively, as we are finally interested in the speech and noise PSDs to be used in a Wiener filter, the weighted sum can be performed in the power spectral domain.

We conclude this section with some remarks on the calculation of the forward probabilities α T which for a codebook captures how well that codebook matches the context of the T th input segment. As mentioned earlier, the proposed framework can be used to model context in speech as well as noise. When context is modeled by the speech codebooks, it was found to be beneficial to calculate α T during speech-dominated segments, and during noise-dominated segments when modeling the noise context. The goal in computing α T is to assess how well a given speech codebook matches the underlying context for a given input segment. If this computation is performed during speech-dominated frames, we obtain accurate values for α T . However, inaccurate weight values may result when the computation is based on segments that lack sufficient information about the speech, such as silence or low-energy segments dominated by noise. In such situations, it is preferable to use the value of α T computed in the last speech-dominated segment. This, in other words, assumes that the context of the current segment is the same as that of the past segment. This assumption is valid in general as the context of speech is not expected to rapidly change from one speech burst to another. Thus, updating α T only during speech-dominated segments does not affect performance. However, estimating α T only during speech-dominated segments suffers from the disadvantage that there may not be a sufficient number of such segments in highly noisy conditions. Introducing a preliminary noise reduction step, e.g., using the long-term noise estimate from [18], and estimating α T from the enhanced signal was seen to address this problem. Importantly, the estimation of the speech and noise PSDs and the resulting Wiener filter occurs for each short-time segment, providing good performance under nonstationary noise conditions.

4 Experimental results

Experiments were performed to verify the robustness of the proposed framework under varying contexts. The contexts modeled by a trained CD codebook may or may not match with that of the observed noisy input signal, leading to two scenarios:

Context match: the best-case scenario for a CD codebook

Context mismatch: the worst-case scenario for a CD codebook

The robustness of the proposed framework, employing both CD and CI codebooks, was tested under both scenarios. Two different sets of experiments were performed, which differed in terms of number of codebooks employed and the aspects of contexts modeled. The first set consisted of experiments with two speech codebooks, a CI speech codebook and a CD speech codebook, modeling the speaker and acoustic environment as aspects of context. The second set consisted of experiments with three speech codebooks: a CI speech codebook and two CD speech codebooks to study the performance of the proposed framework with an increase in the number of codebooks employed. This set modeled, apart from speaker and acoustic environment, the speech type (normal, whisper, loud, etc.) of the input speech as aspects of context.

In the following, we first describe the experimental setup and, thereafter, the various experiments along with the corresponding results.

4.1 4.1 Experimental setup

In all the experiments, the input noisy test utterances were enhanced under different context scenarios, using the CBSE technique [26] applied using the CD codebook alone, the CBSE technique applied using the CI codebook alone, and the proposed Bayesian scheme. We expect that in the context match scenarios, employing the CD codebook alone should lead to the best results. On the other hand, in the context mismatch scenarios, employing the CI codebook alone should lead to results better than those obtained using the CD codebook. The proposed method, however, is expected to provide robust results under varying contexts, i.e., results close to the best results in all scenarios. To serve as a reference for comparisons, we also include results when applying the Wiener filter (5) with a noise estimate obtained from a state-of-the-art noise estimation scheme [37].

The performance of these four processing schemes was compared using two measures: the improvement in segmental SNR (SSNR) referred to as Δ SSNR (in dB) and the improvement in the perceptual evaluation of speech quality (PESQ) [38] measure, referred to as Δ PESQ, averaged over all the enhanced utterances considered under a particular experiment.

The speech codebooks used in the experiments were trained using the Linde-Buzo-Gray (LBG) algorithm [39]. First, the clean speech training utterances, resampled at 8 kHz, were segmented into 50% overlapped Hann windowed frames of size 256 samples each, corresponding to a duration of 32 ms wherein the speech signal can be assumed stationary. Then, LP coefficient vectors of dimension 10, extracted using these frames, were clustered using the LBG algorithm to generate speech codebooks of size 256 each using the Itakura-Saito (IS) distortion [40] as the error criterion.

For training the CI speech codebook, 180 English language utterances of duration 3 to 4 seconds each were used, from 25 male and 25 female speakers from the WSJ speech database [41]. This codebook served as the CI codebook for all the experiments described in this section. The speakers whose utterances were used to train the CI codebook were not used in the test utterances. The different experiments use different CD codebooks and input noisy test data, which are discussed later along with the description of each experiment.

The different CD and CI speech codebooks considered in the experiments are of large size (256) and are derived from a large number of phonetically balanced sentences from the WSJ database. Moreover, the LBG algorithm used to generate the speech codebooks computes cluster centroids in an optimal fashion. All these factors ensure the validity of the assumption about equal probability of models in Equation 16.

Two noise codebooks for two different noise types, traffic and babble, with eight entries each were trained similarly using LP coefficient vectors. For the traffic noise codebook, LP coefficient vectors of order 6 extracted from 2 min of nonstationary traffic noise were used. Since babble noise is speech-like, a higher LP model order of 10 was used while extracting LP coefficient training vectors from approximately 3 min of nonstationary babble noise. The same noise types were also used in the creation of test utterances at 0, 5, and 10 dB SNR for all the experiments. The actual samples were different from those used in training. The active speech level was computed using ITU-T P.56 method B in [42], and noise was scaled and added to obtain a desired SNR.

When processing the noisy files for a particular noise type, the appropriate noise codebook was used. In practice, a classified noise codebook scheme as discussed in [25] can be used. This scheme employs multiple noise codebook, each trained for a particular noise type. A maximum likelihood scheme is used to select the appropriate noise codebook for each short-time frame. This method was shown in [25] to perform as well as the case when the ideal noise codebook was used. We choose to use the ideal noise codebook to retain the focus on the performance of the proposed framework with regard to various aspects of the speech context.

4.2 4.2 Experiments with a single CD codebook

In this experiment, we test the proposed framework when two speech codebooks are employed, a CI and a CD codebook. The CD codebook models two aspects of context, ‘speaker’ and ‘acoustic environment’.

4.2.1 4.2.1 CD codebook training

For training the CD codebook, 180 English language utterances from a single speaker, of 3 to 4 s duration each, were used from the WSJ speech database. These utterances were convolved with an impulse response recorded at a distance of 50 cm from the microphone, in a reverberant room (T60 = 800 ms). This corresponds, for example, to hands-free mode on a mobile phone. In practice, this codebook is adapted during hands-free usage, making it dependent on both the speaker and acoustic environment.

4.2.2 4.2.2 Test utterances for the experiment

Two sets of ten clean speech utterances each were used to generate the noisy test data. Utterances for the first set were from the same speaker and acoustic environment as the data used to train the CD codebook, corresponding to the context match scenario and thus the best case for the CD codebook. The utterances themselves were different from those used in the training set.

The second set of clean utterances were from a speaker different from the one involved in training the CD codebook. These utterances were not convolved with the recorded impulse response (e.g., corresponding to hand-set mode in a mobile phone). Thus, both the speaker and acoustic environment were different from those used to train the CD codebook, corresponding to the context mismatch scenario and thus the worst case for the CD codebook.

4.2.3 4.2.3 Enhancement results

The test utterances were enhanced using the four schemes, mentioned in Section 4.1. The transition probabilities a lk were set to 0.99 when l=k and to 0.01 when lk, with l,k=1,2. Tables 1 and 2 provide the results for the best- and worst-case scenarios, respectively, in babble noise.

Table 1 Best-case scenario for a single CD codebook under babble noise
Table 2 Worst-case scenario for a single CD codebook under babble noise

As can be observed from Table 1, the best results are obtained for the CD codebook, as expected in a context match scenario. There is a significant difference between the results corresponding to the CD and CI codebooks, e.g., 0.19 for Δ PESQ and 1.3 dB for Δ SSNR, at 5 dB input SNR. Moreover, the standard deviation values indicate that the observed differences between the CD and CI results are statistically significant. This illustrates the benefit of employing CD codebooks. On the other hand, Table 2 demonstrates poorer performance when using the CD codebook compared to using the CI codebook, in a context mismatch scenario. The difference between their results is significant for Δ SSNR at all input SNRs, e.g., 1 dB at 0 dB input SNR, and for Δ PESQ at higher SNR, e.g., 0.22 at 10 dB input SNR. These results demonstrate the need for a scheme that appropriately combines the estimates obtained from the CD and CI codebooks, depending on the context at hand.

In Table 1, with increasing input SSNR, there is an increase in Δ PESQ but a decrease in Δ SSNR for all schemes except the reference method. This can be explained by considering the trade-off between speech distortion and noise reduction.

In general, enhancement using a Wiener filter involves applying a gain (also called attenuation) function. When applying this gain function to the noisy speech, both speech and noise components are attenuated. At lower input SNRs, the SSNR measure is dominated by the benefit of noise reduction while ignoring the penalty due to speech distortion. So in these scenarios, applying a greater attenuation than is optimal can increase the output SSNR values as it results in more noise attenuation (it also results in more speech attenuation but that is not captured by the SSNR measure). This situation occurs when using a mismatched codebook, where the clean speech PSD is underestimated, resulting in more severe attenuation of the noisy speech. PESQ is more closer to human perception, and we believe that the effect of speech distortion is better captured by PESQ, resulting in negative delta PESQ values for these scenarios. At higher input SNRs, the SSNR measure also captures the effect of speech distortion. Since Δ PESQ captures well the decrease in speech distortion with increasing input SSNR, there is an increase in Δ PESQ with increasing input SSNR in Table 1. On the other hand, SSNR measure is dominated at lower input SNRs by the benefit of noise reduction ignoring the penalty due to speech distortion. As a result, there is larger Δ SSNR at lower input SNRs than at higher input SNRs.

In contrast to the results obtained when using the CD and CI codebooks alone, the proposed framework achieves robust performance regardless of the observed context. For the best-case scenario (Table 1), its results are close to the CD results. For the worst-case scenario (Table 2), its results are close to the CI results. Thus, the proposed framework achieves results close to the best results for a given scenario, as desired. The reference scheme performs poorly due to the nonstationary nature of the noise. It may be noted that even using a mismatched codebook outperforms the reference scheme, highlighting the benefit of using a priori information for speech enhancement in nonstationary noise.

Tables 3 and 4 provide the results for the best- and worst-case scenarios, respectively, for the traffic noise case. Similar observations can be made as from the Tables 1 and 2 regarding the need for both the CI and CD codebooks for better performance and the robust performance of the proposed framework under varying contexts. Again, the reference method performs poorly due to the nonstationary nature of noise.

Table 3 Best-case scenario for a single CD codebook under traffic noise
Table 4 Worst-case scenario for a single CD codebook under traffic noise

Comparing Δ PESQ values for the best-case scenarios in Tables 1 and 3 for the two noise types shows that there is a sharper drop in values from 5 to 0 dB input SNR in the case of traffic noise results (0.2) compared to babble noise results (0.06). A similar observation can be made for the Δ PESQ values for the worst-case scenarios in Tables 2 and 4 for the two noise types. These observations indicate that the traffic noise case is more difficult to handle than babble noise at 0 dB input SNR. This occurred because the traffic noise considered for the experiments is highly nonstationary compared to the babble noise used for the experiments.

4.2.4 4.2.4 Comparison of the proposed method with the MMSE-STSA method

In the above experiments, the reference method chosen for comparison with the proposed method uses the Wiener gain, as described by (5), computed using a state-of-the-art noise estimator [37]. This choice provides an even comparison as the proposed method too employs the Wiener gain function. The two approaches, however, differ in the computation of the speech and noise PSDs for computing the Wiener gain.

Also of interest is a comparison of the proposed method with a popular statistical approach such as the MMSE-STSA method [9], the results of which are provided in Tables 5 and 6 for the Babble noise case. Table 5 corresponds to the context match scenario wherein the context of the CD codebook matches with that of the input noisy speech. Here, the performance of the proposed method is superior, especially for the PESQ values, to that of the MMSE-STSA technique. The advantage with the proposed approach is higher at lower SNR values. For the mismatch scenario, the performance of both the methods is comparable as shown in Table 6. Note that the Wiener filter is just one example of a gain function that can use the speech and noise PSDs estimated using the proposed method. The estimated speech and noise PSDs can also be used to compute the a priori and a posteriori SNRs for use in the MMSE-STSA gain function. This is however beyond the scope of this paper and is a topic for future work.

Table 5 Comparison of the proposed method with the MMSE-STSA technique for context match scenario corresponding to Table 1
Table 6 Comparison of the proposed method with the MMSE-STSA technique for context mismatch scenario corresponding to Table 2

4.3 4.3 Experiments with multiple CD codebooks

In the previous subsection, we tested the proposed framework under conditions when a single CD codebook was employed along with a CI codebook. Multiple aspects of context were modeled by the single CD codebook. In practice, different contexts will be modeled by different CD codebooks. In this subsection, we experiment with the case of two CD codebooks along with one CI codebook.

4.3.1 4.3.1 CD codebook training

The first CD codebook, referred to as CD-1, models a particular speaker and a speech type. The speech type considered is ‘whisper’ speech. The speech produced in the case of certain speech disorders (dysphonic speech) is similar to whispered speech. CD-1 was trained using around 10 min of whispered speech data from a single speaker from the CHAINS database [43].

The second CD codebook employed, referred to as CD-2, models normal speech in reverberant conditions for the same speaker as modeled by CD-1. CD-2 was trained using training utterances of duration around 10 min, convolved with the same impulse response as used in the previous experiments (corresponding to a distance of 50 cm from the microphone, in a reverberant room with T60 = 800 ms).

The two codebooks differ in terms of speaking style, whispered and normal, and also the acoustic environment. The separation in terms of acoustic environment is useful, e.g., to have different CD models for a particular user of the mobile phone to cater to hand-set and hands-free modes of operation. Note that the CI codebook is speaker-independent and corresponds to hand-set mode.

4.3.2 4.3.2 Test utterances for the experiment

Two sets of experiments were performed pertaining to the matching codebook being CD-1 or CD-2. The first set consisted of test utterances generated by adding noise to ten clean ‘whispered’ speech utterances from the same speaker as in generation of the CD-1 codebook. Similarly, the second set of experiments had test utterances generated using ten clean ‘normal’ speech utterances from the same speaker as in CD-2, convolved with the same recorded impulse response as used in training CD-2 to constitute the context match scenario for CD-2. In both sets of experiments, the test utterances considered were different from those used in the training of the codebooks. The noisy test utterances were generated as described in the beginning of the section.

4.3.3 4.3.3 Enhancement results

Enhancement using multiple CD codebooks was performed by setting transition probabilities a lk to 0.9 when l=k and to 0.05 when lk, with l,k= 1 to 3. Tables 7 and 8 present the matching scenario results for CD-1 and CD-2, respectively, for the babble noise case. Similarly, Tables 9 and 10 present the matching scenario results for CD-1 and CD-2, respectively, for traffic noise case. As can be observed from these tables, the best results for all the scenarios occur for the matching CD codebook. The difference between context match and mismatch (between CD-1 and CD-2/CI, and between CD-2 and CD-1/CI) is significant, especially in the Δ PESQ scores. The differences in Δ SSNR values are significant at higher input SNRs. As the number of codebooks employed by the proposed framework increases, there is a possibility of a negative influence from the inappropriate codebooks in the estimation of the model estimate. But from Tables 7, 8, 9, and 10, we observe that for the case of two CD codebooks and one CI codebook, the results for the proposed framework are close to those of the matched codebook at all input SNRs and for both noise types, confirming the robustness of the proposed framework under varying contexts.

Table 7 Results using two CD codebooks and on CI codebook, for context match scenario for CD-1 under babble noise
Table 8 Results using two CD codebooks and one CI codebook, for context match scenario for CD-2 under babble noise
Table 9 Results using two CD codebooks and one CI codebook, for context match scenario for CD-1 under traffic noise
Table 10 Results using two CD codebooks and one CI codebook, for context match scenario for CD-2 under traffic noise

5 Conclusions

In this paper, we have introduced the notion of context-dependent (CD) models for speech enhancement methods that use trained models of speech and noise parameters. CD speech models can be trained on one or more aspects of speech context such as speaker, acoustic environment, speaking style, etc., and CD noise models can be trained for specific noise types. Using CD models results in better speech enhancement performance compared to using context-independent (CI) models when the noisy speech shares the same context as the trained codebook. The risk, however, is degraded performance in the event of a context mismatch. Thus, the CD and CI models need to co-exist in a practical implementation. The Bayesian speech enhancement framework proposed in this paper obtains estimates of speech and noise parameters based on all available models, requires no prior information on the context at hand, and automatically obtains results close to those obtained when using the appropriate codebook for a given context scenario as seen from experiments with various aspects of speech context.

The improved performance of the proposed method is at the cost of increased computational complexity. As opposed to employing a single CI model, the proposed method involves computations with multiple models. The computations related to each model can, however, occur simultaneously, which allows for a parallel implementation.

The proposed method has been developed using the codebook-based speech enhancement system as an example of a data-driven model-based speech enhancement system. Other model-based schemes, such as those using HMMs, GMMs, and NMF, can benefit in a similar manner, and the extension is a topic for future work. The theory developed in this paper is directly applicable to context-dependent noise codebooks and can be used for robust noise estimation under varying noise conditions.

In this paper, context-dependent models are assumed to be available. In practice, they need to be trained online. For several aspects of context, a separate enrollment stage may not be meaningful and the models need to be progressively adapted during usage when the SNR is high. Distinguishing between different aspects of context and training separate models for them online is another topic for future work.

The codebooks considered in this paper consist of vectors of tenth-order LP coefficients, which model the smoothed spectral envelope. It will be worthwhile to investigate the suitability of other spectral representations such as higher resolution PSDs, mel-frequency cepstral coefficients, etc., to capture context-dependent information. Different features may be employed depending on which aspects of context are to be modeled and depending on the application, e.g., whether the enhancement is for speech communication, speaker identification, or for speech recognition.

Authors’ information

This work was performed when SS was with Philips Research Laboratories, Eindhoven, The Netherlands.