1 Review

1.1 Introduction

The problem of speech enhancement, or noise reduction as it is also sometimes called, is a well-known, longstanding problem with important applications in, for example, speech communication systems and hearing aids, where additive noise can, and often does, have a detrimental impact on the speech quality. Although the problem is a classical one and many solutions have been proposed throughout the years, it has arguably not been well-understood, even for the comparably simple case of linear filters. Indeed, it is not until quite recently that steps have been taken to accurately formulate the problem and characterize the desirable properties of possible solutions. Simply put, the performance of speech enhancement methods can be assessed in terms of two quantities, namely noise reduction and speech distortion, and an optimal solution to the speech enhancement would, thus, explicitly take both into account. As an example that this has historically not been done, consider the classical and well-known Wiener filter. It is usually derived from a mean-square error (MSE) criterion, and it is not until recently that its properties in terms of noise reduction and speech distortion have been thoroughly analyzed [1]. Although it was shown in [1] that the Wiener filter indeed improves the signal-to-noise ratio (SNR) (this had not been shown before), it does so without any explicit control over the amount of distortion incurred on the speech signal. As a result, other filters are worth considering. The minimum variance distortionless response (MVDR) filter, in principle, guarantees that no distortion is incurred on the speech signal while the noise is reduced as much as possible. The maximum SNR filter, on the other hand, yields the highest possible output SNR but does so at the cost of a considerable amount of speech distortion. Other competing methods to linear filtering include spectral subtraction methods [2], subspace methods [3, 4], and statistical methods [57].

In this paper, we continue the research into methods for speech enhancement based on linear filtering. More specifically, we provide a brief overview of linear filters derived from the conventional approach and the recently introduced orthogonal decomposition approach. We do so in a more general framework than what is typical. More specifically, the speech enhancement problem is stated as the problem of finding a rectangular filter matrix for estimating the speech signal vector from a noisy signal observation vector. Using the two aforementioned approaches, we derive the maximum SNR, Wiener, tradeoff, and MVDR filters and analyze and relate their properties. All of the derived filters are based on second-order statistics of the observed signal as well as the noise. While estimation of these statistics are not considered herein, there exist multiple, well-known methods for conducting this estimation in practice both in single-channel [8, 9] and multichannel [1012] scenarios. Finally, we then proceed to demonstrate and discuss their application in various settings, including time and frequency domain enhancement and single- and multichannel enhancement.

The rest of the paper is organized as follows. In Section 1.2, we introduce the signal model and basic assumptions and state the problem at hand. We then, in Section 1.3, address the problem using the conventional approach, define various useful performance measures, and derive and compare some optimal filters. We then proceed to present an alternative approach based on the orthogonal decomposition in Section 1.4, and we use this to derive optimal filters. These are then also compared in terms of their noise reduction and speech distortion properties. In Section 1.5, we show how the two approaches can be applied in various speech enhancement contexts before concluding on the work in Section 2.

1.2 Signal model and problem formulation

We consider the very general signal model of an observation signal vector of length M:

y = y 1 y 2 y M T = x + v ,
(1)

where the superscript T is the transpose operator, and x and v are the speech and noise signal vectors, respectively, which are defined similarly to the noisy signal vector, y. We assume that the components of the two vectors x and v are zero mean, stationary, and circular. We further assume that these two vectors are uncorrelated, i.e., E(x vH)=E(v xH)=0M×M, where E(·) denotes mathematical expectation, the superscript H is the conjugate-transpose operator, and 0M×M is a matrix of size M×M with all its elements equal to 0. In this context, the correlation matrix (of size M×M) of the observations is:

Φ y = E y y H = Φ x + Φ v ,
(2)

where Φ x =E(x xH) and Φ v =E(v vH) are the correlation matrices of x and v, respectively. In the rest of this paper, we assume that the rank of the speech correlation matrix, Φ x , is equal to PM and the rank of the noise correlation matrix, Φ v , is equal to M.

In order to be able to derive appropriate performance measures and optimal linear filters that can achieve a clear objective according to these measures, it is of great importance to define, without any ambiguity, the desired signal that we want to estimate or extract from the observations. Also, in general, y should be written explicitly as a function of this desired signal. In some context, x1, the first element of x, is the desired signal; in some other situations, the whole vector x or part of it is the desired signal vector. Therefore, in a general manner, our desired signal vector is defined as:

x Q = x 1 x 2 x Q T ,
(3)

where 1≤QM. In the same way, we define the vector v Q as the first Q components of v. Then, the objective of speech enhancement (or noise reduction) is to estimate x Q from y. This should be done in such a way that the noise is reduced as much as possible with no or little distortion of the desired signal vector [1, 1315]. In the rest of this study, we consider two important cases: without (conventional approach) and with the orthogonal decomposition of the speech signal vector.

1.3 Speech enhancement with the conventional approach

1.3.1 Principle

Our objective is to estimate x Q from y, even though y is not an explicit function of x Q . With linear filtering techniques [3, 4, 1620], the desired signal vector is estimated as:

z = H y = H x + v = x fd + v rn ,
(4)

where z is supposed to be the estimate of x Q ,

H = h 1 H h 2 H h Q H
(5)

is a rectangular filtering matrix of size Q×M, h q , q=1,2,…,Q are complex-valued filters of length M, xfd=H x is the filtered desired signal, and vrn=H v is the residual noise. We deduce that the correlation matrix of z is:

Φ z = Φ x fd + Φ v rn ,
(6)

where Φ x fd =H Φ x H H and Φ v rn =H Φ v H H .

An interesting particular case is Q=P=1. In this scenario, Equation 4 simplifies to:

z = h H y ,
(7)

where h is a complex-valued filter of length M. Since Φ x is a rank 1 matrix, it can be written as:

Φ x = ϕ x 1 d d H ,
(8)

where ϕ x 1 =E x 1 2 is the variance of x1 and d is a vector of length M, whose first element is equal to 1.

1.3.2 Performance measures

We are now ready to define the most important performance measures in the context of linear filtering.

The input SNR is defined as:

iSNR = tr Φ x Q tr Φ v Q ,
(9)

where tr(·) denotes the trace of a square matrix, and Φ x Q and Φ v Q are the correlation matrices (of size Q×Q) of x Q and v Q , respectively.

The output SNR, obtained from Equation 6, helps quantify the SNR after filtering. It is given by:

oSNR H = tr Φ x fd tr Φ v rn = tr H Φ x H H tr H Φ v H H .
(10)

Then, the main objective of speech enhancement is to find an appropriate H that makes the output SNR greater than the input SNR. Consequently, the quality of the noisy signal may be enhanced.

The noise reduction factor quantifies the amount of noise being rejected by H. This quantity is defined as the ratio of the power of the original noise over the power of the noise remaining after filtering, i.e.,

ξ nr H = tr Φ v Q tr H Φ v H H .
(11)

Any good choice of H should lead to ξnr(H)≥1, in which case the noise has been attenuated.

The desired speech signal can be distorted by the rectangular filtering matrix. Therefore, the speech reduction factor is defined as:

ξ sr H = tr Φ x Q tr H Φ x H H .
(12)

For optimal filters, we should have ξsr(H)≥1 as the optimal filter would otherwise amplify the desired signal.

By making the appropriate substitutions, one can derive the relationship among the measures defined so far, i.e.,

oSNR H iSNR = ξ nr H ξ sr H .
(13)

This fundamental expression indicates the equivalence between gain/loss in SNR and distortion (for both speech and noise).

Another way to measure the distortion of the desired signal vector due to the filtering operation is via the speech distortion index defined as:

υ sd H = E x fd x Q H x fd x Q tr Φ x Q .
(14)

The speech distortion index is always greater than or equal to 0 and should be upper bounded by 1 for optimal rectangular filtering matricesa; so the higher the value of υsd(H) is, the more the desired signal is distorted.

We define the error signal vector between the estimated and desired signals as:

e = z x Q = H y x Q ,
(15)

which can also be expressed as the sum of two uncorrelated error signal vectors:

e = e ds + e rs ,
(16)

where

e ds = x fd x Q
(17)

is the signal distortion due to the rectangular filtering matrix and

e rs = v rn
(18)

represents the residual noise. Therefore, the MSE criterion is:

J H = tr E e e H = tr Φ x Q + tr H Φ y H H tr H Φ x I i T tr I i Φ x H H ,
(19)

where

I i = I Q 0 Q × ( M Q )
(20)

is the identity filtering matrix, with I Q being the Q×Q identity matrix. Using the fact that E e ds e rs H = 0 Q × Q , J(H) can be expressed as the sum of two other MSEs, i.e.,

J H = tr E e ds e ds H + tr E e rs e rs H = J ds H + J rs H ,
(21)

where

J ds H = tr Φ x Q υ sd H
(22)

and

J rs H = tr Φ v Q ξ nr H .
(23)

We deduce that

J ds H J rs H = iSNR × ξ nr H × υ sd H = oSNR H × ξ sr H × υ sd H .
(24)

We observe how the MSEs are related to the different performance measures.

1.3.3 Optimal filters

Let λmax be the maximum eigenvalue of the matrix Φ v 1 Φ x with corresponding eigenvector bmax. It can be shown that the maximum SNR filtering matrix is given by [20]:

H max = β 1 b max T β 2 b max T β Q b max T ,
(25)

where β q , q=1,2,…,Q are arbitrary complex numbers with at least one of them different from 0. The corresponding output SNR is:

oSNR H max = λ max .
(26)

The output SNR with the maximum SNR filtering matrix is always greater than or equal to the input SNR, i.e., oSNR(Hmax)≥iSNR. We also have oSNR(H)≤λmax,∀H. The best way to find the β q s is by minimizing distortion. By substituting Hmax into the distortion-based MSE and minimizing with respect to the β q s, we get

H max = I i Φ x b max b max H λ max = I i Φ v b max b max H .
(27)

If we differentiate the MSE criterion, J(H), with respect to H and equate the result to zero, we find the Wiener filtering matrix:

H W = I i Φ x Φ y 1 = I i I M Φ v Φ y 1 ,
(28)

where I M is the M×M identity matrix. The output SNR with the Wiener filtering matrix is always greater than or equal to the input SNR, i.e., oSNR(HW)≥iSNR. Obviously, we have

oSNR H W oSNR H max
(29)

and, in general,

υ sd H W υ sd H max .
(30)

To better compromise between noise reduction and speech distortion, we can minimize the speech distortion index with the constraint that the noise reduction factor is equal to a positive value that is greater than 1, i.e.,

min H J ds H subject to J rs H = β tr Φ v Q ,
(31)

where 0<β<1 to insure that we get some noise reduction. The previous optimization leads to the tradeoff filter:

H T , μ = I i Φ x Φ x + μ Φ v 1 ,
(32)

where μ>0 is a Lagrange multiplier. The output SNR with the tradeoff filtering matrix is always greater than or equal to the input SNR, i.e., oSNR(HT,μ)≥iSNR, ∀μ>0. Usually, μ is chosen in a heuristic way, so that for

  • μ=1, HT,1=HW, which is the Wiener filtering matrix;

  • μ>1, results in a filtering matrix with low residual noise at the expense of high speech distortion (as compared to Wiener); and

  • μ<1, results in a filtering matrix with high residual noise and low speech distortion (as compared to Wiener).

We should have for μ≥1,

oSNR H W oSNR H T , μ oSNR H max ,
(33)
υ sd H W υ sd H T , μ ,
(34)

and for μ≤1,

oSNR H T , μ oSNR H W oSNR H max ,
(35)
υ sd H T , μ υ sd H W .
(36)

Another filter can be derived by just minimizing Jds(H). We obtain the minimum distortion (MD) rectangular filtering matrix:

H MD = I i Φ x Φ x ,
(37)

where Φ x is the pseudoinverse of Φ x . If Φ x is a full-rank matrix, HMD becomes the identity filter, Ii, which does not affect the observations. The MD filter is very close to the well-known MVDR filter.

For Q=P=1, it is possible to derive the MVDR filter. Indeed, by minimizing the variance of the filter’s output, ϕ z =hHΦ y h, or the variance of the residual noise, ϕ v rn = h H Φ v h, subject to the distortionless constraint, hHd=1, we easily get

h MVDR = Φ y 1 d d H Φ y 1 d = Φ v 1 d d H Φ v 1 d .
(38)

It can be checked that Jds(hMVDR)=0, proving that hMVDR is distortionless.

It is also possible to derive the MVDR (square) filtering matrix for Q=M. Using the well-known eigenvalue decomposition [21], the speech correlation matrix can be diagonalized as:

Q x H Φ x Q x = Λ x ,
(39)

where

Q x = q x , 1 q x , 2 q x , M
(40)

is a unitary matrix, i.e., Q x H Q x = Q x Q x H = I M and

Λ x = diag λ x , 1 , λ x , 2 , , λ x , M
(41)

is a diagonal matrix. The orthonormal vectors qx,1,qx,2,…,qx,M are the eigenvectors corresponding, respectively, to the eigenvalues λx,1,λx,2,…,λx,M of the matrix Φ x , where λx,1λx,2≥⋯≥λx,P>0 and λx,P+1=λx,P+2=⋯=λx,M=0. Let

Q x = T x Ξ x ,
(42)

where the M×P matrix T x contains the eigenvectors corresponding to the nonzero eigenvalues of Φ x and the M×(MP) matrix Ξ x contains the eigenvectors corresponding to the null eigenvalues of Φ x . It can be verified that

I M = T x T x H + Ξ x Ξ x H .
(43)

Notice that T x T x H and Ξ x Ξ x H are two orthogonal projection matrices of rank P and MP, respectively. Hence, T x T x H is the orthogonal projector onto the speech subspace (where all the energy of the speech signal is concentrated), or the range of Φ x and Ξ x Ξ x H is the orthogonal projector onto the null subspace of Φ x . Using Equation 43, we can write the speech vector as:

x = Q x Q x H x = T x T x H x .
(44)

We deduce from Equation 44 that the distortionless constraint is:

H T x = T x ,
(45)

since, in this case, Hx=H T x T x H x= T x T x H x=x. Now, from the criterion:

min H tr H Φ v H H subject to H T x = T x ,
(46)

we find the MVDR:

H MVDR = T x T x H Φ v 1 T x 1 T x H Φ v 1 .
(47)

Equation 47 can also be expressed as:

H MVDR = T x T x H Φ y 1 T x 1 T x H Φ y 1 .
(48)

Of course, for P=M, the MVDR filtering matrix simplifies to the identity matrix, i.e., HMVDR=I M . As a consequence, we can state that the higher the dimension of the null space of Φ x is, the more the MVDR is efficient in terms of noise reduction. The best scenario corresponds to P=1. We can verify that Jds(HMVDR)=0.

1.4 Speech enhancement with the orthogonal decomposition of the speech signal vector

1.4.1 Principle

Another perspective for speech enhancement is to extract the desired signal vector, x Q , from x. This way, the observation signal vector, y, will be an explicit function of x Q . As a consequence, the objectives that we wish to achieve are much easier to handle.

In this section, we assume that the elements x q , q=1,2,…,Q are not fully coherent, so that Φ x Q is a full-rank matrix. To extract x Q from x, we need to decompose x into two orthogonal components: one correlated with (or is a linear transformation of) the desired signal vector and the other one orthogonal to x Q and, hence, will be considered as an interference signal vector. Specifically, the vector x is decomposed into the following form [22, 23]:

x = Φ x x Q Φ x Q 1 x Q + x i = x d + x i ,
(49)

where

x d = Φ x x Q Φ x Q 1 x Q = Γ x x Q x Q
(50)

is a linear transformation of the desired signal vector, Φ x x Q =E x x Q H is the cross-correlation matrix (of size M×Q) between x and x Q , Γ x x Q = Φ x x Q Φ x Q 1 , and

x i = x x d
(51)

is the interference signal vector. It is easy to see that xd and xi are orthogonalb, i.e.,

E x d x i H = 0 M × M .
(52)

We observe that the first Q elements of xd and xi are equal to x Q and 0Q×1, respectively. Now, we can express the observation signal vector as an explicit function of x Q , i.e.,

y = Γ x x Q x Q + x i + v .
(53)

With this approach, the estimator is:

z = H x d + x i + v = x fd + x ri + v rn ,
(54)

where

H = h 1 ′H h 2 ′H h Q ′H
(55)

is a rectangular filtering matrix of size Q×M, h q ,q=1,2,,Q are complex-valued filters of length M, x fd = H x d is the filtered desired signal, x ri = H x i is the residual interference, and v rn = H v is the residual noise. The correlation matrix of z is then:

Φ z = Φ x fd + Φ x ri + Φ v rn ,
(56)

where Φ x fd = H Φ x d H ′H , with Φ x d = Γ x x Q Φ x Q Γ x x Q H being the correlation matrix (whose rank is equal to Q) of xd, Φ x ri = H Φ x i H ′H , with Φ x i =E x i x i H being the correlation matrix of xi, and Φ v rn = H Φ v H ′H .

1.4.2 Performance measures

The input SNR is identical to the definition given in Equation 9.

From Equation 56, we deduce the output SNR:

oSNR H = tr Φ x fd tr Φ x ri + Φ v rn = tr H Γ x x Q Φ x Q Γ x x Q H H ′H tr H Φ in H ′H ,
(57)

where

Φ in = Φ x i + Φ v
(58)

is the correlation matrix of the interference-plus-noise. The obvious objective is to find an appropriate H in such a way that oSNR(H)≥iSNR.

The noise reduction factor is:

ξ nr H = tr Φ v Q tr H Φ in H ′H .
(59)

A reasonable choice of H should give a value of the noise reduction factor greater than 1, meaning that the noise and interference have been attenuated by the filter.

The speech reduction factor is defined as:

ξ sr H = tr Φ x Q tr H Γ x x Q Φ x Q Γ x x Q H H ′H .
(60)

A rectangular filtering matrix that does not affect the desired signal requires the constraintc:

H Γ x x Q = I Q .
(61)

Hence, ξsr(H)=1 in the absence of (correlated) distortion and ξsr(H)>1 in the presence of distortion.

Again, we have the fundamental relationship:

oSNR H iSNR = ξ nr H ξ sr H .
(62)

When no distortion occurs, the gain in SNR coincides with the noise reduction factor.

We can also quantify the distortion with the speech distortion index:

υ sd H = tr H Γ x x Q I Q Φ x Q H Γ x x Q I Q H tr Φ x Q .
(63)

The speech distortion index is always greater than or equal to 0 and should be upper bounded by 1 for optimal filtering matrices, which corresponds to the case where the filtering matrix is just a matrix of zeros; so the higher the value of υsd(H) is, the more the desired signal is distorted.

The error signal is:

e = z x Q = H y x Q .
(64)

It can be written as the sum of two orthogonal error signal vectors:

e = e ds + e rs ,
(65)

where

e ds = x fd x Q = H Γ x x Q I Q x Q
(66)

is the signal distortion due to the rectangular filtering matrix and

e rs = x ri + v rn = H x i + H v
(67)

represents the residual interference-plus-noise. Having defined the error signal, we can now write the MSE criterion:

J H = tr E e e ′H = tr Φ x Q + tr H Φ y H ′H tr H Φ x I i T tr I i Φ x H ′H = J ds H + J rs H ,
(68)

where

J ds H = tr Φ x Q + tr H Φ x d H ′H tr H Φ x d I i T tr I i Φ x d H ′H
(69)

and

J rs H = H Φ in H ′H .
(70)

We deduce that

J ds H J rs H = iSNR × ξ nr H × υ sd H = oSNR H × ξ sr H × υ sd H ,
(71)

showing how the MSEs are related to the different performance measures.

1.4.3 Optimal filters

Let λ max be the maximum eigenvalue of the matrix Φ in 1 Φ x d with corresponding eigenvector b max . We easily find that the maximum SNR filtering matrix with minimum distortion is:

H max = I i Φ x d b max b max ′H λ max = I i Φ in b max b max ′H
(72)

and

oSNR H max = λ max .
(73)

The output SNR with the maximum SNR filtering matrix is always greater than or equal to the input SNR, i.e., oSNR H max iSNR. We also have oSNR H λ max , H .

The minimization of the MSE criterion leads to the Wiener filtering matrix:

H W = I i Φ x Φ y 1 = H W ,
(74)

which is identical to the Wiener filter obtained with the classical approach. Even though the Wiener filter obtained with the two different approaches is the same, its evaluation with the performance measures is slightly different due the conceptual difference between the two methods. We always have oSNR H W iSNR.

We can rewrite the Wiener filtering matrix as:

H W = I Q + Φ x Q Γ x x Q H Φ in 1 Γ x x Q 1 Φ x Q Γ x x Q H Φ in 1 = Φ x Q 1 + Γ x x Q H Φ in 1 Γ x x Q 1 Γ x x Q H Φ in 1 .
(75)

This form is interesting because it shows an obvious link with some other optimal filtering matrices as it will be verified later.

Another way to express the Wiener filter is:

H W = I i Γ x x Q Φ x Q Γ x x Q H Φ y 1 = I i I M Φ in Φ y 1 .
(76)

The MVDRd rectangular filtering matrix is obtained by minimizing the MSE of the residual interference-plus-noise, Jrs(H), subject to the constraint that the desired signal vector is not distorted. Mathematically, this is equivalent to:

min H tr H Φ in H ′H subject to H Γ x x Q = I Q .
(77)

The solution to the above optimization problem is:

H MVDR = Γ x x Q H Φ in 1 Γ x x Q 1 Γ x x Q H Φ in 1 ,
(78)

which is interesting to compare to H W [Equation 75]. We can rewrite the MVDR as:

H MVDR = Γ x x Q H Φ y 1 Γ x x Q 1 Γ x x Q H Φ y 1 .
(79)

We should always have

oSNR H MVDR oSNR H W oSNR H max .
(80)

By minimizing the speech distortion index with the constraint that the noise reduction factor is equal to a positive value that is greater than 1, we find the tradeoff filtering matrix:

H T , μ = Φ x Q Γ x x Q H Γ x x Q Φ x Q Γ x x Q H + μ Φ in 1 ,
(81)

which can be rewritten as:

H T , μ = μ Φ x Q 1 + Γ x x Q H Φ in 1 Γ x x Q 1 Γ x x Q H Φ in 1 ,
(82)

where μ≥0 is a Lagrange multiplier. Usually, μ is chosen in an ad hoc way, so that for

  • μ=1, H T , 1 = H W , which is the Wiener filtering matrix;

  • μ=0 [from Equation 82], H T , 0 = H MVDR , which is the MVDR filtering matrix;

  • μ>1, results in a filtering matrix with low residual noise (as compared to Wiener) at the expense of high speech distortion; and

  • μ<1, results in a filtering matrix with high residual noise and low speech distortion (as compared to Wiener).

We always have oSNR H T , μ iSNR, μ 0. We should also have for μ≥1,

oSNR H MVDR oSNR H W oSNR H T , μ oSNR H max
(83)

and for μ≤1,

oSNR H MVDR oSNR H T , μ oSNR H W oSNR H max .
(84)

The case Q=M is interesting because for both approaches, performance measures and optimal square filtering matrices are identical. We can draw the same conclusions for the case Q=P=1.

1.5 Application examples

In this section, we show how the two approaches can be applied to different applications of speech enhancement.

1.5.1 Single-channel noise reduction in the time domain

The single-channel noise reduction problem in the time domain consists of recovering the desired signal (or clean speech) x(t), t being the discrete-time index, of zero mean from the noisy observation (microphone signal) [1]:

y ( t ) = x ( t ) + v ( t ) ,
(85)

where v(t), assumed to be a zero-mean random process, is the unwanted additive noise that can be either white or colored but is uncorrelated with x(t).

The signal model given in Equation 85 can be put into a vector form by considering the L most recent successive time samples, i.e.,

y ( t ) = y ( t ) y ( t 1 ) y ( t L + 1 ) T = x ( t ) + v ( t ) ,
(86)

where x(t) and v(t) are defined in a similar way to y(t). We define the desired signal vector as:

x Q ( t ) = x ( t ) x ( t 1 ) x ( t Q + 1 ) T ,
(87)

that we can estimate from y(t) with either of the two methods. Estimating the desired signal using conventional, rectangular filters was considered in [24], while [22] considers the orthogonal decomposition approach. Simulation results showing the performance of the two filtering methods for single-channel noise reduction are also found in [22, 24].

1.5.2 Single-channel noise reduction in the time-frequency domain

Using the short-time Fourier transform (STFT), Equation 85 can be rewritten in the time-frequency domain as [13, 25]:

Y ( k , n ) = X ( k , n ) + V ( k , n ) ,
(88)

where the zero-mean complex random variables Y(k,n), X(k,n), and V(k,n) are the STFTs of y(t), x(t), and v(t), respectively, at frequency bin k∈{0,1,…,K−1} and time frame n. Here, the sample X(k,n) is the desired signal.

The simplest way to estimate X(k,n) is by applying a positive gain to Y(k,n) with the conventional approach. However, the noise reduction performance may be limited.

A more general approach to estimate the desired signal is by filtering the observation signal vector of length P[25]:

y(k,n)= Y ( k , n ) Y ( k , n 1 ) Y ( k , n P + 1 ) T ,
(89)

and using the orthogonal decomposition to extract X(k,n) from x(k,n), which is defined in a similar way to y(k,n). Thanks to this approach, the non-negligible interframe correlation is taken into account, which is not the case when just a gain is used. As a consequence, we can better compromise between noise reduction and speech distortion.

The STFT-based filtering methods for single-channel noise reduction was considered in, e.g., [2628], where experimental results can also be found.

1.5.3 Multichannel noise reduction in the time domain

In the multichannel scenario, we have a microphone array with M sensors that captures a convolved source signal in some noise field. In the time domain, the received signals are expressed as [29, 30]:

y m ( t ) = g m ( t ) s ( t ) + v m ( t ) = x m ( t ) + v m ( t ) , m = 1 , 2 , , M ,
(90)

where g m (t) is the acoustic impulse response from the unknown speech source, s(t), location to the m th microphone, ∗ stands for linear convolution, and x m (t) and v m (t) are, respectively, the convolved speech and additive noise at microphone m. We assume that the signals x m (t)=g m (t)∗s(t) and v m (t) are uncorrelated, zero mean, real, and broadband. By definition, x m (t) is coherent across the array. The noise signals, v m (t), are typically only partially coherent across the array.

By processing the data by blocks of L time samples, the signal model given in Equation 90 can be put into a vector form as:

y m ( t ) = x m ( t ) + v m ( t ) , m = 1 , 2 , , M ,
(91)

where

y m ( t ) = y m ( t ) y m ( t 1 ) y m ( t L + 1 ) T
(92)

is a vector of length L, and x m (t) and v m (t) are defined similarly to y m (t). It is more convenient to concatenate the M vectors y m (t), m=1,2,…,M together as:

y - ( t ) = y 1 T ( t ) y 2 T ( t ) y M T ( t ) T = x - ( t ) + v - ( t ) ,
(93)

where vectors x - (t) and v - (t) of length ML are defined in a similar way to y - (t).

We consider x1(t) as the desired signal vector. Our problem then may be stated as follows: given y - (t), our aim is to preserve x1(t) while minimizing the contribution of v - (t). Both approaches can be used but the one based on the orthogonal decomposition is more appropriate since it will better exploit the correlation among the convolved speech signals at the microphones for noise reduction. The orthogonal decomposition approach for multichannel noise reduction was considered in, e.g., [31, 32], where experimental results can also be found.

1.5.4 Multichannel noise reduction in the frequency domain

In the frequency domain, at the frequency index f, Equation 90 can be expressed as:

Y m ( f ) = G m ( f ) S ( f ) + V m ( f ) = X m ( f ) + V m ( f ) , m = 1 , 2 , , M ,
(94)

where Y m (f), G m (f), S(f), V m (f), and X m (f) are the frequency-domain representations of y m (t), g m (t), s(t), v m (t), and x m (t), respectively.

It is more convenient to write the M frequency-domain microphone signals in a vector notation:

y ( f ) = g ( f ) S ( f ) + v ( f ) = x ( f ) + v ( f ) = d ( f ) X 1 ( f ) + v ( f ) ,
(95)

where

y ( f ) = Y 1 ( f ) Y 2 ( f ) Y M ( f ) T , x ( f ) = X 1 ( f ) X 2 ( f ) X M ( f ) T = S ( f ) g ( f ) , g ( f ) = G 1 ( f ) G 2 ( f ) G M ( f ) T , v ( f ) = V 1 ( f ) V 2 ( f ) V M ( f ) T ,

and

d ( f ) = 1 G 2 ( f ) G 1 ( f ) G M ( f ) G 1 ( f ) T = g ( f ) G 1 ( f ) .
(96)

Expression in Equation 95 depends explicitly on the desired signal, X1(f), that we want to estimate from y(f).

There is another interesting way to write Equation 95. First, it is easy to see that

X m ( f ) = γ X m X 1 ( f ) X 1 ( f ) , m = 1 , 2 , , M ,
(97)

where

γ X m X 1 ( f ) = E X m ( f ) X 1 ( f ) E X 1 ( f ) 2 = G m ( f ) G 1 ( f ) , m = 1 , 2 , , M
(98)

is the partially normalized [with respect to X1(f)] coherence function between X m (f) and X1(f). Using Equation 97, we can rewrite Equation 95 as:

y ( f ) = γ x X 1 ( f ) X 1 ( f ) + v ( f ) ,
(99)

where

γ x X 1 ( f ) = 1 γ X 2 X 1 ( f ) γ X M X 1 ( f ) T = E x ( f ) X 1 ( f ) E X 1 ( f ) 2 = d ( f )
(100)

is the partially normalized [with respect to X1(f)] coherence vector (of length M) between x(f) and X1(f). This shows that the two approaches for noise reduction are identical. More details on multichannel noise reduction in the frequency domain as well as experimental results can be found in [25, 33].

1.5.5 Binaural noise reduction

Binaural noise reduction [34] consists of the estimation of the received speech signal at two different microphones with a sensor array of M microphones. One estimate is for the left ear and the other for the right ear. This way and thanks to our binaural hearing system, we will be able to localize the speech source in the space. In the frequency domain, we can estimate, for example, X1(f) and X2(f). As explained above, the two methods are the same. In the time domain, we can estimate, for example, x1(t) and x2(t). The method based on the orthogonal decomposition is more appropriate since it may distort less the signals. Distortion in binaural noise reduction is problematic since it may affect the cues for localization and separation. Experimental results and further theoretical details on binaural noise reduction using the approaches mentioned herein are found in [35, 36].

2 Conclusions

In this paper, we have given a brief overview of linear filtering methods for speech enhancement based on two approaches: a so-called conventional approach and an approach based on the orthogonal decomposition. In the context of these two different approaches, various optimal filters (e.g., MVDR, maximum SNR, and Wiener filters) have been derived and their properties in terms of different performance measures have been assessed and compared. These performance measures, simply put, quantify the properties of the filters and approaches in terms of noise reduction and speech distortion and show how they offer different tradeoffs between the two. We have also demonstrated how the approaches can be applied in various speech enhancement contexts, including single- and multichannel enhancement in both the time and frequency domains and in binaural noise reduction.

Endnotes

a The upper bound comes from the fact that this distortion is obtained when the filtering matrix only contains zeros, which should be the maximum expected distortion.

b It is legitimate to consider xi as an interference, since the desired signal is entirely in xd, and xi and xd are uncorrelated.

c Here, the distortionless constraint is in the sense that we can perfectly recover the desired signal vector, even though the residual interference can add some uncorrelated distortion to the desired signal.

d We use the terminology MVDR because we can completely extract the desired signal with this filter.