Deep neural network techniques for monaural speech enhancement and separation: state of the art analysis

Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in tasks such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement and separation to perform speech denoising, dereverberation, speaker extraction and speaker separation. In this paper, we review the current DNN techniques being employed to achieve speech enhancement and separation. The review looks at the whole pipeline of speech enhancement and separation techniques from feature extraction, how DNN-based tools models both global and local features of speech, model training (supervised and unsupervised) to how they address label ambiguity problem. The review also covers the use of domain adaptation techniques and pre-trained models to boost speech enhancement process. By this, we hope to provide an all inclusive reference of all the state of art DNN based techniques being applied in the domain of speech separation and enhancement. We further discuss future research directions. This survey can be used by both academic researchers and industry practitioners working in speech separation and enhancement domain.


Introduction
Techniques for monaural speech intelligibility improvement can be categorised either as speech enhancement or separation.Speech enhancement involves isolating a target speech either from noise [1] or a mixed speech [2].Speech enhancement involves tasks such as dereverberation, denoising and speaker extraction.Speaker separation on the other hand seeks to estimate independent speeches composed in a mixed speech [3].Speech enhancement and separation have applications in multiple domains such as automatic speech recognition, mobile speech communication and designing of hearing aids [4].Initial research on speech enhancement and separation exploited techniques such as non-negative matrix factorization [5] [6] [7] probabilistic models [8] and computational auditory scene analysis (CASA) [9].However, these techniques are tailored for closed-set speakers (i.e., do not work well with mixtures with unknown speakers) which significantly restricts their applicability in real environments.Due to the recent success of deep learning models in different domains such natural language processing and computer vision, these data driven techniques have been introduced to process audio dataset.In particular, DNN models have become popular in speech enhancement and separation and have achieved great performance in terms of boosting speech intelligibility and their ability to enhance speech with unknown speakers [10] [11].In order to be effective in speech enhancement and separation, DNN models must extract important features of speech, maintain order of audio frames, exploit both local and global contextual information to achieve coherent separation of speech data.This necessitates that DNN models should include techniques tailored to meet these requirements.Discussion of these techniques is the core subject of this review.Further, in computer vision and text domain, large pre-trained models are used to extract universal representations that are beneficial to downstream tasks.The review discusses the impact of pre-trained models to the  speech enhancement and separation domain.It also discusses DNN techniques being adopted by speech enhancement and separation tools to reduce computation complexity to enable them work in low latency and resource constrained environments.The review therefore focuses on the whole pipeline of DNN application to speech enhancement and separation, i.e., from feature extraction, model implementation, training and evaluation.Our goal is to uncover the dominant techniques at each level of DNN implementation.In each section, we highlight key emerging features and challenges that exist.A recent review [12] only looked at supervised techniques of performing speech separation and in this review, we discuss both supervised and unsupervised methods.Moreover, with the fast-growing field of deep learning, new techniques have emerged that necessitates a new look into how these techniques have been implemented in speech enhancement and separation.The review is constrained to discussing how DNN techniques are being applied to monaural speech enhancement and so we do not focus on multi-channel speech separation (which has been covered in [13]).
The paper first explains the types of speech enhancement and separation (section 2) by highlighting their key elements and the tools that focus on each type.It discusses the key speech features that are being used by speech enhancement and separation tools in section 3.This section looks at how the features are derived and how they are used to train the DNN models in supervised learning technique.Section 5 discusses the techniques the tools use to model long dependencies that exist in speech.The paper discusses model size compression techniques in section 6.In section 7, the paper discusses some of the popular objective functions used in speech enhancement and separation.Section 8 discusses how some tools are implementing unsupervised techniques to achieve speech enhancement and separation.Section 9 discusses how the speech separation and enhancement tools are being adapted to the target environment.In section 10 the paper looks at how pre-trained models are being utilized in the speech enhancement and separation pipeline.Finally, section 11 looks at future direction.Figure 1 gives an overall organization and topics covered by the paper.
2 Types of speech separation and enhancement

Speech separation
Scenarios arise where more than one target speech signals are composed in a given speech mixture and the goal is to isolate each independent speech composed in a mixture.This problem is known as speech separation.For a mixture that is composed of C independent speech signals x c (n) with c = 1, • • • , C, a recording y(n) composed of the C speech signals can be represented as: Here, n indexes time.The goal of speech separation is to estimate each independent x c speech signal composed in y(n).Separating speech from another speech is a daunting task by the virtue that all speakers belong to the same class and share similar characteristics [14].Some models such as [15] and [16] lessen this by performing speech separation on a mixed speech signal based on gender voices present.They exploit the fact that there is large discrepancy between male and female voices in terms of vocal track, fundamental frequency contour, timing, rhythm, dynamic range etc.This results in a large spectral distance between male and female speakers in most cases to facilitate a good gender segregation.For speech separation that the mixture involves speakers of the same gender, the separation task is much difficult since the pitch of the voice is in the same range [14].Most speech separation tools that solve this task such as [17] [18] [19] [20] [14] and [10] cast the problem as a multi-class regression.In that case, training a DNN model involves comparing its output to a source speaker.DNN models always output a dimension for each target class and when multiple sources of the same type exist, the system needs to select arbitrarily which output dimension to map to each output and this raises a permutation problem (permutation ambiguity) [14].Taking a case of a two speaker separation, if the model estimates â1 and â2 as the magnitude of the reference speech magnitudes a 1 and a 2 respectively, it is unclear the order in which the model will output the estimates i.e. the order of output can either be {â 1 , â2 } or {â 2 , â1 }.A naive approach shown in figure 2 [21] is to present the reference speech magnitudes in a fixed order and hope that it is the same order in which the system will output its estimation.
Figure 2: Naive approach of solving label matching problem for a two-talker speech separation model In case of a mismatch, the loss computation will be based on the wrong comparison resulting in low quality of separated speeches.Therefore, systems that perform speaker separation have an extra burden of designing mechanisms that are geared towards handling the permutation problem.There are several strategies that are being implemented by speech separation tools to tackle permutation problem.In [19], a number of DNN techniques are implemented that estimates two clean speeches contained in a two-talker mixed speech.They employ supervised training to train DNN models to discriminate the two speeches based on average energy, pitch and instantaneous energy of a frame.Work in [22] and [21] introduce permutation invariant training (PIT) technique of computing permutation loss such that permutations of reference labels are presented as a set to be compared with the output of the system.The permutation with the lowest loss is adopted as the correct order.For a a two-speaker separation system introduced earlier, the reference sources permutation will be {a 1 , a 2 } and {a 2 , a 1 } such that the possible permutation losses are computed as: The one that returns the lowest loss between the two is selected as the permutation loss to be minimized (see figure 3).For an S speaker separation system a total of S! permutations are generated.For a system that performs S speaker separation and S is high (e.g.10), implementation of PIT which has a computation complexity of O(S!) is computationally expensive [23] [24].Due to this, [24] casts the permutation problem as a linear sum problem where Hungarian algorithm is exploited to find the permutation which minimizes the loss at computation complexity of O(S 3 ).Work in [23] proposes SinkPIT loss which is based on Sinkhorn's matrix balancing algorithm.They utilize the loss to reduce the complexity of PIT loss from O(C!) to O(kC 2 ).Work in [17] employs minimum loss permutation computation at each time step t.The best permutation (argmin) at each time-step is exploited to re-order the embedding vectors to be consistent with the training labels.To evade the permutation problem, they train two separate DNN models for each of the two speakers to be identified.Another prominent technique of handling permutation problem is to employ a DNN clustering technique [25] [26] [20] [27] [28] to identify the multiple speakers present in a mixed speech signal.The DNN f θ accepts as its input the whole spectrogram X and generates a D dimension embedding vector V i.e., V = f θ (X) ∈ R N ×D .Here, the embedding V learns the features of the spectrogram X and is considered a permutation-and cardinality-independent encoding of the network's estimate of the signal partition.For the network f θ to be learn how to generate an embedding vector V given the input X, it is trained to minimize the cost function.
Here, Y = {y i,c } represents the target partition that maps the spectrogram S i to each of the C clusters such that y i,c=1 if element i is in cluster c .Y Y T is taken here as a binary affinity matrix that represents the cluster assignment in a partition-independent way.The goal in equation 2 is to minimise the distance between the network estimated affinity matrix V V T and the true affinity matrix Y Y T .The minimization is done over the training examples.||A|| 2 F is the squared Frobenius norm.Once V has been established, its rows are clustered into partitions that will represent the binary masks.To cluster the rows v i of V , K-means clustering algorithm is used.The resulting clusters of V are then used as binary masks to separate the sources by applying the masks on mixed spectrogram X.The separated sources are then reconstructed separated by using inverse STFT.Even though PIT is popular in speech separation models, it is unable to handle the output dimension mismatch problem where there is a mismatch on the number of speakers between training and inference [29].For example, training a speech separation model on n speaker mixtures but testing it on t = n speaker mixtures.The PIT-based methods cannot directly deal with this problem due to their fixed output dimension.Most speech separation models such as [30] [21] [31] [32] deal with the problem by setting a maximum number of sources C that the model should output from any given mixture.If an inference mixture has K sources, where C > K, C − K outputs are invalid, and the model needs to have techniques to handle the invalid sources.In case of invalid sources, some models such as [31], [30], [21] design the model to output silences for invalid sources while [32] outputs the mixture itself which are then discarded by comparing the energy level of the outputs relative to the mixture.The challenge with models that output silences for invalid sources is that they rely on a pre-defined energy threshold, which may be problematic if the mixture also has a very low energy [32].Some models handle the output dimension mismatch problem by generating a single speech in each iteration and subtracting it from the mixture until no speech is left [33], [34], [35] [36], [37].The iterative technique despite being trained with a mixture with low number of sources can generalize to mixtures with a higher number of sources [35].It however faces criticism that setting iteration termination criteria is difficult and the separation performance decreases in later iterations due degradations introduced in prior iterations [35].Other speech separation models include [38] [39] [40] [31] [41] [19] [20] [42].

Speaker extraction
Some speech enhancement DNN models have been developed where in a mixed speech such as an equation 1, they design methods to extract a single target speech.These models focus only on a single target speech x t and treat all other speeches as interfering signals, therefore they modify equation 1 as shown in 3.
where x t (n) is the target speech.By focusing on only a single target speech, the permutation ambiguity problem is avoided.They formulate the speech extraction task into a binary classification problem, where the positive class is the target speech, and the negative class is formed by the combination of all other speakers.A popular technique of speaker extraction is to give as input to the DNN models additional speaker dependent information that can be used to isolate a target speaker [43].Speaker dependent information can be injected into the DNN models by either concatenating speaker dependent auxiliary clues with the input features or adapting part of the DNN model parameters for each speaker [44].This addition information about a speaker injects a bias that is necessary to differentiate the target speaker from the rest in the mixture [2].Several auxiliary clues have been exploited by DNN models which include pre-recorded enrolment utterances of the target speaker [44] [2] [45] [46] [47] [48], electroglottographs (EGGs) of the target speaker [49] and i-vectors extracted at speaker level [50] [51].Tool in [52] adapt parameters for each speaker by allocating a speaker dependent module to a selected intermediate layer of DNN.Speech extraction tool in [53] does not use auxiliary clues of the target speaker but design attractor points that are compared with the mixed speech embeddings to generate the mask used to extract the target speech.

Dereverberation
This is a speech enhancement technique that seeks to eliminate the effect of reverberation contained in speech.When speech is captured in an enclosed space by a microphone that is at distance d from the talker, the observed signal consists of a superposition of many delayed and attenuated copies of the speech resulting from reflections of the enclosed space walls and existing objects within the space (see figure 4) [54].The signal received by the microphone consists of direct sound, reflections that arrive shortly after direct sound ( within approximately 50ms) i.e., early reverberation and reflections that arrive after early reverberation i.e., late reverberation [55].Normally, early reverberation does not affect speech intelligibility much [56] and much of perceptual degradation of speech is attributed to late reverberation.Speech degradation due to reverberation can be attributed to two types of masking [57], overlap masking-where the energy of a preceding phoneme overlaps with the one following or self-masking-where internal temporal which refers to the time and frequency alterations of an individual phoneme.Reverberation therefore can be viewed as the convolution of the direct sound and the room impulse response (RIR).A reverberant speech can be formally represented according to equation 4: The goal of dereverberation is therefore to establish s(t) from y(t).Hence it can be viewed as a deconvolution between the speech signal and RIR [59].Dereverberation is considered a more challenging task than denoising for a number of reasons.First, it is difficult to pinpoint direct speech from its copies especially when the reverberation is strong.Secondly, the key underlying assumption of sparsity and orthogonality of speech representations in the feature domain that is commonly used in monaural mask-based speech separation does not hold for speech under reverberation [60].Due to these unique features of reverberation, most tools designed for denoising, or speaker separation are ill poised to perform dereverberation [60].The DNN tools for speaker separation and denoising mostly make assumption that they are working on reverberation free speech hence do not make special consideration for eliminating reverberation (with exception of a few such as [61] [62]).For instance, in [60] they demonstrate that SepFormer [11]  explore elimination of late reverberation.This is because early reverberation does not affect speech intelligibility much.Finally, the DNN dereverberation tools can be categorised based on the type of training technique used (supervised or unsupervised).Tools such as [64] and [18] perform speech dereverberation by implementing supervised training where the DNN model is trained to directly estimate features clean speech when given features from a reverberant speech.
Here D(k, f ), M (k, f )and Y (k, f ) are the STFTs of the clean speech, the ideal ratio mask, and the reverberant speech at time frame k and frequency channel f respectively.Work in [76] exploits conditional GAN to perform unsupervised dereverberation of a reverberant speech.
Dereverberation in DTF magnitude domain: When dereverberation is to be performed in DFT magnitude domain (see section 3.1), a DFT has to be applied to equation 4 such that, the assumption in equation 8 is that the convolution of the clean signal s(t) with RIR h(t) corresponds to the multiplications of their Fourier transform in the T-F domain.However, this is only true if the extent of H(t, f ) is smaller than the analysis window [60].Therefore, when performing dereverberation in the TF domain the selection of the window is crucial on the performance of the DNN model [60].
Target selection in dereverberation:In dereverberation training, most tools use direct speech as the target.This therefore means that the estimated speech will have to be compared with the direct path speech via a selected loss function.This has the potential of resulting in large prediction errors which can cause speech distortion [59].Due to this, recent work [75] proposes the use of a target that has early reverberation.By doing this, they suggest it will improve the quality of enhanced speech.In fact, experiments in [75] demonstrate that allowing early reverberation in the target speech improves the quality of enhanced speech.

Speech denoising
This is a speech enhancement technique of separating background noise from the target speech.Formally, the noisy speech is represented as: where y t is the noisy speech, s t is the target speech and n t is the noise.

Speech separation and enhancement features
Speech enhancement and separation tools' input features can be categorised into two: 1. Fourier spectrum features.

Fourier spectrum features
Speech enhancement and separation tools that use these features do not work directly on the raw signal (i.e., signal in the time domain) rather they incorporate the discrete Fourier transform (DFT) in their signal processing pipeline mostly as the first step to transform a time domain signal into frequency domain.These models recognise that speech signals are highly non-stationary, and their features vary in both time and frequency.Therefore, extracting their time-frequency features using DFT will better capture the representation of speech signal [86].To demonstrate the DFT process we exploit a noisy speech signal shown in equation 10.The same process can be applied in speech separation.A noisy raw waveform signal of speech, y(t), can be represented as in equation 10.
where x(t) and n(t) represent discrete clean speech and noise respectively.Since speech is assumed to be statistically static for a short period of time, it is analysed frame-wise using DFT as shown in equation 11 [86] [87] [88].
Here, k represents the index ( frequency bin) of the discrete frequency, L is the length of the frequency analysis and w(n) is the analysis window.In speech analysis, the Hamming window is mostly used as w(n) [89].Once the DFT has been applied to the signal y(t), it is transformed into time-frequency domain represented as: Y [t, k], X[t, k] and N [t, k] are the DFT representations of the noisy speech, clean speech and noise respectively.Each term in equation 12 can be expressed in terms of DFT magnitude and phase spectrum.For example, the polar form (including magnitude and phase) of the noisy signal Y [t, k] is: Both the phase and the magnitude are computed from the real and the imaginary part of Y [t, k] i.e.
All models that work with the Fourier spectrum features either use the DFT representations directly as the input of the model or further modify the DFT features.The features based on Fourier spectrum include: 1. Log-power spectrum features.[95].The use of DFT magnitude as features work with high frequency resolution hence necessitating the use of larger time window which is typically more that 32ms [20] [21] for speech and more than 90ms for music separation [96].Due to this, these models must handle increased computational complexity [97].This has motivated other speech separation models to work with lower dimensional features as compared to those of DFT magnitude.DFT complex features: Unlike the DFT magnitude features that only use the magnitude of T-F representations, tools that use DFT complex features include both the magnitude and the phase of the noisy (mixed) speech signal in the estimation of the enhanced or separated speech.Therefore, each T-F unit of a complex features is a complex number with a real and imaginary component (see equation 13).The magnitude and phase of a signal is computed according to equation 15 and 16 respectively.Tools that use DFT complex features include [98] [55] [99] [100].
Mel-frequency cepstral coefficients (MFCC) features: Given the mixed speech signal such as in equation 10, to extract Mel frequency cepstral features, the following steps are executed: Given the DFT features Y [n, k] of the input signal, a filterbank with M filters i.e. a 1 ≤ m ≤ M is defined where m is a triangular filter given by: The filters are used to compute the average spectrum around centre frequencies with increasing bandwidths as shown in figure 1.Here, f [m] are uniformly spaced boundary points in the Mel-scale which is computed according to equation 18.The Mel-scale B is given by equation 19 and B −1 which is its inverse is computed as shown in equation 20.
F s is the sampling frequency, f l and f h represent the lowest and the highest frequencies of the filter bank in Hz.N is the size of DFT and M is the number of filters.
for m = 0, • • • , M where M is the number of filter banks.
4. Compute the Mel frequency by computing the discrete cosine transform of the m filter outputs as shown in equation 22.
where 0 ≤ n < M The motivation for working with MFCC is that it results in reduced resolution space as compared to DFT features.Fewer parameters are easier to learn and may generalise better to unseen speakers and noise [97].The challenge however with working on a reduced resolution such as MFCC is that the DNN estimated features must be extrapolated to the DFT feature space.Due to working on a reduced resolution, MFCC degree-of-freedom will be restricted by the dimensionality of the reduced resolution feature space which is much less than that of the DFT space.The low-rank approximation generates a sub-optimal Wiener filter which cannot account for all the added noise content and yields reduced SDR [97].MFCC features have been exploited in tools such as [101] Log-power spectra features: To compute these features, a short-time Fourier analysis is applied to the raw signal computing the DFT of each overlapping waveform (see equation 11).The log-power spectra are then computed from the output of the DFT.Consider a noisy speech signal in the timefrequency domain i.e., where DFT has been applied to the signal (see equation 12).From equation 14, the power spectrum of the noisy signal can be represented as in equation 23.[105] summarises the process of log-power feature extraction.Complementary features: Since different features strongly capture different acoustic features which characterise different properties of the speech signal, some DNN models exploit a combination of the features to perform speech separation.This is based on works such as [108] and [109] which demonstrated that complementary features significantly improve performance in speech recognition.The complementary features used in [109] [110] [111]include perceptual linear prediction, amplitude modulation spectrogram (AMS), relative spectral transform and perceptual linear product (RASTA-PLP), Gammatone frequency cepstral coefficient, MFCC, pitch-based features.The complementary features are combined by concatenation.Research in [111] reports that the use of complementary features registered better results as compared to those of DFT magnitude.The challenge with using complementary features is how to effectively combine the different features, such that those complementing each other are retained while redundant ones are eliminated [110].

Supervised speech enhancement and separation training with Fourier spectrum features
DNN models that are trained via supervised learning using Fourier spectrum features employ several strategies to learn how to generate estimated clean signal from a noisy (mixed) signal.These strategies can be classified into three categories based on the target of the model.

Spectral mapping techniques
These models fit a nonlinear function to learn a mapping from a mixed signal feature to an estimated clean signal feature (see figure 7).The training dataset of these models consist of a noisy speech signal (source) and clean speech (target) features.The process of training these models can be generalised in the following steps: 1. Given N raw waveforms of mixed (noisy) speech, convert the N raw waveform of noisy speech to the desired representation (such as spectrogram).
2. Convert the respective N clean speech waveform in time domain to the same representation as that of the noisy speech.
3. Create an annotated dataset consisting of a pair of noisy speech features and that of clean speech i.e., < noisy_speech_f eatures i , clean_speech_f eatures i > with Train a deep learning model g θ to learn how to estimate clean features clean_speech_f eatures i given a noisy speech feature as input noisy_speech_f eatures i by minimizing an objective function.
5. Given new a noisy speech features x j the trained model g θ should estimate a clean speech feature y j .
6. Using the estimated clean speech features y j , reconstruct its raw waveform by performing the inverse of the feature generation process (such as using the inverse short-time Fourier transform if the features are in time-frequency domain).

The above generalisation has been exploited in [83] [92] [95] [106] [112] [113] [84]
[85] to achieve speech enhancement and in [94]and [103] to perform speech separation and enhancement.Figure 8 gives a summary of the steps when time-frequency(spectrogram) is exploited as the input of the speech enhancement model.

Spectral masking techniques
Here, the task of estimating clean speech features from a noisy (mixed) speech input features is formulated as that of predicting real-valued or complex-valued masks [55].The mask function is usually constrained to be in range the [0,1] even though different types of soft masks have been proposed (see [21] [114] [115]).Source separation based on masks is predicated on the assumption of sparsity and orthogonality of the sources in the domain in which the masks are computed [60].
Due to the sparsity assumption, the dominant signal at a given range (such as time-frequency bin) is taken to be the only signal at that range (i.e.all other signals are ignored except the dominant signal).
In that case, the role of DNN estimated mask is to estimate the dominant source at a given range.To do this, the mask is applied on the input features such that it eliminates portion of the signal( where the mask has a value of 0) while allowing others (mask value of 1) [116] [117].The masks are always established by computing the signal-to-noise (SNR) within each TF bin against a threshold or a local criterion [116].It has been demonstrated experimentally that the use of masks significantly improves speech intelligibility when an original speech is composed of noise or a masker speech signal [ hence silence regions will make the target mask undefined [21].This cost function also focuses on minimizing the disparity between the masks instead of the features of estimated signal and the target clean signal [21].The second type of cost function seeks to minimize the features of estimated signal Ŝt = m t ⊗ Y [t, n] and those of target clean signal S directly as shown equation 25.
The sum is over all the speech u and time-frequency bin (t, f ).Here, Y and S represents noisy (mixed) and clean (target) speech respectively.So, for DNN tools using indirect estimation of clean signal features, instead of them estimating the clean features directly from the noisy features input, the models first estimate binary masks.The binary masks are then applied to the noisy features to separate the sources (see figure 5, here, the features are the TF spectrogram).This technique has been applied in [3]   min The  [127].Like GAN, VAE is mainly used for denoising i.e.where the mixture is modelled as: Here, x f n denotes the mixture at the frequency index f and the time-frame index n, g n ∈ R + is a frequency independent but frame dependent gain while s f n and b f n represent the clean speech and the noise respectively at the frequency index f and the time-frame index n.We first give a brief overview of VAE before we discuss how it is adapted for speech enhancement.Mathematically, given an observable sample s, the goal of a generative VAE model is to model true data distribution p(s).
To do this, VAE assumes that the observed sample s are generated by associated latent variable z and their joint distribution is p(s, z).The model therefore seeks to learn how to maximize the likelihood p(s) over all observed data.

p(s) = p(s, z)dz
Integrating out all the latent variables z in the above equation is intractable.However, using Evidence lower bound (ELBO) which quantifies the log-likelihood of observed data p(s) can be estimated.ELBO is given in equation 28 (refer to [81] to see derivation of relationship between p(s) and ELBO).
Here, q φ (z | s) is a flexible variational distribution with parameters φ that the model seeks to maximize.Equation 28 can be written as equation 29 using Bayes theorem.
Equation 29 can be expanded as: Equation 30 can be expanded as: The second term on the right of equation 31 seeks to learn the prior p(z) via q φ (z | s) while the first term reconstructs data based on the learned latent variable z. q φ (z | s) is always modelled by a DNN and referred to as encoder and the reconstruction term is another DNN referred to as decoder.
Both the encoder and decoder are trained simultaneously.The encoder is normally chosen to model a multivariate Gaussian with diagonal covariance and the prior is often selected to be a standard multivariate Gaussian: To estimate clean speech based on variational-autoencoder pre-training, the tools execute several techniques that can be generalised into the following steps: The posterior estimator q φ (z | s) is a Gaussian distribution with parameters µ d and δ d .These parameters are to be established by the encoder deep neural network such that µ d : R F → R and δ d : R F → R + .2. Set up a noise model using unsupervised techniques such as NMF [128].For example, in case of NMF the noise b f n in equation 27 can be modelled as where N (0, δ) is a Gaussian distribution with zero mean and variance of δ. 3. Set up a mixture model such that p(x | z, θ s , θ u ) is maximised.Here x is the noisy speech signal, θ s are parameters from the pre-trained model in step 1 i.e φ and θ. θ u = g n , (W b , H b ) f,n represents the parameters to be optimised.The parameters are θ u are optimised by appropriate Bayesian inference technique.4. Reconstruct the clean speech ŝ such that p(ŝ|θ u , θ s , x) is maximised based on the parameters θ u , θ s from step 1 and 3 respectively and the observed mixed speech x.

Works that exploit different versions of variational auto-encoder technique include [77] [78] [1] [79]
[129].Another generative modelling technique that has been used in speech enhancement is the variational diffusion model (VDM) [130].VDM is composed of two processes i.e., diffusion and reverse process.The diffusion process perturbs data to noise and the reverse process seeks to recover data from noise.The goal of diffusion therefore is to transform a given data distribution into a simple prior distribution mostly standard Gaussian while the reverse process recovers data by learning a decoder parameterised by DNN.Formally, representing true data samples and latent variables as x t where t = 0 represents true data and 1 ≤ t ≤ T represents a sequence of latent variables, the VDM posterior is represented as: The VDM encoder q(x t | x t−1 ) unlike that of VAE, is not learned rather it is a predefined linear Gaussian model.The Gaussian encoder is parameterized with mean u t (x t ) = √ α t x t−1 and variance Therefore, the encoder q(x t | x t−1 ) can mathematically be represented as α t evolves over time such that the final distribution of the latent p(x T ) is a standard Gaussian.The reverse process seeks to train a decoder that starts from the standard Gaussian distribution p(x T ).
Formally the reverse process can be represented as: Here p(x T ) = N (x T ; 0, I).The reverse process seeks to set up a decoder p θ (x t−1 | x t ) that optimizes the parameter θ such that: the conditionals p θ (x t−1 | x t ) are established.Once the VDM is optimized, a sample from the Gaussian noise p(x T ) can iteratively be denoised through transitions p θ (x t−1 | x t ) for T steps to generate a simulated x 0 .Using reparameterization trick, x t in equation 34 can be rewritten as: ) where ∼ N ( , O, I) Similarly, Based on this and through iterative derivation of equation 34, it can be shown that: In the reverse process in equation 36, the transition probability p θ (x t−1 | x t ) can be represented by two parameters µ θ and δ θ as N (x t−1 ; µ θ (x t , t), δ θ (x t , t) 2 I) with θ being the learnable parameters.
It has been shown in [131] that µ θ (x t , t) can be established as: Based on equation 40, to estimate µ θ (x t , t) the DNN θ (x t , t) needs to estimate the Gaussian noise in x t which was injected during the diffusion process.Like VAE, VDM uses ELBO objective for optimization.Please see [131] for a thorough discussion on VDM.In speech denoising, work in [81] uses conditional diffusion process to model the encoder q(x t | x t−1 ).In conditional encoder, instead of q(x t | x t−1 ), they define it as q(x t | x 0 , y) i.e., q(x t | x 0 , y) = N (x t ; (1−m t ) √ ᾱx 0 +m t √ ᾱy, δ t I).Here x 0 ,y represents the clean speech and noisy speech respectively.The encoder is modeled as a linear interpolation between clean speech x 0 and the noise speech y with interpolation ratio m t .The reverse process p Here, µ θ (x t , y, t) is the mean of the conditional reverse process.similar to equation 40, µ θ (x t , y, t) is estimated as ) where θ (x t , y, t) is a DNN model to estimate the combination of Gaussian and non-Gaussian noise.The coefficients c xt , c yt and c t are established via the ELBO optimization.Other generative modelling techniques for speech enhancement(denoising).

Highlights on Fourier spectrum features
1.When performing a DFT on the input signal, an optimum window length must be selected.
The choice of the window has a direct impact on the frequency resolution and the latency of the system.To achieve good performance, most systems use 32ms.This may limit the use of the DFT based models in environments which require short latency [132].2. DFT is a generic method for signal transformation that may not be optimised for waveform transformation in speech separation.It is therefore important to know to what extent does it place an upper bound on the performance level of speech enhancement techniques.3. Accurate reconstruction of estimated clean speech from the estimated features is not easy and the erroneous reconstruction of clean speech places an upper bound on the accuracy of the reconstructed audio.4. Perhaps the biggest challenge when working in the frequency domain is how to handle the phase.Most DNN models only use the magnitude spectrum of the noisy signal to train the DNN then factor in the phase of the noisy signal during reconstruction.Recent works such as [89] have shown that this technique does not generate optimum results. 5.While working in the frequency domain, experimental research has demonstrated that spectral masking generates better results in terms of enhanced speech quality as compared to the spectral mapping method [119].

Handling of phase in frequency domain
The assumption made by most DNN models that use Fourier spectrum features is that phase information is not crucial for human auditory.Therefore, they exploit only the magnitude or power of the input speech to train the DNN models to learn the magnitude spectrum of the clean signal and factor in the phase during the reconstruction of the signal( see figure 10) [113] [133] [105] [134] [135].The use of the phase from the noisy signal to estimate the clean signal is based on works such as [136] that demonstrated that the optimal estimator of the clean signal is the phase of the noisy signal.Further, most speech separation models work on frames that are of size between 20-40 ms and believe that the short-time phase contain low information [137] [138] [139] [140] and therefore not crucial when estimating clean speech.However, recent research [89] have demonstrated through experiments that further improvements in quality of estimated clean speech can be attained by processing both the short-time phase and magnitude spectra.Further, the factoring in of the noisy input phase during reconstruction has been noted to be a problem since the phase errors in the input interact with the amplitude of the estimated clean signal hence causing the amplitude of the estimated clean signal to differ with the amplitude of the actual clean signal being estimated [115], [64].Based on the Here D, is a selected objective function such as MSE, θ y and θ s represent the phase of the noisy and clean (target) speech respectively.The sum is over all the speech u and time-frequency bin (t, f ).
Experiments conducted based on the objective function in equation 43 show superior results in terms of signal-to-distortion ratio (SDR) [115].Work in [111] trains a DNN model to generate masks that are composed of both the real and imaginary part (see equation 44).The complex mask will then be applied to a complex representation of the noisy signal to generate the estimated clean signal.By learning a mask that has both the real and imaginary part, they integrate the phase as part of the learning.
O r is the real part of the mask estimated by the DNN model while O i is the imaginary part.M r is the real part of the target mask while M i is the imaginary part.N is the number of frames and (t,f) is a given TF bin.The complex mask implementation has been exploited in [55] [115] [141] where the targets are formulated in the complex coordinate system i.e. the magnitude and phase are composed as part of the learning process.Work in [42] proposes a model that learns the phase during training via input spectrogram inversion (MISI) algorithm [142].Work in [143] proposes a generative adversarial network (GAN) [126] based technique of learning the phase during training.Other works that learn the phase during training include [63] and [144].Techniques that include phase as part of the training face the difficulty of processing a phase spectrogram which is randomly distributed and highly unstructured [145].To mitigate this problem and derive a highly structured phase-aware target masks, [145] employs instantaneous frequency (IF) [146] to extract structured patterns from phase spectrograms.
Post-processing phase update: The models that use this technique, train the DNN models using only the magnitude spectrum.Once the model has been trained to estimate the magnitude spectrum of the clean signal, they iteratively update the phase of the noisy signal to be as close as possible to that of the target clean signal.The algorithm being exploited by the models performing post-processing phase update is based on the Griffin-Lim algorithm proposed in [147].For example, in [64], they exploit the magnitude X 0 of the target clean signal to iteratively obtain an optimal phase φ from the phase of the noisy signal ( see algorithm 1).The obtained phase is then used in the reconstruction of the estimated clean signal together with the magnitude X estimated by the DNN.The technique is also used in [148].Techniques that implement Griffin-Lim algorithm such as in algorithm 1 perform iterative phase reconstruction of each source independently and may not be effective for multiple source separation where the sources must sum up to the mixture [42].Work in [42] proposes to jointly reconstruct the phase of all sources in a given mixture by exploiting their estimated magnitudes and the noisy phase using the multiple input spectrogram inversion (MISI) algorithm [142].They ensure that the sum of the reconstructed time-domain signals after each iteration must sum to the mixture signal.Work [149] and [62] also uses post-processing to update the phase of the noisy signal.

Algorithm 1 Iteratively updating the phase of a noisy signal
Require: Target clean magnitude X 0 , noisy phase φ 0 , iteration N.

Time-domain features
Due to the challenges highlighted in section 3.1.2of working in the time-frequency domain, different models such as [132] [157] explore the idea of designing a deep learning model for speech separation that accepts speech signal in the time-domain.The fundamental concept for these models is to replace the DFT based input with a data-driven representation that is jointly learned during model training.The models therefore accept as their input the mixed raw waveform sound and then generates either the estimated clean sources or masks that are applied on the noisy waveform to generate clean sources.By working on the raw waveform, these models address two key limitations of DFT based models.First, the models are designed to fully learn the magnitude and phase information of the input signal during training [150].Secondly, they avoid reconstruction challenges faced when working with DFT features.The time domain methods can broadly be classified into two categories [150].

Adaptive front-end based method
The models in this category can roughly be discussed as composed of three key modules i.e., the encoder, separation and decoder modules ( see figure 11).
1. Encoder: The encoder can be regarded as an adaptive front-end which seeks to replace STFT with a differentiable transform that is jointly trained (learned) with the separation model.It accepts as its input a time-domain mixture signal then learns STFT-like representation [11] [155].By working directly with the time-domain signal, these models avoid the decoupling of the magnitude and phase of the input signal [10].Most systems employ 1-dimensional convolution as the encoder to learn the features of the input signal.The transform generated by the encoder is then passed to the separation module.Work in [158] demonstrates that learned bases from raw data produce better results for speech/non-speech separation.

Waveform Mapping
The second category of systems implement end-to-end systems where they utilise deep learning models to fit a regression function that maps an input mixed signal to its constituent estimated clean signal without an explicit front-end encoder (see figure 12).The models are trained using a pair of mixed(noisy) and clean speech.The model is fed with features of mixed signal for it to estimate clean speech.The training involves minimising an objective function such as minimum mean square error(MMSE) between the features of the clean signal and the estimated clean signal generated by the model.This approach has been implemented in [159] [160] [161].
Figure 12: Direct approach training of DNN models using raw waveform.

Generative modelling
SEGAN [162] is GAN based model for speech denoising that conditions both G and D of equation 26 on extra information z representing latent representation of the input.To solve the problem of vanishing gradient associated with optimizing objective in equation 26, they replace the cross-entropy loss by a least square function in equation 45.
Here, x is the noisy speech, x is the clean speech, z is the extra input latent representation and ||.|| 1 is the l 1 norm distance between the clean sample x and the generated sample |G(z, x) to encourage the generator G to generate more realistic audio.Work in [163] improves SEGAN to handle a more generalised speech signal distortion case which involves distortions such as chunk removal, band reduction, clipping and whispered speech.Work [164] improves SEGAN by implementing multiple generators as opposed to one and demonstrates that by doing so the speech quality of the enhanced speech is better than when a single generator is used.Work in [165] proposes a variation of SEGAN that is more tailored towards speech synthesis and not ASR.They replace the original loss function used in SEGAN with Wasserstein distance with gradient penalty(WGAN) [166].They also exploit gated linear unit as activation function which has been shown in [167] to be more robust in generating realistic speech.Other GAN based models for speech enhancement in the time domain include [168].

Challenges of working with time-domain features
1. Time domain features lack direct frequency representation; this hinders the features from capturing speech phonetics that are present in the frequency domain.Due to this, artefacts are always introduced in the reconstructed speech in the time domain [172].2. The time domain waveform has a large input space.Based on this, models working with raw waveforms are often deep and complex in order to effectively model the dependencies in the waveform.This is computationally expensive [72] [162] [11] [173].
. 4 Which feature produces superior quality of enhanced speech?
We performed analysis of 500 papers that exploit DNN to perform speech enhancement(i.e., multitalker speech separation or denoising or dereverberation).We selected papers published from 2018 to 2022.We were interested to answer the question, which features are more popular with these tools?The summary is presented in figure 13.Based on the analysis, time-domain features popularity has grown rapidly from 2018 to 2022.The use of DFT features has slightly dropped, however remains popular over the five years.The popularity of MFCC and LPS has diminished.The popularity of features that are computationally expensive such as time-domain and DFT features may be attributed to the improved computation power of computers and efficient sequence modelling techniques such as transformers and temporal convolutional networks (see section 5 for discussion).Features such as MFCC are becoming less popular due to their reduced resolution, which must be extrapolated during reconstruction hence placing an upper bound on the quality of enhanced speech.We also investigated whether DFT or time-domain features produced the highest quality enhanced speech.Several works have conducted experiments with the goal to answer this question.Notable works include [174] and [175].For example, [174] investigates Conv-TasNet's [10] performance under different input types in the encoder and decoder.Conv-TasNet uses a frame length of 4ms, stride of 2ms and overlap of 2ms.Sample results presented in [174] are presented in table 1 where evaluation parameters include scale-invariant signal-to-distortion (si_SDR), signal-to-distortion (SDR), word error rate (WER).The results in table 1 show that the Conv-TasNet model gives marginally better results in terms of si_SDR, SDBdB and W ER when the input is in time domain where the signal representation is learned by the encoder and output is learned by the decoder.The results are significantly reduced in all the three parameters if STFT is used as the input and its inverse used in the decoder.For instance, Conv-TasNet model achieves a SDR of 14.7 when time-domain features are used.This drops to 12.8 when DFT features are used.This shows that working in the time domain may be better for this setting as compared to the frequency domain.Work in [175] also shows the same trend where working in time-domain provides better results as compared to frequency domain.However, for mixed speech with reverberation, the use of a time domain signal does not improve the same results as compared to the frequency domain and further investigation on behaviour of both time and frequency features in the presence of reverberation is needed [175].

Long term dependencies modelling
To effectively perform speech separation, the speech separation tools need to model both long and short sequences within the audio signal.To do this, existing tools have employed several techniques:

Use of RNN
The initial speech separation models such as [85] [106] [176] relied on a feedforward DNN to estimate clean speech from a noisy one.However, feedforward DNN models are ill poised for speech data since they are unable to effectively model long dependencies across time that are present in the speech data.Due to this, researchers progressively introduced recurrent neural networks (RNN) which have a feedback structure such that the representations at given time step t is a function of the data at time t, the hidden state and memory at time t − 1.One such RNN that has been exploited in speech separation is long-short-term memory (LSTM) [177].LSTM has memory blocks that are composed of a memory cell to remember the temporary state and several gates to control the information and gradient flow.LSTM structures can be used to model sequential prediction networks which can exploit long-term contextual information [177].Works in [103] [120] [178] exploit LSTM to perform speech separation while [115] uses bidirectional long short-term memory (BLSTM) networks to make use of contextual information from both sides in the sequence.Due to their inherently sequential nature, RNN models are unable to support parallelization of computation.This limits their use when working with large datasets with long sequences due to slow training [11].Moreover, in speech separation, a typical frame(input features) is usually 25ms which corresponds to 400 samples at a 16kHz sampling rate, for LSTM to work directly on the raw waveform, it would require unrolling the LSTM for an unrealistic large number of time steps to cover an audio of modest length [179].
Other models that use different versions of RNN include [180].Models such as [181] use the gated recurrent unit (GRU) [182] to perform speech denoising.

Use of temporal convolution network
Conventional convolution neural networks(CNN)have been used to design speech separation models [94] [183].However, CNNs are limited in their ability to model long-range dependencies due to limited receptive fields [184].They are therefore mainly tailored to learn local features.
They exploit local window which maintain translation equivariance to learn a shared position-based kernel [185].For CNN to capture long range dependencies ( i.e., to enlarge the receptive field), there is a need to stack many layers.This increases computation cost due to the large number of parameters.These shortcomings of the CNN and RNN, have motivated the use of dilated temporal convolution network (TCN) in speech separation to encode long-range dependencies using hierarchical convolutional layers [186] [187] [188] [17] [189].TCN is composed of two key distinguishing characteristics: the convolution in the model must be causal i.e., a given activation of a certain layer l at time t is only influenced by activations of the previous layer l − 1 from time steps that are less that t, 2) the model takes the sequence of any length and maps it into an output sequence of the same length.To achieve the second characteristic, TCN models are implemented using a 1-dimensional convolutional network such that each hidden layer is the same length as the input layer.To ensure same length, a zero padding of length f iltersize − 1 is added to keep subsequent layers the same length as previous ones [190] (see figure 14).The first property is achieved through the use of causal convolutions i.e.
where an output at time t is convolved only with elements from time t and earlier in the previous layer.To increase the receptive fields, models implement dilated TCN.Dilated convolution is where the filter is applied to a region larger than its size [191].This is achieved by skipping input with certain specified steps (see figure 10).More formally, for 1D sequence such as speech signal, the input x ∈ R n and the kernel f : {0, • • • , k − 1} → R, the dilated convolution operation F on an element s of a given sequence is defined according to equation 20 [190].
where x is the 1D input signal, k is the kernel and d is the dilation factor.The effect of this is to expand the receptive field without loss of resolution and drastically increase the number of parameters.Stacked dilated convolution expands the receptive field with only a few layers.The expanded receptive field allows the network to capture temporal dependence of various resolutions with the input sequences [152].In effect, TCN introduces the idea of time-hierarchy where the upper layers of the network model longer input sequences on larger timescales while local information are modelled by lower layers and are mainly maintained in the network through residuals and skip connections [152].TCN also uses causal convolution where a given output at layer l in time step t is computed only based on time steps up to time step t − 1 in the previous layer.The dilated TCN is exploited by [10] to model sequences that exist within the input speech signal.They implement TCN such that each layer is composed of 1-D dilated convolution blocks.The layers have 1-D CNN blocks with increasing dilation factors.This is to uncover long range dependencies that exist in the audio input.
The dilation factors increase exponentially over the layers in order to cover a large temporal context window to exploit the long-range dependencies that exist within a speech signal.
Here, y(m, n) is the output of a given layer of dilated convolution, x(m, n) is the input and w(i, j) is the filter with the length and the width of M and N respectively.The parameter r is the dilation rate.Note that if r = 1, the dilated convolution becomes the normal convolution convolution.

Use of transformers
A transformer [192] is an attention-based deep learning technique that has been successful in modelling sequences and allows uncovering of dependencies that exist within an input without regard to the distance between any two values of the input.Transformers consist only of feed-forward layers which allows them to exploit the parallel processing capabilities of GPUs leading to fast training [192].
In speech separation, [11] introduces a speech separation system that fully relies on transformers to model the dependencies that exist in the mixed audio signal.This is used to extract a mask for each of the speakers in the audio mixture.The transformer is used to uncover both the short-term dependencies (within a frame) and long-term dependencies (between frames).Work in [184] also exploits transformers in the encoder to model the dependencies that exist in the mixed audio while [67] uses transformers to perform speech dereverberation.Despite their ability to model long-range dependencies and ability to work well with parallelization, the attention mechanism of transformers, has O(N 2 ) complexity that brings a major memory bottleneck [193].For a sequence of length N , the transformer needs to compare N 2 elements which results in a computational bottleneck especially for long signals such as speech.Transformers also use many parameters aggravating the memory problem further.Several versions of transformers such as Longformer [194], LinFormer [195] and Reformer [196] have been proposed with a goal to reduce the computation complexity of the transformers.Work in [197] investigates the performance of the three versions of transformers in speech separation and concludes that they are suitable for speech separation applications since they achieve a highly favourable trade-off between performance and computational requirements.Work in [198] proposes a technique of parameter sharing to reduce the computation complexity of the transformer while [193] reduces complexity by avoiding frame overlap.In [199], a teacher-student speech separation model based on transformer is proposed.The student model which is much smaller than the teacher model is used to reduce computation complexity.Other transformer-based speech enhancement tools include [173] [200].Another key limitation of a transformer is that while it can model long-range global context, they do not extract fine-grained local features patterns well.Based on this, transformer-based speech separation tools apply attention within a frame (chunk) to capture local features and between frames(chunks) to capture global features [11] [201].

Model size reduction techniques
To achieve high performance i.e., generate speech with high intelligibility, DNN models for speech enhancements are becoming large by exploiting millions of parameters [202].High number of parameters increase the memory requirements, computation complexity and latency.To reduce these parameters significantly without compromising quality and make speech enhancement tools to work in resource constrained platform, several techniques are being exploited.The techniques include: Use of dilated convolution: To increase the receptive field of 1D CNN and subsequently increase the temporal window and model long range dependencies within a speech, speech separation such as [132] and [186] implement dilated CNN.Dilated convolution initially introduced by [167] involves a convolution where a kernel is applied to an area that is larger that it.This is achieved by skipping input values by a defined step.It is like implementing a sparse kernel ( i.e., dilating the kernel with zeros).When dilated convolution is applied in a stacked network, it enables the network to increase its receptive field with few layers hence minimizing parameters and reducing computation [188]( see figure 14).This ensures that the models can capture long range dependencies while keeping the number of parameters at minimum.The dilating factors are made to increase exponentially per layer( see figure 10).
Parameter quantization: To reduce computation, inference complexity of DNN models and to scale down the number of parameters, models such as [203] [204] [205] [206] [207] use parameter quantization.In quantization, the objective is to reduce the precision of model parameters and activation values to a low precision with minimal effects on the generalization capability of the DNN model.To achieve this, a quantization operator Q is defined that maps a floating value to a quantized one [208].
Use of depthwise separable convolution: This type of convolution, decouples the convolution process into two i.e. depthwise convolution where a single filter is applied to each input channel and pointwise convolution which is applied to the output of depthwise convolution to achieve a linear convolution of the depthwise layer.Depthwise separable convolution has been shown to reduce the number of parameters as compared to the convectional one [209] [210].Speech enhancement tools that exploit depthwise separable convolution include [10] [26] [67].
Knowledge distillation: Knowledge distillation involves training a large teacher model which can easily extract the structure of data then the knowledge learned by the teacher is distilled down to a smaller model called the student.Here, the student is trained under the supervision of the teacher [211] [212] [213].The student model must mimic the teacher and by doing so achieve superior or similar performance but at reduced computation cost due to reduced parameters.Knowledge distillation technique has been exploited to reduce latency in speech enhancement tool [199] [214].
Parameter pruning: In order to reduce the number of parameters and hence speed up computation, some speech enhancement tools use parameter pruning [206] [215].Pruning involves converting a dense DNN model into a sparse one by significantly scaling down the number of parameters without compromising model's output's quality.In [216] they train a speech enhancement DNN model to obtain an initial parameter set Θ, they then prune the parameters by dropping the weights whose absolute values are below a set pruning threshold.The sparse network is again re-trained to obtain final parameters.Work in [203] estimates the sparsity S(k) of a given channel F jk .If the sparsity S(k) > θ where θ is a predefined threshold, the weights within F jk is set to zero and the model is retrained.After several iterations, the channel F jk is dropped.
Weight sharing: This involves identifying clusters of weights that have a common value.The clusters are normally identified using K-means algorithm.So instead of storing each weight value, only the indexes of the shared values are stored.Through this memory requirements of the model is reduced [217].Speech enhancement tools that use weight sharing include [204], [218].

Objective functions for speech enhancement and separation
Most DNN monaural speech enhancement and separation models especially those working on features in the frequency domain exploit mean-square-error (MSE) as the training objective [21] [38] [136] [151].The DNN models that have the mask as the target use the training objective to minimise the MSE between the estimated mask and the ideal mask target.For models that predict estimated features (such as T-F spectrogram) of the clean source speech, MSE is used to minimise the difference between target features and the estimated features by the model.Despite the dominance of MSE as an objective function in the speech enhancement tools, it has been criticised since it is not closely related to human auditory perception [93].Its major weakness is that it treats estimation elements independently and equally.For instance, it treats each time-frequency unit separately rather than whole spectral [106].This leads to muffled sound quality and compromises intelligibility [106].MSE also treats every estimation element with equal importance which is not the case [219].It also does not discriminate between the positive or negative differences between the clean and estimated spectra.
A positive difference between the clean and estimated spectra represents attenuation distortion, while a negative spectral difference represents amplification distortion.MSE treats the effects of these two distortions on speech intelligibility as equivalent which is problematic [220] [221].Moreover, the MSE is usually defined in the linear frequency scale while the human auditory perception is on the Mel-frequency scale.To avoid the problem of treating every estimation element with equal importance [222], [223] propose a weighted MSE.Due to the shortcomings of MSE, objective functions that are closely related to the human auditory perception have been introduced to train the DNN [219] [224] [225] [226] [120].Some of the human auditory perception training objectives being used by speech enhancement tools are also used as metrics for perceptual evaluation.They include: 1. Short-time objective intelligibility (STOI) [227].
Scale invariant signal-to-distortion ratio Work in [228] proposes an intelligibility measure such that given the target signal s and the model estimated signal ŝ, they re-scale either s or ŝ such that the residual (s − βŝ) after scaling ŝ or (αs − ŝ) after scaling the target s is orthogonal to the target as: (s − βŝ).s = 0 or (ŝ − αs).s = 0 based on this,α can be computed as: Short-time objective intelligibility: This objective has been used in [225] [226] [219] [93].STOI [227][231] is a speech intelligibility measure that is achieved by executing the following steps: 4. Perform a one-third band octave analysis on both clean and enhanced speech by grouping DFT bins i.e the complex-valued STFT coefficients, X(n, k), are combined into J thirdoctave bands by computing the TF units.
Here, k 1 and k 2 represent the one-third octave band edges.The same octave analysis is performed on the enhanced speech.The one-third octave of the enhanced speech is defined in a similar manner.5. Define a short temporal envelope of both enhanced and clean speech as: T respectively.STOI exploits correlation coefficients to compare the temporal envelopes of clean and enhanced speech for a short time region.Note that N=30.6. Normalise the short temporal envelopes of the enhanced speech.Let y j,m (n) denote the n th envelope of enhanced speech.The normalised enhanced speech y j,m (n The intuition behind normalisation of the enhanced speech is to reduce global level differences between clean and enhanced speech.These global level differences should not have a strong effect on speech intelligibility. Here,µ (.) refers to the sample mean of the corresponding vector.9. Compute the average intermediate intelligibility of all frames as where M represents the total number of frames and J the number of one-third octave band.
Equation 36 represents the mean square error between single-sided amplitude spectra of the clean speech x and the DNN estimated speech.Equation 36 is not sensitive to the phase spectrum of the two signals x Perceptual metric for speech quality evaluation(PMSQE).This is an objective function that is based on the adaptation of perceptual evaluation of speech quality (PESQ) algorithm [232].Given the MSE loss in the log-power spectrum with mean and variance normalisation i.e.
(55) Here, X[n, k]| 2 and X[n, k]| 2 represent the power spectra of the clean and enhanced speech respectively.µ k is the mean log-power spectrum and δ k is its standard deviation.The indices n and k represent the frame and frequency, while K is the number of frequency bins.From equation 55, the MSE is entirely dependent on power spectra across frequency bands hence not factoring in the perceptual factors such as loudness difference, masking and threshold effects [233].To factor in the perceptual factors in the MSE, PMSQE modifies the MSE loss by incorporating two disturbance terms (symmetrical disturbance and asymmetrical disturbance) which are based on the PESQ algorithm both computed in a frame-by-frame basis [233].
Here D s t and D a t represent symmetrical and asymmetrical disturbances respectively.The parameters α and β are weighting factors which are determined experimentally.Work in [233] describes how to arrive at the values of D s t and D a t .Since PESQ is non-differentiable, the PMSQE objective function provides a way of estimating it.PMSQE objective function is designed to be inversely proportional to PESQ, such that a low PMSQE value corresponds to a high PESQ value and vice versa.The key question here is: Which objective function is superior?work in [234] tries to answer this question where they evaluate six objective functions.Their conclusion is that the evaluation metric should be a major factor in deciding on the objective function to use in the speech enhancement model.In case a given model targets to improve a specific evaluation metric, then selection of an objective function related to that metric is advantageous.

Unsupervised techniques for speech enhancement
Although supervised techniques of speech enhancement and separation have achieved great success towards improving speech intelligibility, the inherent problems associated with supervised learning still prohibits their applications in all scenarios.First, collecting parallel data of clean and noisy (mixed) data remains costly and time consuming.This limits the amount of data that can be used to train these models.Consequently, the models are not exposed to enough variations of the recording during training hence affecting their generalizability to noise types and acoustic conditions that were not seen during training [235] [236].Collecting clean audio is always difficult and requires a well-controlled studio exacerbating the already high cost of data annotation [236].Unsupervised learning offers an alternative to solving these problems.The existing unsupervised techniques for speech enhancement and separation can roughly be categorised into three: MixIT based techniques, generative modelling technique and teacher-student based techniques.Few novel techniques have also been proposed that fall outside these three dominant categories.Work in [237] proposes mixture invariant training (MixIT) to perform unsupervised speech separation.Given a set of X that is composed of mixed speech i.e.X = {x 1 , x 2 , • • • , x n } where each mixture x i is composed of up to N sources, mixtures are drawn at random from the set X without replacement and a mixture of mixture (MoM) created by adding the drawn mixtures, for example if two mixtures x 1 and x 2 are drawn from the set X, M oM x is created by adding x 1 and x 2 i.e x = x 1 + x 2 .The MoM x is the input to a DNN model which is trained to estimate sources ŝ composed in x 1 and x 2 .The DNN model is trained to minimize the loss function in equation 57.
For a case where MoM is composed of only two mixtures, A ∈ B 2×M is a set of binary matrices where each column sums to 1.The loss function is trained to minimize the loss between mixtures x i and the remixed separated sources Aŝ.MixIT has been criticised for over-separation where it outputs estimated sources greater than the actual number of underlying sources in the mixtures x i [238].Further, MixIT does not work well for speech enhancement ( i.e., denoising) [239].MixIT teacherstudent unsupervised model has been proposed in [238] to tackle the problem of over-separation in MixIT.It trains a student model such that its output matches the number of sources in the mixed speech x.Another MixIT based technique for solving over-separation problem is discussed in MixCycle [240].Work in [241] proposes to improve MixIT to make it more tailored for denoising by exploiting an ASR pre-trained model to modify MixIT's loss function.Work in [239] ) In equation 58, Q is a non-intrusive metric (i.e does not require reference clean speech) that is used to score the enhanced speech from the generator.The scores obtained by Q are used to optimize the model.In MetricGAN-U, DNSMOS [244] is used as Q .Another GAN based technique for unsupervised learning is [245] which exploits CycleGAN [246] multi-objective learning to perform parallel-data-free speech enhancement.Tools in [235] and [247] propose unsupervised speech denoising technique based on variations of VAE.Work in [236] proposes a speech denoising technique that uses only the noisy speech.It exploits the idea that was first proposed in [248] where they demonstrated that it is possible to recover signals under corruptions without observing clean signals.Predicating their work on these findings, given a noisy speech signal x, and noise n, [236] creates a more noisy speech y = x + n.They then train a DNN model to predict an enhanced speech ŝ by having the more noisy input y as the input and noisy speech x as the target.Consequently, the DNN is trained by minimizing the loss in equation 59.
Here, D is the objective function and M is the sample size.This technique works on the basis that DNN cannot predict random noise hence the noise component in the training data is mapped to their expected values.Therefore, by assuming the noise as zero mean random variable, the objective function eliminates the noise [236].Work in [16] proposes unsupervised techniques to perform speech separation based on gender.They exploit i-vectors to model large discrepancy in vocal tract, fundamental frequency contour, timing, rhythm, dynamic range, etc between speakers of different genders.In this case DNN model can be viewed as gender separator.

Domain adaptive techniques for speech enhancement and Separation
Training data used to train speech enhancement and separation tools mostly have acoustic features that are significantly different from the acoustic features of the speech signals where the tools are deployed.This mismatch between the training data and target data leads to degradation in the tool's performance in their deployed environment [249].The target environment dataset's acoustic features may vary from the training data in noise type, speaker and signal-noise-ratio [247] [254] to train an encoder to extract noise-invariant features.To do this, it utilizes the labelled source data and unlabelled target data.Through the feedback from the discriminator which gives the probability distribution over multiple noise types, the encoder is trained to produce noise-invariant features, hence reducing the mismatch problem.Work in [251] also exploits unsupervised DAT for speaker mismatch resolution.Work in [247] exploits importance-weighting (IW) using the classifiers of the networks to classify the source domain samples from the outlier weights and hence reduces the shift between the source and target domain.
10 Use of pre-trained models in speech separation and enhancement Pre-trained models have become popular especially in Natural language processing(NLP) and Computer vision.In NLP, for example, large corpus of text can be used to learn universal language representations which are beneficial for downstream NLP tasks.Due to their success in domains such as NLP and computer vision, pre-trained models based on unsupervised learning have been introduced in audio data [255] [256] [257] [258][259] [260].Such pre-trained models are beneficial in several ways: 1. Pre-trained models are trained in large speech dataset hence can learn universal speech representations which can be beneficial to speech separation by boosting the quality of enhanced speech generated by these models.
2. Pre-trained models provide models with better initialization which can result in better generalization and speed up convergence during training of speech enhancement models.
3. Pre-trained speech models can act as regularizers to help speech enhancements models to avoid over fitting.
Work in [261] seeks to establish if pre-trained speech models will help generate more robust features for downstream speech denoising and separation task as compared to features established without pre-trained models.To do this they use 13 speech pre-trained models to generate features of a noisy speech which are then passed through a three-layer BLSTM network which generates speech denoising or separation mask.They compare the performance of these features with those of baseline STFT and mel filerbank (FBANK) features.Their experiments establish that the 13 pre-trained models used do not significantly improve feature representations as compared to those of baselines.Hence the quality of enhanced and separated speech generated by features of pre-trained models are only slightly better or worse in some cases as compared to those generated based on the baseline features.They attribute this to domain mismatch and information loss.Since most of the pre-trained models were trained with clean speech, they are not portable to a noisy speech domain.Pre-trained models are usually trained to model global features and long-term dependencies hence some local features of the noisy or mixed speech signal may be lost due to this during feature extraction.Using HuBERT Large model [262], they demonstrate that the last layer of the model does not produce the best feature representation for speech enhancement and separation.In fact, for speech separation, the higher layers features are of low quality as compared to lower layers.They show that the weighted-sum representations of the representations from the different layers of pre-trained models where lower layers are given more weight generate better speech the enhancement and separation results as compared to isolated layers representations.They hypothesise that this could be due loss of some local signal information necessary for speech reconstruction tasks in deeper layers.To address the problem of information loss in pre-trained model, [263] proposes two solutions, first they utilize cross-domain features as model inputs to compensate lost information and secondly, they fine-tune a pre-trained model by using a speech enhancement model such that the extracted features are more tailored towards speech enhancement.Research in [264] seeks to synthesise clean speech directly from a noisy one using pre-trained model and HiFiGAN [265] speech synthesis model.It exploits the pre-trained model to extract features of the noisy model.The features are then used as input of HiFiGAN which generates estimated clean speech from these features.Based on the results reported in [261] that demonstrated that the final layer of pre-trained model does not give optimized representation for speech enhancement, they exploit weighted average of the representations of all the layers of the pre-trained model to generate representations of the noisy speech.The novelty of this work is that they do not use a model dedicated for speech denoising rather show that given features of a noisy speech, speech synthesis model can perform denoising.In [266], a pre-trained model has been exploited to design the loss function.Given a pre-trained model Φ with m layers, the tool uses weighted L 1 loss to compute the difference between the feature activations of clean and noisy speech generated by different layers of the pre-trained model according to equation 60.
Here, s and x are the clean and noisy speech respectively, g θ is the denoising DNN model and λ m are the weights of contribution of each pre-trained model layer features to the loss function.Work in [267] proposes a two-stage speech enhancement framework where they first pre-train a model using unpaired noisy and clean data and utilize the pre-train model to perform speech enhancement.Unlike the previous works that use general pre-trained models for audio, the pre-trained model in [267] is trained using speech enhancement dataset.They report state of the art results in speech enhancement and ability of the tool to generalize to unseen noise.
11 Future direction for research in speech enhancement Unsupervised techniques for speech separation: Majority of speech enhancements tools use supervised learning technique.For those that use unsupervised learning discussed in section 8, they almost entirely focus on speech denoising and not speaker separation and dereverberation.There is therefore a gap of extending the unsupervised DNN techniques to perform multi-speaker speech separation and dereverberation.Dimension mismatch problem: Most speech separation tools set a fixed number C of speakers therefore cannot deal with an inference mixture with K sources, where C = K.Currents tools deal with the dimension mismatch problem by either outputting silences if C > K or performing speech separation through iteration.However, both techniques have been found to be inefficient (see discussion in section 2.1).Therefore, there need to explore on developing DNN techniques for speech separation which are dynamic to the number of speakers present in inference mixture and adapt appropriately.Focus on data: Most model compression techniques for speech enhancement speed up the model performance by reducing or optimizing the model parameters.They don't focus on the data-side impact on the model performance.For instance, what is the ideal sequence length and chunk overlap when working in time-domain that can speed up the speech enhancement process without compromising the quality of enhancement.More focus needs to turn towards exploring the data modifications that can speech up the speech enhancement process.Dataset modification: In dereverberation, tools are beginning to explore the use of speech with early reverberation [75] as the target as opposed to using anechoic target.Experiments in [75] demonstrate that allowing early reverberation in the target speech improves the quality of enhanced speech.There is need to develop a standardized dataset where the target is composed of early reverberation to allow for standardized evaluation of the tools on this dataset.
Pre-trained Model: The pre-trained models that have been utilized for speech enhancement or separation have been trained on clean dataset hence failing potability test when used to generate features of a noisy speech signal [261].There is need for development of a pre-trained model tailored for speech separation and enhancement.

Conclusion
This review gives a discussion on how DNN techniques are being exploited by monaural speech enhancement tools.The objective was to uncover the key trends and dominant techniques being used by DNN tools at each stage of speech enhancement process.The review therefore discusses the type of features being exploited by these tools, the modelling of speech contextual information, how the models are trained (both supervised and unsupervised), key objective functions, how pre-trained speech models are being utilized and dominant dataset for evaluating the performance of the speech enhancement tools.In each section we highlight the standout challenges and how tools are dealing with these challenges.Our target is to give an entry into the speech enhancement domain by getting a thorough overview of the concepts and the trends of the domain.The hope is that the review gives a snapshot of the current research on DNN application to speech enhancement.

Figure 1 :
Figure 1: Overall structure of the topics covered by this review.

Figure 3 :
Figure 3: Permutation invariant training implementation of label matching for a two-talker speech separation model

Figure 5 :
Figure 5: Triangular filters used in the computation of the Mel-cepstrum using equation 18 23) Here, θ represents the angle between the two complex variables |X[k]| and |N [k]|.Most models that exploit log-power spectra features ignore the last term ( assume the value to be zero) and employ equation 24.

Figure 6 :
Figure 6: Demonstrating feature extraction.Here, Y t represents the noisy signal in time domain, Y f represents the transformed signal in the frequency domain.Y l is the log power features of the input signal

Figure 7 :
Figure 7: Supervised training of speech enhancement model using spectrogram as input and spectrogram as output.

Figure 9 :
Figure 9: DNN model for mask estimation from a noisy spectrogram.

Figure 10 :
Figure 10: Showing how DNN models exploit the phase of the noisy signal during reconstruction of the estimated signal

2 .
Separation module: This module is fed by the output of the encoder.It implements techniques to identify the different sources present in the input signal.3. Decoder: It accepts input from the separation module and sometimes from the encoder(for residual implementation ).It is mostly implemented as an inverse of the encoder in order to reconstruct the separated signals [132] [10] [11].

1 .
Given discrete time signals of clean speech signal x(n) and enhanced speech y(n), perform a DFT on both x(n) and y(n) i.e X(n, k) = DF T (y(n)) and Y (n, k) = DF T (y(n)).Here, k refers to the index of the discrete frequency.2. Remove silences in both the clean signal and the enhanced signals.Silences are removed by first identifying the frame with maximum energy(max energy ) in the clean signal.All frames with energy of 40 dB less than max energy are dropped.3. Reconstruct both the clean and enhanced speech signals.

Short-time spectral
amplitude mean square error.Let X[n, k] with 1 ≤ n ≤ N and 1 ≤ k ≤ K be an N point DFT of x and K is the number of frames.Let A[n, k] = X[n, k] with k, • • • , N 2 + 1 and k = 1, • • • , K denote the single sided amplitude spectra of X[n, k].Let Â[n, k] be an estimate of A[n,k], the short-time spectral amplitude mean square error(STSA-MSE) is given by Deep Neural Network Techniques for Monaural Speech Enhancement and Separation [119]119].For deep learning models working on the time-frequency domain, a model g θ is designed such that given a noisy or mixed speech spectrogram Y [t, n] at time frame t, it estimates the mask m t at that time frame.The established mask m t is then applied to the input spectrogram to estimate target or denoised spectrogram i.e., Ŝt = m t ⊗ Y [t, n] (see figure9).Here, Ŝt is the spectrogram estimate of the clean speech at time frame t and ⊗ denotes element wise multiplication.To train the model g θ , there are two key objective variants.The first type minimizes an objective function D such as mean squared error (MSE) between the model estimated mask mm and the target mask (tm).D|tm u,t,f , mu,t,f |This approach however cannot effectively handle silences where |Y [t, n]| = 0 and |X[t, n]| = 0, because the target masks tm will be undefined at the silence bins.Note that target masks such ideal amplitude mask (IRM) that is defined as (1)]]tive function in equation 26 is maximised w.r.t to D(.) and minimised w.r.t G(.).In speech enhancement, GAN was first introduced by SEGAN( see section 3.2.3).SEGAN which works in time-domain, uses a conditioned version of the objective function of equation 26.In conditioned GAN, both the generator and the discriminator are given extra information.This allows GAN to perform classification and mapping.SEGAN uses the least-squares GAN loss as opposed to sigmoid cross-entropy loss used for training.CGAN[104]just like SEGAN uses conditioned GAN but works in T-F domain to generate a denoised speech.Since most automatic speech recognition (ASR) tools work in T-F domain, CGAN hypothesises that the generative model working in T-F domain will be more robust for ASR as compared to those working in raw waveform.Therefore, CGAN can be seen as version of SEGAN that accepts input in T-F domain.To address the problem of mismatch between the training objective used in CGAN and the evaluation metrics, MetricGAN[93]proposes to integrate evaluation metric in the discriminator.By doing this, instead of the generator giving a false (0) or true(1)discrete values, it will generate continuous values based on the evaluation metric.MetricGAN can therefore be trained to generate data according to the the selected metric score.Through this modification, MetricGAN produces more robust enhanced speech.Another common generative group of generative models is variational auto-encoder (VAE) technique 1. Train a model such that it can maximise the likelihood p θ (s | z).Here, s denotes the clean speech dataset that is composed of F-dimensional samples i.e s t ∈ R F , 1 ≤ t ≤ T .The variational autoencoder assumes a D-dimensional latent variable z t ∈ R D .The latent variable z t and the clean speech s t have the following distribution: z t ∼ N (0, I D ) s t ∼ p(s t |z t ) Here, N (µ, δ) denotes a Gaussian distribution with mean µ and variance δ.Basically, a decoder p θ (s t | z t ) is trained to generate clean speech s t when given the latent variable z t , the decoder is parameterized by θ.The decoder p θ (s t | z t ) is learned by deep learning model during training.The encoder is trained to estimate the posterior q φ (z t |s t ) using a DNN.The overall objective of the variational auto-encoder training is to maximise equation 32.

Table 1
[76]]seeks to improve MixIT for denoising by improving loss function and noise augmentation scheme.RemixIT[242]is an unsupervised speech denoising tool that exploits teacher-student DNN model.Given a batch of noisy speeches of size b, the teacher estimates the clean speech sources ŝi and noises ni where 1 ≤ i ≤ b.The teacher estimated noises ni are mixed at random to generate n p .The mixed noise n p together with the teacher estimated sources are used to generate new mixtures mi = ŝi + n p .The new mixtures mi are used as input to the student.The student is optimised to generate ŝi and noise n p i.e., ŝi and n p are the targets.Through this the teacher-student model is trained to denoise the speech.In RemixIT, a pre-trained speech enhancement model is used as the teacher model.Motivated by RemixIT,[243]also proposes a speech denoising unsupervised tool using teacher-student DNN model.They propose various techniques of student training.MetricGAN-U[76]is an unsupervised GAN based speech enhancement tool that trains a conditioned GAN discriminator without a reference clean speech, MetricGAN-U employs objective in equation 58 to train the speech enhancement model.
[250] potential way of tackling this problem is to collect massive training data that covers different variation of deployment environment.However, this is mostly not possible due to prohibitive cost.Due to this, some tools are proposing DNN based techniques for domain adaptation.Domain adaptation seeks to exploit either labelled or unlabelled target domain data to transfer a given tool from training data domain to the target data domain.Basically, domain adaptation seeks to reduce the covariance shift between the source and target domains data.The domain adaptation techniques in literature for speech separation and enhancement tools can be categorised into two: Unsupervised domain adaptation techniques such as[250][251] which use unlabelled target domain dataset to adapt a DNN model and supervised domain adaptation techniques such as[252][247][253]which exploit limited labelled target domain dataset to perform domain adaptation of a DNN model for speech enhancement or separation.To make speech enhancement tools portable to new a new language,[252]proposes to use transfer learning.Transfer learning entails tailoring trained DNN models to apply knowledge acquired during training to a new domain where there is some commonality in type of task.The tool fine-tunes the top layers of a trained DNN model for speech enhancement by using labelled data of a new language while freezing the lower layer which are composed of parameters acquired during training of the original language.Work in[253]also uses transfer learning to show that pre-trained SEGAN can achieve high performance in new languages with unseen speakers and noise with just short training time.To make it more adaptable to different types of noise, tool in[247]proposes to employ multiple encoders where each encoder is trained in a supervised manner to focus only on given acoustic feature.The features are categorized into two i.e., utterance-level features such as gender, age, ascent of the speaker, signal-to-noise ratio and noise type and the signal-level features such high and low frequency of the speech parts.Feature focused encoders are trained to learn how to extract a given feature representation such as gender representation composed in the speech.Through the feature focused encoders, the experimental results show that the tool can adapt more to unseen noise types as compared to using a single global encoder.To adapt the DNN speech enhancement model to unseen noise type, work in[250]utilizes domain adversarial training (DAT)