Databases
The experiments were conducted with the Wall Street Journal (WSJ) database. The setting is largely standard: the training part used the WSJ si284 training dataset, which involves 37,318 utterances or about 80 h of speech signals. The WSJ dev93 dataset (503 utterances) was used as the development set for parameter tuning and cross validation in DNN training. The WSJ eval92 dataset (333 utterances) was used to conduct evaluation.
Note that the WSJ database was recorded in a noise-free condition. In order to simulate noise-corrupted speech signals, the DEMAND noise database (http://parole.loria.fr/DEMAND/) was used to sample noise segments. This database involves 18 types of noises, from which we selected 7 types in this work, including white noise and noises at cafeteria, car, restaurant, train station, bus and park.
Experimental settings
We used the Kaldi toolkit (http://kaldi.sourceforge.net/) to conduct the training and evaluation and largely followed the WSJ s5 recipe for Graphics Processing Unit (GPU)-based DNN training. Specifically, the training started from a monophone system with the standard 13-dimensional MFCCs plus the first- and second-order derivatives. Cepstral mean normalization (CMN) was employed to reduce the channel effect. A triphone system was then constructed based on the alignments derived from the monophone system, and a linear discriminant analysis (LDA) transform was employed to select the most discriminative dimensions from a large context (five frames to the left and right, respectively). A further refined system was then constructed by applying a maximum likelihood linear transform (MLLT) upon the LDA feature, which intended to reduce the correlation among feature dimensions so that the diagonal assumption of the Gaussians is satisfied. This MLLT+LDA system involves 351 phones and 3,447 Gaussian mixtures and was used to generate state alignments.
The DNN system was then trained utilizing the alignments provided by the MLLT+LDA GMM system. The feature used was 40-dimensional filter banks. A symmetric 11-frame window was applied to concatenate neighboring frames, and an LDA transform was used to reduce the feature dimension to 200. The LDA-transformed features were used as the DNN input.
The DNN architecture involves 4 hidden layers, and each layer consists of 1,200 units. The output layer is composed of 3,447 units, equal to the total number of Gaussian mixtures in the GMM system. The cross entropy was set as the objective function of the DNN training, and the stochastic gradient descendent (SGD) approach was employed to perform optimization, with the mini batch size set to 256 frames. The learning rate started from a relatively large value (0.008) and was then gradually shrunk by halving the value whenever no improvement on frame accuracy on the development set was obtained. The training stopped when the frame accuracy improvement on the cross-validation data was marginal (less than 0.001). Neither momentum nor regularization was used, and no pre-training was employed since we did not observe a clear advantage by involving these techniques.
In order to inject noises, the averaged energy was computed for each training/test utterance, and a noise segment was randomly selected and scaled according to the expected SNR; the speech and noise signals were then mixed by simple time-domain addition. Note that the noise injection was conducted before the utterance-based CMN. In the noisy training, the training data were corrupted by the selected noises, while the development data used for cross validation remained uncorrupted. The DNNs reported in this section were all initialized from scratch and were trained based on the same alignments provided by the LDA+MLLT GMM system. Note that the process of the model training is reproducible in spite of the randomness on noise injection and model initialization, since the random seed was hard-coded.
In the test phase, the noise type and SNR are all fixed so that we can evaluate the system performance in a specific noise condition. This is different from the training phase where both the noise type and SNR level can be random. We choose the ‘big dict’ test case suggested in the Kaldi WSJ recipe, which is based on a large dictionary consisting of 150k English words and a corresponding 3-gram language model.
Table 1 presents the baseline results, where the DNN models were trained with clean speech data, and the test data were corrupted with different types of noises at different SNRs. The results are reported in word error rates (WER) on the evaluation data. We observe that without noise, a rather high accuracy (4.31%) can be obtained; with noise interference, the performance is dramatically degraded, and more noise (a smaller SNR) results in more serious degradation. In addition, different types of noises impact the performance in different degrees: the white noise is the most serious corruption which causes a ten times of WER increase when the SNR is 10 dB; in contrast, the car noise is the least impactive: It causes a relatively small WER increase (37% in relative) even if the SNR goes below 5 dB.
Table 1
WER of the baseline system
The different behaviors in WER changes can be attributed to the different patterns of corruptions with different noises: white noise is broad-band and so it corrupts speech signals on all frequency components; in contrast, most of the color noises concentrate on a limited frequency band and so lead to limited corruptions. For example, car noise concentrates on low frequencies only, leaving most of the speech patterns uncorrupted.
Single noise injection
In the first set of experiments, we study the simplest configuration for the noisy training, which is a single noise injection at a particular SNR. This is simply attained by fixing the injected noise type and selecting a small σ
SNR so that the sampled SNRs concentrate on the particular level μ
SNR. In this section, we choose σ
SNR=0.01.
White noise injection
We first investigate the effect of white noise injection. Among all the noises, the white noise is rather special: it is a common noise that we encounter every day, and it is broad-band and often leads to drastic performance degradation compared to other narrow-band noises, as has been shown in the previous section. Additionally, the noise injection theory discussed in Section 6 shows that white noise satisfies Equation 2 and hence leads to the regularized cost function of Equation 5. This means that injecting white noise would improve the generalization capability of the resulting DNN model; this is not necessarily the case for most of other noises.
Figure 2 presents the WER results, where the white noise is injected during training at SNR levels varying from 5 to 30 dB, and each curve represents a particular SNR case. The first plot shows the WER results on the evaluation data that are corrupted by white noise at different SNR levels from 5 to 25 dB. For comparison, the results on the original clean evaluation data are also presented. It can be observed that injecting white noise generally improves ASR performance on noisy speech, and a matched noise injection (at the same SNR) leads to the most significant improvement. For example, injecting noise at an SNR of 5 dB is the most effective for the test speech at an SNR of 5 dB, while injecting noise at an SNR of 25 dB leads to the best performance improvement for the test speech at an SNR of 25 dB. A serious problem, however, is that the noise injection always leads to performance degradation on clean speech. For example, the injection at an SNR of 5 dB, although very effective for highly noisy speech (SNR < 10 dB), leads to a WER ten times higher than the original result on the clean evaluation data.
The second and third plots show the WER results on the evaluation data that are corrupted by car noise and cafeteria noise, respectively. In other words, the injected noise in training does not match the noise condition in the test. It can be seen that the white noise injection leads to some performance gains on the evaluation speech corrupted by the cafeteria noise, as far as the injected noise is limited in magnitude. This demonstrated that the white noise injection can improve the generalization capability of the DNN model, as predicted by the noise injection theory in Section 6. For the car noise corruption, however, the white noise injection does not show any benefit. This is perhaps attributed to the fact that the cost function (Equation 1) is not so bumpy with respect to the car noise, and hence, the regularization term introduced in Equation 3 is less effective. This conjecture is supported by the baseline results which show very little performance degradation with the car noise corruption.
In both the car and cafeteria noise conditions, if the injected white noise is too strong, then the ASR performance is drastically degraded. This is because a strong white noise injection does not satisfy the small noise assumption of Equation 2, and hence, the regularized cost (Equation 3) does not hold anymore. This, on one hand, breaks the theory of noise injection so that the improved generalization capability is not guaranteed, and on the other hand, it results in biased learning towards the white noise-corrupted speech patterns that are largely different from the ones that are observed in speech signals corrupted by noises of cars and cafeterias.
As a summary, white noise injection is effective in two ways: for white noise-corrupted test data, it can learn white noise-corrupted speech patterns and provides dramatic performance improvement particularly at matched SNRs; for test data corrupted by other noises, it can deliver a more robust model if the injection is in a small magnitude, especially for noises that cause a significant change on the DNN cost function. An aggressive white noise injection (with a large magnitude) usually leads to performance reduction on test data corrupted by color noises.
Color noise injection
Besides white noise, in general, any noise can be used to conduct the noisy training. We choose the car noise and the cafeteria noise in this experiment to investigate the color noise injection. The results are shown in Figures 3 and 4, respectively.
For the car noise injection (Figure 3), we observe that it is not effective for the white noise-corrupted speech. However, for the test data corrupted by car noise and cafeteria noise, it indeed delivers performance gains. The results with the car noise-corrupted data show clear advantage with matched SNRs, i.e., with the training and test data corrupted by the same noise at the same SNR, the noise injection tends to deliver better performance gains. For the cafeteria noise-corrupted data, it shows that a mild noise injection (SNR=10 dB) performs the best. This indicates that there are some similarities between car noise and cafeteria noise, and learning patterns of car noise is useful to improve robustness of the DNN model against corruptions caused by cafeteria noise.
For the cafeteria noise injection (Figure 4), some improvement can be attained with data corrupted by both white noise and cafeteria noise. For the car noise-corrupted data, performance gains are found only with mild noise injections. This suggests that cafeteria noise possesses some similarities to both white noise and car noise: It involves some background noise which is generally white, and some low-frequency components that resemble car noise. Without surprise, the best performance improvement is attained with data corrupted by cafeteria noise.
Multiple noise injection
In the second set of experiments, multiple noises are injected when performing noisy training. For simplicity, we fix the noise level at SNR=15 dB, which is obtained by setting μ
SNR=15 and σ
SNR=0.01. The hyperparameters {α
i
} in the noise-type sampling are all set to 10, which generates a distribution on noise types roughly concentrated in the uniform distribution but with a sufficiently large variation.
The first configuration injects white noise and car noise, and test data are corrupted by all the seven noises. The results in terms of absolute WER reduction are presented in Figure 5a. It can be seen that with the noisy training, almost all the WER reductions (except in the clean speech case) are positive, which means that the multiple noise injection improves the system performance in almost all the noise conditions. An interesting observation is that this approach delivers general good performance gains for the unknown noises, i.e., the noises other than the white noise and the car noise.
The second configuration injects white noise and cafeteria noise; again, the conditions with all the seven noises are tested. The results are presented in Figure 5b. We observe a similar pattern as in the case of white + car noise (Figure 5a): The performance on speech corrupted by any noise is significantly improved. The difference from Figure 5a is that the performance on the speech corrupted by cafeteria noise is more effectively improved, while the performance on the speech corrupted by car noise is generally decreased. This is not surprising as the cafeteria noise is now ‘known’ and the car noise becomes ‘unknown’. Interestingly, the performance on speech corrupted by the restaurant noise and that by the station noise are both improved in a more effective way than in Figure 5a. This suggests that the cafeteria noise shares some patterns with these two types of noises.
As a summary, the noisy training based on multiple noise injection is effective in learning patterns of multiple noise types, and it usually leads to significant improvement of ASR performance on speech data corrupted by the noises that have been learned. This improvement, interestingly, can be well generalized to unknown noises. In all the seven investigated noises, the behavior of the car noise is abnormal, which suggests that car noise is unique in properties and is better to be involved in noisy training.
Multiple noise injection with clean speech
An obvious problem of the previous experiments is that the performance on clean speech is generally degraded with noisy training. A simple approach to alleviate the problem is to involve clean speech in the training. This can be achieved by sampling a special ‘no-noise’ type together with other noise types. The results are reported in Figure 6a which presents the configuration with white + car noise and in Figure 6b which presents the configuration with white + cafeteria noise. We can see that with clean speech involved in the noisy training, the performance degradation on clean speech is largely solved.
Interestingly, involving clean speech in the noisy training improves performance not only on clean data, but also on noise-corrupted data. For example, Figure 6b shows that involving clean speech leads to general performance improvement on test data corrupted by car noise, which is quite different from the results shown in Figure 5b, where clean speech is not involved in the training and the performance on speech corrupted by car noise is actually decreased. This interesting improvement on noise data is maybe due to the ‘no-noise’ data that provide information about the ‘canonical’ patterns of speech signals, with which the noisy training is easier to discover the invariant and discriminative patterns that are important for recognition on both clean and corrupted data.
We note that the noisy training with multiple noise injection resembles the multi-condition training: Both involve training speech data under multiple noise conditions. However, there is an evident difference between the two approaches: In multi-conditional training, the training data are recorded under multiple noise conditions and the noise is unchanged across utterances of the same session; in noisy training, noisy data are synthesized by noise injection, so it is more flexible in noise selection and manipulation, and the training speech data can be utilized more efficiently.
Noise injection with diverse SNRs
The flexibility of noisy training in noise selection can be further extended by involving multiple SNR levels. By involving noise signals at various SNRs, more abundant noise patterns can be learned. More importantly, we hypothesize that the abundant noise patterns provide more negative learning examples for DNN training, so the ‘true speech patterns’ can be better learned.
The experimental setup is the same as the previous experiment, i.e., fixing μ
SNR=15 dB and then injecting multiple noises including ‘non-noise’ data. In order to introduce diverse SNRs, σ
SNR is set to be a large value. In this study, σ
SNR varies from 0.01 to 50. A larger σ
SNR leads to more diverse noise levels and higher possibility for loud noises. For simplicity, only the results with white + cafeteria noise injection are reported, while other configurations were experimented and the conclusions are similar.
Firstly, we examine the performance with ‘known noises’, i.e., data corrupted by white noise and cafeteria noise. The WER results are shown in Figure 7a which presents the results on the data corrupted by white noise and in Figure 7b which presents the results on the data corrupted by cafeteria noise. We can observe that with a more diverse noise injection (a larger σ
SNR), the performances under both the two noise conditions are generally improved. However, if σ
SNR is too large, the performance might be decreased. This can be attributed to the fact that a very large σ
SNR results in a significant proportion of extremely large or small SNRs, which is not consistent with the test condition. The experimental results show that the best performance is obtained with σ
SNR=10.
In another group of experiments, we examine performance of the noisy-trained DNN model on data corrupted by ‘unknown noises’, i.e., noises that are different from the ones injected in training. The results are reported in Figure 8. We observe quite different patterns for different noise corruptions: For most noise conditions, we observe a similar trend as in the known noise condition. When injecting noises at more diverse SNRs, the WER tends to be decreased, but if the noise is over diverse, the performance may be degraded. The maximum σ
SNR should not exceed 0.1 in most cases (restaurant noise, park noise, station noise). For the car noise condition, the optimal σ
SNR is 0.01, and for the bus noise condition, the optimal σ
SNR is 1.0. The smaller optimal σ
SNR in the car noise condition indicates again that this noise is significantly different from the injected white and cafeteria noises; on the contrary, the larger optimal σ
SNR in the bus noise condition suggests that the bus noise resembles the injected noises.
In general, the optimal values of σ
SNR in the condition of unknown noises are much smaller than those in the condition of known noises. This is somewhat expected, since injection of over diverse/loud noises that are different from those observed in the test tends to cause acoustic mismatch between the training and test data, which may offset the improved generalization capability offered by the noisy training. Therefore, to accomplish the most possible gains with the noisy training, the best strategy is to involve noise types as many as possible in training so that (1) most of the noises in test are known or partially known, i.e., similar noises involved in training, and (2) a larger σ
SNR can be safely employed to obtain better performance. For a system that operates in unknown noise conditions, the most reasonable strategy is to involve some typical noise types (e.g., white noise, car noise, cafeteria noise) and choose a moderate noise corruption level, i.e., a middle-level μ
SNR not larger than 15 dB and a small σ
SNR not larger than 0.1.