Modeling State-Conditional Observation Distribution Using Weighted Stereo Samples for Factorial Speech Processing Models

Khademian, Mahdi; Homayounpour, Mohammad Mehdi

doi:10.1007/s00034-016-0310-y

Modeling State-Conditional Observation Distribution Using Weighted Stereo Samples for Factorial Speech Processing Models

Published: 06 April 2016

Volume 36, pages 339–357, (2017)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Mahdi Khademian¹ &
Mohammad Mehdi Homayounpour¹

158 Accesses
3 Citations
1 Altmetric
Explore all metrics

An Erratum to this article was published on 27 April 2016

Abstract

This paper investigates the effectiveness of factorial speech processing models in noise-robust automatic speech recognition tasks. For this purpose, the paper proposes an idealistic approach for modeling state-conditional observation distribution of factorial models based on weighted stereo samples. This approach is an extension to previous single-pass retraining for ideal model compensation which is extended here to support multiple audio sources. Non-stationary noises can be considered as one of these audio sources with multiple states. Experiments of this paper over the set A of the Aurora 2 dataset show that recognition performance can be improved by this consideration. The improvement is significant in low signal-to-noise energy conditions, up to 4 % absolute word recognition accuracy. In addition to the power of the proposed method in accurate representation of state-conditional observation distribution, it has an important advantage over previous methods by providing the opportunity to independently select feature spaces for both source and corrupted features. This opens a new window for seeking better feature spaces appropriate for noisy speech, independent from clean speech features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

Article 06 January 2017

A Bayesian view on acoustic model-based techniques for robust speech recognition

Article Open access 02 December 2015

Stereo-based histogram equalization for robust speech recognition

Article Open access 09 June 2015

References

M. Afify, X. Cui, Y. Gao, Stereo-based stochastic mapping for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 17, 1325–1334 (2009)
Article Google Scholar
J. Baker, L. Deng, J. Glass, S. Khudanpur, C. Lee, N. Morgan, D. O’Shaughnessy, Developments and directions in speech recognition and understanding, part 1 [DSP education]. IEEE Signal Process. Mag. 26, 75–80 (2009)
Article Google Scholar
C.M. Bishop, Pattern Recognition and Machine Learning, vol. 4 (Springer, New York, 2007)
MATH Google Scholar
M. Brookes, Voicebox: Speech Processing Toolbox for MATLAB. (1997). http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
L. Deng, J. Droppo, A. Acero, Enhancement of log Mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise. IEEE Trans. Speech Audio Process. 12, 133–143 (2004)
Article Google Scholar
P.S. Dwyer, Some applications of matrix derivatives in multivariate analysis. J. Am. Stat. Assoc. 62, 607–625 (1967)
Article MathSciNet MATH Google Scholar
B.J. Frey, L. Deng, A. Acero, T.T. Kristjansson, ALGONQUIN: iterating Laplace’s method to remove multiple types of acoustic distortion for robust speech recognition. Eurospeech 2001, 901–904 (2001)
Google Scholar
M.J.F. Gales, Model-Based Techniques for Noise Robust Speech Recognition (University of Cambridge, Cambridge, 1995)
Google Scholar
M.J.F. Gales, S.J. Young, Robust continuous speech recognition using parallel model combination. IEEE Trans. Speech Audio Process. 4, 352–359 (1996)
Article Google Scholar
Z. Ghahramani, M.I. Jordan, Factorial hidden Markov models. Mach. Learn. 29, 245–273 (1997)
Article MATH Google Scholar
J.R. Hershey, S.J. Rennie, J. Le Roux, Factorial models for noise robust speech recognition, in Techniques for Noise Robustness in Automatic Speech Recognition, ed. by T. Virtanen, R. Singh, B. Raj (Wiley, New York, 2012), pp. 311–345. https://merl.com/publications/docs/TR2012-002.pdf
J.R. Hershey, S.J. Rennie, P.A. Olsen, T.T. Kristjansson, Super-human multi-talker speech recognition: a graphical modeling approach. Comput. Speech Lang. 24, 45–66 (2010)
Article Google Scholar
H.G. Hirsch, D. Pearce, The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in ASR2000-Automatic Speech Recognition: Challenges for the New Millenium ISCA Tutorial and Research Workshop (ITRW) (2000)
V. Leutnant, R. Haeb-Umbach, An analytic derivation of a phase-sensitive observation model for noise robust speech recognition. Interspeech 2009, 2395–2398 (2009)
MATH Google Scholar
J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 745–777 (2014)
Article Google Scholar
J. Li, L. Deng, D. Yu, Y. Gong, A. Acero, A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions. Comput. Speech Lang. 23, 389–405 (2009)
Article Google Scholar
B. Logan, P.J. Moreno, Factorial Hidden Markov Models for Speech Recognition: Preliminary Experiments (Cambridge Research Laboratory, Cambridge, 1997)
Google Scholar
D.J.C. Mackay, Introduction to Monte Carlo methods, in Learning in Graphical Models, vol. 89, ed. by M.I. Jordan (Springer, Dordrecht, 1998), pp. 175–204
Chapter Google Scholar
M.H. Radfar, R.M. Dansereau, Single-channel speech separation using soft mask filtering. IEEE Trans. Audio Speech Lang. Process. 15, 2299–2310 (2007)
Article Google Scholar
S.T. Roweis, Factorial models and refiltering for speech separation and denoising, in Eighth European Conference on Speech Communication and Technology (2003)
S.M. Siddiqi, G.J. Gordon, A.W. Moore, Fast state discovery for HMM model selection and learning, in International Conference on Artificial Intelligence and Statistics (2007), pp 492–499
R.C. Van Dalen, Statistical Models for Noise-Robust Speech Recognition (University of Cambridge, Cambridge, 2011)
Google Scholar
S.X. Wang, Maximum Weighted Likelihood Estimation (University of British Columbia, Vancouver, 2001)
Google Scholar
S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev, P. Woodland, The HTK Book, vol. 3 (Cambridge University Engineering Department, Cambridge, 2009)
Google Scholar

Download references

Acknowledgments

The authors would like to thank to Mohammad Ali Keyvanrad, Dr. Omid Naghshineh Arjmand and Dr. Adel Mohammadpour for their valuable arguments and suggestions in this work.

Author information

Authors and Affiliations

Laboratory for Intelligent Multimedia Processing (LIMP), Amirkabir University of Technology, Tehran, Iran
Mahdi Khademian & Mohammad Mehdi Homayounpour

Authors

Mahdi Khademian
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Mehdi Homayounpour
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Mehdi Homayounpour.

Additional information

An erratum to this article is available at http://dx.doi.org/10.1007/s00034-016-0326-3.

Appendix: Extending the EM Algorithm for Modeling Mixture of Gaussians Based on Weighted Samples

In the E-step of the EM algorithm for weighted particles, particle weights have no effect on the component responsibility equations. By considering particle weights as the replicating order of the particles (similar to (24)), we see that this replication has no effect on the component responsibilities to each particle. Therefore, component responsibilities are calculated without considering particle weights by the old parameter set as in E-step of the standard EM algorithm for GMMs:

$$\begin{aligned} \gamma _{l} \left( k \right) \propto \pi _{k}^{\prime } \mathcal {N}\left( {{\varvec{y}}_{{{l}}} ;\varvec{\mu }_{{{k}}}^{\prime } ,\varvec{\Sigma }_{{{k}}}^{\prime } } \right) \end{aligned}$$

(33)

where the normalization constant is $\mathop \sum \nolimits _{k=1}^K \pi _{k}^{\prime } \mathcal {N}\left( {\varvec{y}_{l} ;\varvec{\mu }_{k}^{\prime } , \varvec{\Sigma }_{k}^{\prime } } \right) $.

For the M-step, the following optimization problem must be solved:

$$\begin{aligned}&\theta ^\mathrm{new}=\mathop {\hbox {argmax}}\limits _\theta \mathcal {Q}\left( {\theta ,{\theta }^{'}} \right) \nonumber \\&\hbox {st}:\mathop \sum \nolimits _{k=1}^K \pi _{k} =1 \end{aligned}$$

(34)

Using the method of Lagrange multiplier for satisfying the constraint for component priors, we have the following objective function for optimization:

$$\begin{aligned}&g\left( {\varvec{\mu },\varvec{\Sigma },\varvec{\pi }} \right) \nonumber \\&\quad =\mathop \sum \nolimits _{l=1}^L w_{l} \mathop \sum \nolimits _{k=1}^K \gamma _{l} \left( k \right) \left[ {\ln p\left( {{\varvec{y}}_{{{l}}} ;\varvec{\mu }_{{{k}}} ,\varvec{\Sigma }_{{{k}}} } \right) +\ln \pi _{k} } \right] +\lambda \left( {\mathop \sum \nolimits _{k=1}^K \pi _{k} -1} \right) \nonumber \\ \end{aligned}$$

(35)

Taking the derivative g with respect to $\varvec{\mu }_{k} $ results in:

$$\begin{aligned} \partial g/\partial \varvec{\mu }_{{{k}}} =2\mathop \sum \nolimits _{l=1}^L w_{l} \gamma _{l} \left( k \right) \left[ {\varvec{\Sigma }_{{{k}}}^{-1} \left( {{\varvec{y}}_{{{l}}}-\varvec{\mu }_{{{k}}} } \right) } \right] \end{aligned}$$

(36)

Now (30) is easily obtained for updating $\varvec{\mu }_{{{k}}} $ by setting this derivative to zero. For estimating $\varvec{\Sigma }_{{{k}}} $, according to [6] the derivative takes the following form:

$$\begin{aligned} \partial g/\partial \varvec{\Sigma }_{{{k}}} =-\frac{1}{2}\mathop \sum \nolimits _{l=1}^L w_{l} \gamma _{l} \left( k \right) \left[ {\varvec{\Sigma }_{{{k}}}^{-1} - \varvec{\Sigma }_{{{k}}}^{-1} \left( {{\varvec{y}}_{{{l}}} -\varvec{\mu }_{{{k}}} } \right) \left( {{\varvec{y}}_{{{l}}} -\varvec{\mu }_{{{k}}} } \right) ^{T} \varvec{\Sigma }_{{\varvec{k}}}^{-1}} \right] \end{aligned}$$

(37)

in which the $\varvec{\mu }_{k} $ is estimated by (30). Setting it to zero, we obtain:

$$\begin{aligned} \mathop \sum \nolimits _{l=1}^L w_{l} \gamma _{l} \left( k \right) \left( {{\varvec{y}}_{{{l}}} -\varvec{\mu }_{{{k}}} } \right) \left( {{\varvec{y}}_{{{l}}} -\varvec{\mu }_{{{k}}} } \right) ^{T}\varvec{\Sigma }_{k}^{-1} =\mathop \sum \nolimits _{l=1}^L w_{l} \gamma _{l} \left( k \right) \end{aligned}$$

(38)

Then (31) is obtained for estimating $\varvec{\Sigma }_{{{k}}} $ in which when the number of samples are significant, there is no need for adjusting the estimator for bias. Finally for $\pi _{k} $ we have:

$$\begin{aligned} \partial g/\partial \pi _{k} =\mathop \sum \nolimits _{l=1}^L \left( {w_{l} \gamma _{l} \left( k \right) } \right) /\pi _{k} +\lambda =0 \end{aligned}$$

(39)

by using the assumption $\mathop \sum \nolimits _{k=1}^K \pi _{k} =1$ and considering $\gamma _{l} \left( k \right) $ as a valid conditional probability mass function, $\lambda $ is calculated by:

$$\begin{aligned} \lambda =-\mathop \sum \nolimits _{l=1}^L w_{l} \end{aligned}$$

(40)

Now we can eliminate $\lambda $ from (39) by (40) which leads to (32) for updating $\pi _{k} .$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khademian, M., Homayounpour, M.M. Modeling State-Conditional Observation Distribution Using Weighted Stereo Samples for Factorial Speech Processing Models. Circuits Syst Signal Process 36, 339–357 (2017). https://doi.org/10.1007/s00034-016-0310-y

Download citation

Received: 25 September 2015
Revised: 24 March 2016
Accepted: 25 March 2016
Published: 06 April 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s00034-016-0310-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modeling State-Conditional Observation Distribution Using Weighted Stereo Samples for Factorial Speech Processing Models

Abstract

Access this article

Similar content being viewed by others

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

A Bayesian view on acoustic model-based techniques for robust speech recognition

Stereo-based histogram equalization for robust speech recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Extending the EM Algorithm for Modeling Mixture of Gaussians Based on Weighted Samples

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Modeling State-Conditional Observation Distribution Using Weighted Stereo Samples for Factorial Speech Processing Models

Abstract

Access this article

Similar content being viewed by others

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

A Bayesian view on acoustic model-based techniques for robust speech recognition

Stereo-based histogram equalization for robust speech recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Extending the EM Algorithm for Modeling Mixture of Gaussians Based on Weighted Samples

Appendix: Extending the EM Algorithm for Modeling Mixture of Gaussians Based on Weighted Samples

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation