International Journal of Speech Technology

, Volume 14, Issue 3, pp 147–155 | Cite as

Robust features for multilingual acoustic modeling

Article
  • 92 Downloads

Abstract

In this paper, we propose a technique to derive robust features for multilingual acoustic modeling using hidden Markov model–Gaussian mixture models (HMM-GMM). We achieve this by discriminatively combining the phonetic contexts of the target languages (languages in the multilingual system). Phonetic context is captured using wide temporal context of the features, and the dimensionality of the resulting feature set is reduced to suit the HMM-GMM implementation using a neural network with a bottle-neck in one of the hidden layers. The output before the non-linearity at the bottle-neck layer of the neural network is the new feature. Since the features are optimized for the target languages in the multilingual recognizer, they are referred to as Target Languages Oriented Features (TLOF).

We perform our experiments for two of the most widely spoken Indian languages, Hindi and Tamil. TLOF offers significant performance improvements over both monolingual and multilingual phone recognizers using Mel frequency cepstral coefficients (MFCC). This emphasizes that TLOF can help share data across languages.

It was also seen that TLOF can enhance the performance of monolingual acoustic models, compared to systems using MFCC.

Keywords

Hidden Markov model (HMM) Neural networks (NN) Gaussian mixture models (GMM) Multilingual Acoustic modeling Robust features Phone recognition Speech recognition 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bub, U., Kohler, J., & Imperl, B. (1997). In-service adaptation of multilingual hidden Markov models. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Munich (pp. 1451–1454). Google Scholar
  2. Burget, L. et al. (2010). Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Denver, USA. Google Scholar
  3. Chatzichrisafis, N., Digalakis, V., Diakoloukas, V., & Harizakis, C. (2004). Rapid acoustic model development using Gaussian mixture clustering and language adaptation. In Proc. int. conf. on spoken language processing, Jeja Island, Korea. Google Scholar
  4. Forney, G. D. (1973). The Viterbi algorithm. Proceedings of the IEEE, 61, 268–278. MathSciNetCrossRefGoogle Scholar
  5. Grezl, F., & Fousek, F. (2008). Optimizing bottle-neck features for LVCSR. In Proc. int. conf. on acoustics, speech and signal processing, Las Vegas, USA. Google Scholar
  6. Hermansky, H., & Sharma, S. (1998). TRAPS—classifiers of temporal patterns. In Proc. international conference on spoken language processing, Sydney, Australia. Google Scholar
  7. Hongbing, H., & Zahorian, S. A. (2008). A neural network based non-linear feature transformation for speech recognition. In Proc. Interspeech, Brisbane, Australia. Google Scholar
  8. Itahashi, S., Zhu, S., & Yamamoto, M. (2004). Constructing family trees of multilingual speech using Gaussian mixture models. In Proc. international conference on spoken language processing, Jeju Island, Japan. Google Scholar
  9. Jain, A. K. (1989). Fundamentals of digital image processing. Englewood Cliffs: Prentice Hall. MATHGoogle Scholar
  10. Juang, B. H., & Rabiner, L. R. (1985). A probabilistic distance measure for hidden Markov models. AT&T Technical Journal, 64(2), 391–408. MathSciNetGoogle Scholar
  11. Ketabdar, H. (2008). Enhancing posterior based speech recognition systems. Ph.D. thesis, IDIAP, Research Institute, Switzerland. Google Scholar
  12. Ketabdar, H., & Boulard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In Proc. international conference on acoustics, speech and signal processing, Las Vegas, USA. Google Scholar
  13. Kirchoff, K. (1999). Robust speech recognition using articulatory features. Ph.D. thesis, University of Bielefield. Google Scholar
  14. Kirchoff, K. (2000). Integrating articulatory features into acoustic models for speech recognition. In Proc. of the workshop on phonetics and phonology in ASR, parameters and features, and their implications, Saarbrucken, Germany. Google Scholar
  15. Kohler, J. (1996). Multi-lingual phoneme recognition exploiting acoustic-phonetic similarities of sounds. In Proc. international conferences on spoken language processing (ICSLP), Philadelphia, USA. Google Scholar
  16. Kohler, J. (1998). Language adaptation of multilingual phone models for vocabulary independent multilingual speech recognition. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Seattle, USA (pp. 417–420). Google Scholar
  17. Kohler, J. (2001). Multilingual phone models for vocabulary independent speech recognition. Speech Communication, 35, 21–30. CrossRefGoogle Scholar
  18. Kullback, S. (1958). Information theory and statistics. New York: Wiley. Google Scholar
  19. Lee, C. H. et al. (2007). An overview of automatic speech attribute transcription. In Proc. Interspeech, Antwerp, Belgium. Google Scholar
  20. Li, J., & Lee, C. H. (2005). On designing and evaluating speech event detectors. In Proc. Interspeech, Lisbon, Portugal. Google Scholar
  21. Lin, H., Deng, L., Yu, D., Gong, Y., Acero, A., & Lee, C. H. (2009). A study on multilingual acoustic modeling for large vocabulary ASR. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Taipei, Taiwan. Google Scholar
  22. Lyu, D., Siniscalchi, S. M., Kim, T. Y., & Li, C. H. (2008). Continuous speech recognition without target language data. In Proc. Interspeech, Brisbane, Australia. Google Scholar
  23. Odell, J. J. (1995). The use of context in large vocabulary speech recognition. Ph.D. Thesis, Engineering Department, Cambridge University. Google Scholar
  24. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. CrossRefGoogle Scholar
  25. Rumelhart, D. E., Hintont, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(4), 533–536. CrossRefGoogle Scholar
  26. Schultz, T., & Krirchoff, K. (Eds.) (2006). Multilingual speech processing. New York: Elsevier. Google Scholar
  27. Schultz, T., & Waibel, A. (1999). Language adaptive LVCSR through poly-phone decision tree specialization. In Workshop on multilingual interoperability in speech technology, Leusden, The Netherlands (pp. 85–90). Google Scholar
  28. Schultz, T., & Waibel, A. (2001). Language independent and language adaptive acoustic modeling for speech recognition. Speech Communication, 31–51. Google Scholar
  29. Schwarz, P. (2008). Phoneme recognition using long temporal block. Ph.D. thesis, Brno University of Technology, Czech Republic. Google Scholar
  30. Schwarz, P., Matějka, P., & Černocký, J. (2004). Towards lower error rates in phoneme recognition. In Proceedings of 7th international conference text, speech and dialogue, Brno, Czech Republic. Google Scholar
  31. Stuker, S., Metze, F., Schultz, T., & Waibel, A. (2003). Integrating multilingual articulatory features into speech recognition. In Proc. Eurospeech, Geneva. Google Scholar
  32. Stuker, S., Schultz, T., Meize, F., & Waibel, A. (2007). Multilingual articulatory features. In Proc. international conference on acoustics, speech and signal processing, Honolulu, USA. Google Scholar
  33. Toth, L., Frankel, J., Gosziolya, G., & King, S. (2008). Cross-lingual portability of MLP based tandem features—A case study for English and Hungarian. In Proc. Interspeech, Brisbane, Australia. Google Scholar
  34. Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269. MATHCrossRefGoogle Scholar
  35. Waibel, A., Geutner, P., Mayfield, L., Schultz, T., & Woszczyna, M. (2000). Multilinguality in speech and spoken language systems. In Proc. IEEE (Vol. 88, pp. 1297–1313). Special issue on spoken language processing. Google Scholar
  36. Young, S., Jansen, J., Odell, J., Ollason, D., & Woodland, P. (2003). The HTK book. Cambridge: Cambridge University Engineering Department. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.ECE Department, Amrita School of EngineeringAmrita Vishwa VidyapeethamEttimadai, CoimbatoreIndia

Personalised recommendations