Skip to main content

Advertisement

Log in

Robust features for multilingual acoustic modeling

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper, we propose a technique to derive robust features for multilingual acoustic modeling using hidden Markov model–Gaussian mixture models (HMM-GMM). We achieve this by discriminatively combining the phonetic contexts of the target languages (languages in the multilingual system). Phonetic context is captured using wide temporal context of the features, and the dimensionality of the resulting feature set is reduced to suit the HMM-GMM implementation using a neural network with a bottle-neck in one of the hidden layers. The output before the non-linearity at the bottle-neck layer of the neural network is the new feature. Since the features are optimized for the target languages in the multilingual recognizer, they are referred to as Target Languages Oriented Features (TLOF).

We perform our experiments for two of the most widely spoken Indian languages, Hindi and Tamil. TLOF offers significant performance improvements over both monolingual and multilingual phone recognizers using Mel frequency cepstral coefficients (MFCC). This emphasizes that TLOF can help share data across languages.

It was also seen that TLOF can enhance the performance of monolingual acoustic models, compared to systems using MFCC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bub, U., Kohler, J., & Imperl, B. (1997). In-service adaptation of multilingual hidden Markov models. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Munich (pp. 1451–1454).

    Google Scholar 

  • Burget, L. et al. (2010). Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Denver, USA.

    Google Scholar 

  • Chatzichrisafis, N., Digalakis, V., Diakoloukas, V., & Harizakis, C. (2004). Rapid acoustic model development using Gaussian mixture clustering and language adaptation. In Proc. int. conf. on spoken language processing, Jeja Island, Korea.

    Google Scholar 

  • Forney, G. D. (1973). The Viterbi algorithm. Proceedings of the IEEE, 61, 268–278.

    Article  MathSciNet  Google Scholar 

  • Grezl, F., & Fousek, F. (2008). Optimizing bottle-neck features for LVCSR. In Proc. int. conf. on acoustics, speech and signal processing, Las Vegas, USA.

    Google Scholar 

  • Hermansky, H., & Sharma, S. (1998). TRAPS—classifiers of temporal patterns. In Proc. international conference on spoken language processing, Sydney, Australia.

    Google Scholar 

  • Hongbing, H., & Zahorian, S. A. (2008). A neural network based non-linear feature transformation for speech recognition. In Proc. Interspeech, Brisbane, Australia.

    Google Scholar 

  • Itahashi, S., Zhu, S., & Yamamoto, M. (2004). Constructing family trees of multilingual speech using Gaussian mixture models. In Proc. international conference on spoken language processing, Jeju Island, Japan.

    Google Scholar 

  • Jain, A. K. (1989). Fundamentals of digital image processing. Englewood Cliffs: Prentice Hall.

    MATH  Google Scholar 

  • Juang, B. H., & Rabiner, L. R. (1985). A probabilistic distance measure for hidden Markov models. AT&T Technical Journal, 64(2), 391–408.

    MathSciNet  Google Scholar 

  • Ketabdar, H. (2008). Enhancing posterior based speech recognition systems. Ph.D. thesis, IDIAP, Research Institute, Switzerland.

  • Ketabdar, H., & Boulard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In Proc. international conference on acoustics, speech and signal processing, Las Vegas, USA.

    Google Scholar 

  • Kirchoff, K. (1999). Robust speech recognition using articulatory features. Ph.D. thesis, University of Bielefield.

  • Kirchoff, K. (2000). Integrating articulatory features into acoustic models for speech recognition. In Proc. of the workshop on phonetics and phonology in ASR, parameters and features, and their implications, Saarbrucken, Germany.

    Google Scholar 

  • Kohler, J. (1996). Multi-lingual phoneme recognition exploiting acoustic-phonetic similarities of sounds. In Proc. international conferences on spoken language processing (ICSLP), Philadelphia, USA.

    Google Scholar 

  • Kohler, J. (1998). Language adaptation of multilingual phone models for vocabulary independent multilingual speech recognition. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Seattle, USA (pp. 417–420).

    Google Scholar 

  • Kohler, J. (2001). Multilingual phone models for vocabulary independent speech recognition. Speech Communication, 35, 21–30.

    Article  Google Scholar 

  • Kullback, S. (1958). Information theory and statistics. New York: Wiley.

    Google Scholar 

  • Lee, C. H. et al. (2007). An overview of automatic speech attribute transcription. In Proc. Interspeech, Antwerp, Belgium.

    Google Scholar 

  • Li, J., & Lee, C. H. (2005). On designing and evaluating speech event detectors. In Proc. Interspeech, Lisbon, Portugal.

    Google Scholar 

  • Lin, H., Deng, L., Yu, D., Gong, Y., Acero, A., & Lee, C. H. (2009). A study on multilingual acoustic modeling for large vocabulary ASR. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Taipei, Taiwan.

    Google Scholar 

  • Lyu, D., Siniscalchi, S. M., Kim, T. Y., & Li, C. H. (2008). Continuous speech recognition without target language data. In Proc. Interspeech, Brisbane, Australia.

    Google Scholar 

  • Odell, J. J. (1995). The use of context in large vocabulary speech recognition. Ph.D. Thesis, Engineering Department, Cambridge University.

  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

    Article  Google Scholar 

  • Rumelhart, D. E., Hintont, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(4), 533–536.

    Article  Google Scholar 

  • Schultz, T., & Krirchoff, K. (Eds.) (2006). Multilingual speech processing. New York: Elsevier.

    Google Scholar 

  • Schultz, T., & Waibel, A. (1999). Language adaptive LVCSR through poly-phone decision tree specialization. In Workshop on multilingual interoperability in speech technology, Leusden, The Netherlands (pp. 85–90).

    Google Scholar 

  • Schultz, T., & Waibel, A. (2001). Language independent and language adaptive acoustic modeling for speech recognition. Speech Communication, 31–51.

  • Schwarz, P. (2008). Phoneme recognition using long temporal block. Ph.D. thesis, Brno University of Technology, Czech Republic.

  • Schwarz, P., Matějka, P., & Černocký, J. (2004). Towards lower error rates in phoneme recognition. In Proceedings of 7th international conference text, speech and dialogue, Brno, Czech Republic.

    Google Scholar 

  • Stuker, S., Metze, F., Schultz, T., & Waibel, A. (2003). Integrating multilingual articulatory features into speech recognition. In Proc. Eurospeech, Geneva.

    Google Scholar 

  • Stuker, S., Schultz, T., Meize, F., & Waibel, A. (2007). Multilingual articulatory features. In Proc. international conference on acoustics, speech and signal processing, Honolulu, USA.

    Google Scholar 

  • Toth, L., Frankel, J., Gosziolya, G., & King, S. (2008). Cross-lingual portability of MLP based tandem features—A case study for English and Hungarian. In Proc. Interspeech, Brisbane, Australia.

    Google Scholar 

  • Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.

    Article  MATH  Google Scholar 

  • Waibel, A., Geutner, P., Mayfield, L., Schultz, T., & Woszczyna, M. (2000). Multilinguality in speech and spoken language systems. In Proc. IEEE (Vol. 88, pp. 1297–1313). Special issue on spoken language processing.

    Google Scholar 

  • Young, S., Jansen, J., Odell, J., Ollason, D., & Woodland, P. (2003). The HTK book. Cambridge: Cambridge University Engineering Department.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. Santhosh Kumar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Santhosh Kumar, C., Mohandas, V.P. Robust features for multilingual acoustic modeling. Int J Speech Technol 14, 147–155 (2011). https://doi.org/10.1007/s10772-011-9092-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-011-9092-6

Keywords

Navigation