International Journal of Speech Technology

, Volume 10, Issue 2–3, pp 63–74 | Cite as

Development of the compact English LVCSR acoustic model for embedded entertainment robot applications

  • Xavier Menéndez-PidalEmail author
  • Ajay Patrikar
  • Lex Olorenshaw
  • Hitoshi Honda


In this paper we discuss two techniques to reduce the size of the acoustic model while maintaining or improving the accuracy of the recognition engine. The first technique, demiphone modeling, tries to reduce the redundancy existing in a context dependent state-clustered Hidden Markov Model (HMM). Three-state demiphones optimally designed from the triphone decision tree are introduced to drastically reduce the phone space of the acoustic model and to improve system accuracy. The second redundancy elimination technique is a more classical approach based on parameter tying. Similar vectors of variances in each HMM cluster are tied together to reduce the number of parameters. The closeness between the vectors of variances is measured using a Vector Quantizer (VQ) to maintain the information provided by the variances parameters. The paper also reports speech recognition improvements using assignment of variable number Gaussians per cluster and gender-based HMMs. The main motivation behind these techniques is to improve the acoustic model and at the same time lower its memory usage. These techniques may help in reducing memory and improving accuracy of an embedded Large Vocabulary Continuous Speech Recognition (LVCSR) application.


Large Vocabulary Continuous Speech Recognition Acoustic modeling Hidden Markov Model Embedded speech recognition systems Redundancy elimination HMM parameters optimization HMM memory reduction Triphones Demiphones 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Aubert, X., Haeb-Umbach, R., & Ney, H. (1993). Continuous mixture densities and linear discriminant analysis for improved context-dependent acoustic models. In Proc. ICASSP’93, Minneapolis (Vol. II, pp. 648–651) 1993. Google Scholar
  2. Chen, S., & Gopinath, R. A. (1999). Model selection in acoustic modeling. In Proc. Eurospeech’99, Budapest (pp. 1087–1090) 1999. Google Scholar
  3. Garafolo, J., Lamel, L., Fisher, J., Fiscus, D., & Pallet, N. (1993). The DARPA TIMIT acoustic-phonetic continuous speech corpus CDROM, NTIS order number, PB91-100354. Google Scholar
  4. Hernández-Ábrego, G., & Menéndez-Pidal, X. (2002). In Alternative approaches to acoustic modeling based on triphone clustering (Tech. Report SLT). SONY Electronics. Google Scholar
  5. Hernández-Ábrego, G., Menéndez-Pidal, X., Kemp, T., Minamino, K., & Lucke, H. (2003). Automatic set-up for speech recognition Engines based on merit optimization. In Proc. ICASSP-2003, Hong Kong, 2003. Google Scholar
  6. Hernández-Ábrego, G., Olorenshaw, L., Tato, R., & Schaaf, T. (2004). Dictionary Refinements based on Phonetic Consensus and non-uniform pronunciation reduction, Proc. ICLSP-2004, Korea, 2004. Google Scholar
  7. Huang, X., Arikki, Y., & Jack, M. (1990). Hidden Markov models for speech recognition. Edinburgh: Edinburgh University Press. Google Scholar
  8. Kats, W., Krippe, C., & Tallal, P. (1991). Anticipatory coarticulation in the speech of adults and children: acoustic, perceptual, and video data. Journal of Speech and Hearing Research, 34, 1222–1232. Google Scholar
  9. Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, COM-28, 84–95. CrossRefGoogle Scholar
  10. Liu, X., & Gales, M. J. F. (2003). Automatic model complexity control using marginalized discriminative growth functions. In Proc. 2003 ASRU workshop, St. Thomas, US, Virgin Island (pp. 37–42) 2003. Google Scholar
  11. Ljole, A., Hindle, D., Riley, M., & Sproat, R. (2000). The AT&T LVCSR-2000 system. In Proc. 2000 speech transcription workshop, 2000. Google Scholar
  12. Lucke, H., Honda, H., Minamino, K., Hiroe, A., Mori, H., Ogawa, H., Asano, Y., & Kishi, H. (2003). Development of a Spontaneous Speech Recognition Engine for an Entertainment Robot. In ISCA IEEE workshop on spontaneous speech processing and recognition, Tokyo (pp. 87–90) 2003. Google Scholar
  13. Mariño, J., Pachès, P., & Nogueiras, A. (1998). The demiphone versus the triphone in a decision-tree state-tying framework. In Proc. ICSLP’98, Sydney, 1998. Google Scholar
  14. Mariño, J., Nogueiras, A., Pachès, P., & Bonafonte, A. (2000). The demiphone: an efficient contextual subword unit for continuous speech recognition. Speech Communication, 32(3), 187–197. CrossRefGoogle Scholar
  15. Menéndez-Pidal, X., Chen, R., Wu, D., & Tanaka, M. (2001). Compensation of channel and noise distortions combining normalization and speech enhancement techniques. Speech Communication, 34(1–2), 115–126. zbMATHCrossRefGoogle Scholar
  16. Pachès, P. (1999). Improved modelling for robust speech recognition. PhD Thesis Dissertation.
  17. Young, S. J. (1999). Acoustic modelling for Large Vocabulary Continuous Speech Recognition, computational models of speech pattern processing. In Proc. NATO advance study institute (pp. 1–23). Berlin: Springer. Google Scholar
  18. Young, S. J., Odell, J. J., & Woodland, P. C. (1994). Tree-based tying for high accuracy acoustic modeling. In Proc. human language technology workshop, Plainsboro, NJ (pp. 307–312). San Mateo: Morgan Kaufman. CrossRefGoogle Scholar
  19. Young, S. Y., Evermann, G., Hain, T., Kershaw, D., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2002). The HTK book version 3.2. Cambridge: Cambridge University Press. Google Scholar
  20. Zheng, J., Franco, H., & Stolcke, A. (2003). Modeling word-level rate of speech variation in a large vocabulary conversational speech recognition. Speech Communication, 41(2–3), 273–278. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Xavier Menéndez-Pidal
    • 1
    Email author
  • Ajay Patrikar
    • 2
  • Lex Olorenshaw
    • 2
  • Hitoshi Honda
    • 3
  1. 1.R&D LaboratorySONY Computer Entertainment of AmericaFoster CityUSA
  2. 2.Former Spoken Language Technology LaboratorySONY ElectronicsSan JoséUSA
  3. 3.Information Technologies LaboratoriesSONY CorporationTokyoJapan

Personalised recommendations