Development of the compact English LVCSR acoustic model for embedded entertainment robot applications

Menéndez-Pidal, Xavier; Patrikar, Ajay; Olorenshaw, Lex; Honda, Hitoshi

doi:10.1007/s10772-008-9012-6

Development of the compact English LVCSR acoustic model for embedded entertainment robot applications

Published: 10 January 2009

Volume 10, pages 63–74, (2007)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Xavier Menéndez-Pidal¹,
Ajay Patrikar²,
Lex Olorenshaw² &
…
Hitoshi Honda³

91 Accesses
1 Citation
3 Altmetric
Explore all metrics

Abstract

In this paper we discuss two techniques to reduce the size of the acoustic model while maintaining or improving the accuracy of the recognition engine. The first technique, demiphone modeling, tries to reduce the redundancy existing in a context dependent state-clustered Hidden Markov Model (HMM). Three-state demiphones optimally designed from the triphone decision tree are introduced to drastically reduce the phone space of the acoustic model and to improve system accuracy. The second redundancy elimination technique is a more classical approach based on parameter tying. Similar vectors of variances in each HMM cluster are tied together to reduce the number of parameters. The closeness between the vectors of variances is measured using a Vector Quantizer (VQ) to maintain the information provided by the variances parameters. The paper also reports speech recognition improvements using assignment of variable number Gaussians per cluster and gender-based HMMs. The main motivation behind these techniques is to improve the acoustic model and at the same time lower its memory usage. These techniques may help in reducing memory and improving accuracy of an embedded Large Vocabulary Continuous Speech Recognition (LVCSR) application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FPGA-Based Robust Wireless Speech Motion Control for Home Service Robot Subject to Environmental Noises

Article 18 July 2016

Shing-Tai Pan, Cheng-Yuan Chang & Yi-Heng Tsai

Developing of a Software–Hardware Complex for Automatic Audio–Visual Speech Recognition in Human–Robot Interfaces

Today’s Challenges for Embedded ASR

References

Aubert, X., Haeb-Umbach, R., & Ney, H. (1993). Continuous mixture densities and linear discriminant analysis for improved context-dependent acoustic models. In Proc. ICASSP’93, Minneapolis (Vol. II, pp. 648–651) 1993.
Chen, S., & Gopinath, R. A. (1999). Model selection in acoustic modeling. In Proc. Eurospeech’99, Budapest (pp. 1087–1090) 1999.
Garafolo, J., Lamel, L., Fisher, J., Fiscus, D., & Pallet, N. (1993). The DARPA TIMIT acoustic-phonetic continuous speech corpus CDROM, NTIS order number, PB91-100354.
Hernández-Ábrego, G., & Menéndez-Pidal, X. (2002). In Alternative approaches to acoustic modeling based on triphone clustering (Tech. Report SLT). SONY Electronics.
Hernández-Ábrego, G., Menéndez-Pidal, X., Kemp, T., Minamino, K., & Lucke, H. (2003). Automatic set-up for speech recognition Engines based on merit optimization. In Proc. ICASSP-2003, Hong Kong, 2003.
Hernández-Ábrego, G., Olorenshaw, L., Tato, R., & Schaaf, T. (2004). Dictionary Refinements based on Phonetic Consensus and non-uniform pronunciation reduction, Proc. ICLSP-2004, Korea, 2004.
Huang, X., Arikki, Y., & Jack, M. (1990). Hidden Markov models for speech recognition. Edinburgh: Edinburgh University Press.
Google Scholar
Kats, W., Krippe, C., & Tallal, P. (1991). Anticipatory coarticulation in the speech of adults and children: acoustic, perceptual, and video data. Journal of Speech and Hearing Research, 34, 1222–1232.
Google Scholar
Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, COM-28, 84–95.
Article Google Scholar
Liu, X., & Gales, M. J. F. (2003). Automatic model complexity control using marginalized discriminative growth functions. In Proc. 2003 ASRU workshop, St. Thomas, US, Virgin Island (pp. 37–42) 2003.
Ljole, A., Hindle, D., Riley, M., & Sproat, R. (2000). The AT&T LVCSR-2000 system. In Proc. 2000 speech transcription workshop, 2000.
Lucke, H., Honda, H., Minamino, K., Hiroe, A., Mori, H., Ogawa, H., Asano, Y., & Kishi, H. (2003). Development of a Spontaneous Speech Recognition Engine for an Entertainment Robot. In ISCA IEEE workshop on spontaneous speech processing and recognition, Tokyo (pp. 87–90) 2003.
Mariño, J., Pachès, P., & Nogueiras, A. (1998). The demiphone versus the triphone in a decision-tree state-tying framework. In Proc. ICSLP’98, Sydney, 1998.
Mariño, J., Nogueiras, A., Pachès, P., & Bonafonte, A. (2000). The demiphone: an efficient contextual subword unit for continuous speech recognition. Speech Communication, 32(3), 187–197.
Article Google Scholar
Menéndez-Pidal, X., Chen, R., Wu, D., & Tanaka, M. (2001). Compensation of channel and noise distortions combining normalization and speech enhancement techniques. Speech Communication, 34(1–2), 115–126.
Article MATH Google Scholar
Pachès, P. (1999). Improved modelling for robust speech recognition. PhD Thesis Dissertation. http://gps-tsc.upc.es/veu/research/pubs/thesis.php3.
Young, S. J. (1999). Acoustic modelling for Large Vocabulary Continuous Speech Recognition, computational models of speech pattern processing. In Proc. NATO advance study institute (pp. 1–23). Berlin: Springer.
Google Scholar
Young, S. J., Odell, J. J., & Woodland, P. C. (1994). Tree-based tying for high accuracy acoustic modeling. In Proc. human language technology workshop, Plainsboro, NJ (pp. 307–312). San Mateo: Morgan Kaufman.
Chapter Google Scholar
Young, S. Y., Evermann, G., Hain, T., Kershaw, D., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2002). The HTK book version 3.2. Cambridge: Cambridge University Press.
Google Scholar
Zheng, J., Franco, H., & Stolcke, A. (2003). Modeling word-level rate of speech variation in a large vocabulary conversational speech recognition. Speech Communication, 41(2–3), 273–278.
Article Google Scholar

Download references

Author information

Authors and Affiliations

R&D Laboratory, SONY Computer Entertainment of America, 919 East Hillsdale Blvd, 2nd floor, Foster City, CA, 94404, USA
Xavier Menéndez-Pidal
Former Spoken Language Technology Laboratory, SONY Electronics, San José, CA, 94134, USA
Ajay Patrikar & Lex Olorenshaw
Information Technologies Laboratories, SONY Corporation, Tokyo, Japan
Hitoshi Honda

Authors

Xavier Menéndez-Pidal
View author publications
You can also search for this author in PubMed Google Scholar
Ajay Patrikar
View author publications
You can also search for this author in PubMed Google Scholar
Lex Olorenshaw
View author publications
You can also search for this author in PubMed Google Scholar
Hitoshi Honda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xavier Menéndez-Pidal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Menéndez-Pidal, X., Patrikar, A., Olorenshaw, L. et al. Development of the compact English LVCSR acoustic model for embedded entertainment robot applications. Int J Speech Technol 10, 63–74 (2007). https://doi.org/10.1007/s10772-008-9012-6

Download citation

Received: 08 July 2005
Accepted: 12 November 2008
Published: 10 January 2009
Issue Date: September 2007
DOI: https://doi.org/10.1007/s10772-008-9012-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Development of the compact English LVCSR acoustic model for embedded entertainment robot applications

Abstract

Access this article

Similar content being viewed by others

FPGA-Based Robust Wireless Speech Motion Control for Home Service Robot Subject to Environmental Noises

Developing of a Software–Hardware Complex for Automatic Audio–Visual Speech Recognition in Human–Robot Interfaces

Today’s Challenges for Embedded ASR

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Development of the compact English LVCSR acoustic model for embedded entertainment robot applications

Abstract

Access this article

Similar content being viewed by others

FPGA-Based Robust Wireless Speech Motion Control for Home Service Robot Subject to Environmental Noises

Developing of a Software–Hardware Complex for Automatic Audio–Visual Speech Recognition in Human–Robot Interfaces

Today’s Challenges for Embedded ASR

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation