Speech input implemented in voice user interface (voice UI) plays an important role in enhancing the usability of small portable devices, such as mobile phones. In these devices more traditional ways of interaction (e.g. keyboard and display) are limited by small size, battery life and cost. Speech is considered as a natural way of interaction for man-machine interfaces. After decades of research and development, voice UIs are becoming widely deployed and accepted in commercial applications. It is expected that the global proliferation of embedded devices will further strengthen this trend in the coming years. A core technology enabler of voice UIs is automatic speech recognition (ASR). Example applications in mobile phones relying on embedded ASR are name dialling, phone book search, command-and-control and more recently large vocabulary dictation. In the mobile context several technological challenges have to be overcome concerning ambient noise in the environment, constraints of available hardware platforms and cost limitations, and necessity for wide language coverage. In addition, mobile ASR systems need to achieve a virtually perfect performance level for user acceptance. This chapter reviews the application of embedded ASR in mobile phones, and describes specific issues related to language development, noise robustness and embedded implementation and platforms. Several practical solutions are presented throughout the chapter with supporting experimental results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Andrassy, B., Vlaj, D., and Beaugeant, Ch. (2001). Recognition performance of the siemens front-end with and without frame dropping on the Aurora 2 database. In Proc. Eur. Conf. Speech Comm. Technol. (Eurospeech), vol. 1, pp. 193-196.
Balan, R., Rosca, J., Beaugeant, Ch., Gilg, V., and Fingscheidt, T. (2004). Generalized stochastic principle for microphone array speech enhancement and applications to car environments. In Proc. Eur. Signal Proc. Conf. (Eusipco), September 6-10, 2004.
Bocchieri, E. (1993). Vector quantization for efficient computation of continuous density likelihoods. In Proc. of ICASSP, Minneapolis, MN, vol. 2, pp. II-692-II-695.
Bocchieri, E., and Mak, B. (1997). Subspace distribution clustering for continuous observation density hidden Markov models. In Proc. 5th Eur. Conf. Speech Comm. Technol., vol. 1, pp. 107-110.
Bulyko, I., Ostendorf, M., and Stolcke, A. (2003). Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures. In Proc. 2003 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Technol.: Companion Volume Proc. HLT-NAACL 2003—short papers—vol. 2.
Caseiro, D., Trancoso, L., Oliveira, L., and Viana, C. (2002). Grapheme-to-phone using finite-state transducers. In Proc. 2002 IEEE Workshop Speech Synthesis.
Fenn, J. (2005). Speech recognition on the desktop: still niche after all these years. Gartner Research Report, G00132456.
Fischer, V., Gonzalez, J., Janke, E., Villani, M., and Waast-Richard, C. (2000). Towards multilingual acoustic modeling for large vocabulary continuous speech recognition. In Proc. IEEE Workshop Multilingual Speech Comm., pp. 31-35.
Furui, S. (1986). Speaker independent isolated word recognition using dynamic features of speech spectrum. IEEE Trans. Acoust. Speech Signal Process., vol. 34, pp. 52-59.
Gales, M.J.F., Knill, K.M., and Young, S.J. (1999). State-based Gaussian selection in large vocabulary continuous speech recognition using HMM’s. IEEE Trans. Speech Audio Process., vol. 7, no. 2.
Häkkinen, J., Suontausta, J., Jensen, K., and Riis, S. (2000). Methods for text-to-phoneme mapping in speaker independent isolated word recognition. Technical Report, Nokia Research Center.
Hermansky, H., and Morgan, N. (1994). RASTA processing of speech. IEEE Trans. Speech Audio Proc., vol. 2, no. 4, pp. 578-589.
Höge, H. (2000). Speech database technology for commercially used recognizers-status and future issues. In Proc. Workshop XLDB LREC2000, Athens.
Houtgast, T. (1989). Frequency selectivity in amplitude-modulation detection. J. Acoust. Soc. Amer., vol. 85, pp. 1676-1680.
Karpov, E., Kiss, I., Leppänen, J., Olsen, J., Oria, D., Sivadas, S., and Tian, J. (2006). Short message dictation on symbian series 60 mobile phones. In Proc. Workshop Speech Mobile Pervasive Environments (SiMPE) Conjunction MobileHCI 2006.
Kiss, I., and Vasilache, M. (2002). Low complexity techniques for embedded ASR systems. In Proc. ICSLP, Denver, Colorado, pp. 1593-1596.
Laurila, K. (1997). Noise robust speech recognition with state duration constraints. In Proc. ICASSP.
Leppänen, J., and Kiss, I. (2005). Comparison of low footprint acoustic modeling techniques for embedded ASR systems. In Proc. Interspeech.
Leppänen, J., and Kiss, I. (2006). Gaussian selection with non-overlapping clusters for ASR in embedded devices. In Proc. ICASSP.
McCulloch, N., Bedworth, M., and Bridle, J. (1987). NETspeak, a re-implementation of NETtalk. Computer Speech and Language, no. 2, pp. 289-301.
Ning, B., Garudadri, H., Chienchung, C., DeJaco, A., Yingyong, Q., Malayath, N., and Huang, W. (2002). A robust speech recognition system embedded in CDMA cellular phone chipsets. In Proc. ICASSP.
Olsen, J., and Oria, D. (2006). Profile-based compression of N-gram language models. In Proc. ICASSP.
Oria D., and Olsen, J. (2006). Statistical language modeling with semantic classes for large vocabulary speech recognition in embedded devices. CI 2006 Special Session on NLP for Real Life Applications.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Mateo, CA.
Ramírez, J., Segura, J.C., Benítez, C., de la Torre, Á., and Rubio, A. (2004). Efficient voice activity detection algorithms using long-term speech information. Speech Comm., vol. 42, no. 3-4, pp. 271-287.
Sarikaya, R., Gravano, A., and Yuqing, G. (2005). Rapid language model development using external resources for new spoken dialog domains. In Proc. ICASSP.
Schultz, T., and Waibel, A. (2001). Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Comm., vol. 35, no. 1-2, pp. 31-51.
Schultz, T., and Waibel, A. (2000). Language portability in acoustic modeling. In Proc. IEEE Workshop Multilingual Speech Comm., pp. 59-64.
Sejnowski, J.T., and Rosenberg, C.R. (1987). Parallel networks that learn to pronounce English text, Complex Systems, vol. 1, no. 1, pp. 145-168.
Sethy, A., Georgiou, P., and Narayanan, S. (2006). Text data acquisition for domain-specific language models. In Proc. EMNLP.
Sethy, A., Ramabhadran, B., and Narayanan, S. (2007). Data driven approach for language model adaptation using stepwise relative entropy minimization. In Proc. ICASSP. Sivadas, S. (2006). Additive noise reduction for speech recognition by filtering in modulation spectral domain. Technical Report, Nokia Research Center.
SpeechDat (2000). http://www.speechdat.org
Stolcke, A. (1998). Entropy-based pruning of backoff language models. In Proc. DARPA Broadcast News Trans. Understanding Workshop, pp. 270-274.
Varga, I., Aalburg, S., Andrassy, B., Astrov, S., Bauer, J.G., Beaugeant, Ch., Geissler, Ch., and Höge, H. (2002). ASR in mobile phones—an industrial approach. IEEE Trans. Speech Audio Process., vol. 10, no. 8, pp. 562-569.
Vasilache, M. (2000). Speech recognition using HMMs with quantized parameters. In Proc. ICSLP, vol. 1, pp. 441-443.
Vasilache, M., and Viikki, O. (2001). Speaker adaptation of quantized parameter HMMs. In Proc. Eurospeech, vol. 2, pp. 1265-1268.
Viikki, O., Bye, D., and Laurila, K. (1998). A recursive feature vector normalization approach for robust speech recognition in noise. In Proc. ICASSP.
Virag, N. (1999). Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans. Speech Audio Process., vol. 7, no. 2, pp.126-137.
Westphal, M. (1997). The use of cepstral means in conversational speech recognition. In Proc. Eur. Conf. Speech Comm. Technol. (Eurospeech).
Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., and Woodland, P. (2002). The HTK Book (for HTK Version 3.1).
Young, S.J., Russel, N.H., and Thornton, J.H.S. (1989). Token passing: a conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag London Limited
About this chapter
Cite this chapter
Varga, I., Kiss, I. (2008). Speech Recognition in Mobile Phones. In: Automatic Speech Recognition on Mobile Devices and over Communication Networks. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-143-5_14
Download citation
DOI: https://doi.org/10.1007/978-1-84800-143-5_14
Publisher Name: Springer, London
Print ISBN: 978-1-84800-142-8
Online ISBN: 978-1-84800-143-5
eBook Packages: Computer ScienceComputer Science (R0)