Skip to main content

Part of the book series: Advances in Pattern Recognition ((ACVPR))

Speech input implemented in voice user interface (voice UI) plays an important role in enhancing the usability of small portable devices, such as mobile phones. In these devices more traditional ways of interaction (e.g. keyboard and display) are limited by small size, battery life and cost. Speech is considered as a natural way of interaction for man-machine interfaces. After decades of research and development, voice UIs are becoming widely deployed and accepted in commercial applications. It is expected that the global proliferation of embedded devices will further strengthen this trend in the coming years. A core technology enabler of voice UIs is automatic speech recognition (ASR). Example applications in mobile phones relying on embedded ASR are name dialling, phone book search, command-and-control and more recently large vocabulary dictation. In the mobile context several technological challenges have to be overcome concerning ambient noise in the environment, constraints of available hardware platforms and cost limitations, and necessity for wide language coverage. In addition, mobile ASR systems need to achieve a virtually perfect performance level for user acceptance. This chapter reviews the application of embedded ASR in mobile phones, and describes specific issues related to language development, noise robustness and embedded implementation and platforms. Several practical solutions are presented throughout the chapter with supporting experimental results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Andrassy, B., Vlaj, D., and Beaugeant, Ch. (2001). Recognition performance of the siemens front-end with and without frame dropping on the Aurora 2 database. In Proc. Eur. Conf. Speech Comm. Technol. (Eurospeech), vol. 1, pp. 193-196.

    Google Scholar 

  • Balan, R., Rosca, J., Beaugeant, Ch., Gilg, V., and Fingscheidt, T. (2004). Generalized stochastic principle for microphone array speech enhancement and applications to car environments. In Proc. Eur. Signal Proc. Conf. (Eusipco), September 6-10, 2004.

    Google Scholar 

  • Bocchieri, E. (1993). Vector quantization for efficient computation of continuous density likelihoods. In Proc. of ICASSP, Minneapolis, MN, vol. 2, pp. II-692-II-695.

    Google Scholar 

  • Bocchieri, E., and Mak, B. (1997). Subspace distribution clustering for continuous observation density hidden Markov models. In Proc. 5th Eur. Conf. Speech Comm. Technol., vol. 1, pp. 107-110.

    Google Scholar 

  • Bulyko, I., Ostendorf, M., and Stolcke, A. (2003). Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures. In Proc. 2003 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Technol.: Companion Volume Proc. HLT-NAACL 2003—short papers—vol. 2.

    Google Scholar 

  • Caseiro, D., Trancoso, L., Oliveira, L., and Viana, C. (2002). Grapheme-to-phone using finite-state transducers. In Proc. 2002 IEEE Workshop Speech Synthesis.

    Google Scholar 

  • Fenn, J. (2005). Speech recognition on the desktop: still niche after all these years. Gartner Research Report, G00132456.

    Google Scholar 

  • Fischer, V., Gonzalez, J., Janke, E., Villani, M., and Waast-Richard, C. (2000). Towards multilingual acoustic modeling for large vocabulary continuous speech recognition. In Proc. IEEE Workshop Multilingual Speech Comm., pp. 31-35.

    Google Scholar 

  • Furui, S. (1986). Speaker independent isolated word recognition using dynamic features of speech spectrum. IEEE Trans. Acoust. Speech Signal Process., vol. 34, pp. 52-59.

    Article  Google Scholar 

  • Gales, M.J.F., Knill, K.M., and Young, S.J. (1999). State-based Gaussian selection in large vocabulary continuous speech recognition using HMM’s. IEEE Trans. Speech Audio Process., vol. 7, no. 2.

    Google Scholar 

  • Häkkinen, J., Suontausta, J., Jensen, K., and Riis, S. (2000). Methods for text-to-phoneme mapping in speaker independent isolated word recognition. Technical Report, Nokia Research Center.

    Google Scholar 

  • Hermansky, H., and Morgan, N. (1994). RASTA processing of speech. IEEE Trans. Speech Audio Proc., vol. 2, no. 4, pp. 578-589.

    Article  Google Scholar 

  • Höge, H. (2000). Speech database technology for commercially used recognizers-status and future issues. In Proc. Workshop XLDB LREC2000, Athens.

    Google Scholar 

  • Houtgast, T. (1989). Frequency selectivity in amplitude-modulation detection. J. Acoust. Soc. Amer., vol. 85, pp. 1676-1680.

    Article  Google Scholar 

  • Karpov, E., Kiss, I., Leppänen, J., Olsen, J., Oria, D., Sivadas, S., and Tian, J. (2006). Short message dictation on symbian series 60 mobile phones. In Proc. Workshop Speech Mobile Pervasive Environments (SiMPE) Conjunction MobileHCI 2006.

    Google Scholar 

  • Kiss, I., and Vasilache, M. (2002). Low complexity techniques for embedded ASR systems. In Proc. ICSLP, Denver, Colorado, pp. 1593-1596.

    Google Scholar 

  • Laurila, K. (1997). Noise robust speech recognition with state duration constraints. In Proc. ICASSP.

    Google Scholar 

  • Leppänen, J., and Kiss, I. (2005). Comparison of low footprint acoustic modeling techniques for embedded ASR systems. In Proc. Interspeech.

    Google Scholar 

  • Leppänen, J., and Kiss, I. (2006). Gaussian selection with non-overlapping clusters for ASR in embedded devices. In Proc. ICASSP.

    Google Scholar 

  • McCulloch, N., Bedworth, M., and Bridle, J. (1987). NETspeak, a re-implementation of NETtalk. Computer Speech and Language, no. 2, pp. 289-301.

    Google Scholar 

  • Ning, B., Garudadri, H., Chienchung, C., DeJaco, A., Yingyong, Q., Malayath, N., and Huang, W. (2002). A robust speech recognition system embedded in CDMA cellular phone chipsets. In Proc. ICASSP.

    Google Scholar 

  • Olsen, J., and Oria, D. (2006). Profile-based compression of N-gram language models. In Proc. ICASSP.

    Google Scholar 

  • Oria D., and Olsen, J. (2006). Statistical language modeling with semantic classes for large vocabulary speech recognition in embedded devices. CI 2006 Special Session on NLP for Real Life Applications.

    Google Scholar 

  • Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Mateo, CA.

    Google Scholar 

  • Ramírez, J., Segura, J.C., Benítez, C., de la Torre, Á., and Rubio, A. (2004). Efficient voice activity detection algorithms using long-term speech information. Speech Comm., vol. 42, no. 3-4, pp. 271-287.

    Article  Google Scholar 

  • Sarikaya, R., Gravano, A., and Yuqing, G. (2005). Rapid language model development using external resources for new spoken dialog domains. In Proc. ICASSP.

    Google Scholar 

  • Schultz, T., and Waibel, A. (2001). Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Comm., vol. 35, no. 1-2, pp. 31-51.

    Article  MATH  Google Scholar 

  • Schultz, T., and Waibel, A. (2000). Language portability in acoustic modeling. In Proc. IEEE Workshop Multilingual Speech Comm., pp. 59-64.

    Google Scholar 

  • Sejnowski, J.T., and Rosenberg, C.R. (1987). Parallel networks that learn to pronounce English text, Complex Systems, vol. 1, no. 1, pp. 145-168.

    MATH  Google Scholar 

  • Sethy, A., Georgiou, P., and Narayanan, S. (2006). Text data acquisition for domain-specific language models. In Proc. EMNLP.

    Google Scholar 

  • Sethy, A., Ramabhadran, B., and Narayanan, S. (2007). Data driven approach for language model adaptation using stepwise relative entropy minimization. In Proc. ICASSP. Sivadas, S. (2006). Additive noise reduction for speech recognition by filtering in modulation spectral domain. Technical Report, Nokia Research Center.

    Google Scholar 

  • SpeechDat (2000). http://www.speechdat.org

  • Stolcke, A. (1998). Entropy-based pruning of backoff language models. In Proc. DARPA Broadcast News Trans. Understanding Workshop, pp. 270-274.

    Google Scholar 

  • Varga, I., Aalburg, S., Andrassy, B., Astrov, S., Bauer, J.G., Beaugeant, Ch., Geissler, Ch., and Höge, H. (2002). ASR in mobile phones—an industrial approach. IEEE Trans. Speech Audio Process., vol. 10, no. 8, pp. 562-569.

    Article  Google Scholar 

  • Vasilache, M. (2000). Speech recognition using HMMs with quantized parameters. In Proc. ICSLP, vol. 1, pp. 441-443.

    Google Scholar 

  • Vasilache, M., and Viikki, O. (2001). Speaker adaptation of quantized parameter HMMs. In Proc. Eurospeech, vol. 2, pp. 1265-1268.

    Google Scholar 

  • Viikki, O., Bye, D., and Laurila, K. (1998). A recursive feature vector normalization approach for robust speech recognition in noise. In Proc. ICASSP.

    Google Scholar 

  • Virag, N. (1999). Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans. Speech Audio Process., vol. 7, no. 2, pp.126-137.

    Article  Google Scholar 

  • Westphal, M. (1997). The use of cepstral means in conversational speech recognition. In Proc. Eur. Conf. Speech Comm. Technol. (Eurospeech).

    Google Scholar 

  • Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., and Woodland, P. (2002). The HTK Book (for HTK Version 3.1).

    Google Scholar 

  • Young, S.J., Russel, N.H., and Thornton, J.H.S. (1989). Token passing: a conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag London Limited

About this chapter

Cite this chapter

Varga, I., Kiss, I. (2008). Speech Recognition in Mobile Phones. In: Automatic Speech Recognition on Mobile Devices and over Communication Networks. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-143-5_14

Download citation

  • DOI: https://doi.org/10.1007/978-1-84800-143-5_14

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84800-142-8

  • Online ISBN: 978-1-84800-143-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics