Speech input implemented in voice user interface (voice UI) plays an important role in enhancing the usability of small portable devices, such as mobile phones. In these devices more traditional ways of interaction (e.g. keyboard and display) are limited by small size, battery life and cost. Speech is considered as a natural way of interaction for man-machine interfaces. After decades of research and development, voice UIs are becoming widely deployed and accepted in commercial applications. It is expected that the global proliferation of embedded devices will further strengthen this trend in the coming years. A core technology enabler of voice UIs is automatic speech recognition (ASR). Example applications in mobile phones relying on embedded ASR are name dialling, phone book search, command-and-control and more recently large vocabulary dictation. In the mobile context several technological challenges have to be overcome concerning ambient noise in the environment, constraints of available hardware platforms and cost limitations, and necessity for wide language coverage. In addition, mobile ASR systems need to achieve a virtually perfect performance level for user acceptance. This chapter reviews the application of embedded ASR in mobile phones, and describes specific issues related to language development, noise robustness and embedded implementation and platforms. Several practical solutions are presented throughout the chapter with supporting experimental results.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andrassy, B., Vlaj, D., and Beaugeant, Ch. (2001). Recognition performance of the siemens front-end with and without frame dropping on the Aurora 2 database. In Proc. Eur. Conf. Speech Comm. Technol. (Eurospeech), vol. 1, pp. 193-196.Google Scholar
  2. Balan, R., Rosca, J., Beaugeant, Ch., Gilg, V., and Fingscheidt, T. (2004). Generalized stochastic principle for microphone array speech enhancement and applications to car environments. In Proc. Eur. Signal Proc. Conf. (Eusipco), September 6-10, 2004.Google Scholar
  3. Bocchieri, E. (1993). Vector quantization for efficient computation of continuous density likelihoods. In Proc. of ICASSP, Minneapolis, MN, vol. 2, pp. II-692-II-695.Google Scholar
  4. Bocchieri, E., and Mak, B. (1997). Subspace distribution clustering for continuous observation density hidden Markov models. In Proc. 5th Eur. Conf. Speech Comm. Technol., vol. 1, pp. 107-110.Google Scholar
  5. Bulyko, I., Ostendorf, M., and Stolcke, A. (2003). Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures. In Proc. 2003 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Technol.: Companion Volume Proc. HLT-NAACL 2003—short papers—vol. 2.Google Scholar
  6. Caseiro, D., Trancoso, L., Oliveira, L., and Viana, C. (2002). Grapheme-to-phone using finite-state transducers. In Proc. 2002 IEEE Workshop Speech Synthesis.Google Scholar
  7. Fenn, J. (2005). Speech recognition on the desktop: still niche after all these years. Gartner Research Report, G00132456.Google Scholar
  8. Fischer, V., Gonzalez, J., Janke, E., Villani, M., and Waast-Richard, C. (2000). Towards multilingual acoustic modeling for large vocabulary continuous speech recognition. In Proc. IEEE Workshop Multilingual Speech Comm., pp. 31-35.Google Scholar
  9. Furui, S. (1986). Speaker independent isolated word recognition using dynamic features of speech spectrum. IEEE Trans. Acoust. Speech Signal Process., vol. 34, pp. 52-59.CrossRefGoogle Scholar
  10. Gales, M.J.F., Knill, K.M., and Young, S.J. (1999). State-based Gaussian selection in large vocabulary continuous speech recognition using HMM’s. IEEE Trans. Speech Audio Process., vol. 7, no. 2.Google Scholar
  11. Häkkinen, J., Suontausta, J., Jensen, K., and Riis, S. (2000). Methods for text-to-phoneme mapping in speaker independent isolated word recognition. Technical Report, Nokia Research Center.Google Scholar
  12. Hermansky, H., and Morgan, N. (1994). RASTA processing of speech. IEEE Trans. Speech Audio Proc., vol. 2, no. 4, pp. 578-589.CrossRefGoogle Scholar
  13. Höge, H. (2000). Speech database technology for commercially used recognizers-status and future issues. In Proc. Workshop XLDB LREC2000, Athens.Google Scholar
  14. Houtgast, T. (1989). Frequency selectivity in amplitude-modulation detection. J. Acoust. Soc. Amer., vol. 85, pp. 1676-1680.CrossRefGoogle Scholar
  15. Karpov, E., Kiss, I., Leppänen, J., Olsen, J., Oria, D., Sivadas, S., and Tian, J. (2006). Short message dictation on symbian series 60 mobile phones. In Proc. Workshop Speech Mobile Pervasive Environments (SiMPE) Conjunction MobileHCI 2006.Google Scholar
  16. Kiss, I., and Vasilache, M. (2002). Low complexity techniques for embedded ASR systems. In Proc. ICSLP, Denver, Colorado, pp. 1593-1596.Google Scholar
  17. Laurila, K. (1997). Noise robust speech recognition with state duration constraints. In Proc. ICASSP.Google Scholar
  18. Leppänen, J., and Kiss, I. (2005). Comparison of low footprint acoustic modeling techniques for embedded ASR systems. In Proc. Interspeech.Google Scholar
  19. Leppänen, J., and Kiss, I. (2006). Gaussian selection with non-overlapping clusters for ASR in embedded devices. In Proc. ICASSP.Google Scholar
  20. McCulloch, N., Bedworth, M., and Bridle, J. (1987). NETspeak, a re-implementation of NETtalk. Computer Speech and Language, no. 2, pp. 289-301.Google Scholar
  21. Ning, B., Garudadri, H., Chienchung, C., DeJaco, A., Yingyong, Q., Malayath, N., and Huang, W. (2002). A robust speech recognition system embedded in CDMA cellular phone chipsets. In Proc. ICASSP.Google Scholar
  22. Olsen, J., and Oria, D. (2006). Profile-based compression of N-gram language models. In Proc. ICASSP.Google Scholar
  23. Oria D., and Olsen, J. (2006). Statistical language modeling with semantic classes for large vocabulary speech recognition in embedded devices. CI 2006 Special Session on NLP for Real Life Applications.Google Scholar
  24. Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Mateo, CA.Google Scholar
  25. Ramírez, J., Segura, J.C., Benítez, C., de la Torre, Á., and Rubio, A. (2004). Efficient voice activity detection algorithms using long-term speech information. Speech Comm., vol. 42, no. 3-4, pp. 271-287.CrossRefGoogle Scholar
  26. Sarikaya, R., Gravano, A., and Yuqing, G. (2005). Rapid language model development using external resources for new spoken dialog domains. In Proc. ICASSP.Google Scholar
  27. Schultz, T., and Waibel, A. (2001). Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Comm., vol. 35, no. 1-2, pp. 31-51.MATHCrossRefGoogle Scholar
  28. Schultz, T., and Waibel, A. (2000). Language portability in acoustic modeling. In Proc. IEEE Workshop Multilingual Speech Comm., pp. 59-64.Google Scholar
  29. Sejnowski, J.T., and Rosenberg, C.R. (1987). Parallel networks that learn to pronounce English text, Complex Systems, vol. 1, no. 1, pp. 145-168.MATHGoogle Scholar
  30. Sethy, A., Georgiou, P., and Narayanan, S. (2006). Text data acquisition for domain-specific language models. In Proc. EMNLP.Google Scholar
  31. Sethy, A., Ramabhadran, B., and Narayanan, S. (2007). Data driven approach for language model adaptation using stepwise relative entropy minimization. In Proc. ICASSP. Sivadas, S. (2006). Additive noise reduction for speech recognition by filtering in modulation spectral domain. Technical Report, Nokia Research Center.Google Scholar
  32. SpeechDat (2000). http://www.speechdat.org
  33. Stolcke, A. (1998). Entropy-based pruning of backoff language models. In Proc. DARPA Broadcast News Trans. Understanding Workshop, pp. 270-274.Google Scholar
  34. Varga, I., Aalburg, S., Andrassy, B., Astrov, S., Bauer, J.G., Beaugeant, Ch., Geissler, Ch., and Höge, H. (2002). ASR in mobile phones—an industrial approach. IEEE Trans. Speech Audio Process., vol. 10, no. 8, pp. 562-569.CrossRefGoogle Scholar
  35. Vasilache, M. (2000). Speech recognition using HMMs with quantized parameters. In Proc. ICSLP, vol. 1, pp. 441-443.Google Scholar
  36. Vasilache, M., and Viikki, O. (2001). Speaker adaptation of quantized parameter HMMs. In Proc. Eurospeech, vol. 2, pp. 1265-1268.Google Scholar
  37. Viikki, O., Bye, D., and Laurila, K. (1998). A recursive feature vector normalization approach for robust speech recognition in noise. In Proc. ICASSP.Google Scholar
  38. Virag, N. (1999). Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans. Speech Audio Process., vol. 7, no. 2, pp.126-137.CrossRefGoogle Scholar
  39. Westphal, M. (1997). The use of cepstral means in conversational speech recognition. In Proc. Eur. Conf. Speech Comm. Technol. (Eurospeech).Google Scholar
  40. Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., and Woodland, P. (2002). The HTK Book (for HTK Version 3.1).Google Scholar
  41. Young, S.J., Russel, N.H., and Thornton, J.H.S. (1989). Token passing: a conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University.Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Imre Varga
    • 1
  • Imre Kiss
    • 2
  1. 1.Corporate TechnologySiemens AGGermany
  2. 2.NokiaFinland

Personalised recommendations