Handheld Speech to Speech Translation System

  • Yuqing Gao
  • Bowen Zhou
  • Weizhong Zhu
  • Wei Zhang
Part of the Advances in Pattern Recognition book series (ACVPR)

Recent Advances in the processing capabilities of handheld devices (PDAs or mobile phones) have provided the opportunity for enablement of speech recognition system, and even end-to-end speech translation system on these devices. However, two-way free-form speech-to-speech translation (as opposite to fixed phrase translation) is a highly complex task. A large amount of computation is involved to achieve reliable transformation performance. Resource limitations are not just CPU speed, but also the memory and storage requirements, and the audio input and output requirements all tax current systems to their limits. When the resource demand exceeds the computational capability of available state-of-the-art hand-held devices, a common technique for mobile speech-to-speech translation system is to use a client-server approach, where the handheld device (a mobile phone or PDA) is treated simply as a system client. While we will briefly describe the client/server approach, we will mainly focus on the approach that the end-to-end speech-to-speech translation system is completely hosted on the handheld devices. We will describe the challenges and algorithm and code optimization solutions we developed for the handheld MASTOR systems (Multilingual Automatic Speech-to-Speech Translator) for between English and Mandarin Chinese, and between English and Arabic on embedded Linux and Windows CE operating systems. The system includes an HMM-based large vocabulary continuous speech recognizer using statistical n-grams, a translation module, and a multi-language speech synthesis system.


Speech Recognition Language Model Personal Digital Assistant Machine Translation Target Language 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Afify, M., Sarikaya, R., Kuo, J., Besacier, L., and Gao, Y. (2006). On the use of mor-phological analysis for dialectal Arabic speech recognition. In Proceedings of Inter-Speech.Google Scholar
  2. Balakrishnan, S.V. (2003). Fast incremental adaptation using maximum likelihood regres-sion and stochastic gradient descent. In Proceedings of EUROSPEECH.Google Scholar
  3. Bangalore, S., and Riccardi, G. (2001). A finite-state approach to machine translation. In Proceedings of North American Chapter of the Association for Computational Linguis-tics (NAACL).Google Scholar
  4. Brown, P., Della Pietra, S.A., Della Pietra, V.J., and Mercer, R.L. (1993). The Mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, vol. 19 (2), pp. 263-311.Google Scholar
  5. Donovan, R.E., and Eide, E.M. (1998). The IBM trainable speech synthesis system. In Proceedings of ICSLP, Sydney, Australia.Google Scholar
  6. Gao, Y., Zhou, B., Diao, Z., Sorensen, J., and Picheny, M. (2002). MARS: A statistical Semantic Parsing and Generation-Based Multilingual Automatic Translation System. Machine Translation, vol. 17, pp. 185-212.CrossRefGoogle Scholar
  7. Gao, Y., Zhou, B., Gu, L., Sarikaya, R., Kuo, H-K., Rosti, A-V.I., Afify, M., and Zhu, W. (2006). IBM MASTOR: Multilingual Automatic Speech-to-Speech Translator. In Pro-ceedings of ICASSP.Google Scholar
  8. Gu, L., Gao, Y., Liu, F., and Picheny, M. (2006). Concept-based speech-to-speech transla-tion using maximum entropy models for statistical natural concept generation. IEEE Transactions on Speech and Audio Processing, vol. 14, no. 2, pp. 377-392.CrossRefGoogle Scholar
  9. Knight, K., and Al-Onaizan, Y. (1998). Translation with finite-state devices. In Proceedings of 4th Conference of the Association for Machine Translation in the Americas, pp. 421-437.Google Scholar
  10. Koehn, P., Och, F., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of North American Chapter of the Association for Computational Linguistics/Human Language Technologies.Google Scholar
  11. Kumar, S., Deng, Y., and Byrne, W. (2005). A weighted finite state transducer translation template model for statistical machine translation. Journal of Natural Language Engi-neering, vol. 11, no. 3.Google Scholar
  12. Lavie, A., Waibel, A., Levin, L., Finke, M., Gates, D., Gavalda, M., Zeppenfeld, T., and Zhan P. (1997). JANUS-III: Speech-to-Speech Translation in Multiple Languages. In Proceedings of ICASSP, Munich, Germany, vol. 1, pp. 99-102.Google Scholar
  13. Levin, L., Lavie, A., Woszczyna, M., Gates, D., Gavalda, M., Koll, D., and Waibel, A. (2000). The Janus-III Translation System: Speech-to-Speech Translation in Multiple Domains. Machine Translation, vol. 15, pp. 3-25.zbMATHCrossRefGoogle Scholar
  14. Lazzari, G. (2000). Spoken Translation: Challenges and Opportunities. In Proceedings of ICSLP, Beijing.Google Scholar
  15. Li, Y., Erdogan, H., Gao, Y., and Marcheret, E. (2002). Incremental on-line feature space MLLR adaptation for telephony speech recognition. In Proceedings of ICSLP.Google Scholar
  16. Mohri, M., Pereira, F., and Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech and Language, vol. 16, no. 1, pp. 69-88.CrossRefGoogle Scholar
  17. Ney, H., Niessen, S., Och, F.J., Sawaf, H., Tillmann, C., and Vogel, S. (2000). Algorithms for statistical translation for spoken language. IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 24-36.CrossRefGoogle Scholar
  18. Ney, H. (2003). The statistical approach to machine translation and a roadmap for speech translation. In Proceedings of Eurospeech.Google Scholar
  19. Povey, D., and Woodland, and P.C. (2002). Minimum phone error and I-smoothing for improved discriminative training. In Proceedings of ICASSP.Google Scholar
  20. Ratnaparkhi, A. (2002). Trainable method for surface natural language generation. In Pro-ceedings of 1st Meeting of North American Chapter of ACL.Google Scholar
  21. Tillmann, C., Vogel, S., Ney, H., and Sawaf, H. (2000). Statistical Translation of Text and Speech: First Results with the RWTH System. Machine Translation, vol. 15, pp. 43-74.zbMATHCrossRefGoogle Scholar
  22. Wahlster, W. (ed.) (2000). Verbmobil: Foundations of Speech-to-Speech Translation. Springer, Berlin.zbMATHGoogle Scholar
  23. Yamamoto, S. (2000). Toward speech communications beyond language barrier— Research of spoken language translation technologies at ATR. In Proceedings of ICSLP, Beijing.Google Scholar
  24. Zhou, B., Dechelotte, D., and Gao, Y. (2004). Two-way Speech-to-Speech Translation on Handheld Devices. In Proceedings of ICSLP.Google Scholar
  25. Zhou, B., Chen, S., and Gao, Y. (2005). Constrained phrase-based translation using weighted finite-state transducers. In Proceedings of ICASSP.Google Scholar
  26. Zhou, B., Chen, S., and Gao, Y. (2006). Folsom: A Fast and memory-efficient phrase-based approach to statistical machine translation. In Proceedings of IEEE/ACL 2006 Work-shop on Spoken Language Technology.Google Scholar
  27. Zhou, Y., Zong, C., and Xu, B. (2004). Bilingual chunk alignment in statistical machine translation. In Proceedings IEEE International Conference on Systems, Man and Cy-bernetics, vol. 2, pp. 1401-1406.Google Scholar
  28. Zhu, W., Zhou, B., Prosser, C., Krbec, P., and Gao, Y. (2006). Recent advances of IBM’s handheld speech translation system. In Proceedings of Interspeech.Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Yuqing Gao
    • 1
  • Bowen Zhou
    • 1
  • Weizhong Zhu
    • 1
  • Wei Zhang
    • 1
  1. 1.IBM T. J. Watson Research CenterUSA

Personalised recommendations