Skip to main content
Log in

Revealing Translators' Knowledge: Statistical Methods in Constructing Practical Translation Lexicons for Language and Speech Processing

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Parallel corpora encode extremely valuable linguistic knowledge about paired languages, both in terms of vocabulary and syntax. A professional translation of a text represents a series of linguistic decisions made by the translator in order to convey as faithfully as possible the meaning of the original text and to produce a “natural” text from the perspective of a native speaker of the target language. The “naturalness” of a translation implies not only the grammaticality of the translated text, but also style and cultural or social specificity.

We describe a program that exploits the knowledge embedded in the parallel corpora and produces a set of translation equivalents (a translation lexicon). The program uses almost no linguistic knowledge, relying on statistical evidence and some simplifying assumptions. Our experiments were conducted on the MULTEXT-EAST multilingual parallel corpus (Orwell's “1984”), and the evaluation of the system performance is presented in some detail in terms of precision, recall and processing time. We conclude by briefly mentioning some applications of the automatic extracted lexicons for text and speech processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Ahrenberg, L., Andersson, M., and Merkel, M. (2000).Aknowledgelite approach to word alignment. In J. Véronis (Ed.), Parallel Text Processing, Text, Speech and Language Technology Series, vol. 13. Boston: Kluwer Academic Publishers, pp. 97–116.

    Google Scholar 

  • Brants, T. (2000). TnT-A statistical part-of-speech tagger. Proceedings of the Sixth Applied Natural Language Processing Conference. Seattle, WA: ANLP. Available at http://www.coli.unisb. de/~thorsten/.

    Google Scholar 

  • Brew, C. and McKelvie, D. (1996). Word-pair extraction for lexicography. Available at http:///www.ltg.ed.ac.uk/~chrisbr/papers/ nemplap96.

  • Brown, P., Pietra, S.A., Della Pietra, V.J., and Mercer, R.L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311.

    Google Scholar 

  • Burileanu, D., Burileanu, C., and Niculiu, T. (2000). Connectionist methods applied in automatic speech synthesis. Romanian Journal of Information Science and Technology, 3(3):201–210.

    Google Scholar 

  • Cole, R.A., Hirschman, L., Atlas, L., Beckman, M., Bierman, A., Bush, M., Cohen, J., Garcia, O., Hanson, B., Hermansky, H., Levinson, S., McKeown, K., Morgan, N., Novick, D., Ostendorf, M., Oviatt, S., Price, P., Silverman, H., Spitz, J., Waibel, A., Weinstein, C., Zahorian, S., and Zue, V. (1995). The challenge of spoken language systems: Research directions for the nineties. IEEE Transactions on Speech and Audio Processing, 3(1):1–21.

    Google Scholar 

  • Cole, R.A., Mariani, J., Uszkoreit, H., Zaenen, A., and Zue, V. (Eds.) (1996). Survey of the State of the Art in Human Language Technology. Cambridge, UK: Cambridge University Press.

    Google Scholar 

  • Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H., Petkevic, V., and Tufiş, D. (1998). Multext-East: Parallel and comparable corpora and lexicons for six Central and East European languages. Proceedings ACL-COLING'98. Montreal: ACL, pp. 315–319.

    Google Scholar 

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74.

    Google Scholar 

  • Erjavec, T. and Ide, N. (1998). The Multext-East corpus. Proceedings LREC'1998. Granada: ELRA, pp. 971–974.

    Google Scholar 

  • Erjavec, T., Lawson, A., and Romary, L. (1998). East Meet West: A Compendium of Multilingual Resources. TELRI-MULTEXT EAST CD-ROM.

  • Gale, W.A. and Church, K.W. (1991). Identifying word correspondences in parallel texts. Fourth DARPA Workshop on Speech and Natural Language. Asilomar, CA, pp. 152–157.

  • Gale, W.A. and Church, K.W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):75–102.

    Google Scholar 

  • Hiemstra, D. (1997). Deriving a bilingual lexicon for cross language information retrieval. Proceedings of the Fourth Groningen International Information Technology Conference for Students. Groningen: University of Groningen, pp. 21–26.

    Google Scholar 

  • Ide, N., Erjavec, T., and Tufiş, D. (2001). Automatic sense tagging using parallel corpora. Proceedings of the 6th Natural Language Processing Pacific Rim Symposium. Tokyo: NLPRS Organization, pp. 83–90.

    Google Scholar 

  • Kay, M. and Röscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19(1):121–142.

    Google Scholar 

  • Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31st Annual Meeting of the Association of Computational Linguistics. Columbus, Ohio: ACL, pp. 17–22.

    Google Scholar 

  • Melamed, D. (2001). Empirical Methods for Exploiting Parallel Texts. Cambridge, MA: MIT Press.

    Google Scholar 

  • Morimoto, T., Takezawa, T., Yato, F., Sagayama, S., Tashiro, T., Nagata, M., and Kurematsu, A. (1993). ATR's speech translation system: ASURA. Proceedings of the Third Conference on Speech Communication and Technology. Berlin, Germany, pp. 1295–1298.

  • Muthusamy, Y.K., Cole, R.A., and Oshika, B.T. (1992). The OGI multi-language telephone speech corpus. Proceedings of the 1992 International Conference on Spoken Language Processing. Banff, Canada: University of Alberta, pp. 895–898.

    Google Scholar 

  • Smadja, F., McKeown, K.R., and Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1):1–38.

    Google Scholar 

  • Traber, C., Huber, K., Jantzen, V., Nedir, K., Pfister, B., Keller, E., and Zellner, B. (1999). From multilingual to polyglot speech synthesis. Proceedings of the Eurospeech'99. Budapest: Speech Technology Center, vol. 2, pp. 835–838.

    Google Scholar 

  • Tufiş, D. (1999). Tiered tagging and combined classifiers. In F. Jelinek and E. Nöth (Eds.), Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence 1692. New York: Springer-Verlag, pp. 29–33.

    Google Scholar 

  • Tufiş, D. (2000). Using a large set of Eagles-compliant morphosyntactic descriptors as a tagset for probabilistic tagging. Proceedings LREC'2000. Athens: ELRA, pp. 1105–1112.

    Google Scholar 

  • Tufiş, D. (2001). Partial translations recovery in a 1:1word-alignment approach, RACAI Research report, Bucharest, p. 34.

  • Tufiş, D. and Barbu, A.M. (2001a). Automatic construction of translation lexicons. InV.V. Kluev, C.E. D'Attellis, and N.E. Mastorakis (Eds.), Advances in Automation, Multimedia and Video Systems, and Modern Computer Science, Electrical and Computer Engineering Series, WSES Press, http://www.worldses.org, pp. 156–161.

  • Tufiş, D. and Barbu, A.M. (2001b). Extracting multilingual lexicons from parallel corpora. Proceedings of the ACH-ALLC Conference. New York: New York University ITS Publishers, pp. 122–124.

    Google Scholar 

  • Tufiş, D. and Barbu, A.M. (2001c). Accurate automatic extraction of translation equivalents from parallel corpora. In P. Rayson, A. Wilson, T. McEnery, A. Hardie, and S. Khoja (Eds.), Proceedings of the Corpus Linguistics 2001 Conference. Lancaster: Lancaster University, pp. 581–586.

    Google Scholar 

  • Tufiş, D. and Cristea, D. (2002). Methodological issues in building the Romanian Wordnet and consistency checks in Balkanet. Proceedings of theWorkshop onWordnet Structures and Standardization, and How These AffectWordnet Applications and Evaluation. Las Palmas: ELRA, pp. 35–41.

    Google Scholar 

  • Tufiş, D., Ide, N., and Erjavec, T. (1998). Standardized specifications, development and assessment of large morpho-lexical resources for six Central and Eastern European languages. Proceedings LREC'98. Granada: ELRA, pp. 233–240.

    Google Scholar 

  • Vossen, P. (Ed.) (1999). EuroWordNet: A Multilingual Database with Lexical Semantic Networks for European Languages. Dordrecht: Kluwer Academic Publishers.

    Google Scholar 

  • Waibel, A., Jain, A., McNair, A., Saito, H., Hauptmann, A., and Tebelskis, J. (1991). JANUS: A speech-to-speech translation system using connectionist and symbolic processing strategies. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, pp. 793–796.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tufiş, D., Barbu, A.M. Revealing Translators' Knowledge: Statistical Methods in Constructing Practical Translation Lexicons for Language and Speech Processing. International Journal of Speech Technology 5, 199–209 (2002). https://doi.org/10.1023/A:1020284521742

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1020284521742

Navigation