Advertisement

A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora

  • Pascale Fung
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1529)

Abstract

We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method-Convec. Convec is based on context information of a word to be translated. We show a 30% to 76% precision when top-one to top-20 translation candidates are considered. Most of the top-20 candidates are either collocations or words related to the correct translation. Since non-parallel corpora contain a lot more polysemous words, many-to-many translations, and different lexical items in the two languages, we conclude that the output from Convec is reasonable and useful.

Keywords

Machine Translation Dynamic Time Warping Word Sense Disambiguation Computational Linguistics Unknown Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    A. Bookstein. Explanation and generalization of vector models in information retrieval. In Proceedings of the 6th Annual International Conference on Research and Development in Information Retrieval, pages 118–132, 1983.Google Scholar
  2. 2.
    P.F. Brown, J. Cocke, S.A. Della Pietra, V.J. Della Pietra, F. Jelinek, J.D. Lafferty, R.L. Mercer, and P. Roosin. A statistical approach to machine translation. Computational Linguistics, 16:79–85, 1990.Google Scholar
  3. 3.
    P.F. Brown, S.A Della Pietra, V.J. Della Pietra, and R.L. Mercer. The mathematics of machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.Google Scholar
  4. 4.
    Stanley Chen. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, pages 9–16, Columbus, Ohio, June 1993.Google Scholar
  5. 5.
    W. Bruce Croft. A comparison of the cosine correlation and the modified probabilistic model. In Information Technology, volume 3, pages 113–114, 1984.Google Scholar
  6. 6.
    Ido Dagan and Kenneth W. Church. Termight: Identifying and translating technical terminology. In Proceedings of the 4th Conference on Applied Natural Language Processing, pages 34–40, Stuttgart, Germany, October 1994.Google Scholar
  7. 7.
    Ido Dagan, Kenneth W. Church, and William A. Gale. Robust bilingual word alignment for machine aided translation. In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 1–8, Columbus, Ohio, June 1993.Google Scholar
  8. 8.
    Ido Dagan and Alon Itai. Word sense disambiguation using a second language monolingual corpus. In Computational Linguistics, pages 564–596, 1994.Google Scholar
  9. 9.
    Pascale Fung. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In Proceedings of the Third Annual Workshop on Very Large Corpora, pages 173–183, Boston, Massachusettes, June 1995.Google Scholar
  10. 10.
    Pascale Fung. A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of the 33rd Annual Conference of the Association for Computational Linguistics, pages 236–233, Boston, Massachusettes, June 1995.Google Scholar
  11. 11.
    Pascale Fung. Domain word translation by space-frequency analysis of context length histograms. In Proceedings of ICASSP 96, volume 1, pages 184–187, Atlanta, Georgia, May 1996.Google Scholar
  12. 12.
    Pascale Fung and Kenneth Church. Kvec: A new approach for aligning parallel texts. In Proceedings of COLING 94, pages 1096–1102, Kyoto, Japan, August 1994.Google Scholar
  13. 13.
    Pascale Fung and Kathleen McKeown. Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 81–88, Columbia, Maryland, October 1994.Google Scholar
  14. 14.
    Pascale Fung and Kathleen McKeown. A technical word and term translation aid using noisy parallel corpora across language groups. Machine Translation, pages 53–87, 1996.Google Scholar
  15. 15.
    Pascale Fung and Kathleen McKeown. Finding terminology translations from non-parallel corpora. In The 5th Annual Workshop on Very Large Corpora, pages 192–202, Hong Kong, Aug. 1997.Google Scholar
  16. 16.
    Pascale Fung and Lo Yuen Yee. An ir approach for translating new words from nonparallel, comparable texts.Google Scholar
  17. 17.
    W. Gale, K. Church, and D. Yarowsky. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. In Proceedings of the 30th Conference of the Association for Computational Linguistics. Association for Computational Linguistics, 1992.Google Scholar
  18. 18.
    W. Gale, K. Church, and D. Yarowsky. Using bilingual materials to develop word sense disambiguation methods. In Proceedings of TMI 92, 1992.Google Scholar
  19. 19.
    W. Gale, K. Church, and D. Yarowsky. Work on statistical methods for word sense disambiguation. In Proceedings of AAAI 92, 1992.Google Scholar
  20. 20.
    W. Gale, K. Church, and D. Yarowsky. A method for disambiguating word senses in a large corpus. In Computers and Humanities, volume 26, pages 415–439, 1993.CrossRefGoogle Scholar
  21. 21.
    William Gale and Kenneth Church. Identifying word correspondences in parallel text. In Proceedings of the Fourth Darpa Workshop on Speech and Natural Language, Asilomar, 1991.Google Scholar
  22. 22.
    M. Hearst. Noun homograph disambiguation using local context in large text corpora. In Using Corpora, Waterloo, Canada, 1991.Google Scholar
  23. 23.
    Martin Kay and Martin Röscheisen. Text-Translation alignment. Computational Linguistics, 19(1):121–142, 1993.Google Scholar
  24. 24.
    Robert Korfhage. Some thoughts on similarity measures. In The SIGIR Forum, volume 29, page 8, 1995.CrossRefGoogle Scholar
  25. 25.
    Julian Kupiec. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, pages 17–22, Columbus, Ohio, June 1993.Google Scholar
  26. 26.
    Frederick Mosteller and David L. Wallace. Applied Bayesian and Classical Inference-The Case of The Federalist Papers. Springer Series in Satistics, Springer-Verlag, 1968.Google Scholar
  27. 27.
    G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.Google Scholar
  28. 28.
    Hinrich Shütze. Dimensions of meaning. In Proceedings of Supercomputing’ 92, 1992.Google Scholar
  29. 29.
    Frank Smadja, Kathleen McKeown, and Vasileios Hatzsivassiloglou. Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 21(4):1–38, 1996.Google Scholar
  30. 30.
    Howard R. Turtle and W. Bruce Croft. A comparison of text retrieval methods. In The Computer Journal, volume 35, pages 279–290, 1992.zbMATHCrossRefGoogle Scholar
  31. 31.
    Dekai Wu and Hongsing Wong. Machine translation with a stochastical grammatical channel.Google Scholar
  32. 32.
    Dekai Wu and Xuanyin Xia. Learning an English-Chinese lexicon from a parallel corpus. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 206–213, Columbia, Maryland, October 1994.Google Scholar
  33. 33.
    D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Conference of the Association for Computational Linguistics, pages 189–196. Association for Computational Linguistics, 1995.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Pascale Fung
    • 1
  1. 1.Human Language Technology Center Department of Electrical and Electronic EngineeringUniversity of Science and Technology (HKUST)Clear Water BayHong Kong

Personalised recommendations