Skip to main content
Log in

Large-scale automatic extraction of an English-Chinese translation lexicon

  • Published:
Machine Translation

Abstract

We report experimental results on automatic extraction of an English-Chinese translation lexicon, by statistical analysis of a large parallel corpus, using limited amounts of linguistic knowledge. To our knowledge, these are the first empirical results of the kind between an Indo-European and non-Indo-European language for any significant vocabulary and corpus size. The learned vocabulary size is about 6,500 English words, achieving translation precision in the 86–96% range, with alignment proceeding at paragraph, sentence, and word levels. Specifically, we report (1) progress on the HKUST English-Chinese Parallel Bilingual Corpus, (2) experiments supporting the usefulness of restricted lexical cues for statistical paragraph and sentence alignment, and (3) experiments that question the role of hand-derived monolingual lexicons for automatic word translation acquisition. Using a hand-derived monolingual lexicon, the learned translation lexicon averages 2.33 Chinese translations per English entry, with a manually-filtered precision of 95.1%, and an automatically-filtered weighted precision of 86.0%. We then introduce a fully automatic two-stage statistical methodology that is able to learn translations for collocations. A statistically-learned monolingual Chinese lexicon is first used to segment the Chinese text, before applying bilingual training to produce 6,429 English entries with 2.25 Chinese translations per entry. This method improves the manually-filtered precision to 96.0% and the automatically-filtered weighted precision to 91.0%, an error rate reduction of 35.7% from using a hand-derived monolingual lexicon.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • BDC. 1992.The BDC Chinese-English Electronic Dictionary (version 2.0). Behavior Design Corporation.

  • Brown, P.F., J. Cocke, S.A. DellaPietra, V.J. DellaPietra, F. Jelinek, J.D. Lafferty, R.L. Mercer, and P.S. Roossin. 1990. A Statistical Approach to Machine Translation.Computational Linguistics, 16(2):29–85.

    Google Scholar 

  • Brown, P.F., S.A. DellaPietra, V.J. DellaPietra, and R.L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation.Computational Linguistics, 19(2):263–311.

    Google Scholar 

  • Brown, P.F., Jennifer C. Lai, and R.L. Mercer. 1991. Aligning Sentences in Parallel Corpora. InProceedings of the 29th Annual Conference of the Association for Computational Linguistics, pages 169–176, Berkeley.

  • Catizone, R., G. Russell, and S. Warwick. 1989. Deriving Translation Data from Bilingual Texts. InProceedings of the First International Acquisition Workshop, Detroit.

  • Chen, Stanley F. 1993. Aligning Sentences in Bilingual Corpora Using Lexical Information. InProceedings of the 31st Annual Conference of the Association for Computational Linguistics, pages 9–16, Columbus, OH.

  • Church, K.W. 1993. Char-align: A Program for Aligning Parallel Texts at the Character Level. InProceedings of the 31st Annual Conference of the Association for Computational Linguistics, pages 1–8, Columbus, OH.

  • Dagan, I., K.W. Church, and W.A. Gale. 1993. Robust Bilingual Word Alignment for Machine Aided Translation. InProceedings of the Workshop on Very Large Corpora, pages 1–8, Columbus, OH, June.

  • Fung, Pascale and Dekai Wu. 1994. Statistical Augmentation of a Chinese Machine-Readable Dictionary. InProceedings of the Second Annual Workshop on Very Large Corpora, pages 69–85, Kyoto, August.

  • Gale, W.A. and K.W. Church. 1991. A Program for Aligning Sentences in Bilingual Corpora. InProceedings of the 29th Annual Conference of the Association for Computational Linguistics, pages 177–184, Berkeley.

  • Gale, W.A., K.W. Church, and D. Yarowsky. 1993. A Method for Disambiguating Word Senses in a Large Corpus. InComputers and the Humanities.

  • Kay, M. and M. Röscheisen. 1988. Text-Translation Alignment. Technical Report P90-00143, Xerox Palo Alto Research Center.

  • Smadja, F.A. 1993. Retrieving Collocations From Text: Xtract.Computational Linguistics, 19(1):143–177.

    Google Scholar 

  • Smadja, F.A. and K.R. McKeown. 1994. Translating Collocations for Use in Bilingual Lexicons. InProceedings of the ARPA Human Language Technology Workshop, Princeton, N.J., March.

  • Sperberg-McQueen, C.M. and L. Burnard. 1992. Guidelines for Electronic Text Encoding and Interchange. Version 2 draft.

  • Wu, Dekai and Pascale Fung. 1994. Improving Chinese Tokenization with Linguistic Filters on Statistical Lexical Acquisition. InProceedings of the Fourth Conference on Applied Natural Language Processing, pages 180–181, Stuttgart, October.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, D., Xia, X. Large-scale automatic extraction of an English-Chinese translation lexicon. Mach Translat 9, 285–313 (1994). https://doi.org/10.1007/BF00980581

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00980581

Keywords

Navigation