Language Resources and Evaluation

, Volume 43, Issue 1, pp 27–40 | Cite as

A cost-effective lexical acquisition process for large-scale thesaurus translation

  • Jimmy LinEmail author
  • G. Craig Murray
  • Bonnie J. Dorr
  • Jan Hajič
  • Pavel Pecina


Thesauri and controlled vocabularies facilitate access to digital collections by explicitly representing the underlying principles of organization. Translation of such resources into multiple languages is an important component for providing multilingual access. However, the specificity of vocabulary terms in most thesauri precludes fully-automatic translation using general-domain lexical resources. In this paper, we present an efficient process for leveraging human translations to construct domain-specific lexical resources. This process is illustrated on a thesaurus of 56,000 concepts used to catalog a large archive of oral histories. We elicited human translations on a small subset of concepts, induced a probabilistic phrase dictionary from these translations, and used the resulting resource to automatically translate the rest of the thesaurus. Two separate evaluations demonstrate the acceptability of the automatic translations and the cost-effectiveness of our approach.


Thesauri Controlled vocabularies Manual translation process 



Our thanks to Doug Oard for helpful discussions; to our Czech informants; and to Soumya Bhat for her programming efforts. This work was supported in part by NSF IIS Award 0122466 and NSF CISE RI Award EIA0130422. Additional support also came from grants of the MSMT CR #1P05ME786, #LC536 and #MSM0021620838, and the Grant Agency of the Czech Republic #GA405/06/0589. The first author would like to thank Esther and Kiri for their kind support.


  1. Chun, C., & Wenlin, L. (2002). The translation of agricultural multilingual thesaurus. In Proceedings of the Third Asian Conference for Information Technology in Agriculture.Google Scholar
  2. Čmejrek, M., Cuřín, J., Havelka, J., Hajič, J., & Kuboň, V. (2004). Prague Czech-English Dependency Treebank: Syntactically annotated resources for machine translation. In Proceedings of LREC 2004.Google Scholar
  3. Déjean, H., Gaussier, E., Renders, J.-M., & Sadat, F. (2005). Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence in Medicine, 33(2), 111–124.Google Scholar
  4. Echizen-ya, H., Araki, K., Momouchi, Y. (2006). Automatic extraction of bilingual word pairs using inductive chain learning in various languages. Information Processing and Management, 42(5), 1294–1315.CrossRefGoogle Scholar
  5. Frederking, R., Nirenburg, S., Farwell, D., Helmreich, S., Hovy, E., Knight, K., Beale, S., Domashnev, C., Attardo, D., Grannes, D., & Brown, R. (1994). The Pangloss Mark III machine translation system. In Proceedings of the 1st AMTA Conference.Google Scholar
  6. Gustman, S., Soergel, D., Oard, D. W., Byrne, W. J., Picheny, M., Ramabhadran, B., & Greenberg, D. (2002). Supporting access to large digital oral history archives. In Proceedings of JCDL 2002 (pp. 18–27).Google Scholar
  7. Kaji, H., & Aizono, T. (1996). Extracting word correspondences from bilingual corpora based on word co-occurrence information. In Proceedings of COLING 1996 (pp. 23–28).Google Scholar
  8. Murray, G. C., Dorr, B., Lin, J., Hajič, J., & Pecina, P. (2006a). Leveraging recurrent phrase structure in large-scale ontology translation. In Proceedings of EAMT 2006.Google Scholar
  9. Murray, G. C., Dorr, B., Lin, J., Hajič, J., & Pecina, P. (2006b). Leveraging reusability: Cost-effective lexical acquisition for large-scale ontology translation. In Proceedings of COLING/ACL 2006 (pp. 945–952).Google Scholar
  10. Olsen, M., Dorr, B., & Thomas, S. (1998). Enhancing automatic acquisition of thematic structure in a large-scale lexicon for Mandarin Chinese. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas (AMTA ’98).Google Scholar
  11. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of ACL 2002 (pp. 311–318).Google Scholar
  12. Rapp, R. (1999). Automatic identification of word translations from unrelated English and German Corpora. In Proceedings of ACL 1999 (pp. 519–526).Google Scholar
  13. Sabarís, M., Alonso, J., Dafonte, C., & Arcay, B. (2001). Multilingual authoring through an artificial language. In Proceedings of MT Summit VIII.Google Scholar
  14. Sadat, F., Yoshikawa, M., & Uemura, S. (2003). Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora. In Proceedings of SIGIR 2003 (pp. 397–398).Google Scholar
  15. Snover, M., Dorr, B. J., Schwartz, R., Makhoul, J., Micciulla, L., & Weischedel, R. (2005). A study of translation error rate with targeted human annotation. Technical Report LAMP-TR-126/CS-TR-4755/UMIACS-TR-2005-58, University of Maryland, College Park.Google Scholar
  16. Tanaka, K., & Iwasaki, H. (1996). Extraction of lexical translations from non-aligned corpora. In Proceedings of COLING 1996 (pp. 580–585).Google Scholar
  17. USC. (2006). USC Shoah Foundation Institute for Visual History and Education.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2008

Authors and Affiliations

  • Jimmy Lin
    • 1
    Email author
  • G. Craig Murray
    • 1
  • Bonnie J. Dorr
    • 1
  • Jan Hajič
    • 2
  • Pavel Pecina
    • 2
  1. 1.University of MarylandCollege ParkUSA
  2. 2.Charles UniversityPragueCzech Republic

Personalised recommendations