Skip to main content

Termhood-Based Comparability Metrics of Comparable Corpus in Special Domain

  • Conference paper
Chinese Lexical Semantics (CLSW 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7717))

Included in the following conference series:

Abstract

Cross-Language Information Retrieval (CLIR) and machine translation (MT) resources, such as dictionaries and parallel corpora, are scarce and hard to come by for special domains. Besides, these resources are just limited to a few languages, such as English, French, and Spanish and so on. So, obtaining comparable corpora automatically for such domains could be an answer to this problem effectively. Comparable corpora, that the subcorpora are not translations of each other, can be easily obtained from web. Therefore, building and using comparable corpora is often a more feasible option in multilingual information processing.

Comparability metrics is one of key issues in the field of building and using comparable corpus. Currently, there is no widely accepted definition or metrics method of corpus comparability. In fact, Different definitions or metrics methods of comparability might be given to suit various tasks about natural language processing. A new comparability, namely, termhood-based metrics, oriented to the task of bilingual terminology extraction, is proposed in this paper. In this method, words are ranked by termhood not frequency, and then the cosine similarities, calculated based on the ranking lists of word termhood, is used as comparability. Experiments results show that termhood-based metrics performs better than traditional frequency-based metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. McEnery, A.M., Xiao, R.Z.: Parallel and comparable corpora: What are they up to. In: Proceedings of Incorporating Corpora: Translation and the Linguist Translating Europe Multilingual Matters, Clevedon, UK (2007)

    Google Scholar 

  2. Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M., Keskustalo, H.: Creating and Exploiting a Comparable Corpus in Cross - Language Information Retrieval. J. ACM Transactions on Information Systems (TOIS) 25(1), 322–334 (2007)

    Google Scholar 

  3. D1.3: Final report on criteria and metrics of comparability and parallelism and their suitability, http://www.accurat-project.eu/index.php?p=deliverables

  4. Dejean, H., Gaussier, E., Sadat, F.: Bilingual terminology extraction: an approach based on multilingual thesaurus applicable to comparable corpora. In: Proceedings of COLING 2002, Tapei, Taiwan, pp. 218–224 (2002)

    Google Scholar 

  5. Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora. In: Proceedings of the Fifth Workshop on Very Large Corpora, Hong Kong, China, pp. 231–245 (1997)

    Google Scholar 

  6. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in the acquisition of comparable corpora. J. Information Retrieval 11(5), 427–445 (2008)

    Article  Google Scholar 

  7. Leturia, I., Vicente, I.S., Saralegi, X.: Search engine based approaches for collecting domain-specific Basque-English comparable corpora from the Internet. In: Proceedings of the Fifth Web as Corpus Workshop (WAC5), Basque Country, Spain, pp. 53–61 (2009)

    Google Scholar 

  8. Otero, P.G., López, I.G.: Wikipedia as Multilingual Source of Comparable Corpora. In: Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC 2010, Malta, pp. 21–25 (2010)

    Google Scholar 

  9. Vasiļjevs, A.: ACCURAT: Metrics for the evaluation of comparability of multilingual corpora. In: Proceedings of the Workshop on Methods for the Automatic Acquisition of Language Resources and their Evaluation Methods, LREC 2010, Malta (2010)

    Google Scholar 

  10. TTC Project, http://www.ttc-project.eu/releases

  11. Li, B., Gaussier, E.: Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, pp. 644–652 (2010)

    Google Scholar 

  12. CNKI, http://www.cnki.net/

  13. EBscohost, http://www.ebscohost.com/

  14. Google Translate, http://translate.google.cn/

  15. ICTClAS, http://ictclas.org/

  16. Kit, C.Y., Liu, X.Y.: Measuring mono-word termhood by rank difference via corpus comparison. J. Terminology 14(2), 204–229 (2008)

    Article  Google Scholar 

  17. Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM, Malta (1975)

    Google Scholar 

  18. Fung, P.: A Statistical view on Bilingual lexicon extraction: From Parallel Corpora to non-parallel corpora. In: Proceedings of Jean Veronis. Parallel Text Processing (2000)

    Google Scholar 

  19. LCD, http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2002L27

  20. Kondrak, G., Marcu, D., Knight, K.: Cognates Can Improve Statistical Translation Models. In: Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, S., Zhang, C. (2013). Termhood-Based Comparability Metrics of Comparable Corpus in Special Domain. In: Ji, D., Xiao, G. (eds) Chinese Lexical Semantics. CLSW 2012. Lecture Notes in Computer Science(), vol 7717. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36337-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36337-5_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36336-8

  • Online ISBN: 978-3-642-36337-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics