Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR

  • Tuomas Talvensaari
Conference paper

DOI: 10.1007/978-3-540-78646-7_13

Part of the Lecture Notes in Computer Science book series (LNCS, volume 4956)
Cite this paper as:
Talvensaari T. (2008) Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR. In: Macdonald C., Ounis I., Plachouras V., Ruthven I., White R.W. (eds) Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol 4956. Springer, Berlin, Heidelberg


Aligned corpora are often-used resources in CLIR systems. The three qualities of translation corpora that most dramatically affect the performance of a corpus-based CLIR system are: (1) topical nearness to the translated queries, (2) the quality of the alignments, and (3) the size of the corpus. In this paper, the effects of these factors are studied and evaluated. Topics of two different domains (news and genomics) are translated with corpora of varying alignment quality, ranging from a clean parallel corpus to noisier comparable corpora. Also, the sizes of the corpora are varied. The results show that of the three qualities, topical nearness is the most crucial factor, outweighing both other factors. This indicates that noisy comparable corpora should be used as complimentary resources, when parallel corpora are not available for the domain in question.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Tuomas Talvensaari
    • 1
  1. 1.Department of Computer SciencesUniversity of TampereFinland

Personalised recommendations