A comparable corpus (pl. corpora) is a document collection composed of two or more disjoint subsets, each written in a different language, such that documents in each subset are on a same topic as the documents in the others. The prototypical example of a comparable corpora is a collection of newspaper article written in different languages and reporting about the same events: while they will not be, strictly speaking, the translation of one another, they will share most of the semantic content. Some methods for cross-language text mining rely, totally or partially, on the statistical properties of comparable corpora.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this entry
Cite this entry
(2011). Comparable Corpus. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_144
Download citation
DOI: https://doi.org/10.1007/978-0-387-30164-8_144
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering