Compilation of a Spanish Representative Corpus

  • Alexander Gelbukh
  • Grigori Sidorov
  • Liliana Chanona-Hernández
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2276)

Abstract

Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Biber, D., S. Conrad, and D. Reppen (1998). Corpus linguistics. Investigating language structure and use. Cambridge University Press, Cambridge.Google Scholar
  2. 2.
    Kilgariff, A. (2001). Web as corpus. In: Proc. of Corpus Linguistics 2001 conference, University center for computer corpus research on language, technical papers vol. 13, Lancaster University, 2001, pp 342–344.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Alexander Gelbukh
    • 1
  • Grigori Sidorov
    • 1
  • Liliana Chanona-Hernández
    • 1
  1. 1.Center for Computing ResearchNational Polytechnic InstituteUSA

Personalised recommendations