On the Assessment of Text Corpora

  • David Pinto
  • Paolo Rosso
  • Héctor Jiménez-Salazar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5723)


Classifier-independent measures are important to assess the quality of corpora. In this paper we present supervised and unsupervised measures in order to analyse several data collections for studying the following features: domain broadness, shortness, class imbalance, and stylometry. We found that the investigated assessment measures may allow to evaluate the quality of gold standards. Moreover, they could also be useful for classification systems in order to take strategical decisions when tackling some specific text collections.


Machine Translation Class Imbalance Text Corpus Domain Broadness Text Collection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Debole, F., Sebastiani, F.: An analysis of the relative hardness of Reuters-21578 subsets. Journal of the American Society for Information Science and Technology 56(6), 584–596 (2005)CrossRefGoogle Scholar
  2. 2.
    Wibowo, W., Williams, H.: On using hierarchies for document classification. In: Proc. of the Australian Document Computing Symposium, pp. 31–37 (1999)Google Scholar
  3. 3.
    Herdan, G.: Type-Token Mathematics: A Textbook of Mathematical Linguistics. Mouton & Co., The Hague (1960)MATHGoogle Scholar
  4. 4.
    Tweedie, F.J., Baayen, R.H.: How variable may a constant be?: Measures of lexical richness in perspective. Computers and the Humanities 32(5), 323–352 (1998)CrossRefGoogle Scholar
  5. 5.
    Hoover, D.L.: Another perspective on vocabulary richness. Computers and the Humanities 37(2), 151–178 (2004)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Japkowicz, N.: The class imbalance problem: Significance and strategies. In: Proc. of the 2000 International Conference on Artificial Intelligence (IC-AI 2000), vol. 1, pp. 111–117 (2000)Google Scholar
  7. 7.
    Montejo-Ráez, A.: Automatic text categorization of documents in the High Energy Physics domain. Phd thesis, Granada University, Spain (2006)Google Scholar
  8. 8.
    Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1-2), 109–123 (2004)Google Scholar
  9. 9.
    Can, F., Patton, J.M.: Change of writing style with time. Computers and the Humanities 38(1), 61–82 (2004)CrossRefGoogle Scholar
  10. 10.
    Hoover, D.L.: Corpus stylistics, stylometry, and the styles of henry james. Style 41(2), 174–203 (2007)Google Scholar
  11. 11.
    Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867 (2007)Google Scholar
  12. 12.
    Màrquez, L., Padró, L.: A flexible pos tagger using an automatically acquired language model. In: Proc. of the 35th annual meeting on Association for Computational Linguistics, pp. 238–245 (1997)Google Scholar
  13. 13.
    Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Research and Development in Information Retrieval, pp. 275–281 (1998)Google Scholar
  14. 14.
    Bahl, L.R., Jelinek, E., Mercer, R.L.: A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 5(2), 179–190 (1983)CrossRefGoogle Scholar
  15. 15.
    Brown, P.F., Pietra, V.J.D., de Souza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)Google Scholar
  16. 16.
    Zipf, G.K.: Human behaviour and the principle of least effort. Addison-Wesley, Reading (1949)Google Scholar
  17. 17.
    Cardoso-Cachopo, A., Oliveira, A.: Combining LSI with other classifiers to improve accuracy of single-label text categorization. In: First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning - EWLSATEL 2007 (2007)Google Scholar
  18. 18.
    Pinto, D., Benedí, J.M., Rosso, P.: Clustering narrow-domain short texts by using the Kullback-Leibler distance. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 611–622. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  19. 19.
    Agirre, E., Soroa, A.: Semeval-2007 task 2: Evaluating word sense induction and discrimination systems. In: Proc. of the 4th International Workshop on Semantic Evaluations - SemEval 2007, pp. 7–12. Association for Computational Linguistics (2007)Google Scholar
  20. 20.
    Kendall, M.: A new measure of rank correlation. Biometrika 30, 81–89 (1938)MATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • David Pinto
    • 1
  • Paolo Rosso
    • 2
  • Héctor Jiménez-Salazar
    • 3
  1. 1.Faculty of Computer ScienceB. Autonomous University of PueblaMexico
  2. 2.Natural Language Engineering Lab. - ELiRFUniversidad Politécnica de ValenciaSpain
  3. 3.Department of Information TechnologiesAutonomous Metropolitan UniversityMexico

Personalised recommendations