Evaluation of Internal Validity Measures in Short-Text Corpora

  • Diego Ingaramo
  • David Pinto
  • Paolo Rosso
  • Marcelo Errecalde
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4919)

Abstract

Short texts clustering is one of the most difficult tasks in natural language processing due to the low frequencies of the document terms. We are interested in analysing these kind of corpora in order to develop novel techniques that may be used to improve results obtained by classical clustering algorithms. In this paper we are presenting an evaluation of different internal clustering validity measures in order to determine the possible correlation between these measures and that of the F-Measure, a well-known external clustering measure used to calculate the performance of clustering algorithms. We have used several short-text corpora in the experiments carried out. The obtained correlation with a particular set of internal validity measures let us to conclude that some of them may be used to improve the performance of text clustering algorithms.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agirre, E., Soroa, A.: Semeval-2007 task 02: Evaluating word sense induction and discrimination systems. In: Proc. of the SemEval Workshop, Prague, Czech Republic, The Association for Computational Linguistics, pp. 7–12 (2007)Google Scholar
  2. 2.
    Alexandrov, M., Gelbukh, A., Rosso, P.: An approach to clustering abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 8–13. Springer, Heidelberg (2005)Google Scholar
  3. 3.
    Bezdek, J.C., et al.: A geometric approach to cluster validity for normal mixtures. Soft Computing 1(4) (1997)Google Scholar
  4. 4.
    Ingaramo, D., Leguizamón, G., Errecalde, M.: Adaptive clustering with artificial ants. Journal of Computer Science & Technology 5(4), 264–271 (2005)Google Scholar
  5. 5.
    Karypis, G., Han, E.-H., Vipin, K.: Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999)CrossRefGoogle Scholar
  6. 6.
    Lehmann, E.L., D’Abrera, H.J.M.: Nonparametrics: Statistical Methods Based on Ranks. Prentice-Hall, Englewood Cliffs (1998)Google Scholar
  7. 7.
    Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering abstracts instead of full texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)Google Scholar
  8. 8.
    Montejo, A., Uren̈a, L.A.: Binary classifiers versus adaboost for labeling of digital documents. In: Procesamiento del Lenguaje Natural, pp. 319–326 (2006)Google Scholar
  9. 9.
    Pinto, D., Benedí, J.M., Rosso, P.: Clustering narrow-domain short texts by using the Kullback-Leibler distance. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 611–622. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  10. 10.
    Pinto, D., Rosso, P.: On the relative hardness of clustering corpora. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 155–161. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  11. 11.
    Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering abstracts of scientific texts using the transition point technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Rose, T.G., Stevenson, M., Whitehead, M.: The Reuters Corpus volume 1 - from yesterday’s news to tomorrow’s language resources. In: Proc. of the 3rd International Conference on Language Resources and Evaluation - LREC 2002, pp. 827–832 (2002)Google Scholar
  13. 13.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  14. 14.
    Shin, K., Han, S.Y.: Fast clustering algorithm for information organization. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 619–622. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  15. 15.
    Stein, B., Meyer, S., Wißbrock, F.: On cluster validity and the information need of users. In: Proceedings of the 3rd IASTED, pp. 216–221. ACTA Press (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Diego Ingaramo
    • 1
  • David Pinto
    • 2
    • 3
  • Paolo Rosso
    • 2
  • Marcelo Errecalde
    • 1
  1. 1.Development and Research Laboratory in Computacional Intelligence (LIDIC)UNSLArgentina
  2. 2.Natural Language Engineering Lab., Department of Information Systems and ComputationPolytechnic University of ValenciaSpain
  3. 3.Faculty of Computer Science (FCC)BUAPMexico

Personalised recommendations