Advertisement

ITSA ⋆ : An Effective Iterative Method for Short-Text Clustering Tasks

  • Marcelo Errecalde
  • Diego Ingaramo
  • Paolo Rosso
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6096)

Abstract

The current tendency for people to use very short documents, e.g. blogs, text-messaging, news and others, has produced an increasing interest in automatic processing techniques which are able to deal with documents with these characteristics. In this context, “short-text clustering” is a very important research field where new clustering algorithms have been recently proposed to deal with this difficult problem. In this work, ITSA ⋆ , an iterative method based on the bio-inspired method PAntSA ⋆  is proposed for this task. ITSA ⋆  takes as input the results obtained by arbitrary clustering algorithms and refines them by iteratively using the PAntSA ⋆  algorithm. The proposal shows an interesting improvement in the results obtained with different algorithms on several short-text collections. However, ITSA ⋆  can not only be used as an effective improvement method. Using random initial clusterings, ITSA ⋆  outperforms well-known clustering algorithms in most of the experimental instances.

Keywords

Cluster Algorithm Initial Clusterings Experimental Instance Iterative Version Improvement Percentage 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Pinto, D., Rosso, P.: On the relative hardness of clustering corpora. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 155–161. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  2. 2.
    Errecalde, M., Ingaramo, D., Rosso, P.: Proximity estimation and hardness of short-text corpora. In: Proceedings of TIR-2008, pp. 15–19. IEEE CS, Los Alamitos (2008)Google Scholar
  3. 3.
    Ingaramo, D., Pinto, D., Rosso, P., Errecalde, M.: Evaluation of internal validity measures in short-text corpora. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 555–567. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  4. 4.
    Cagnina, L., Errecalde, M., Ingaramo, D., Rosso, P.: A discrete particle swarm optimizer for clustering short-text corpora. In: BIOMA08, pp. 93–103 (2008)Google Scholar
  5. 5.
    Ingaramo, D., Errecalde, M., Cagnina, L., Rosso, P.: Particle Swarm Optimization for clustering short-text corpora. In: Computational Intelligence and Bioengineering, pp. 3–19. IOS press, Amsterdam (2009)Google Scholar
  6. 6.
    Ingaramo, D., Errecalde, M., Rosso, P.: A new anttree-based algorithm for clustering short-text corpora. Journal of CS&T (in press, 2010)Google Scholar
  7. 7.
    Ingaramo, D., Errecalde, M., Pinto, D.: A general bio-inspired method to improve the short-text clustering task. In: Proc. of CICLing 2010. LNCS. Springer, Heidelberg (in press 2010)Google Scholar
  8. 8.
    Azzag, H., Monmarche, N., Slimane, M., Venturini, G., Guinot, C.: AntTree: A new model for clustering with artificial ants. In: Proc. of the CEC 2003, Canberra, pp. 2642–2647. IEEE Press, Los Alamitos (2003)Google Scholar
  9. 9.
    Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)zbMATHCrossRefGoogle Scholar
  10. 10.
    Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering abstracts instead of full texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  11. 11.
    Alexandrov, M., Gelbukh, A., Rosso, P.: An approach to clustering abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 8–13. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  12. 12.
    Pinto, D., Benedí, J.M., Rosso, P.: Clustering narrow-domain short texts by using the Kullback-Leibler distance. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 611–622. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  13. 13.
    Stein, B., Meyer zu Eißen, S.: Document Categorization with MajorClust. In: Proc. WITS 02, pp. 91–96. Technical University of Barcelona (2002)Google Scholar
  14. 14.
    Karypis, G., Han, E.H., Vipin, K.: Chameleon: Hierarchical clustering using dynamic modeling. Computer 32, 68–75 (1999)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Marcelo Errecalde
    • 1
  • Diego Ingaramo
    • 1
  • Paolo Rosso
    • 2
  1. 1.LIDIC, Universidad Nacional de San LuisArgentina
  2. 2.Natural Language Eng. Lab. ELiRF, DSICUniversidad Politécnica de ValenciaSpain

Personalised recommendations