Clustering Short Text and Its Evaluation

  • Prajol Shrestha
  • Christine Jacquin
  • Béatrice Daille
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7182)


Recently there has been an increase in interest towards clustering short text because it could be used in many NLP applications. According to the application, a variety of short text could be defined mainly in terms of their length (e.g. sentence, paragraphs) and type (e.g. scientific papers, newspapers). Finding a clustering method that is able to cluster short text in general is difficult. In this paper, we cluster 4 different corpora with different types of text with varying length and evaluate them against the gold standard. Based on these clustering experiments, we show how different similarity measures, clustering algorithms, and cluster evaluation methods effect the resulting clusters. We discuss four existing corpus based similarity methods, Cosine similarity, Latent Semantic Analysis, Short text Vector Space Model, and Kullback-Leibler distance, four well known clustering methods, Complete Link, Single Link, Average Link hierarchical clustering and Spectral clustering, and three evaluation methods, clustering F-measure, adjusted Rand Index, and V. Our experiments show that corpus based similarity measures do not significantly affect the clusters and that the performance of spectral clustering is better than hierarchical clustering. We also show that the values given by the evaluation methods do not always represent the usability of the clusters.


Cluster Method Spectral Cluster Cosine Similarity Latent Semantic Analysis Hierarchical Agglomerative Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Pinto, D., Rosso, P.: Kncr: A short-text narrow-domain sub-corpus of medline. In: Proceedings of the TLH 2006 Conference. Advances in Computer Science, pp. 266–269 (2006)Google Scholar
  2. 2.
    Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering Abstracts Instead of Full Texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12, 461–486 (2009)CrossRefGoogle Scholar
  4. 4.
    Reichart, R., Rappoport, A.: The nvi clustering evaluation measure. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL), pp. 165–173 (2009)Google Scholar
  5. 5.
    von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17, 395–416 (2007)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Nakov, P., Popova, A., Mateev, P.: Weight functions impact on lsa performance. In: EuroConference RANLP 2001, Recent Advances in NLP, pp. 187–193 (2001)Google Scholar
  7. 7.
    Shrestha, P., Jacquin, C., Daille, B.: Reduction of search space to annotate monolingual corpora. In: Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011) (2011)Google Scholar
  8. 8.
    Pinto, D., Benedí, J.-M., Rosso, P.: Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 611–622. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  9. 9.
    Manning, C.D., Raghavan, P., Schütze, H.: Clustering Narrow-Domain Short Texts by using the Kullback-Leibler Distance. Cambridge University Press (2008)Google Scholar
  10. 10.
    Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to latent semantic analysis. In: Discourse Processes (1998)Google Scholar
  11. 11.
    Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering Abstracts of Scientific Texts Using the Transition Point Technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407 (1990)CrossRefGoogle Scholar
  13. 13.
    Jolliffe, I.T.: Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2, 37–52 (1986)Google Scholar
  14. 14.
    Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856. MIT Press (2001)Google Scholar
  15. 15.
    Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 378–382 (1971)CrossRefGoogle Scholar
  16. 16.
    Fung, B.C., Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM International Conference on Data Mining, SDM 2003 (2003)Google Scholar
  17. 17.
    Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)CrossRefGoogle Scholar
  18. 18.
    Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: EMNLP 2007 (2007)Google Scholar
  19. 19.
    Harold, K.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2, 83–97 (1955)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Prajol Shrestha
    • 1
  • Christine Jacquin
    • 1
  • Béatrice Daille
    • 1
  1. 1.Laboratore d’Informatique de Nantes-Atlantique (LINA)Université de NantesNantes Cedex 3France

Personalised recommendations