Advertisement

Knowledge and Information Systems

, Volume 27, Issue 3, pp 345–365 | Cite as

Short text clustering by finding core terms

  • Xingliang Ni
  • Xiaojun Quan
  • Zhi Lu
  • Liu WenyinEmail author
  • Bei Hua
Regular Paper

Abstract

A new clustering strategy, TermCut, is presented to cluster short text snippets by finding core terms in the corpus. We model the collection of short text snippets as a graph in which each vertex represents a piece of short text snippet and each weighted edge between two vertices measures the relationship between the two vertices. TermCut is then applied to recursively select a core term and bisect the graph such that the short text snippets in one part of the graph contain the term, whereas those snippets in the other part do not. We apply the proposed method on different types of short text snippets, including questions and search results. Experimental results show that the proposed method outperforms state-of-the-art clustering algorithms for clustering short text snippets.

Keywords

Clustering Short text clustering TermCut 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Banerjee A, Merugu S, Dhillon I, Ghosh J (2004) Clustering with Bregaman Divergences. In: Proceedings of 4th SIAM international conference data mining (SDM 2004), pp 234–245Google Scholar
  2. 2.
    Banerjee S, Ramanathan K, Gupta A (2007) Clustering short text using Wikipedia. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2007), pp 787–788Google Scholar
  3. 3.
    Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3: 993–1022zbMATHCrossRefGoogle Scholar
  4. 4.
    Bolelli L, Ertekin S, Zhou D, Giles CL (2007) K-SVMeans: a hybrid clustering algorithm for multi-type interrelated datasets. In: Proceedings of international conference on web intelligence (WI 2007), pp 198–204Google Scholar
  5. 5.
    BuyAns (2009) http://www.buyans.com
  6. 6.
    Chen K, Liu L (2009) Best K: critical clustering structures in categorical datasets. Knowl Inf Syst 20: 1–33CrossRefGoogle Scholar
  7. 7.
    Chuang S, Chien L (2004) A practical web-based approach to generating topic hierarchy for text segments. In: Proceedings of the 13th ACM international conference on Information and knowledge management (CIKM 2004), pp 127–136Google Scholar
  8. 8.
  9. 9.
    Cutting DR, Karger DR, Pedersen JO (1993) Constant interaction-time scatter/gather browsing of very large document collections. In: Proceedings of the 16th international ACM SIGIR conference on research and development in information retrieval, pp 126–134Google Scholar
  10. 10.
    Cutting DR, Karger DR, Pedersen P, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 5th international ACM SIGIR conference on research and development in information retrieval (SIGIR 1992), pp 318–329Google Scholar
  11. 11.
    Dempster A, Laird N, Rubin D (1977) Maximum likelihood estimation from incomplete data via the EM algorithm. J R Stat Soc 39(1): 1–38MathSciNetzbMATHGoogle Scholar
  12. 12.
    Ding C, He X, Zha H (2001) A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings of the international conference on data mining (ICDM 2001), pp 107–114Google Scholar
  13. 13.
    Dittenbach M, Merkl D, Rauber A (2002) Organizing and exploring high dimensional data with the growing hierarchical self organizing map. In: Proceedings of the 1st international conference on fuzzy systems and knowledge discovery (FSKD 2002), vol 2, pp 626–630Google Scholar
  14. 14.
    Ester M, Kriegal HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining (KDD 1996), pp 226–231Google Scholar
  15. 15.
    Fragoudis D, Meretakis D, Likothanassis S (2005) Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inf Syst 8: 16–33CrossRefGoogle Scholar
  16. 16.
    Gluck MA, Corter JE (1985) Information, uncertainty, and the utility of categories. In: Proceedings of the 7th annual conference of the cognitive science society (CogSci 1985), pp 283–287Google Scholar
  17. 17.
    Google (2009) http://www.google.com
  18. 18.
    Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, CambridgezbMATHCrossRefGoogle Scholar
  19. 19.
    Hachey B, Grover C (2005) Sequence modelling for sentence classification in a legal summarisation system. In: Proceedings of the 2005 ACM symposium on applied computing (SAC 2005), pp 292–296Google Scholar
  20. 20.
    ICTCLAS (2009) http://www.ictclas.org
  21. 21.
    Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1): 17–40CrossRefGoogle Scholar
  22. 22.
    Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. John Wiley and Sons, New YorkGoogle Scholar
  23. 23.
    Kim H, Lee S (2004) An intelligent information system for organizing online text documents. Knowl Inf Syst 6: 125–149Google Scholar
  24. 24.
    Kummamuru K, Lotlikar R, Roy S, Singal K, Krishnapuram R (2004) A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th international conference on World Wide Web (WWW 2004), pp 658–665Google Scholar
  25. 25.
    Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD 1999), pp 16–22Google Scholar
  26. 26.
    Liu W, Hao T, Chen W, Feng M (2009) A web-based platform for user-interactive question-answering. In: World Wide Web: Internet Web Inf Syst 12(2): 107–124Google Scholar
  27. 27.
    Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2): 129–137MathSciNetzbMATHCrossRefGoogle Scholar
  28. 28.
  29. 29.
    MacQueen J (1967) Some method for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability I: (Statistics), pp 281–297Google Scholar
  30. 30.
    Ng RT, Han J (1994) Clustering methods for spatial data mining. In: Proceedings of 20th international conference very large data bases (VLDB 1994), pp 144–155Google Scholar
  31. 31.
    Ni X, Lu Z, Quan X, Liu W, Hua B (2009) Short text clustering for search results. In: Proceedings of the joint international conferences on Asia-Pacific web conference (APWeb) and web-age information management (WAIM). LNCS, pp 584–589Google Scholar
  32. 32.
    Ordonez C, Omiecinski E (2005) Accelerating EM clustering to find high-quality solutions. Knowl Inf Syst 7(2): 135–157CrossRefGoogle Scholar
  33. 33.
    Phan X, Nguyen L, Horiguchi S (2008) Learn to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web (WWW 2008), pp 91–100Google Scholar
  34. 34.
    Quan X, Liu G, Lu Z, Ni X, Wenyin L (2009) Short text similarity based on probabilistic topics. Knowl Inf Syst. doi: 10.1007/s10115-009-0250-y, published online first
  35. 35.
    Su Z, Yang Q, Zhang H, Xu X, Hu Y, Ma S (2002) Correlation-based web document clustering for adaptive web interface design. Knowl Inf Syst 4(2): 151–167CrossRefGoogle Scholar
  36. 36.
    Treeratpituk P, Callan J (2006) An experimental study on automatically labeling hierarchical clusters using statistical features. In: Proceedings of the 29th international ACM SIGIR conference on research and development in information retrieval, pp 707–708Google Scholar
  37. 37.
    Treeratpituk P, Callan J (2006) Automatically labeling hierarchical clusters. In: Proceedings of the 7th international conference on digital government research (dg.o 2006), pp 167–176Google Scholar
  38. 38.
    Wang X, Zhai C (2007) Learn from web search logs to organize search results. In: Proceedings of the 15th international ACM SIGIR conference on research and development in information retrieval, pp 87–94Google Scholar
  39. 39.
    Wikipedia (2009) http://www.wikipedia.org
  40. 40.
    Yahoo! Answers (2009) http://answers.yahoo.com
  41. 41.
    Yahoo! Groups (2009) http://groups.yahoo.com
  42. 42.
    Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. In: Proceedings of the 8th international conference on World Wide Web (WWW1999), pp 1361–1374Google Scholar
  43. 43.
    Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21th international ACM SIGIR conference on research and development in information retrieval (SIGIR 1998), pp 46–54Google Scholar
  44. 44.
    Zeng H, He Q, Chen Z, Ma W, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2004), pp 210–217Google Scholar
  45. 45.
    Zhang D, Lee WS (2003) Question classification using support vector machines. In: Proceedings of the 26th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2003), pp 26–32Google Scholar
  46. 46.
    Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the 7th international conference on Information and knowledge management (CIKM 2002), pp 515–524Google Scholar
  47. 47.
    Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  • Xingliang Ni
    • 1
    • 2
    • 3
  • Xiaojun Quan
    • 2
  • Zhi Lu
    • 2
  • Liu Wenyin
    • 1
    • 2
    • 3
    Email author
  • Bei Hua
    • 1
    • 3
  1. 1.School of Computer Science and TechnologyUniversity of Science and Technology of ChinaHefeiChina
  2. 2.Department of Computer ScienceCity University of Hong KongHKSARChina
  3. 3.Joint Research Lab of ExcellenceCityU-USTC Advanced Research InstituteSuzhouChina

Personalised recommendations