Skip to main content
Log in

Short text similarity based on probabilistic topics

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript


In this paper, we propose a new method for measuring the similarity between two short text snippets by comparing each of them with the probabilistic topics. Specifically, our method starts by firstly finding the distinguishing terms between the two short text snippets and comparing them with a series of probabilistic topics, extracted by Gibbs sampling algorithm. The relationship between the distinguishing terms of the short text snippets can be discovered by examining their probabilities under each topic. The similarity between two short text snippets is calculated based on their common terms and the relationship of their distinguishing terms. Extensive experiments on paraphrasing and question categorization show that the proposed method can calculate the similarity of short text snippets more accurately than other methods including the pure TF-IDF measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others


  1. Wenyin L, Hao TY, Chen W, Feng M (2009) A web-based platform for user-interactive question-answering. World Wide Web: Internet Web Inform Syst 12(2): 107–124

    Google Scholar 

  2. Park EK, Ra DY, Jang MG (2005) Techniques for improving web retrieval effectiveness. Inform Process Manag 41: 1207–1223

    Article  Google Scholar 

  3. Atkinson-Abutridy J, Mellish C, Aitken S (2004) Combining information extraction with genetic algorithms for text mining. IEEE Intell Syst 19: 22–30

    Article  Google Scholar 

  4. Metzler D, Dumais S, Meek C (2007) Similarity measures for short segments of text. In: Proceedings of the 29th European conference on information retrieval (ECIR 2007). Lecture notes in computer science, vol 4425, Springer, Berlin (2007) pp 16–27

  5. Phan XH, Nguyen ML, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web. ACM Press, New York, pp 91–100

  6. Salon G (1989) Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison-Wesley, Reading

    Google Scholar 

  7. Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using Web search engines. In: Proceedings of the 16th international conference on World Wide Web (WWW2007). ACM Press, New York, pp 757–766

  8. Sahami M, Heilman T (2006) A Web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on World Wide Web (WWW2006). ACM Press, New York, pp 377–386

  9. Yih W, Meek C (2007) Improving similarity measures for short segments of text. In: Proceedings of twenty-second conference on artificial intelligence (AAAI-07), Vancouver, July 22–26, pp 1489–1494

  10. Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Cambridge

    MATH  Google Scholar 

  11. Li YH, McLean D, Bandar ZA et al (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18: 1138–1150

    Article  Google Scholar 

  12. Griffiths T, Steyvers M (2004) Finding scientific topics. Natl Acad Sci 101: 5228–5235

    Article  Google Scholar 

  13. Salon G, Yang CS (1973) On the specification of term values in automatic indexing. J Documentation 29(4): 351–372

    Article  Google Scholar 

  14. Hatzivassiloglou V, Klavans J, Eskin E (1999) Detecting text similarity over short passages: exploring linguistic feature combinations via machine learning. In: Proceedings of joint SIGDAT conference on empirical methods in NLP and very large corpora., College Park, MD, USA, June 21–22

  15. Okazaki N, Matsuo Y, Matsumura N et al (2003) Sentence extraction by spreading activation through sentence similarity. IEICE Trans Inform Syst E86D(9): 1686–1694

    Google Scholar 

  16. Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the American association for artificial intelligence (AAAI 2006), Boston, July 2006, pp 775–780

  17. Lavrenko V, Croft WB (2001) Relevance based language models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, Louisiana, September 9–13. ACM Press, New York, pp 120–127

  18. Zhai C, Lafferty J (2001) Model-based feedback in the language modeling approach to information retrieval. In: Proceedings of the tenth international conference on Information and knowledge management, Atlanta, Georgia, October 5–10. ACM Press, New York, pp 403–410

  19. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3: 993–1022

    Article  MATH  Google Scholar 


  21. Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence, Stockholm, Sweden, July 30–August 1, pp 289–296

  22. Li X, Roth D (2002) Learning question classifiers. In: Proceedings of the 19th international conference on computational linguistics, Taipei, Taiwan, August 24–September 01, pp 1–7

  23. Zhang D, Lee WS (2003) Question classification using support vector machine. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, Toronto, Canada, July 28–August 01. ACM Press, New York, pp 26–32

  24. Dolan WB, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th international conference on computational linguistics, Geneva, Switzerland, August 23–27, No 350

  25. Cesario E, Folino F, Locane A et al (2008) Boosting text segmentation via progressive classification. Knowl Inform Syst 15(3): 285–320

    Article  Google Scholar 

  26. Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inform Syst 16(3): 281–301

    Article  Google Scholar 

  27. Chang C, Lin C (2001) LIBSVM: a library for support vector machines.

  28. Fragoudis D, Meretakis D, Likothanassis SD (2005) Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inform Syst 8(1): 16–33

    Article  Google Scholar 

  29. Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Liu Wenyin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Quan, X., Liu, G., Lu, Z. et al. Short text similarity based on probabilistic topics. Knowl Inf Syst 25, 473–491 (2010).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: