An effective coherence measure to determine topical consistency in user-generated content

  • Jiyin HeEmail author
  • Wouter Weerkamp
  • Martha Larson
  • Maarten de Rijke
Open Access
Original Paper


When searching for blogs on a specific topic, information seekers prefer blogs that place a central focus on that topic over blogs whose mention of the topic is diffuse or incidental. In order to present users with better blog feed search results, we developed a measure of topical consistency that is able to capture whether or not a blog is topically focused. The measure, called the coherence score, is inspired by the genetics literature and captures the tightness of the clustering structure of a data set relative to a background collection. In a set of experiments on synthetic data, the coherence score is shown to provide a faithful reflection of topic clustering structure. The properties that make the coherence score more appropriate than lexical cohesion, a common measure of topical structure, are discussed. Retrieval experiments show that integrating the coherence score as a prior in a language modeling-based approach to blog feed search improves retrieval effectiveness. The coherence score must, however, be used judiciously in order to avoid boosting the ranking of irrelevant but topically focused blogs. To this end, we experiment with a series of weighting schemes that adjust the contribution of the coherence score according to the relevance of a blog to the user query. An appropriate weighting scheme is able to improve retrieval performance. Finally, we show that the coherence score can be reliably estimated with a sample exceeding 20 posts in size. Consistent with this finding, experiments show that the best retrieval performance is achieved if coherence scores are used when a blog contains more than 20 posts.


User-generated content Topical structure Information retrieval Blog search Coherence 



We are very grateful to our anonymous reviewers for providing helpful feedback and suggestions. This research was supported by the DuOMAn project carried out within the STEVIN programme which is funded by the Dutch and Flemish Governments ( under project number STE-09-12, and by the Netherlands Organisation for Scientific Research (NWO) under project numbers 017.001.190, 640.001.501, 640.002.501, 612.066.512, 612.061.814, 612.061.815, 640.004.802.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.


  1. 1.
    Allen, R.B., Obry, P., Littman, M.: An interface for navigating clustered document sets returned by queries. In: COCS’93, pp. 166–171. ACM, New York (1993)Google Scholar
  2. 2.
    Amitay, E., Carmel, D., Darlow, A., Herscovici, M., Lempel, R., Soffer, A., Kraft, R., Zien, J.: Juru at trec 2003—topic distillation using query-sensitive tuning and cohesiveness filtering. In: TREC’03 Working Notes (2003)Google Scholar
  3. 3.
    Balog, K., Azzopardi, L., de Rijke, M.: Formal models for expert finding in enterprise corpora. In: SIGIR’06, pp. 43–50. ACM Press, New York (2006)Google Scholar
  4. 4.
    Balog, K., de Rijke, M., Weerkamp, W.: Bloggers as experts. In: SIGIR’08, pp. 753–754. ACM (2008)Google Scholar
  5. 5.
    Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. In: Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS’97), ACL (1997)Google Scholar
  6. 6.
    Bock H.-H.: On some significance tests in cluster analysis. J. classif. 2(1), 77–108 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Cronen-Townsend, S., Croft, W.B.: Quantifying query ambiguity. In: HLT’02, pp. 94–98. (2002)Google Scholar
  8. 8.
    Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: SIGIR’02, pp. 299–306. ACM (2002)Google Scholar
  9. 9.
    Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: SIGIR ’92, pp. 318–329, ACM, NY (1992)Google Scholar
  10. 10.
    Elsas, J., Arguello, J., Callan, J., Carbonell, J.: Retrieval and feedback models for blog distillation. In: TREC’07 Working Notes (2007)Google Scholar
  11. 11.
    Ernsting, B.J., Weerkamp, W., de Rijke, M.: The University of Amsterdam at the TREC 2007 Blog Track. In: TREC’07 Working Notes (2007)Google Scholar
  12. 12.
    Everitt B.S.: Unresolved problems in cluster analysis. Biometrics 35(1), 169–181 (1979)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Fellbaum, C. (eds): WordNet: An Electronic Lexical Database (Language, Speech, and Communication). MIT Press, Cambridge (1998)Google Scholar
  14. 14.
    Fujimura, K., Toda, H., Inoue, T., Hiroshima, N., Kataoka, R., Sugizaki, M.: Blogranger—a multi-faceted blog search engine. In: WWW’06 (2006)Google Scholar
  15. 15.
    Halliday M.A.K., Hasan R.: Cohesion in English (English Language). Longman, London (1976)Google Scholar
  16. 16.
    Hartigan J.A.: Statistical theory in clustering. J. classif. 2(1), 63–76 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    He, J., Larson, M., de Rijke, M.: On the topical structure of the relevance feedback set. In: FGIR Workshop Information Retrieval (WIR 2008), Wurzburg, Germany (2008a)Google Scholar
  18. 18.
    He, J., Larson, M., de Rijke, M.: Using coherence-based measures to predict query difficulty. In: ECIR’08, pp. 689–694 (2008b)Google Scholar
  19. 19.
    He, J., Weerkamp, W., Larson, M., de Rijke, M.: Blogger, stick to your story: modeling topical noise in blogs with coherence measures. In: AND ’08: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, pp. 39–46. ACM (2008c)Google Scholar
  20. 20.
    Lavrenko, V., Croft, W.B.: Relevance-based language models. In: SIGIR ’01, pp. 120–127, ACM Press, New York (2001)Google Scholar
  21. 21.
    Lee, W.-L., Lommatzsch, A.: Feed distillation using adaboost and topic maps. In: TREC ’07 Working Notes (2007)Google Scholar
  22. 22.
    Macdonald, C., Ounis, I.: The TREC Blogs06 collection: creating and analyzing a blog test collection. Technical Report TR-2006-224, Department of Computer Science, University of Glasgow (2006)Google Scholar
  23. 23.
    Macdonald, C., Ounis, I.: Key blog distillation: ranking aggregates. In: CIKM ’08: Proceeding of the 17th ACM Conference on Information and Knowledge Management, pp. 1043–1052, ACM, New York (2008)Google Scholar
  24. 24.
    Macdonald, C., Ounis, I., Soboroff, I.: Overview of the TREC 2007 Blog Track. In: TREC ’07 Working Notes, pp. 31–43 (2007)Google Scholar
  25. 25.
    Mackay D.J.C., Peto L.: A hierarchical dirichlet language model. Nat. Lang. Eng. 1(3), 1–19 (1994)Google Scholar
  26. 26.
    Mishne, G.: Applied text analytics for blogs. PhD thesis, University of Amsterdam (2007)Google Scholar
  27. 27.
    Mishne, G., de Rijke, M.: A study of blog search. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR’06, vol. 3936, pp. 289–301 (2006)Google Scholar
  28. 28.
    Morris J., Hirst G.: Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Comput. Linguist. 17(1), 21–48 (1991)Google Scholar
  29. 29.
    Ounis, I., Macdonald, C., de Rijke, M., Mishne, G., Soboroff, I.: Overview of the TREC 2006 Blog Track. In: TREC ’06 Working Notes. NIST (2007)Google Scholar
  30. 30.
    Pilpel Y., Sudarsanam P., Church G.M.: Identifying regulatory networks by combinatiorial analysis of promoter elements. Nat. Genet. 29, 153–159 (2001)CrossRefGoogle Scholar
  31. 31.
    Robertson, S., Callan, J.: Routing and filtering. In: TREC ’05, pp. 99–122. MIT (2005)Google Scholar
  32. 32.
    Seki, K., Kino, Y., Sato, S.: TREC 2007 Blog Track experiments at Kobe University. In: TREC ’07 Working Notes (2007)Google Scholar
  33. 33.
    Seo, J., Croft, W.B.: UMass at TREC 2007 Blog distillation task. In: TREC ’07 Working Notes (2007)Google Scholar
  34. 34.
    Stokes, N., Newman, E., Carthy, J., Smeaton, A.F.: Broadcast news gisting using lexical cohesion analysis. In: the Proceedings of the 26th BCS-IRSG European Conference on Information Retrieval (ECIR-04), pp. 209–222. Springer (2004)Google Scholar
  35. 35.
    Weeds, J., Weir, D., Mccarthy, D.: Characterising measures of lexical distributional similarity. In: Proceedings of CoLing 2004, pp. 1015–1021 (2004)Google Scholar
  36. 36.
    Weerkamp, W., Balog, K., de Rijke, M.: Finding key bloggers, one post at a time. In: ECAI’08 (2008)Google Scholar
  37. 37.
    Weerkamp, W., Balog, K., de Rijke, M.: A generative blog post retrieval model that uses query expansion based on external collections. In: Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-ICNLP 2009) (2009)Google Scholar
  38. 38.
    Weerkamp, W., de Rijke, M.: Credibility improves topical blog post retrieval. In: Proceedings of ACL-08: HLT, pp. 923–931. Association for Computational Linguisticsages, Columbus (2008)Google Scholar
  39. 39.
    Willett P.: Recent trends in hierarchic document clustering: a critical review. Inf. Process Manag. 24(5), 577–597 (1988)CrossRefGoogle Scholar
  40. 40.
    Zhai C., Lafferty J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22(2), 179–214 (2004)CrossRefGoogle Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  • Jiyin He
    • 1
    Email author
  • Wouter Weerkamp
    • 1
  • Martha Larson
    • 2
  • Maarten de Rijke
    • 1
  1. 1.ISLAUniversity of AmsterdamAmsterdamThe Netherlands
  2. 2.EEMCSDelft University of TechnologyDelftThe Netherlands

Personalised recommendations