Abstract
When searching for blogs on a specific topic, information seekers prefer blogs that place a central focus on that topic over blogs whose mention of the topic is diffuse or incidental. In order to present users with better blog feed search results, we developed a measure of topical consistency that is able to capture whether or not a blog is topically focused. The measure, called the coherence score, is inspired by the genetics literature and captures the tightness of the clustering structure of a data set relative to a background collection. In a set of experiments on synthetic data, the coherence score is shown to provide a faithful reflection of topic clustering structure. The properties that make the coherence score more appropriate than lexical cohesion, a common measure of topical structure, are discussed. Retrieval experiments show that integrating the coherence score as a prior in a language modeling-based approach to blog feed search improves retrieval effectiveness. The coherence score must, however, be used judiciously in order to avoid boosting the ranking of irrelevant but topically focused blogs. To this end, we experiment with a series of weighting schemes that adjust the contribution of the coherence score according to the relevance of a blog to the user query. An appropriate weighting scheme is able to improve retrieval performance. Finally, we show that the coherence score can be reliably estimated with a sample exceeding 20 posts in size. Consistent with this finding, experiments show that the best retrieval performance is achieved if coherence scores are used when a blog contains more than 20 posts.
Article PDF
Similar content being viewed by others
References
Allen, R.B., Obry, P., Littman, M.: An interface for navigating clustered document sets returned by queries. In: COCS’93, pp. 166–171. ACM, New York (1993)
Amitay, E., Carmel, D., Darlow, A., Herscovici, M., Lempel, R., Soffer, A., Kraft, R., Zien, J.: Juru at trec 2003—topic distillation using query-sensitive tuning and cohesiveness filtering. In: TREC’03 Working Notes (2003)
Balog, K., Azzopardi, L., de Rijke, M.: Formal models for expert finding in enterprise corpora. In: SIGIR’06, pp. 43–50. ACM Press, New York (2006)
Balog, K., de Rijke, M., Weerkamp, W.: Bloggers as experts. In: SIGIR’08, pp. 753–754. ACM (2008)
Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. In: Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS’97), ACL (1997)
Bock H.-H.: On some significance tests in cluster analysis. J. classif. 2(1), 77–108 (1985)
Cronen-Townsend, S., Croft, W.B.: Quantifying query ambiguity. In: HLT’02, pp. 94–98. (2002)
Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: SIGIR’02, pp. 299–306. ACM (2002)
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: SIGIR ’92, pp. 318–329, ACM, NY (1992)
Elsas, J., Arguello, J., Callan, J., Carbonell, J.: Retrieval and feedback models for blog distillation. In: TREC’07 Working Notes (2007)
Ernsting, B.J., Weerkamp, W., de Rijke, M.: The University of Amsterdam at the TREC 2007 Blog Track. In: TREC’07 Working Notes (2007)
Everitt B.S.: Unresolved problems in cluster analysis. Biometrics 35(1), 169–181 (1979)
Fellbaum, C. (eds): WordNet: An Electronic Lexical Database (Language, Speech, and Communication). MIT Press, Cambridge (1998)
Fujimura, K., Toda, H., Inoue, T., Hiroshima, N., Kataoka, R., Sugizaki, M.: Blogranger—a multi-faceted blog search engine. In: WWW’06 (2006)
Halliday M.A.K., Hasan R.: Cohesion in English (English Language). Longman, London (1976)
Hartigan J.A.: Statistical theory in clustering. J. classif. 2(1), 63–76 (1985)
He, J., Larson, M., de Rijke, M.: On the topical structure of the relevance feedback set. In: FGIR Workshop Information Retrieval (WIR 2008), Wurzburg, Germany (2008a)
He, J., Larson, M., de Rijke, M.: Using coherence-based measures to predict query difficulty. In: ECIR’08, pp. 689–694 (2008b)
He, J., Weerkamp, W., Larson, M., de Rijke, M.: Blogger, stick to your story: modeling topical noise in blogs with coherence measures. In: AND ’08: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, pp. 39–46. ACM (2008c)
Lavrenko, V., Croft, W.B.: Relevance-based language models. In: SIGIR ’01, pp. 120–127, ACM Press, New York (2001)
Lee, W.-L., Lommatzsch, A.: Feed distillation using adaboost and topic maps. In: TREC ’07 Working Notes (2007)
Macdonald, C., Ounis, I.: The TREC Blogs06 collection: creating and analyzing a blog test collection. Technical Report TR-2006-224, Department of Computer Science, University of Glasgow (2006)
Macdonald, C., Ounis, I.: Key blog distillation: ranking aggregates. In: CIKM ’08: Proceeding of the 17th ACM Conference on Information and Knowledge Management, pp. 1043–1052, ACM, New York (2008)
Macdonald, C., Ounis, I., Soboroff, I.: Overview of the TREC 2007 Blog Track. In: TREC ’07 Working Notes, pp. 31–43 (2007)
Mackay D.J.C., Peto L.: A hierarchical dirichlet language model. Nat. Lang. Eng. 1(3), 1–19 (1994)
Mishne, G.: Applied text analytics for blogs. PhD thesis, University of Amsterdam (2007)
Mishne, G., de Rijke, M.: A study of blog search. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR’06, vol. 3936, pp. 289–301 (2006)
Morris J., Hirst G.: Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Comput. Linguist. 17(1), 21–48 (1991)
Ounis, I., Macdonald, C., de Rijke, M., Mishne, G., Soboroff, I.: Overview of the TREC 2006 Blog Track. In: TREC ’06 Working Notes. NIST (2007)
Pilpel Y., Sudarsanam P., Church G.M.: Identifying regulatory networks by combinatiorial analysis of promoter elements. Nat. Genet. 29, 153–159 (2001)
Robertson, S., Callan, J.: Routing and filtering. In: TREC ’05, pp. 99–122. MIT (2005)
Seki, K., Kino, Y., Sato, S.: TREC 2007 Blog Track experiments at Kobe University. In: TREC ’07 Working Notes (2007)
Seo, J., Croft, W.B.: UMass at TREC 2007 Blog distillation task. In: TREC ’07 Working Notes (2007)
Stokes, N., Newman, E., Carthy, J., Smeaton, A.F.: Broadcast news gisting using lexical cohesion analysis. In: the Proceedings of the 26th BCS-IRSG European Conference on Information Retrieval (ECIR-04), pp. 209–222. Springer (2004)
Weeds, J., Weir, D., Mccarthy, D.: Characterising measures of lexical distributional similarity. In: Proceedings of CoLing 2004, pp. 1015–1021 (2004)
Weerkamp, W., Balog, K., de Rijke, M.: Finding key bloggers, one post at a time. In: ECAI’08 (2008)
Weerkamp, W., Balog, K., de Rijke, M.: A generative blog post retrieval model that uses query expansion based on external collections. In: Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-ICNLP 2009) (2009)
Weerkamp, W., de Rijke, M.: Credibility improves topical blog post retrieval. In: Proceedings of ACL-08: HLT, pp. 923–931. Association for Computational Linguisticsages, Columbus (2008)
Willett P.: Recent trends in hierarchic document clustering: a critical review. Inf. Process Manag. 24(5), 577–597 (1988)
Zhai C., Lafferty J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22(2), 179–214 (2004)
Acknowledgments
We are very grateful to our anonymous reviewers for providing helpful feedback and suggestions. This research was supported by the DuOMAn project carried out within the STEVIN programme which is funded by the Dutch and Flemish Governments (http://www.stevin-tst.org) under project number STE-09-12, and by the Netherlands Organisation for Scientific Research (NWO) under project numbers 017.001.190, 640.001.501, 640.002.501, 612.066.512, 612.061.814, 612.061.815, 640.004.802.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper is a revised and extended version of [19].
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
He, J., Weerkamp, W., Larson, M. et al. An effective coherence measure to determine topical consistency in user-generated content. IJDAR 12, 185–203 (2009). https://doi.org/10.1007/s10032-009-0089-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-009-0089-5