An effective coherence measure to determine topical consistency in user-generated content

He, Jiyin; Weerkamp, Wouter; Larson, Martha; de Rijke, Maarten

doi:10.1007/s10032-009-0089-5

An effective coherence measure to determine topical consistency in user-generated content

Original Paper
Open access
Published: 13 August 2009

Volume 12, pages 185–203, (2009)
Cite this article

Download PDF

You have full access to this open access article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

An effective coherence measure to determine topical consistency in user-generated content

Download PDF

Jiyin He¹,
Wouter Weerkamp¹,
Martha Larson² &
…
Maarten de Rijke¹

1017 Accesses
Explore all metrics

Abstract

When searching for blogs on a specific topic, information seekers prefer blogs that place a central focus on that topic over blogs whose mention of the topic is diffuse or incidental. In order to present users with better blog feed search results, we developed a measure of topical consistency that is able to capture whether or not a blog is topically focused. The measure, called the coherence score, is inspired by the genetics literature and captures the tightness of the clustering structure of a data set relative to a background collection. In a set of experiments on synthetic data, the coherence score is shown to provide a faithful reflection of topic clustering structure. The properties that make the coherence score more appropriate than lexical cohesion, a common measure of topical structure, are discussed. Retrieval experiments show that integrating the coherence score as a prior in a language modeling-based approach to blog feed search improves retrieval effectiveness. The coherence score must, however, be used judiciously in order to avoid boosting the ranking of irrelevant but topically focused blogs. To this end, we experiment with a series of weighting schemes that adjust the contribution of the coherence score according to the relevance of a blog to the user query. An appropriate weighting scheme is able to improve retrieval performance. Finally, we show that the coherence score can be reliably estimated with a sample exceeding 20 posts in size. Consistent with this finding, experiments show that the best retrieval performance is achieved if coherence scores are used when a blog contains more than 20 posts.

Article PDF

Artificial intelligence in E-Commerce: a bibliometric study and literature review

Article 18 March 2022

Sentiment Analysis in the Age of Generative AI

Article Open access 05 March 2024

A phenomenology and epistemology of large language models: transparency, trust, and trustworthiness

Article Open access 18 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Allen, R.B., Obry, P., Littman, M.: An interface for navigating clustered document sets returned by queries. In: COCS’93, pp. 166–171. ACM, New York (1993)
Amitay, E., Carmel, D., Darlow, A., Herscovici, M., Lempel, R., Soffer, A., Kraft, R., Zien, J.: Juru at trec 2003—topic distillation using query-sensitive tuning and cohesiveness filtering. In: TREC’03 Working Notes (2003)
Balog, K., Azzopardi, L., de Rijke, M.: Formal models for expert finding in enterprise corpora. In: SIGIR’06, pp. 43–50. ACM Press, New York (2006)
Balog, K., de Rijke, M., Weerkamp, W.: Bloggers as experts. In: SIGIR’08, pp. 753–754. ACM (2008)
Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. In: Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS’97), ACL (1997)
Bock H.-H.: On some significance tests in cluster analysis. J. classif. 2(1), 77–108 (1985)
Article MATH MathSciNet Google Scholar
Cronen-Townsend, S., Croft, W.B.: Quantifying query ambiguity. In: HLT’02, pp. 94–98. (2002)
Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: SIGIR’02, pp. 299–306. ACM (2002)
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: SIGIR ’92, pp. 318–329, ACM, NY (1992)
Elsas, J., Arguello, J., Callan, J., Carbonell, J.: Retrieval and feedback models for blog distillation. In: TREC’07 Working Notes (2007)
Ernsting, B.J., Weerkamp, W., de Rijke, M.: The University of Amsterdam at the TREC 2007 Blog Track. In: TREC’07 Working Notes (2007)
Everitt B.S.: Unresolved problems in cluster analysis. Biometrics 35(1), 169–181 (1979)
Article MATH MathSciNet Google Scholar
Fellbaum, C. (eds): WordNet: An Electronic Lexical Database (Language, Speech, and Communication). MIT Press, Cambridge (1998)
Google Scholar
Fujimura, K., Toda, H., Inoue, T., Hiroshima, N., Kataoka, R., Sugizaki, M.: Blogranger—a multi-faceted blog search engine. In: WWW’06 (2006)
Halliday M.A.K., Hasan R.: Cohesion in English (English Language). Longman, London (1976)
Google Scholar
Hartigan J.A.: Statistical theory in clustering. J. classif. 2(1), 63–76 (1985)
Article MATH MathSciNet Google Scholar
He, J., Larson, M., de Rijke, M.: On the topical structure of the relevance feedback set. In: FGIR Workshop Information Retrieval (WIR 2008), Wurzburg, Germany (2008a)
He, J., Larson, M., de Rijke, M.: Using coherence-based measures to predict query difficulty. In: ECIR’08, pp. 689–694 (2008b)
He, J., Weerkamp, W., Larson, M., de Rijke, M.: Blogger, stick to your story: modeling topical noise in blogs with coherence measures. In: AND ’08: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, pp. 39–46. ACM (2008c)
Lavrenko, V., Croft, W.B.: Relevance-based language models. In: SIGIR ’01, pp. 120–127, ACM Press, New York (2001)
Lee, W.-L., Lommatzsch, A.: Feed distillation using adaboost and topic maps. In: TREC ’07 Working Notes (2007)
Macdonald, C., Ounis, I.: The TREC Blogs06 collection: creating and analyzing a blog test collection. Technical Report TR-2006-224, Department of Computer Science, University of Glasgow (2006)
Macdonald, C., Ounis, I.: Key blog distillation: ranking aggregates. In: CIKM ’08: Proceeding of the 17th ACM Conference on Information and Knowledge Management, pp. 1043–1052, ACM, New York (2008)
Macdonald, C., Ounis, I., Soboroff, I.: Overview of the TREC 2007 Blog Track. In: TREC ’07 Working Notes, pp. 31–43 (2007)
Mackay D.J.C., Peto L.: A hierarchical dirichlet language model. Nat. Lang. Eng. 1(3), 1–19 (1994)
Google Scholar
Mishne, G.: Applied text analytics for blogs. PhD thesis, University of Amsterdam (2007)
Mishne, G., de Rijke, M.: A study of blog search. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR’06, vol. 3936, pp. 289–301 (2006)
Morris J., Hirst G.: Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Comput. Linguist. 17(1), 21–48 (1991)
Google Scholar
Ounis, I., Macdonald, C., de Rijke, M., Mishne, G., Soboroff, I.: Overview of the TREC 2006 Blog Track. In: TREC ’06 Working Notes. NIST (2007)
Pilpel Y., Sudarsanam P., Church G.M.: Identifying regulatory networks by combinatiorial analysis of promoter elements. Nat. Genet. 29, 153–159 (2001)
Article Google Scholar
Robertson, S., Callan, J.: Routing and filtering. In: TREC ’05, pp. 99–122. MIT (2005)
Seki, K., Kino, Y., Sato, S.: TREC 2007 Blog Track experiments at Kobe University. In: TREC ’07 Working Notes (2007)
Seo, J., Croft, W.B.: UMass at TREC 2007 Blog distillation task. In: TREC ’07 Working Notes (2007)
Stokes, N., Newman, E., Carthy, J., Smeaton, A.F.: Broadcast news gisting using lexical cohesion analysis. In: the Proceedings of the 26th BCS-IRSG European Conference on Information Retrieval (ECIR-04), pp. 209–222. Springer (2004)
Weeds, J., Weir, D., Mccarthy, D.: Characterising measures of lexical distributional similarity. In: Proceedings of CoLing 2004, pp. 1015–1021 (2004)
Weerkamp, W., Balog, K., de Rijke, M.: Finding key bloggers, one post at a time. In: ECAI’08 (2008)
Weerkamp, W., Balog, K., de Rijke, M.: A generative blog post retrieval model that uses query expansion based on external collections. In: Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-ICNLP 2009) (2009)
Weerkamp, W., de Rijke, M.: Credibility improves topical blog post retrieval. In: Proceedings of ACL-08: HLT, pp. 923–931. Association for Computational Linguisticsages, Columbus (2008)
Willett P.: Recent trends in hierarchic document clustering: a critical review. Inf. Process Manag. 24(5), 577–597 (1988)
Article Google Scholar
Zhai C., Lafferty J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22(2), 179–214 (2004)
Article Google Scholar

Download references

Acknowledgments

We are very grateful to our anonymous reviewers for providing helpful feedback and suggestions. This research was supported by the DuOMAn project carried out within the STEVIN programme which is funded by the Dutch and Flemish Governments (http://www.stevin-tst.org) under project number STE-09-12, and by the Netherlands Organisation for Scientific Research (NWO) under project numbers 017.001.190, 640.001.501, 640.002.501, 612.066.512, 612.061.814, 612.061.815, 640.004.802.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Authors and Affiliations

ISLA, University of Amsterdam, Science Park 107, 1098GX, Amsterdam, The Netherlands
Jiyin He, Wouter Weerkamp & Maarten de Rijke
EEMCS, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands
Martha Larson

Authors

Jiyin He
View author publications
You can also search for this author in PubMed Google Scholar
Wouter Weerkamp
View author publications
You can also search for this author in PubMed Google Scholar
Martha Larson
View author publications
You can also search for this author in PubMed Google Scholar
Maarten de Rijke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiyin He.

Additional information

This paper is a revised and extended version of [19].

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

He, J., Weerkamp, W., Larson, M. et al. An effective coherence measure to determine topical consistency in user-generated content. IJDAR 12, 185–203 (2009). https://doi.org/10.1007/s10032-009-0089-5

Download citation

Received: 20 November 2008
Revised: 19 May 2009
Accepted: 09 July 2009
Published: 13 August 2009
Issue Date: September 2009
DOI: https://doi.org/10.1007/s10032-009-0089-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An effective coherence measure to determine topical consistency in user-generated content

Abstract

Article PDF

Similar content being viewed by others

Artificial intelligence in E-Commerce: a bibliometric study and literature review

Sentiment Analysis in the Age of Generative AI

A phenomenology and epistemology of large language models: transparency, trust, and trustworthiness

References

Acknowledgments

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An effective coherence measure to determine topical consistency in user-generated content

Abstract

Article PDF

Similar content being viewed by others

Artificial intelligence in E-Commerce: a bibliometric study and literature review

Sentiment Analysis in the Age of Generative AI

A phenomenology and epistemology of large language models: transparency, trust, and trustworthiness

References

Acknowledgments

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation