International Journal on Document Analysis and Recognition (IJDAR)

, Volume 12, Issue 3, pp 185–203

An effective coherence measure to determine topical consistency in user-generated content


    • ISLAUniversity of Amsterdam
  • Wouter Weerkamp
    • ISLAUniversity of Amsterdam
  • Martha Larson
    • EEMCSDelft University of Technology
  • Maarten de Rijke
    • ISLAUniversity of Amsterdam
Open AccessOriginal Paper

DOI: 10.1007/s10032-009-0089-5

Cite this article as:
He, J., Weerkamp, W., Larson, M. et al. IJDAR (2009) 12: 185. doi:10.1007/s10032-009-0089-5


When searching for blogs on a specific topic, information seekers prefer blogs that place a central focus on that topic over blogs whose mention of the topic is diffuse or incidental. In order to present users with better blog feed search results, we developed a measure of topical consistency that is able to capture whether or not a blog is topically focused. The measure, called the coherence score, is inspired by the genetics literature and captures the tightness of the clustering structure of a data set relative to a background collection. In a set of experiments on synthetic data, the coherence score is shown to provide a faithful reflection of topic clustering structure. The properties that make the coherence score more appropriate than lexical cohesion, a common measure of topical structure, are discussed. Retrieval experiments show that integrating the coherence score as a prior in a language modeling-based approach to blog feed search improves retrieval effectiveness. The coherence score must, however, be used judiciously in order to avoid boosting the ranking of irrelevant but topically focused blogs. To this end, we experiment with a series of weighting schemes that adjust the contribution of the coherence score according to the relevance of a blog to the user query. An appropriate weighting scheme is able to improve retrieval performance. Finally, we show that the coherence score can be reliably estimated with a sample exceeding 20 posts in size. Consistent with this finding, experiments show that the best retrieval performance is achieved if coherence scores are used when a blog contains more than 20 posts.


User-generated contentTopical structureInformation retrievalBlog searchCoherence
Download to read the full article text

Copyright information

© The Author(s) 2009