Original Paper

International Journal on Document Analysis and Recognition (IJDAR)

, Volume 12, Issue 3, pp 185-203

First online:

Open Access This content is freely available online to anyone, anywhere at any time.

An effective coherence measure to determine topical consistency in user-generated content

  • Jiyin HeAffiliated withISLA, University of Amsterdam Email author 
  • , Wouter WeerkampAffiliated withISLA, University of Amsterdam
  • , Martha LarsonAffiliated withEEMCS, Delft University of Technology
  • , Maarten de RijkeAffiliated withISLA, University of Amsterdam


When searching for blogs on a specific topic, information seekers prefer blogs that place a central focus on that topic over blogs whose mention of the topic is diffuse or incidental. In order to present users with better blog feed search results, we developed a measure of topical consistency that is able to capture whether or not a blog is topically focused. The measure, called the coherence score, is inspired by the genetics literature and captures the tightness of the clustering structure of a data set relative to a background collection. In a set of experiments on synthetic data, the coherence score is shown to provide a faithful reflection of topic clustering structure. The properties that make the coherence score more appropriate than lexical cohesion, a common measure of topical structure, are discussed. Retrieval experiments show that integrating the coherence score as a prior in a language modeling-based approach to blog feed search improves retrieval effectiveness. The coherence score must, however, be used judiciously in order to avoid boosting the ranking of irrelevant but topically focused blogs. To this end, we experiment with a series of weighting schemes that adjust the contribution of the coherence score according to the relevance of a blog to the user query. An appropriate weighting scheme is able to improve retrieval performance. Finally, we show that the coherence score can be reliably estimated with a sample exceeding 20 posts in size. Consistent with this finding, experiments show that the best retrieval performance is achieved if coherence scores are used when a blog contains more than 20 posts.


User-generated content Topical structure Information retrieval Blog search Coherence