Blog feed search with a post index

Weerkamp, Wouter; Balog, Krisztian; de Rijke, Maarten

doi:10.1007/s10791-011-9165-9

Blog feed search with a post index

Open access
Published: 15 March 2011

Volume 14, pages 515–545, (2011)
Cite this article

Download PDF

You have full access to this open access article

Information Retrieval Aims and scope Submit manuscript

Blog feed search with a post index

Download PDF

Wouter Weerkamp¹,
Krisztian Balog² &
Maarten de Rijke¹

2025 Accesses
Explore all metrics

Abstract

User generated content forms an important domain for mining knowledge. In this paper, we address the task of blog feed search: to find blogs that are principally devoted to a given topic, as opposed to blogs that merely happen to mention the topic in passing. The large number of blogs makes the blogosphere a challenging domain, both in terms of effectiveness and of storage and retrieval efficiency. We examine the effectiveness of an approach to blog feed search that is based on individual posts as indexing units (instead of full blogs). Working in the setting of a probabilistic language modeling approach to information retrieval, we model the blog feed search task by aggregating over a blogger’s posts to collect evidence of relevance to the topic and persistence of interest in the topic. This approach achieves state-of-the-art performance in terms of effectiveness. We then introduce a two-stage model where a pre-selection of candidate blogs is followed by a ranking step. The model integrates aggressive pruning techniques as well as very lean representations of the contents of blog posts, resulting in substantial gains in efficiency while maintaining effectiveness at a very competitive level.

The Blog Ranking Algorithm Using Analysis of Both Blog Influence and Characteristics of Blog Posts

Correlated Blog-Page Retrieval with Structural Characteristics

A Topic-Oriented Information Retrieval Algorithm in the Blogosphere

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We increasingly live our lives online: we keep in touch with friends on Facebook, ^{Footnote 1} expand our network using LinkedIn, ^{Footnote 2} quickly post messages on Twitter, ^{Footnote 3} comment on news events on online news paper sites, help others on forums, mailing lists, or community question-answer sites, and report on experiences or give our opinions in blogs. All of these activities involve the creation of content by the end users of these platforms, as opposed to editors or webmasters. This content, i.e., user generated content, is particularly valuable as it offers an insight in what people do, think, need to know, or care about. Organizations look for ways of mining the information that is available in these user generated sources, and to do so, tools and techniques need to be developed that are capable of handling this type of content.

In this paper we focus on blogs. A blog is the unedited, unregulated voice of an individual (Mishne 2007), as published on a web page containing time-stamped entries (blog posts) in reverse chronological order (i.e., last entry displayed first). In most cases, bloggers (the authors of blog entries) offer readers the possibility to reply to entries in the blog (commenting), bloggers link to other blogs (blogroll), thereby creating a network of blogs, and many blogs are updated regularly. Blogs offer a unique insight in people’s minds: whether the blog is about their personal life (which products do people use? what are their needs or wishes?), personal interests (what are their opinions on X?), or a more professional view on topics (can they explain X to me?), getting access to this information is valuable for many others.

Accessing the blogosphere (the collection of all blogs) can be done in various ways, but usually revolves around one of two main tasks: (1) identifying relevant blog posts (blog post retrieval), and (2) identifying relevant blogs (blog feed search). In (1) the goal is to list single blog posts (“utterances”) that talk about a given topic; having constructed this list, one can present it to a user or use it in further downstream processing (e.g., sentiment analysis, opinion extraction, mood detection). In (2) the goal is not to return single posts, but to identify blogs that show a recurring interest in a given topic. Blogs that only mention the topic sporadically or in passing are considered non-relevant, but a blog (or: the person behind the blog) that talks about this topic regularly would be relevant. Again, one can simply return these blogs to an end user as is, but could also decide to use the results in further processing (e.g., recommending blogs to be followed, identifying networks of expert bloggers, detect topic shifts in blogs). In this paper we specifically look at the second task, identifying relevant blogs given a topic, also known as blog feed search.

The total number of blogs in the world is not known exactly. Technorati, ^{Footnote 4} the largest blog directory, was tracking 112 million blogs in 2008, and counted 175,000 new blogs every day. These bloggers created about 1.6 million entries per day. Most of these blogs are written in English, but the largest part of the internet users is not English-speaking. The China Internet Network Information Center (CNNIC) ^{Footnote 5} released a news report in December 2007 stating that about 73 million blogs are being maintained in China, which means that, by now, the number of Chinese blogs is probably close to the number of blogs tracked by Technorati. Although we lack exact numbers on the size of the blogosphere, we can be sure that its size is significant—in terms of blogs, bloggers, and blog posts.

Given the size of the blogosphere and the growing interest in the information available in it, we need effective and efficient ways of accessing it. An important first step concerns indexing. When looking for relevant blog posts, it makes sense to do so on top of an index consisting of individual blog posts: the unit of retrieval is the same as the indexing unit, blog posts. When looking for blogs, however, two options present themselves. We could, again, opt for the “unit of retrieval coincides with the unit of indexing” approach; this would probably entail concatenating a blog’s posts into a single pseudo-document and indexing these pseudo-documents. In this paper, we want to pursue an alternative strategy, viz. to drop the assumption that the unit of retrieval and the unit of indexing need to coincide for blog feed search. Instead, we want to use a post-based index (i.e., the indexing unit is a blog post) to support a blog feed search engine (i.e., the unit of retrieval is a blog). This approach has a number of advantages. First, it allows us to support a blog post search engine and a blog feed search engine with a single index. Second, result presentation is easier using blog posts as they represent the natural utterances produced by a blogger. Third, a post index allows for simple incremental indexing and does not require frequent re-computations of pseudo-documents that are meant to represent an entire blog.

We introduce two models, the Blogger model and the Posting model, that are able to rank blogs for a given query based on a post index. Both models use associations between posts and blogs to indicate to which blog their relevance score should contribute. Both models achieve highly competitive retrieval performance (on community-based benchmarks), although the Blogger model consistently outperforms the Posting model in terms of retrieval effectiveness while the Posting model needs to compute substantially fewer associations between posts and blogs and, hence, is more efficient. To improve the efficiency of the Blogger model we integrate our Blogger and Posting models in a single two-stage model which we subject to additional pruning techniques while we maintain (and even increase) effectiveness at a competitive level.

1.1 Research questions and contributions

Our main research question is whether we can effectively and efficiently use a blog post index for the task of blog feed search. The Blogger and Posting models that we introduce are tested on effectiveness using standard IR methodologies. To examine their efficiency, we identify core operations that need to be executed to perform blog feed search using either of those two models.

A second set of research questions is centered around a two-stage model that we introduce to combine the strengths of the Blogger and Posting models. Specifically, we introduce a number of pruning techniques aimed at improving efficiency while maintaining (or even improving) effectiveness. We study the impact of these techniques on retrieval effectiveness as well as the impact of integrating alternative blog post representations (title-only vs. full content) into our two-stage model.

Our main contribution is twofold. First, we show that blog feed search can be supported using a post-based index. Second, we propose an effective two-stage blog feed search model together with several techniques aimed at improving its efficiency.

The remainder of this paper is organized as follows. In Sect. 2 we discuss related work on blog feed search and language modeling. The retrieval models that we use in the paper are discussed in Sect. 3. Our experimental setup is detailed in Sect. 4 and our baseline results are established in Sect. 5. Results on our two-stage model and its refinements are presented in Sect. 6. A discussion (Sect. 7) and conclusion (Sect. 8) complete the paper.

2 Related work

In this paper related work comes in three flavors. We introduce previous research in information access in the blogosphere, we take a look at what has been done more specifically on blog feed search, and we briefly introduce language modeling for information retrieval, as this is the approach underlying our models.

2.1 Information access in the blogosphere

With the growth of the blogosphere comes the need to provide effective access to the knowledge and experience contained in the many tens of millions of blogs out there. Information needs in the blogosphere come in many flavors, addressing many aspects of blogs and thereby extending the notion of relevance, from “being about the same topic” to, for instance, “expressesing opinions about the topic” or “sharing an experience around the topic.” In (Mishne and de Rijke 2006), both ad hoc and filtering queries are considered in the context of a blog search engine; the authors argue that blog searches have different intents than typical web searches, suggesting that the primary targets of blog searchers are tracking references to named entities, identifying posts that express a view on a certain concept and searching blogs that show evidence of a long-term interest in a concept.

In 2006, a blog track (Ounis et al. 2007) was launched by TREC, the Text REtrieval Conference, aimed at evaluating information access tasks in the context of the blogosphere. The first edition of the track focused mainly on finding relevant blog posts, i.e., on blog post retrieval, with a special interest in their opinionatedness. The 2007 and 2008 editions of the track featured a blog distillation or blog feed search task. It addresses a search scenario where the user aims to find a blog to follow or read in their RSS reader. This blog should be principally devoted to a given topic over a significant part of the timespan of the feed. Unlike blog post search tasks, the blog feed search task aims to rank blogs (i.e., aggregates of blog posts by the same blogger) instead of permalink documents.

2.2 Blog feed search

Some commercial blog search facilities provide an integrated blog search tool to allow users to easily find new blogs of interest. In (Fujimura et al. 2006), a multi-faceted blog search engine was proposed that allows users to search for blogs and posts. One of the options was to use a blogger filter: the search results (blog posts) are clustered by blog and the user is presented with a list of blogs that contain one or more relevant posts. Ranking of the blogs is done based on the EigenRumor algorithm (Fujimura et al. 2005); in contrast to the methods that we consider below, this algorithm is query-independent.

An important theme to emerge from the work on systems participating in the TREC 2007 blog feed search tasks is the indexing unit used (Macdonald et al. 2008). While the unit of retrieval is fixed for blog feed search—systems have to return blogs in response to a query—it is up to the individual systems to decide whether to produce a ranking based on a blog index or on a post index. The former views blogs as a single document, disregarding the fact that a blog is constructed from multiple posts. The latter takes samples of posts from blogs and combines the relevance scores of these posts into a single blog score. The most effective approaches to feed distillation at TREC 2007 were based on using the (aggregated) text of entire blogs as indexing units. E.g., Elsas et al. (2008a, b) experiment with a “large document model” in which entire blogs are the indexing units and a“small document model” in which evidence of relevance of a blog is harvested from individual blog posts. They also experiment with combining the two models, obtaining best performance in terms of MAP (Arguello et al. 2008).

Participants in TREC 2007 and 2008 (Macdonald et al. 2009) explored various techniques for improving effectiveness on the blog feed search task: Query expansion using Wikipedia (Elsas et al. 2008), topic maps (Lee et al. 2008), and a particularly interesting approach—one that tries to capture the recurrence patterns of a blog—using the notion of time and relevance (Seki et al. 2007). Although some of the techniques used proved to be useful in both years (e.g., query expansion), most approaches did not lead to significant improvements over a baseline, or even led to a decrease in performance.

In the setting of blog feed search, authors have considered various ways of improving effectiveness: (1) index pruning techniques, (2) modeling topical noise in blogs to measure recurring interest, (3) using blog characteristics such as the number of comments, post length, or the posting time, and (4) mixing different document representations. We briefly sample from publications on each of these four themes.

Starting with index pruning, a pre-processing step in (Seo and Croft 2008b) consists of removing all blogs that consist of only one post, since retrieving these blogs would come down to retrieving posts and would ignore the requirement of retrieving blogs with a recurring interest. We use various types of index pruning in Sects. 5 and 6, including removing non-English blogs and blogs that consist of a single post.

As to capturing the central interest of a blog, several authors attempt to capture the central interest of a blogger by exploiting information about topical patterns in blogs. The voting-model-based approach of Macdonald and Ounis (2008) is competitive with the TREC 2007 blog feed search results reported in (Macdonald et al. 2008) and formulates three possible topical patterns along with models that encode each into the blog retrieval model. In (He et al. 2009) the need to target individual topical patterns and to tune multiple topical-pattern-based scores is eliminated; their proposed use of a coherence score to encode the topical structure of blogs allows them to simultaneously capture the topical focus at the blog level and the tightness of the relatedness of sub-topics within the blog. A different approach is proposed in (Seo and Croft 2008a), where the authors use diversity penalties: blogs with a diverse set of posts receive a penalty. This penalty is integrated in various resource selection models, where a blog is seen as a resource (collection of posts), and given a query, the goal is to determine the best resource. Below, we capture the central interest of a blogger using the KL-divergence between a post and the blog to which it belongs.

The usage of blog-specific features like comments and recency has been shown to be beneficial in blog post retrieval (Mishne 2007, Weerkamp and de Rijke 2008). In blog feed search these features can be applied in the post retrieval stage of the Posting model, but they can also be used to estimate the importance of a post for its parent blog (Weerkamp et al. 2008); we use some of these features in Sects. 5 and 6 below.

Finally, blog posts can be represented in different ways. On several occasions people have experimented with using syndicated content (i.e., RSS or ATOM feeds) instead of permalinks (HTML content) (Elsas et al. 2008a, b, Mishne 2007); results of which representation works better are mixed. Other ways of representing documents are, for example, a title-only representation, or an (incoming) anchor text representation; combinations of various representations show increased effectiveness in other web retrieval tasks (e.g., ad hoc retrieval (Eiron and McCurley 2003, Jin et al. 2002)). We increase the efficiency of our most effective model by considering multiple content representations in Sect. 6.

2.3 Language modeling for information retrieval

At the TREC 2007 and 2008 blog tracks, participants used various retrieval plaforms, with a range of underlying document ranking models (Macdonald et al. 2008, 2009). We base our ranking methods on probabilistic, generative language models. Here, documents are ranked by the probability of the query being observed during randomly sampling words from the document. Since their introduction to the area of information retrieval, language modeling techniques have attracted a lot of attention (Hiemstra 2001, Miller et al. 1999, Ponte and Croft 1998). They are attractive because of their foundations in statistical theory, the great deal of complementary work on language modeling in speech recognition and natural language processing, and the fact that very simple language modeling retrieval methods have performed quite well empirically.

Work on blog feed search shows great resemblance to expert finding: given a topic, identify people that are experts on the topic. Our approach to the blog feed search task is modeled after two well-known language modeling-based models from the expert finding literature. In particular, our Blogger model corresponds to Model 1 in (Balog et al. 2006, 2009), while our Posting model corresponds to their Model 2. These connections were first detailed in (Balog et al. 2008, Weerkamp et al. 2008) and are examined and compared in great detail in this paper.

3 Probabilistic models for blog feed search

In this section we introduce two models for blog feed search, i.e., for the following task: given a topic, identify blogs (that is, feeds) about the topic. The blogs that we are aiming to identify should not just mention the topic in passing but display a recurring central interest in the topic so that readers interested in the topic would add the feed to their feed reader.

To tackle the task of identifying such key blogs given a query, we take a probabilistic approach and formulate the task as follows: what is the probability of a blog (feed) being a key source given the query topic q? That is, we determine P(blog|q) and rank blogs according to this probability. Since the query is likely to consist of very few terms to describe the underlying information need, a more accurate estimate can be obtained by applying Bayes’ Theorem, and estimating:

$$ P(blog|q) = {\frac{P(q|blog) \cdot P(blog)}{P(q)}}, $$

(1)

where P(blog) is the probability of a blog and P(q) is the probability of a query. Since P(q) is constant (for a given query), it can be ignored for the purpose of ranking. Thus, the probability of a blog being a key source given the query q is proportional to the probability of a query given the blog P(q|blog), weighted by the a priori belief that a blog is a key source, P(blog):

$$ P(blog|q) \propto P(q|blog) \cdot P(blog). $$

(2)

Since we focus on a post-based approach to blog distillation, we assume the prior probability of a blog P(blog) to be uniform. The distillation task then boils down to estimating P(q|blog), the likelihood of a blog generating query q.

In order to estimate the probability P(q|blog), we adapt generative probabilistic language models used in Information Retrieval in two different ways. In our first model, the Blogger model (Sect. 3.1), we build a textual representation of a blog, based on posts that belong to the blog. From this representation we estimate the probability of the query topic given the blog’s model. Our second model, the Posting model (Sect. 3.2), first retrieves individual blog posts that are relevant to the query, and then considers the blogs from which these posts originate.

The Blogger model and Posting model originate from the field of expert finding and correspond to Model 1 and Model 2 (Balog et al. 2006, 2009). We opt for translating these models to the new setting of blog feed search, and focus on using blog specific associations, combining the models, and improving efficiency. In the remainder of this paper we use the open source implementation of both the Blogger and Posting model, called EARS: ^{Footnote 6} Entity and Association Retrieval System.

3.1 Blogger model

The Blogger model estimates the probability of a query given a blog by representing the blog as a multinomial probability distribution over the vocabulary of terms. Therefore, a blog model θ_blogger(blog) is inferred for each blog, such that the probability of a term given the blog model is P(t|θ_blogger(blog)). The model is then used to predict how likely a blog would produce a query q. Each query term is assumed to be sampled identically and independently. Thus, the query likelihood is obtained by taking the product across all terms in the query:

$$ P(q|\theta_{blogger}(blog)) = \prod_{t\in q} P(t|\theta_{blogger}(blog))^{n(t,q)}, $$

(3)

where n(t, q) denotes the number of times term t is present in query q.

To ensure that there are no zero probabilities due to data sparseness, it is standard to employ smoothing. That is, we first obtain an empirical estimate of the probability of a term given a blog P(t|blog), which is then smoothed with the background collection probabilities P(t):

$$ P(t|\theta_{blogger}(blog)) = (1-\lambda_{blog})\cdot P(t|blog) + \lambda_{blog}\cdot P(t). $$

(4)

In Eq. 4, P(t) is the probability of a term in the document repository. In this context, smoothing adds probability mass to the blog model according to how likely it is to be generated (i.e., published) by any blog.

To approximate P(t|blog) we use the blog’s posts as a proxy to connect the term t and the blog in the following way:

$$ P(t|blog) = \sum_{post\in blog} P(t|post,blog)\cdot P(post|blog). $$

(5)

We assume that terms are conditionally independent from the blog (given a post), thus P(t|post, blog) = P(t|post). We approximate P(t|post) with the standard maximum likelihood estimate, i.e., the relative frequency of the term in the post. Our first approach to setting the conditional probability P(post|blog) is to allocate the probability mass uniformly across posts, i.e., assuming that all posts of the blog are equally important. In Sect. 6 we explore other ways of estimating this probability.

$$ |blog| = \sum_{post \in blog} |post| \cdot P(post|blog), $$

(6)

where |post| denotes the length of the post. This way, the amount of smoothing is proportional to the information contained in the blog; blogs with fewer posts will rely more on the background probabilities. This method resembles Bayes smoothing with a Dirichlet prior (Mackay and Peto 1994). We set β to be the average blog length in the collection; see Table 4 for the actual values used in our experiments.

3.2 Posting model

Our second model assumes a different perspective on the process of finding blog feeds. Instead of directly modeling the blog, individual posts are modeled and queried (hence the name, Posting model); after that, blogs associated with these posts are considered. Specifically, for each blog we sum up the relevance scores of individual posts (P(q|θ_posting(post))), weighted by their relative importance given the blog (P(post|blog)). Formally, this can be expressed as:

$$ P(q|blog) = \sum_{post \in blog} P(q|\theta_{posting}(post)) \cdot P(post|blog). $$

(7)

Assuming that query terms are sampled independently and identically, the probability of a query given an individual post is:

$$ P(q|\theta_{posting}(post)) = \prod_{t \in q} P(t|\theta_{posting}(post))^{n(t,q)}. $$

(8)

The probability of a term t given the post is estimated by inferring a post model P(t|θ_posting(post)) for each post following a standard language modeling approach:

$$ P(t|\theta_{posting}(post)) = (1-\lambda_{post})\cdot P(t|post) + \lambda_{post} \cdot P(t), $$

(9)

3.3 A two-stage model

We also consider a two-stage model, that integrates the Posting model, which is the more efficient of the two, as we will see, and the Blogger model, which has a better representation of the blogger’s interests, into a single model. To achieve this goal, we use two separate stages:

Stage 1:
Use Eq. 8 to retrieve blog posts that match a given query and construct a truncated list B of blogs these posts belong to. We do not need to “store” the ranking of this stage.
Stage 2:
Given the list of blogs B, we use Eq. 3 to rank just the blogs that are present in this list.

By limiting the list of blogs B, in stage 1, that need to be ranked in stage 2, this two-stage approach aims at improving efficiency, while it maintains the ability to construct a ranking based on the complete profile of a blogger.

More precisely, let N, M be two natural numbers. Let f be a ranking function on blog posts: given a set of posts it returns a ranking of those posts; f could be recency, length, or it could be a topic dependent function, in which case the query q needs to be specified. We write $(f \upharpoonright N)(blog)$ for the list consisting of the first N posts ranked using f; if q is a query, we write f _q for the post ranking function defined by Eq. 8. Then,

$$ P(q|\theta_{two}(blog)) = \left\{ \begin{array}{ll} 0, & \hbox{if } (f_{q}\upharpoonright N) (blog)=\emptyset \\ \prod\limits_{t\in q}P(t|\theta_{two}(blog))^{n(t,q)}, & \hbox{otherwise,} \end{array}\right . $$

(10)

where $(f_{q}\upharpoonright N) (blog)$ denotes the set of top N relevant posts given the query and θ_two(blog) is defined as a mixture, just like Eq. 4:

$$ P(t|\theta_{two}(blog))= (1-\lambda_{blog})\cdot P_{two}(t|blog)+\lambda_{blog}\cdot P(t), $$

(11)

in which the key ingredient P _two(t|blog) is defined as a variation on Eq. 5, restricted to the top M posts of the blog:

$$ P_{two}(t|blog) = \sum_{post\in (f\upharpoonright M)(blog)}P(t|post)\cdot P(post|blog). $$

(12)

Before examining the impact of the parameters N and M in Eqs. 10 and 12, and more generally, before comparing the models just introduced in terms of their effectiveness and efficiency on the blog feed search task, we detail the experimental setup used to answer our research questions.

4 Experimental setup

We use the test sets made available by the TREC 2007 and 2008 blog tracks for the blog feed search task. Those collections consist of (1) a task definition, (2) a document collection, (3) a set of test topics, (4) relevance judgments (“ground truth”), and (5) evaluation metrics. Below, we detail those as well as statistics on our indexes and smoothing parameter β.

4.1 Document collection

The experiments presented in this paper use the TREC Blog06 collection (Macdonald and Ounis 2006). Table 1 lists the original collection statistics. The collection comes with three document types: (1) feeds, (2) permalinks, and (3) homepages. For our experiments, we only use the permalinks, that is, the HTML version of a blog post. During preprocessing, we removed the HTML code, and kept only the page title, and block level elements longer than 15 words, as detailed in (Hofmann and Weerkamp 2008).

Table 1 Statistics of the TRECBlog06 collection

Blog feed search with a post index

Abstract

Similar content being viewed by others

The Blog Ranking Algorithm Using Analysis of Both Blog Influence and Characteristics of Blog Posts

Correlated Blog-Page Retrieval with Structural Characteristics

A Topic-Oriented Information Retrieval Algorithm in the Blogosphere

1 Introduction

1.1 Research questions and contributions

2 Related work

2.1 Information access in the blogosphere

2.2 Blog feed search

2.3 Language modeling for information retrieval

3 Probabilistic models for blog feed search

3.1 Blogger model

3.2 Posting model

3.3 A two-stage model

4 Experimental setup

4.1 Document collection

4.2 Topic sets

4.3 Inverted indexes

4.4 Smoothing

4.5 Evaluation metrics and significance testing

5 Baseline results

5.1 Language detection

5.2 Short blogs

5.3 Baseline results

5.4 Analysis

5.5 Intermediate conclusions

6 A two-stage model for blog feed search

6.1 Motivation

6.2 Estimating post importance

6.2.1 Post length

6.2.2 Centrality

6.2.3 Comments

6.3 Pruning the single stage models

6.3.1 Blogger model

6.3.2 Posting model

6.4 Evaluating the two-stage model

6.5 A further reduction

6.6 Per-topic analysis of the two-stage model

6.7 Intermediate conclusions

7 Discussion

7.1 Efficiency vs. effectiveness

7.2 Very high early precision

7.3 Smoothing parameter

8 Conclusions

Notes

References

Acknowledgments

Open Access

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation