1 Introduction

It is well-known that a vast number of web pages are created for the purpose of surreptitiously causing a search engine to deliver an unwelcome payload (Gyöngyi and Garcia-Molina 2005). Broadly speaking, this purpose is effected by two mechanisms: self promotion and mutual promotion. A common form of self-promotion is keyword stuffing, in which gratuitous keywords (often invisible to the reader) are inserted to improve the retrieved rank of the page. A common form of mutual promotion is the link farm, in which a large number plausible-looking pages reference one another, so as to improve a topic-independent quality score such as PageRank. Often, the content of both kinds of spam pages is mechanically generated or plagiarized.

We are concerned with measuring and mitigating the effect of spam pages on retrieval effectiveness, as illustrated by two TREC 2009 Footnote 1 tasks that use the new ClueWeb09 dataset. Footnote 2 The first of these tasks is the ad hoc task of the Web Track (Clarke et al. 2009). The second task is the Relevance Feedback Track task, which evaluated relevance feedback techniques through a two-phase experimental protocol. Relevance judgments were provided to the participants after Phase 1, for use in Phase 2.

Definitions of web spam typically focus on those pages that contain deceptive or harmful content. The authors of such spam pages may attempt to subvert the ranking algorithm of a search engine by presenting a false impression of relevance. However, “defining web spam is not as straightforward as it might seem” (Gyöngyi and Garcia-Molina 2005), and it is often unclear whether a page is genuinely malicious or merely of low quality. We avoid considerations of author intent by employing a broad definition of web spam that encompasses low-quality “junk” pages, those which are unlikely to be judged relevant to any query that might reasonably retrieve them.

The ClueWeb09 dataset was crawled from the general web in early 2009, and contains roughly 1 billion pages written in a number of languages. The TREC tasks were concerned only with the English subset of about 500 million pages. Furthermore, several sub-tasks used a “Category B” (or “Cat B”) subset containing about 50 million pages. The full set of 1 billion pages was dubbed “Category A” (or “Cat A”). To our knowledge, all TREC participants submitting Category A runs used at most the 500 million page, English subset of ClueWeb09. The Relevance Feedback Track specifically limited Category A runs to the English subset.

The literature lacks quantitative studies of the impact of spam and spam filtering on retrieval effectiveness. The AIRWeb Web Spam Challenge series Footnote 3 has measured the effectiveness of various methods at identifying spam hosts, but not the overall contribution of these methods to retrieval effectiveness. The Web Spam Challenge and other studies use two datasets prepared by Yahoo for the purpose: WEBSPAM-UK2006 and WEBSPAM-UK2007. Footnote 4 Each of these datasets consists of a crawl of a portion of the “.uk” web space, with spam or non-spam labels for a sample of a few thousand of the hosts represented in each crawl. To our knowledge, the only study of retrieval effectiveness using the WEBSPAM corpora (Jones et al. 2007) shows that users prefer spam-filtered results to unfiltered ones, but offers no relevance-based measurement of the impact of spam. The same authors investigate the properties that a collection should have to evaluate the effectiveness of “spam nullification” (Jones et al. 2009).

Previous TREC web IR evaluation efforts have used corpora largely devoid of spam. For example, the TREC Terabyte Track used a collection of government web pages (Büttcher et al. 2006). Spam has been identified as an issue in the 2006–2008 TREC Blog Tracks (Macdonald et al. 2007, 2009). From known spam blogs representing about 16% of the corpus, systems retrieved about 10% spam, but the removal of this spam did not substantially impact relative system rankings (Macdonald et al. 2009). The use of the ClueWeb09 dataset places the spam issue front and center. At least three participants (Hauff and Hiemstra 2009; Lin et al. 2009) used spam filters of some sort; one from a commercial search provider. Other participants noted the impact of spam on their efforts, particularly for Category A tasks (He et al. 2009; Kaptein et al. 2009; McCreadie et al. 2009).

Our objectives in undertaking this work were twofold: (1) to develop a practical method of labeling every page in ClueWeb09 as spam or not, and (2) to quantify the quality of the labeling by its impact on the effectiveness of contemporary IR methods. Our results are:

  • Several complete sets of spam labels, available for download without restriction. Each label is a percentile score, which may be used in combination with a threshold to classify a page as “spam” or “not spam”, or may be used to rank the page with respect to others by “spamminess.”

  • A general process for labeling large web datasets, which requires minimal computation, and minimal training.

    • A variant of the process is essentially unsupervised, in that it uses automatically labeled training examples, with no human adjudication.

    • A variant uses training examples from a four-year-old, dissimilar, and much smaller collection.

    • A variant uses only 2.5 hours of human adjudication to label representative training examples.

    • A variant combines all of the above to yield a superior meta-ranking.

  • Measurements that show a significant and substantive positive impact on precision at fixed cutoff, when the labels are used to remove spammy documents from all runs officially submitted by participants to the TREC 2009 web adhoc and relevance feedback tasks.

  • A method to automatically reorder, rather than simply to filter, the runs.

  • Measurements that use 50-fold cross-validation to show a significant and substantive positive impact on reordering at all cutoff levels, and on rank-based summary measures.

Our measurements represent the first systematic study of spam in a dataset of the magnitude of ClueWeb09, and the first quantitative results of the impact of spam filtering on IR effectiveness. Over and above the particular methods and measurements, our results serve as a baseline and benchmark for further investigation. New sets of labels may be compared to ours by the impact they have on effectiveness. Different methods of harnessing the labels—such as using them as a feature for learning to rank—may be compared to ours.

2 Context

Given the general consensus in the literature that larger collections yield higher precision at fixed cutoff (Hawking and Robertson 2003), we expected the precision at rank 10 (P@10) scores for the TREC 2009 web ad hoc task to be high (P@10 > 0.5), especially for the Category A dataset. Contrary to our expectations, the Category A results were poor (P@10 for all Cat A submitted runs: μ = 0.25, σ = 0.11, max = 0.41). The Category B results were better (P@10 for all Cat B submitted runs: μ = 0.38, σ = 0.07, max = 0.56), but still short of our expectation.

The authors represent two groups, X (Cormack and Mojdeh 2009) and Y (Smucker et al. 2009), that participated in TREC 2009, employing distinct retrieval methods. Both of these methods were based only on document content, and did not employ web-specific techniques such as anchor text or link analysis. In the course of developing their method, group X composed a set of 67 pilot queries (Table 1) and, for each, informally adjudicated the top few results from Category A. The results were terrible, the vast majority being spam. At the time, we took this to be a shortcoming of our method and reworked it using pseudo-relevance feedback from Wikipedia to yield higher quality results, which we characterized as “not terrible.”

Table 1 Pilot queries composed prior to TREC 2009

Group Y achieved satisfactory results (P@10 = 0.52) in Phase 1 of the relevance feedback (RF) task, which mandated Category B. In the final results, group X achieved strong performance on the Category A ad hoc task (P@10 = 0.38), while group Y did not (P@10 = 0.16). These results were surprising, as group Y used exactly the same search engine and parameters for their ad hoc submission as they did for their relevance feedback submission—the only difference was the use of the Category A dataset instead of the Category B dataset.

A plausible explanation for these observations is that the TREC submissions were, in general, adversely affected by spam, and that the Category A collection has a higher proportion of spam than Category B. To validate this explanation, we first sought to quantify our observation that the top-ranked pages returned by our methods were dominated by spam. We then sought to find an automatic method to abate spam, and to evaluate the impact of that method on retrieval effectiveness.

To quantify the amount of spam returned, we constructed a web-based evaluation interface (Fig. 1) and used it to adjudicate a number of the top-ranked pages returned during our preliminary investigation. As shown in the figure, the interface presents one page at a time for judging, displaying both rendered HTML and the HTML source. Links along the top allow the user to judge the page. A “spam” judgment indicates that the page appears to contain harmful or malicious content. A “crap” judgment indicates that the page appears to contain useless or junk content, which may not actually be malicious or harmful. A “ham” judgment indiciates that the page appears to contain some useful content. A “pass” link allows the user to skip the page without making a judgment. For the purposes of this paper we merged the “spam” and “crap” judgments, since the distinction between them was often unclear.

Fig. 1
figure 1

User interface for adjudicating spamminess of ClueWeb09 pages

Group X adjudicated 756 pages in total, selected at random with replacement from the top ten results for each of the 67 pilot topics. Group Y adjudicated 461 pages, selected at random with replacement from the top ten results for their Category A and Category B relevance feedback runs. These efforts consumed 2 h. 20 min. and 1 h. 20 min. respectively. The results, shown with 95% confidence intervals in Table 2, indicate a high proportion of spam for the results of both groups in both categories. It follows that this high proportion of spam must have a substantial adverse effect on precision. As anticipated, the proportion of spam is higher in the Category A results.

Table 2 Group X and Group Y estimates of spam prevalence in top-ranked documents

3 Evaluation measures

The Group X and Group Y examples are entirely independent from each other, as they are derived from different topics and different retrieval methods, and assessed by different individuals. Furthermore, the Group X examples are independent of TREC, as the topics and pages were determined beforehand. For this reason, it is appropriate to use the Group X examples for training and tuning, and the Group Y examples for evaluation.

But evaluation on the Group Y examples answers only the narrow question, “How well can a spam filter identify spam in the pages returned by one particular method?” We are interested in the broader questions, “How well can a spam filter identify non-relevant pages?” and “What is the impact of removing or reranking these pages on retrieval effectiveness in general?”

For each of these questions, we require a suitable evaluation measure. For the first two—how well a filter identifies spam and non-relevant documents—we use AUC, the area under the receiver operating characteristic curve. AUC—a threshold-independent measure of classifier effectiveness—has been used as the primary measure in previous web and email spam filtering evaluation efforts (Cormack and Lynam 2005). Our choice of evaluation measure for the third question—how well a filter improves retrieval effectiveness—was constrained by the sparsity of relevance assessments for TREC 2009. For the Category A ad hoc and relevance feedback tasks, the top-ranked 12 documents for each submitted run were assessed, as well as a stratified random sample of the rest. Precision at cutoff 10 (P@10) was reported as the primary retrieval effectiveness measure. These reported results serve as the baseline for our evaluation of filter impact.

Once documents are eliminated due to spam filtering, it is no longer the case that there are assessments for the top-ranked 12 documents, as lower ranked ones (which may not be assessed) rise in the ranking. It is therefore necessary to estimate P@10 for the filtered results, in order to compare them to the baseline. Furthermore, the Category B baseline must be estimated, as only a sample of the documents submitted for Category B submissions was assessed. We considered four methods of estimating P@10:

  • unjudged-nrel. Unadjudicated documents considered to be non-relevant. This method underestimates P@10.

  • unjudged-elided. Unadjudicated documents are elided, and P@10 is computed on the top-ranked 10 documents that are adjudicated (Sakai and Kando 2008). This method may underestimate P@10 because lower ranked documents are evaluated in place of higher ranked ones. Or it may overestimate P@10 because the elided documents are less likely to be relevant due to selection bias.

  • statPC10. An intermediate calculation in the statAP method, as described by Carterette et al. (2008):

    $$ statPC10={\frac{statrel10}{10}} $$
    (1)

    where statrel10 is a sample-based estimate of the number of relevant documents in the top-ranked 10. We determined experimentally that this method underestimates P@10 when the number of assessed documents is small. It yields an inexact estimate of P@10 even when the top 10 documents are fully adjudicated, and sometimes yields the absurd result statPC10 > 1.

  • estP10. A sparse set-based estimate used for the TREC Legal Track, as described by Tomlinson et al. (2007):

    $$ estP10={\frac{estrel10}{\max(estrel10+estnrel10,1)}} $$
    (2)

    where,

    $$ estrel10=\min(statrel10,10-nrel10) $$
    (3)

    and

    $$ estnrel10=\min(statnrel10,10-rel10). $$
    (4)

    Both statrel10 and statnrel10 are sample-based estimates of the number of relevant and non-relevant documents in the top-ranked 10, and rel10 and nrel10 are exact counts of the number of assessed relevant and non-relevant documents in the top-ranked 10. The estP10 measure yields more stable results than statPC10 and has the property that estP10 = P@10 when the top 10 documents are fully adjudicated. Moreover, it is nearly symmetric and therefore less likely to be biased: When none of the 10 documents is judged, estP10 = 0; otherwise, \(estP10=1-\overline{estP10}\) where \(\overline{estP10}\) is calculated by complementing all judgements.

The results reported here use estP10 as the primary retrieval effectiveness measure, computed using the TREC Legal Track evaluation software l07_eval version 2.0.Footnote 5 The other three measures produce similar results and lead to the same conclusions. We briefly compare estP10 with the other measures in Sect. 5.

The estP10 measure estimates the effectiveness of retrieval for one specific task: identifying ten likely relevant documents. While this view of effectiveness is not unreasonable for web search, we are also concerned with effectiveness at cutoff values other than ten, and more generally rank measures that summarize effectiveness over many cutoff values. The most commonly used rank measure is mean average precision (MAP) which is the mean of average precision (AP) over all topics:

$$ AP={\frac{1}{R}}\sum_{k}P\hbox{@}k\cdot rel(k) , $$
(5)

where rel(k) = 1 if the k th-ranked document in the run is relevant; 0 if it is not. R is the total number of relevant documents in the collection.

Unfortunately, R is unknown for the TREC 2009 tasks. Furthermore, rel(k) is unknown for most k, in particular most k > 12. Methods to estimate AP with incomplete knowledge of rel(k) have proven to be unreliable for the TREC 2009 tasks [private correspondence, TREC 2009 Web Track coordinators].

A more straightforward rank effectiveness measure is R-precision, (RP) which is simply

$$ RP=P\hbox{@}R . $$
(6)

While R-precision depends on RR is not a direct factor in the formula, so estimation errors have much lower impact. Furthermore, estRP is easily computed:

$$ estRP=estP\hbox{@}estR . $$
(7)

Regardless of whether AP or RP or some other rank effectiveness measure is used, if our reranking improves estPk for all values of k, it follows that the rank measure is improved. Therefore, as our primary evaluation of the effectiveness of spam reranking, we compute estPk for representative values of k. As our best effort to quantify the magnitude of the effect, we present estRP as well. Furthermore, we evaluate StatMAP, MAP (unjudged non-relevant), and MAP (unjudged elided).

4 Spam filter design

Our principal criteria in choosing a spam filter design were efficiency and effectiveness. By efficiency, we mean the end-to-end time and resource consumption (both human and computer) to label the corpus. By effectiveness, we mean the ability to identify spam (and hence non-relevant) pages among those retrieved documents and thus to improve precision by deleting it.

Although the literature is dominated by graph-based methods for web spam filtering and static ranking (Becchetti et al. 2008; Richardson et al. 2006), content-based email spam filters were found to work as well as graph-based methods in the 2007 Web Spam Challenge (Cormack 2007). Furthermore, these filters are very fast, being able to classify thousands of documents per second. Our implementation required about 48 hours elapsed time to decompress, decode and score the 500M English ClueWeb09 pages on a standard PC with an Intel dual core E7400 CPU.

We used three different sets of training examples to create three filters, each of which was used to label the entire corpus; in addition, we created an ensemble filter using a naive Bayes metaclassifier to combine the results. The sets of training examples are:

  • UK2006. The WEBSPAM-UK2006 corpus, used for the AIRWeb Web Spam Challenge and other studies, contains spam and nonspam labels for 8,238 hosts. For each spam host and each nonspam host, we selected the first page in the corpus whose size was at least 5 K bytes. This approach tends to select an important page near the root of the host’s web space. Our training set consisted of 767 spam pages and 7,474 nonspam pages—one for each spam host and one for each nonspam host. Our aim in using this set of training examples was to investigate the efficacy of transfer learning from an older, smaller, less representative corpus.

  • Britney. Our second set of training examples was essentially generated automatically, requiring no manual labeling of spam pages. We asked ourselves, “If we were spammers, where would we find keywords to put into our pages to attract the most people?” We started looking for lists of popular searches and found an excellent source at a search engine optimization (SEO) site.Footnote 6 This particular SEO site collects the daily published “popular search queries” from the major web search engines, retailers, and social tagging sites. We used their collected Google Trends, Yahoo!Buzz, Ask, Lycos, and Ebay Pulse queries. We downcased all queries and took the top 1,000 for the year 2008, the period immediately before the ClueWeb09 corpus was crawled. The most popular query was “britney spears” and hence the name of this training data. Table 3 shows other query examples. We used the #combine operator in Indri (Strohman et al. 2005) to perform naive query likelihood retrievals from Category A with these 1,000 queries. We used the same index and retrieval parameters as Smucker et al. (2009). For each query, we took the top ten documents and summarily labeled them as spam, with no human adjudication. We fetched the Open Directory Project archiveFootnote 7 and intersected its links with the URIs found in ClueWeb09. From this intersection, we selected 10,000 examples which we summarily labeled as nonspam, with no human adjudication. Our rationale for using this set of training examples was derived from the observation that our naive methods retrieved almost all spam, using the queries that we composed prior to TREC. We surmised that popular queries would be targeted by spammers, and thus yield an even higher proportion of spam—high enough that any non-spam would be within the noise tolerance limits of our spam filter. In effect, the SEO queries acted as a “honeypot” to attract spam.

  • Group X. We used the 756 documents adjudicated by Group X as training examples (Table 2, column 1). Recall that these documents were selected without knowledge of the TREC topics or relevance assessments. Our objective was to determine how well a cursory labeling of messages selected from the actual corpus would work.

  • Fusion. The scores yielded by the three filters were interpreted as log-odds estimates and averaged, in effect yielding a naive Bayes combination of the three scores. This approach is known to be effective for both email and web spam filtering (Cormack 2007; Lynam and Cormack 2006).

Table 3 Top 40 queries from the 1,000 queries used for the Britney training examples

We expected all the filters to identify spam better than chance, but had no prediction as to which set of training examples would work best. In particular, we did not know how well training on the UK2006 examples would transfer to ClueWeb09 due to the differences in the times of the crawls, the hosts represented, and the representativeness of the host-based labels. We did not know how well “pages retrieved by a naive method in response to a popular query” would act as proxies for spam, or how overfitted to the particular queries the results would be. Similarly, we did not know how well ODP pages would act as proxies for non-spam. We did not know if the Group X examples were sufficiently numerous, representative, or carefully labeled to yield a good classifier. We did have reason to think that the fusion filter might outperform all the rest, consistent with previously reported results.

4.1 Filter operation

A linear classifier was trained using on-line gradient-descent logistic regression in a single pass over the training examples (Goodman and Yih 2006). The classifier was then applied to the English portion of the ClueWeb09 dataset end-to-end, yielding a spamminess score for each successive page p. Owing to the use of logistic regression for training, the spamminess score may be interpreted as a log-odds estimate:

$$score(p) \approx \log {\frac{{\Pr [p\;is\;spam]}}{{\Pr [p\;is\;nonspam]}}}\,.$$
(8)

However, this estimate is likely to be biased by the mismatch between training and test examples. Nonetheless, a larger score indicates a higher likelihood of spam, and the sum of independent classifier scores is, modulo an additive constant, a naive Bayes estimate of the combined log-odds.

For the purpose of comparing effectiveness, we convert each score to a percentile rank over the 503,903,810 English pages:

$$ percentile(p)=\left\lfloor 100{\frac{|p'|score(p')\geq score(p)|} {503,903,810}}\right\rfloor. $$
(9)

That is, the set of pages with percentile(p) < t represents the spammiest t% of the corpus. In the results below, we measure effectiveness for \(t\in[0, 100]\), where t = 0 filters nothing and t = 100 filters everything.

4.2 Filter implementation

The implementation of the classifier and update rule are shown in Fig. 2. Apart from file I/O and other straightforward housekeeping code, these figures contain the full implementation of the filter.Footnote 8 The function train() should be called on each training example. After training, the function spamminess() returns a log-odds estimate of the probability that the page is spam.

Fig. 2
figure 2

C implementation of the filter. The function spamminess is the soft linear classifier used for spam filtering. The function train implements the online logistic regression gradient descent training function

Each page, including WARC and HTTP headers, was treated as flat text. No tokenization, parsing, or link analysis was done. Pages exceeding 35,000 bytes in length were arbitrarily truncated to this length. Overlapping byte 4-grams were used as features. That is, if the page consisted of “pq xyzzy” the features would be simply “pq x”, “q xy”, “ xyz”, “xyzz”, and “yzzy”. Each feature was represented as a binary quantity indicating its presence or absence in the page. Term and document frequencies were not used. Finally, the feature space was reduced from 4 × 109 to 106 dimensions using hashing and ignoring collisions. This brutally simple approach to feature engineering was used for one of the best filters in the TREC 2007 email spam filtering task (Cormack 2007), giving us reason to think it would work here.

Given a page p represented by a feature vector X p a linear classifier computes

$$ score(p)=\beta\cdot X_{p} $$
(10)

where the weight vector β is inferred from training examples. For the particular case of on-line gradient-descent logistic regression, the inference method is quite simple. β is initialized to \(\overline{0}\), and for each training document p in arbitrary order, the following update rule is applied:

$$ \beta\leftarrow\beta+\delta X_{p}\left(is\,spam(p)-{\frac{1} {1+e^{-score(p)}}}\right) ,\rm{ where} $$
(11)
$$ is\,spam(p)=\left\{ \begin{array}{ll} 1 & \,{p\,is\,spam}\,\\ 0 & \,{p\,is\,nonspam}\, \end{array} \right.. $$
(12)

We fixed the learning rate parameter δ = 0.002 based on prior experience with email spam and other datasets.

5 Filter results

We first consider how well the four filters identify spam, and how well they rank for static relevance. We then consider the impact on the TREC 2009 web ad hoc submissions on average, and the impact on the individual submissions. Finally, we consider the impact on the TREC 2009 relevance feedback submissions.

Figure 3 shows the effectiveness of the UK2006 filter at identifying spam in the examples labeled by Group Y. The top and middle panels show the fraction of spam pages identified, and the fraction of nonspam pages identified, as a function of the percentile threshold. For a good filter, the lines should be far apart, indicating that a great deal of spam can be eliminated while losing little nonspam. The two panels indicate that the filter is effective, but do little to quantify how far apart the lines are. The bottom panel plots the first curve as a function of the second—it is a receiver operating characteristic (ROC) curve. The area under the curve (AUC)—a number between 0 and 1—indicates the effectiveness of the filter. The results (0.94 for Category A and 0.90 for Category B) are surprisingly good, comparable to the best reported for the 2007 AIRWeb Challenge. AUC results for this and the other three filters are shown, with 95% confidence intervals, in Table 4. These results indicate that all filters are strong performers, with UK2006 and Group X perhaps slightly better at identifying spam than Britney. The fusion filter, as predicted, is better still.

Fig. 3
figure 3

Effect of filtering on elimination of spam and nonspam. The top and centre plots show the fraction of spam and nonspam eliminated as a function of the fraction of the corpus that is labeled “spam.” The bottom plot shows the corresponding ROC curves

Table 4 ROC Area (AUC) and 95% confidence intervals for three base filters, plus the naive Bayes fusion of the three

Figure 4 shows the filter’s effectiveness at identifying non-relevant documents, as opposed to spam. To measure nonrelevance, we use the same documents discovered by Group Y, but the official TREC relevance assessments instead of Group Y’s spam labels. We see that the curves are well separated and the AUC scores are only slightly lower than those for spam identification. Table 4 summarizes the AUC results for both spam and static relevance.

Fig. 4
figure 4

Effect of filtering on elimination of non-relevant and relevant pages. The top and centre plots show the fraction of non-relevant and relevant pages eliminated as a function of the fraction of the corpus that is labeled “spam.” The bottom plot shows the corresponding ROC curves

We would expect high correlation between documents identified as spam and documents assessed to be non-relevant, but were surprised nonetheless by how well the filter worked for this purpose. These results suggest that spam is a strong predictor—perhaps the principal predictor—of nonrelevance in the results returned by the search engine in this study.

To evaluate the impact of filtering on retrieval effectiveness, we acquired from the TREC organizers all submissions for the TREC 2009 web ad hoc and relevance feedback tasks. These runs employed a wide variety of retrieval methods, including many web-specific techniques such as anchor text, HTML field weighting, and link analysis (Chandar et al. 2009; Dou et al. 2009; Guan et al. 2009; He et al. 2009; Kaptein et al. 2009; McCreadie et al. 2009). The retrieval techniques differed greatly from group to group, and we refer you to their TREC 2009 reports for complete details. However, we note that at least three groups employed spam filtering techniques when generating their runs (Hauff and Hiemstra 2009; Lin et al. 2009).

We applied the four filters—and also a random control—to the TREC 2009 runs with threshold settings of \(t\in{0,10,20,30,40,50,60,70,80,90}\). The random control simply labeled t% of the corpus at random to be spam. Our prediction was that for an effective filter, estP10 should increase with t and eventually fall off. For the random control, estP10 should either remain flat or fall off slightly, assuming the submissions obey the probability ranking principle.

Figure 5 shows estP10, averaged over all official TREC submissions, as a function of t for each of the filters. All (except the control) rise substantially and then fall off as predicted. The control appears to rise insubstantially, and then fall off. It is entirely possible that the rise is due to chance, or that the probability ranking is compromised by the presence of very highly ranked spam. 95% confidence intervals are given for the control, but omitted for the other filters as their superiority is overwhelmingly significant (p ≪ 0.001).

Fig. 5
figure 5

Effect of spam filtering on the average effectiveness of all Web Track ad hoc submissions, Category A and Category B. Effectiveness is shown as precision at 10 documents returned (estP@10) as a function of the fraction of the corpus that is labeled “spam”

All filters behave as predicted. The value of estP10 increases to t = 50, at which point the UK2006 filter starts to fall off. Beyond t = 50, the other filters continue to improve for Category A, and plateau for Category B. As expected, the fusion filter is superior to the rest, reaching peak effectiveness at t = 70 for Category A and t = 50 for Category B. The fusion filter with these threshold settings is used to illustrate the impact on individual TREC submissions.

The UK2006 filter is trained on documents from a different corpus, which exclude corpus-specific information like WARC and HTTP headers. The strong performance of this filter—notwithstanding the fall-off at high thresholds—is evidence that spam, and not some artifact of the data itself, is responsible for the results presented here.

Figure 6 shows scatterplots comparing unfiltered with filtered estP10 results. Nearly every submission is improved by filtering. The top scatterplot panel (Category A) is particularly remarkable as it shows no significant correlation between filtered and unfiltered results for particular runs (95% confidence interval: −0.24 − 0.40). That is, the effect of spam filtering overwhelms any other differences among the submissions. Tables 5 and 6 respectively report the results of the fusion filter for the individual Category A and Category B ad hoc runs.

Fig. 6
figure 6

Effect of spam filtering on the effectiveness of individual Web Track ad hoc submissions, Category A and Category B. The top scatterplot shows estP@10 with 70% of the corpus labeled “spam” by the fusion method; the bottom scatterplot shows estP@10 with 50% of the corpus labeled “spam” by the fusion method. The correlations with 95% confidence intervals for the top and bottom plots respectively are 0.09 (−0.24 to 0.40) and 0.60 (0.34–0.79)

Table 5 Effect of spam filtering on the TREC web ad hoc submissions, Category A
Table 6 Effect of spam filtering on the TREC web ad hoc submissions, Category B

Figure 7 illustrates the dramatic impact of spam filtering on our simple query likelihood method. In Category A, our submission is improved from the worst unfiltered result to better than the best unfiltered result. In Category B, the same method (which was not an official submission to TREC) sees a less dramatic but substantial improvement.

Fig. 7
figure 7

Effect of filtering on naive query likelihood language model runs

Figure 8 shows the effect of filtering on the relevance feedback runs. The baseline results are stronger, but still improved substantially by filtering.

Fig. 8
figure 8

Effect of spam filtering on the average effectiveness of all relevance feedback task Phase 2 submissions, Category A and Category B. Effectiveness is shown as precision at 10 documents returned (estP@10) as a function of the fraction of the corpus that is labeled “spam”

Figure 9 recasts the superior and inferior curves from Fig. 5 in terms of the other three measures. The overall effect is the same for all measures: filtering substantially improves P@10 over baseline for a wide range of threshold settings.

Fig. 9
figure 9

Other P@10 estimates for Category A ad hoc runs, fusion filter

6 Reranking method

In the experiments reported in the previous section, we used the score returned by our classifier in the crudest possible way: as a brick wall filter that effectively eliminates some fraction of the corpus from consideration. In this section, we consider instead the problem of using the spam scores to reorder the ranked list of documents returned by a search engine.

In reranking, we do not eliminate documents with high scores; instead we move them lower in the ranking. Presumably, documents with extreme scores should be moved more than others, but by how much? Our approach is to use supervised learning to compute the best new ranking, given the original ranking and the spam percentile scores.

Supervised learning requires training examples. In a real-world deployment, the training examples would be constructed by adjudicating the results of historical queries presented to the same search engine. For our experiments, we have no historical results—only those from the 50 topics used at TREC 2009, which have no particular historical order. We therefore use 50-fold cross-validation (i.e., leave-one-out cross-validation) using one topic at a time for evaluation, and considering the remaining 49 to be historical examples. The evaluation results from these 50 separate experiments—each reranking the results for one topic—are then averaged.

Our learning method, more properly learning-to-rank method, consists of exhaustive enumeration to compute, for all k, the threshold t k that optimizes estPk:

$$ t_{k}=\arg\max_{t}estPk . $$
(13)

Then we proceed in a greedy fashion to build the new ranked list r′ from the original r:

$$ r'[1]=r[\min\{i|score(r[i])\ge t_{1}\}] $$
(14)
$$ r'[i>1]=r[\min\{i|score(r[i])\ge t_{i} \quad \rm{and}\quad {\it r[i]}\notin {\it r'}[1,{\it i}-1]\}] $$
(15)

except when Eq. 15 is undefined, in which case

$$ r'[i>1]=r[i]. $$
(16)

The special case is occasioned by the fact that t k is not necessarily monotonic in k due to noise in the training examples.

This reranking method was applied independently to each TREC ad hoc Category A submission, using our fusion filter’s percentile scores. Table 7 shows estP30, estP300 and estRP for each pair of original and reranked results, along with the average over all runs and the p-value for the difference. Table 8 shows StatMAP, MAP (unjudged not relevant), and MAP (unjudged elided). All measures show a substantial improvement for nearly all runs. One notable exception is the run labeled watwp which, ironically, is a submission by one of the authors. This run consists entirely of Wikipedia documents, so it is not surprising that spam filtering does not improve it.

Table 7 Effect of spam reranking on TREC ad hoc submissions, Category A
Table 8 Effect of spam reranking on TREC ad hoc submissions, Category A

7 Discussion

While it is common knowledge that the purpose of web spam is to subvert the purpose of information retrieval methods, a relevance-based quantitative assessment of its impact—or of methods to mitigate its impact—has not previously been reported. Measurements of the prevalence of spam in retrieved results and the ability of filters to identify this spam give some indication but do not tell the whole story. The bottom line is: How much does spam hurt, and how much is this hurt salved by spam filtering? For both questions, we offer a lower bound which is substantial. A simple on-line logistic regression filter dramatically improves the effectiveness of systems participating in the TREC 2009 web ad hoc and relevance feedback tasks, including those from major web search providers, some of which employ their own spam filters. One may infer from the improvement that the impact of spam is similarly substantial. Unless, that is, the spam filter is learning some aspect of page quality apart from spamminess. We find this explanation unlikely, as the AUC scores indicate that the filters indeed identify spam. In any event the distinction is moot: If the filters have serendipitously discovered some other aspect of static relevance, what is the harm?

Several of the TREC 2009 submissions already incorporated some form of spam filtering. The yhooumd00BGM run, for example (Lin et al. 2009), used Yahoo’s spam labels and a learning method to improve its P@10 score from 0.1420 to 0.4040. Our reranking improves it further to 0.4724. The authors of uvamrftop also paid particular attention to spam; our method improves its result from 0.4100 to 0.4855; twJ48rsU (Hauff and Hiemstra 2009) is similarly improved from 0.2380 to 0.2801.

For those who wish to experiment with any of these approaches, or simply to apply the filters as described here, we make the four sets of percentile scores available for download. Using a custom compressor/decompressor, each is about 350 MB compressed and 16 GB uncompressed. Footnote 9

These spam rankings were made available to participants in the TREC 2010 Web Track, where they were widely adopted (Clarke et al. 2010). The TREC 2010 Web Track also included a spam task, which required participants to rank the documents in ClueWeb09 according to spamminess, with our spam rankings used as a baseline for the task. While several groups made reasonable efforts to apply standard graph-based quality and spam metrics to the problem, no group was able to outperform this baseline.