Quantifying retrieval bias in Web archive search

Samar, Thaer; Traub, Myriam C.; van Ossenbruggen, Jacco; Hardman, Lynda; de Vries, Arjen P.

doi:10.1007/s00799-017-0215-9

Quantifying retrieval bias in Web archive search

Open access
Published: 18 April 2017

Volume 19, pages 57–75, (2018)
Cite this article

Download PDF

You have full access to this open access article

International Journal on Digital Libraries Aims and scope Submit manuscript

Quantifying retrieval bias in Web archive search

Download PDF

Thaer Samar ORCID: orcid.org/0000-0002-9872-5258¹,
Myriam C. Traub¹,
Jacco van Ossenbruggen^1,3,
Lynda Hardman^1,4 &
…
Arjen P. de Vries²

4345 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.

Detecting off-topic pages within TimeMaps in Web archives

Article 18 July 2016

Profiling Web Archive Coverage for Top-Level Domain and Content Language

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Indexing and retrieving documents from a Web archive can be challenging. Web archive collections are different from conventional static Web collections. The main reasons are the continuously increasing size of Web archives and the existence of multiple versions of the same document collected at different moments in time. The different versions may appear multiple times in search results and thereby render other documents inaccessible for a user. Despite these challenges, Web archive initiatives make an effort to make their collections better accessible. For example, Gomes et al. [27] conducted a survey in 2010 on 42 Web archive initiatives around the world (26 countries). They found that $89\%$ of the initiatives support access to the Web archive by a given URL, $79\%$ support searching metadata, and $67\%$ provide full-text search over their archives. The same survey was conducted again in 2014 in order to observe the change in Web archiving since 2010 [18]. They noticed an increase in the number of initiatives (68) and the number of countries involved in Web archiving (33 countries). However, in terms of access methods, the results of 2014 are the same as those for 2010.

Previous studies showed that applying existing Information Retrieval (IR) models on Web archives leads to unsatisfactory results [17, 21]. Measuring the effectiveness of IR systems can be done using test collections. A test collection consists of a set of topics (queries), a document collection, and a set of relevance assessments. Costa and Silva [21] extended this approach by taking the characteristics of Web archives into account. Their approach includes the design of a test collection and constructing topics from the users’ query log of a functioning Web archive search system. Getting relevance judgments, however, is a costly process. An additional complication is the dependency on query logs, as they are seldom available.

To complement standard methods of IR evaluation, that focus on the assessment of efficiency and effectiveness of IR systems, Azzopardi et al. introduced retrievability as a measure for potential bias in the access of documents in a collection [7]. The retrievability score of a document counts how often the document is retrieved when a large representative set of queries is issued to the retrieval system. The overall bias in the scores among all documents in the collection induced by a retrieval system can be quantified using measures such as the Lorenz Curve [26] and the Gini Coefficient [26]. While the Lorenz Curve can be used to visualize the bias, the Gini Coefficient can be used to quantify the extent of bias for different experimental conditions.

We follow an approach similar to [7] to study how retrievability can be used to quantify retrieval bias induced by different retrieval systems on a subset of the Dutch Web archive collection from the National Library of The Netherlands^{Footnote 1} (KB).

Our main goal is to investigate how to use retrievability to evaluate a Web archive retrieval system, and how the number of document versions and the method of aggregation of crawls influence the retrieval bias in the Web archive.

Specifically, we address the following research questions:

RQ1 :: Is access to the Web archive collection influenced by a retrievability bias? Can we evaluate and compare retrieval systems on the Web archive collection using the retrievability measure to quantify their retrieval bias?

We follow the approach of [7] to quantify the overall bias imposed by different retrieval systems using the Gini Coefficient and the Lorenz Curve constructed using retrievability scores of documents in the collection.

RQ2 :: How does the number of versions of documents in the Web archive collection influence the retrievability bias of a retrieval system?

The number of versions per document in the archive varies, for example, because documents have been crawled with different frequencies or because they were added to the crawler’s seed list at different points in time. We show how multiple versions impact the retrieval bias when the granularity of retrieval in the search results is the document’s version (each version of a document is considered an independent document). We compute the retrievability score of a document by accumulating the retrievability score of its versions: a document with more versions is assigned a higher retrievability score. We show the change in bias when the multiple versions are handled by the retrieval system using two approaches to collapse documents’ versions: first, based on their content similarity; second, based on their URLs.

RQ3 :: Does a retrieval system favor specific subsets of the collection?

The Web archive collection of the KB consists of snapshots of websites from different points in time spanning 4 years. Therefore, we investigate what subset of the archive is most affected by retrieval bias.

The remainder of the paper is organized as follows. After discussing related work (Sect. 2), we describe our approach to answer the research questions introduced in this section (Sect. 3). We discuss the experimental setup in detail in (Sect. 4) and answer research questions RQ1-3 in Sects. 5, 6, 7, respectively. Finally, we discuss conclusions drawn from our findings (Sect. 8).

2 Related work

Understanding the information needs of Web archive users is an important step toward developing good access methods for Web archives. Several studies showed that full-text search is preferred [19, 20, 27, 41]. This shift from single URL search to search interfaces was described as a turning point in the history of Web archives [13].

Research in temporal IR aims to exploit temporal information in documents and queries for better query understanding and time-based ranking [1, 16, 32]. Costa and Silva [21] created a temporal test collection from the Portuguese Web Archive [28], to enable evaluation of temporal methods in IR. A test collection consists of queries (topics), documents, and the judgments by users of their relevance to the queries. When a new system is built then its effectiveness can be measured based on the test collection using evaluation metrics such as precision (for example P@10). The collection developed by Costa and Silva consists of crawls in the period from 1996 to 2009. The queries (topics) were selected from query logs, and the documents retrieved by the retrieval system were manually judged. Their method extends the Cranfield paradigm with consideration of the temporal aspect of Web archive collections. Other studies used crowdsourcing to collect relevance judgments. For example, Berberich et al. [14] used Amazon Mechanical Turk to collect queries and relevance assessments.

Retrievability was introduced to measure how likely a document is to be retrieved given an IR system [5,6,7]. Computing the retrievability scores requires the availability of a large query set, but without the need for relevance judgments. Queries can be simulated by drawing them from the content of documents in the collection. The retrievability score of a document r(d) gives an indication of how retrievable the document is compared to other documents in the collection. It is computed by accumulating the number of times this document appears in the ranked list provided for all queries, at a given cutoff rank. In order to quantify the retrievability bias across all documents in the collection, the Lorenz Curve [26] is used to visualize the bias and the Gini Coefficient [26] is used to summarize the bias. In economics, the Lorenz Curve is used to visualize the distribution of wealth or income of a population. If the wealth or income is equally distributed in the population, the accumulative distribution is a diagonal line (called the line of equality). The larger the inequality is within a population, the more the curve deviates from the equality line. The Gini Coefficient summarizes the overall inequality into a value which ranges from zero (perfect equality) to one (perfect inequality). The Gini Coefficient quantifies the retrievability inequality among documents. In the context of retrievability, the population corresponds to the document collection and wealth corresponds to the retrievability scores.

Retrievability has been used to compare different retrieval models based on the bias they impose on a given collection, and to study whether the retrieval system favors documents with particular features. For example, the system might favor long documents over shorter documents. In the following, we discuss a few studies that used retrievability. Retrievability was applied in the patent search domain [8, 11], which is recall-oriented, to quantify the retrieval bias of retrieval systems on the patent collection. The correlation between retrievability and the query set was considered in several studies. Based on a limited set of queries, the correlation between retrievability score and query relevance to the document^{Footnote 2} was analyzed [9]. Their experimental results showed that $90\%$ of highly retrievable documents when all queries were considered are not highly retrievable considering only their relevant queries. The influence of query characteristics on retrieval bias was explored in [12]. They showed that different query characteristics increase or decrease the retrieval bias differently. Query expansion was used to improve document’s retrievability [10].

Other studies investigated the relation between a system’s retrieval bias and its effectiveness. For example, Azzopardi et al. [2] showed that a positive relation exists between effectiveness and retrievability. Measuring effectiveness using precision at 10 (P@10) & Mean Average Precision (MAP), the results showed that as the effectiveness increases, the retrievability bias tends to decrease. This relationship between retrievability and effectiveness has been used to tune systems [44]. Bashir and Rauber [10] investigated the impact of query expansion on the retrievability bias. They showed that standard query expansion methods caused an increase in effectiveness and retrieval bias. They explained the increase in retrieval bias due to the assumption of query expansion methods that the top-ranked documents are relevant. However, some documents in the top-ranked results might be noise. Therefore, in order to decrease the retrieval bias, they proposed a query expansion approach based on document clustering, and they showed that their approach reduces the bias.

3 Approach

We explore how we can use retrievability to assess the retrieval bias of retrieval systems providing access to 4 years of the Dutch Web archive. In order to investigate our first research question, RQ1, we use three well-known IR models and two large query sets. For every model and query set, we compute the retrievability score (r(d)) for document versions at different rank cutoffs c. Parameter c represents the willingness of the user to explore a certain number of documents in the search results; therefore, it is independent from the retrieval model. In our study, we experiment with $c=10$, 20, 30, 40, 50, 100, and 1000. Users are known to rarely evaluate more than the first 10 search results; however, we also consider high values for c to find out whether the inequality bias would still exist if the users were willing to explore higher numbers of results. In order to allow the comparison of the retrieval models in terms of retrieval bias they impose on the documents, we need a measure to quantify the overall bias given a collection, a query set, and a retrieval system. We use the Gini Coefficient to summarize the retrieval bias, and the Lorenz Curve to visualize the retrieval bias, following [7].

A certain fraction of documents is not-retrieved by any of the retrieval models. This fraction is especially high for smaller c’s and has a strong influence on the overall bias measured by the Gini Coefficient. Therefore, we compute two variants of the Gini Coefficient. In the first variant, all documents in the collection are included; if a document is not retrieved by the model, its retrievability score is zero ($r(d)=0$). Here, the number of documents is the same for all models at all c’s (number of retrieved documents plus number of not-retrieved documents $=$ whole collection). In the second variant, only documents that are retrieved using at least one of the three retrieval models at a given c are considered. We do this by creating a union set of unique documents retrieved using at least one of the three models at the given c (3Models_union_c) for each query set. If a document was retrieved using model A, but not with model B, then the retrievability score of that document given model B is assigned a value of 0 ($r_B(d)=0$). The number of documents will be the same for all models at the same c (num. retrieved plus num. not-retrieved $=$ 3Models_union_c). Therefore, this can still be considered to provide a fair comparison across the retrieval models for a given c. Using the second variant will reduce the impact of a high fraction of documents with $r(d)=0$. A model that does not retrieve a large number of documents that were retrieved using other models will get a higher Gini Coefficient, that is, it is considered to be more biased.

In order to understand the relation between the retrievability scores and the ability to find a document in the collection, we use a known-item-search setup based on the approach proposed in [3, 4].

We quantify the impact of multiple versions of the same document on the inequality of retrieval bias, RQ2. First, we investigate the retrieval of all versions of a document. At indexing and retrieval time, we consider the document’s version as an independent document. In order to check how that affects the document’s retrievability, we compute the retrievability of a document by aggregating the retrievability scores of its versions retrieved at a given c, and thus the overall bias imposed by the model. Second, we collapse similar versions of the same document and again compute the retrievability score and the overall bias. Third, to explore the impact of the number of versions on the bias, we linearly combine the scores given by the models with a prior based on the number of versions. This allows us to measure retrieval bias at the granularity of the document, instead of a specific version.

Finally, we address our last research question, RQ3. Our Web archive collection is an accumulation of several crawls over time. We are interested in whether the bias imposed by a given retrieval system, among subsets based on the time of crawling, correlates with the number of crawled documents in that year. To explore this research question, we focus on the documents retrieved using the BM25 model; as we show in the results, it provides the least bias. Using the timestamps of the crawling time associated with documents, we split the search results for BM25 into four subsets at different c’s and then measure the retrieval bias per subset.

4 Experimental setup

In Sect. 4.1, we describe the components used to measure retrievability on the Web archive collection. In Sect. 4.2, we describe the known-item search setting to investigate the relation between retrievability score of a document and the difficulty level of finding that document.

4.1 Retrievability experimental setup

First, we introduce the Dutch Web archive collection (Sect. 4.1.1). Then, we describe how we preprocessed and indexed the collection (Sect. 4.1.2). After that, we discuss how we designed the query sets that are used to retrieve documents from the collection (Sect. 4.1.3). Finally, we discuss how to measure retrievability scores and how to quantify the overall bias imposed by a given retrieval model (Sect. 4.1.4).

4.1.1 Data set

In their Web archive, the KB preserves a growing seed set of currently more than 10,000 websites [40]. For our research, the KB provided us with a subset of the Dutch Web archive that has been harvested between February 2009 and December 2012, consisting of 76,828 Archive (ARC)^{Footnote 3} files. Each ARC file contains multiple archived records (content plus the response header), which yields a total of 148M documents. Table 1 shows the total number of archived objects, raw count and the percentage of text/html. We refer to text/html content-type objects as documents. These documents form our collection D on which we focus our analysis. Every crawled document has its own URL and the timestamp of the crawling time in addition to its content on the Web at the time of the crawl. Every document d may have multiple versions crawled at different points in time $t_i$,

$$\begin{aligned} d := \left\{ d_{v}^{t_1}, d_{v}^{t_2},\ldots , d_{v}^{t_n}\right\} \end{aligned}$$

where $d_{v}^{t_1}$ is the document’s version crawled at time $t_1$. The mean value of number of versions (total number of versions over the number of unique documents based on URLs) increases over the years, as more crawls have been added to the archive (see Table 1). The distribution of the number of versions per document is skewed (see Fig. 1 in a log scale).

Table 1 Summary of the archived objects over the years, with more details on documents of text/html content-type

Quantifying retrieval bias in Web archive search

Abstract

Similar content being viewed by others

Detecting off-topic pages within TimeMaps in Web archives

Profiling Web Archive Coverage for Top-Level Domain and Content Language

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

1 Introduction

2 Related work

3 Approach

4 Experimental setup

4.1 Retrievability experimental setup

4.1.1 Data set

4.1.2 Preprocessing and indexing

4.1.3 Query set

4.1.4 Retrievability assessment

4.2 Known-item search setup based on retrievability scores

5 Retrievability bias

5.1 Retrievability and findability

6 Impact of number of versions on the retrievability bias

6.1 Collapsing similar versions

6.2 Collapsing versions (URL-based)

7 Quantification of retrieval bias over the years

7.1 Time-based subsets based on time-based queries

8 Discussion and conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation