The method we propose in this contribution, based on and extending an earlier version of this work (Van den Bosch et al. 2015), does not use the overlap between search engines on the basis of a near-uniform sampling of web documents, but rather uses a sampled web page collection as the basis for extrapolating an estimate of the number of documents in a search engine’s index.
In the following subsections we describe in detail how we estimated word frequencies from a fully known training corpus in order to estimate the unknown size of a test corpus, given the frequency of a word in the test corpus. We introduce DMOZ, the selected training corpus. We explain how we took the arithmetic mean over 28 selected pivot words, and we describe how we started the longitudinal (and still running) experiment in March 2006.
Estimating word frequencies for corpus size extrapolation
On the basis of a textual corpus that is fully available, both the number of documents and the term and document frequencies of individual terms can be counted. In the context of Web search engines, however, we only have reported hit counts (or document counts), and we are usually not informed about the total number of indexed documents. Since it is the latter we are interested in, we want to estimate the number of documents indexed by a search engine indirectly from the reported document counts.
We can base such estimates on a training corpus for which we have full information on document frequencies of words and on the total number of documents. From the training corpus we can extrapolate a size estimation of any other corpus for which document counts are given. Suppose that, for example, we collect a training corpus T of 500,000 web pages, i.e. \(|T| = 500,000\). For all words w occurring on these pages we can count the number of documents they occur in, or their document count, \(d_T(w)\). A frequent word such as are may occur in 250,000 documents, i.e. it occurs in about one out of every two documents; \(d_T(are) = 250,000\). Now if the same word are is reported to occur in 1 million documents in another corpus C, i.e. its document count \(d_C(are) = 1,000,000\), we can estimate by extrapolation that this corpus will contain about \(|C| = \frac{d_C(are) \times |T|}{d_T(are)}\), i.e. 2 million documents.
There are two crucial requirements that would make this extrapolation sound. First, the training corpus would need to be representative of the corpus we want to estimate the size of. Second, the selection of wordsFootnote 2 that we use as the basis for extrapolation will need to be such that the extrapolations based on their frequencies are statistically sound. We should not base our estimates on a small selection of words, or even a single word, as frequencies of both high-frequency and low-frequency words may differ significantly among corpora. Following the most basic statistical guidelines, it would be better to repeat this estimation for several words, and average over the extrapolations. The number of words n should be high enough such that the arithmetic means of different selections of n words are normally distributed around an average, according to the Central Limiting Hypothesis (Rice 2006).
A random selection of word types is likely to produce a selection with relatively low frequencies, as Zipf’s second law predicts (Zipf 1935). A well-known issue in corpus linguistics is that when any two corpora are different in genre or domain, very large differences are likely to occur in the two corpora’s word frequencies and document frequencies, especially in the lower frequency bands (or the long tails) of the term distributions. It is not uncommon that half of the word types in a corpus occur only once; many of these terms will not occur in another disjoint corpus, even if it represents the same genre. This implies that extrapolations should not be based on a random selection of terms, many of which will have a low frequency of occurrence. The selection of words should sample several high-frequency words but preferably also several other words with frequencies spread across the other frequency bands.
It should be noted that Zipf’s law concerns word frequencies, not document frequencies. Words with a higher frequency tend to recur more than once in single documents. The higher the frequency of a word, the more its document frequency will be lower than its word frequency. A ceiling effect thus occurs with the most frequent words if the corpus contains documents of sufficient size: they tend to occur in nearly all documents, making their document frequencies within the same order of magnitude as the number of documents in the corpus, while at the same time their word token frequencies still differ to the degree predicted by Zipf’s law (Zipf 1935). This fact is not problematic for our estimation goal, but it should be noted that this hinges on the assumption that the training corpus and the new corpus of which the frequencies are unknown, contain documents of about the same average size.
Selecting a representative corpus: DMOZ
As our purpose is to estimate the size of a Web search engine’s index, we must make sure that our training corpus is representative of the web, containing documents with a representative average size. This is quite an ambitious goal. We chose to generate a randomly filtered selection of 531,624 web pages from the DMOZFootnote 3 web directory. We made this selection in the spring of 2006. To arrive at this selection, first a random selection was made of 761,817 DMOZ URLs, which were crawled. Besides non-existing pages, we also filtered out pages with frames, server redirects beyond two levels, and client redirects. In total, the resulting DMOZ selection of 531,624 documents contains 254,094,395 word tokens (4,395,017 unique word types); the average DMOZ document contains 478 words.
We then selected a sequence of pivot words by their frequency rank, starting with the most frequent word in the DMOZ data, and selecting an exponential series where we increase the selection rank number with a low exponent, viz. 1.6. We ended up with a selection of the following 28 pivot words, the first nine being high-frequency function words and auxiliary verbs: and, of, to, for, on, are, was, can, do, people, very, show, photo, headlines, william, basketball, spread, nfl, preliminary, definite, psychologists, vielfalt, illini, chèque, accordée, reticular, rectificació. The DMOZ directory is multilingual, but English dominates. It is not surprising that the tail of this list contains words from different languages.
Our estimation method then consists of retrieving document counts for all 28 pivot words from the search engine we wish to estimate the number of documents for, obtaining an extrapolated estimate for each word, and taking an arithmetic mean over the 28 estimations. If a word is not reported to occur in any document (which hardly happens with Google or Bing), it is not included in the average. In preliminary experiments, we tested different selections of 28 words using different starting words but the same exponential rank factor of 1.6, and found closely matching averages of the computed extrapolations.
To stress-test the assumption that the DMOZ document frequencies of our 28 pivot words yield sensible estimates of corpus size, we estimated the size of a range of corpora: the New York Times part of the English Gigaword corpusFootnote 4 (newspaper articles published between 1993 and 2001), the Reuters RCV1 corpusFootnote 5 (newswire articles), the English WikipediaFootnote 6 (encyclopedic articles, excluding pages that redirect or disambiguate), and a held-out sample of random DMOZ pages (not overlapping with the training set, but drawn from the same source). If our assumptions are correct, the size of the latter test corpus should be fairly accurate. Table 1 provides an overview of the estimations on these widely different corpora. The size of the New York Times corpus is overestimated by a large margin of 126 %. The size of the Wikipedia corpus is only mildly overestimated by 3.6 %. The sizes of the Reuters and DMOZ corpora are underestimated. The size of the DMOZ sample is indeed relatively accurately estimated, with a small underestimation of 1.3 %.
Table 1 Real versus estimated numbers (with standard deviations) of documents on four textual corpora, based on the DMOZ training corpus statistics: two news resources (top two) and two collections of web pages (bottom two)
Taking the arithmetic mean
The standard deviations of the averages listed in Table 1, computed over the 28 pivot words, indicate that the per-word estimates are dispersed over quite a large range. Figure 1 illustrates this for the case of the Wikipedia corpus (the third data line of Table 1). There is a tendency for the pivot words in the highest frequency range (the, of, to, and especially was) to cause overestimations, but this is offset against relatively accurate estimates from pivot words with a mid-range frequency such as very, basketball, and definite, and underestimations from low-frequency words such as vielfalt and chèque. The DMOZ frequency of occurrence and the estimated number of documents in Wikipedia are only weakly correlated, with a Pearson’s \(R=0.48,\) but there is an observable trend in low frequencies causing underestimations, and high frequencies causing overestimations. The log-linear regression function with the smallest residual sum of squares is the function \(x = (204,224 \times ln(x)) - 141,623,\) visualized as the slanted dotted line in Fig. 1. Arguably, selecting exponentially-spaced pivot words across the whole frequency spectrum leads to a large standard deviation, but a reasonably accurate mean estimate on collections of web pages.
Setting up the longitudinal experiment
After having designed this experiment in March 2006, we started to run it on a daily basis on March 13, 2006, and have done so ever since.Footnote 7 Each day we send the 28 DMOZ words as queries to two search engines: Bing and Google.Footnote 8 We retrieve the reported number of indexed pages on which each word occurs (i.e., the hit counts) as it is returned by the web interface of both search engines, not their APIs. These hit counts were extracted from the first page of results using regular expressions. This hit count is typically rounded: it retains three or four significant numbers, the rest being padded by zeroes. For each word we use the reported document count to extrapolate an estimate of the search engine’s size, and average over the extrapolations of all words. The web interfaces to the search engines have gone through some changes, and the time required to adapt to these changes sometimes caused lags of a number of days in our measurements. For Google 3027 data points were logged, which is 93.6 % of the 3235 days between March 13, 2006 and January 20, 2015. For Bing, this percentage is 92.8 % (3002 data points).