Estimating deep web data source size by capture–recapture method
- First Online:
This paper addresses the problem of estimating the size of a deep web data source that is accessible by queries only. Since most deep web data sources are non-cooperative, a data source size can only be estimated by sending queries and analyzing the returning results. We propose an efficient estimator based on the capture–recapture method. First we derive an equation between the overlapping rate and the percentage of the data examined when random samples are retrieved from a uniform distribution. This equation is conceptually simple and leads to the derivation of an estimator for samples obtained by random queries. Since random queries do not produce random documents, it is well known that the traditional methods by random queries underestimate the size, i.e., those estimators have negative bias. Based on the simple estimator for random samples, we adjust the equation so that it can handle the samples returned by random queries. We conduct both simulation studies and experiments on corpora including Gov2, Reuters, Newsgroups, and Wikipedia. The results show that our method has small bias and standard deviation.