Information Retrieval

, Volume 13, Issue 1, pp 70–95 | Cite as

Estimating deep web data source size by capture–recapture method

Article

Abstract

This paper addresses the problem of estimating the size of a deep web data source that is accessible by queries only. Since most deep web data sources are non-cooperative, a data source size can only be estimated by sending queries and analyzing the returning results. We propose an efficient estimator based on the capture–recapture method. First we derive an equation between the overlapping rate and the percentage of the data examined when random samples are retrieved from a uniform distribution. This equation is conceptually simple and leads to the derivation of an estimator for samples obtained by random queries. Since random queries do not produce random documents, it is well known that the traditional methods by random queries underestimate the size, i.e., those estimators have negative bias. Based on the simple estimator for random samples, we adjust the equation so that it can handle the samples returned by random queries. We conduct both simulation studies and experiments on corpora including Gov2, Reuters, Newsgroups, and Wikipedia. The results show that our method has small bias and standard deviation.

Keywords

Deep web Estimators Capture–recapture 

Notes

Acknowledgments

We would like to thank reviewers for their insightful comments, and Jie Liang for providing the query interface for the corpora. The research is supported by NSERC (Natural Sciences and Engineering Research Council Canada), SSHRC (Social Sciences and Humanities Research Council Canada), and State Laboratory for Novel Software Technology, Nanjing University.

References

  1. Amstrup, S. C., McDonald, T. L., & Manly, B. F. J. (2005). Handbook of capture–recapture analysis. Princeton University Press.Google Scholar
  2. Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. In Proceedings of SBBD, 2004.Google Scholar
  3. Bar-Yossef, Z., & Gurevich, M. (2006). Random sampling from a search engine’s index. In Proceedings of WWW, 2006, pp, 367–376.Google Scholar
  4. Bar-Yossef, Z., & Gurevich, M. (2007). Efficient search engine measurements. In Proceedings of WWW, 2007, pp. 401–410.Google Scholar
  5. Bergman, M. K. (2001). The deep web: Surfacing hidden value. The Journal of Electronic Publishing, 7(1).Google Scholar
  6. Bharat, K., & Broder, A. (1998). A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of WWW, 1998, pp. 379–388.Google Scholar
  7. Bolshakov, I. A., & Galicia-Haro, S. N. (2003). Can we correctly estimate the total number of pages in Google for a specific language? CICLing 2003, pp. 415–419.Google Scholar
  8. Broder, A., Fontura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., et al. (2006). Estimating corpus size via queries. In CIKM’06, pp. 594–603.Google Scholar
  9. Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2), 97–130.Google Scholar
  10. Caverlee, J., Liu, L., & Buttler, D. (2004). Probe, cluster, and discover: Focused extraction of QA-pagelets from the deep web. In Proceedings of ICDE 2004, pp. 103–114.Google Scholar
  11. Chao, A., & Lee, S.-M. (1992). Estimating the number of classes via sample coverage. Journal of American Statistical Association, 87, 210–217.MATHCrossRefMathSciNetGoogle Scholar
  12. Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of VLDB 2001, pp. 109–118.Google Scholar
  13. Darroch, J. N. (1958). The multiple-recapture census: I. Estimation of a closed population. Biometrika, 45(3/4), 343–359.MATHCrossRefMathSciNetGoogle Scholar
  14. Dobra, A., & Fienberg, S. (2004). How large is the World Wide Web? Web Dynamics, Springer, pp. 23–44.Google Scholar
  15. Gulli, A., & Signorini A. (2005). The indexable web is more than 11.5 billion pages. In Proceedings of WWW 2005, pp. 902–903.Google Scholar
  16. Haas, P. J., Naughton, J. F., Seshadri, S., & Stokes, L. (1995). Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of VLDB 1995, pp. 311–322.Google Scholar
  17. Hatcher, E., & Gospodnetic, O. (2004). Lucene in action. Manning Publications.Google Scholar
  18. Holst, L. (1979). A unified approach to limit theorems for urn models. Journal of Applied Probability, 16(1), 154–162.MATHCrossRefMathSciNetGoogle Scholar
  19. Ipeirotis, P. G., Gravano, L., & Sahami, M. (2001). Probe, count, and classify: Categorizing hidden web databases. In Proceedings of SIGMOD’01.Google Scholar
  20. Knoblock, C. A., Lerman, K., Minton, S., & Muslea, I. (2000). Accurately and reliably extracting data from the web: A machine learning approach. IEEE Data Engineering Bulletin, 23(4), 33–41.Google Scholar
  21. Lang, K. (1995). Newsweeder: Learning to filter netnews. In Twelfth international conference on machine learning, pp. 331–339.Google Scholar
  22. Liddle, S. W., Embley, D. W., Scott, D. T., & Yau, S. H. (2002). Extracting data behind web forms, advanced conceptual modeling techniques, pp. 402–413.Google Scholar
  23. Liu, K., Yu, C., & Meng, W. (2002). Discovering the representative of a search engine. In Proceedings of CIKM’02, pp. 652–654.Google Scholar
  24. Lu, J. (2008). Efficient estimation of the size of text deep web data source. In Proceedings of CIKM 2008, pp. 1485–1486.Google Scholar
  25. Lu, J., Wang, Y., Liang, J., Chen, J., & Liu, J. (2008). An approach to deep web crawling by sampling. In Proceedings of Web Intelligence, pp. 718–724.Google Scholar
  26. Nelson, M. L., Smith, J. A., & del Campo, I. G. (2006). Efficient, automatic web resource harvesting. In Proceedings of WIDM’06, pp. 43–50.Google Scholar
  27. Ntoulas, A., Zerfos, P., & Cho, J. (2005). Downloading textual hidden web content through keyword queries. In Proceedings of JCDL, 2005, pp. 100–109.Google Scholar
  28. Pollock, K. H., Nichols, J. D., Brownie, C., & Hines, J. E. (1990). Statistical inference for capture crecapture experiments. The Wildlife Society. Wildlife Monographs, 107, 3–97.Google Scholar
  29. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. Proceedings of VLDB 2001.Google Scholar
  30. Schumacher, F. X., & Eschmeyer, R. W. (1943). The estimation of fish populations in lakes or ponds. Journal. Tennessee Academy of Science, 18, 228–249.Google Scholar
  31. Shestakov, D., Bhowmick, S. S. & Lim, E.-P. (2005). DEQUE: Querying the deep web. Journal of Data & Knowledge Engineering, 52(3), 273–311.CrossRefGoogle Scholar
  32. Shokouhi, M., Zobel, J., & Scholer, F. (2006). SMM Tahaghoghi, capturing collection size for distributed non-cooperative retrieval. In Proceedings of SIGIR’06, pp. 316–323.Google Scholar
  33. Si, L., & Callan, J. (2003). Relevant document distribution estimation method for resource selection. In Proceedings of SIGIR’03.Google Scholar
  34. Thomas, P., & Hawking, D. (2007). Evaluating sampling methods for uncooperative collections. In Proceedings of SIGIR, 2007.Google Scholar
  35. Wu, S., Gibb, F., & Crestani, F. (2003). Experiments with document archive size detection. 25th European conference on IR research, pp. 294–304.Google Scholar
  36. Wu, P., Wen, J.-R., Liu, H., & Ma, W.-Y. (2006). Query selection techniques for efficient crawling of structured web sources. In Proceedings of ICDE, 2006, pp. 47–56.Google Scholar
  37. Xu, J., Wu, S., & Li, X. (2007). Estimating collection size with logistic regression. In Proceedings of SIGIR’07, pp. 789–790.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.School of Computer ScienceUniversity of WindsorWindsorCanada
  2. 2.State Key Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina
  3. 3.Department of EconomicsUniversity of WindsorWindsorCanada

Personalised recommendations