Abstract

A huge portion of today’s Web consists of web pages filled with information from myriads of online databases. This part of the Web, known as the deep Web, is to date relatively unexplored and even major characteristics such as number of searchable databases on the Web or databases’ subject distribution are somewhat disputable. In this paper, we revisit a problem of deep Web characterization: how to estimate the total number of online databases on the Web? We propose the Host-IP clustering sampling method to address the drawbacks of existing approaches for deep Web characterization and report our findings based on the survey of Russian Web. Obtained estimates together with a proposed sampling technique could be useful for further studies to handle data in the deep Web.

Keywords

deep Web web databases web characterization DNS load balancing virtual hosting Host-IP clustering random sampling national web domain 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
  3. 3.
    Baeza-Yates, R., Castillo, C.: Crawling the infinite Web: five levels are enough. In: Leonardi, S. (ed.) WAW 2004. LNCS, vol. 3243, pp. 156–167. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  4. 4.
    Baeza-Yates, R., Castillo, C., Efthimiadis, E.N.: Characterization of national Web domains. ACM Trans. Internet Technol. 7(2) (2007)Google Scholar
  5. 5.
    Baeza-Yates, R., Castillo, C., López, V.: Characteristics of the Web of Spain. Cybermetrics 9(1) (2005)Google Scholar
  6. 6.
    Bergman, M.: The deep Web: surfacing hidden value. Journal of Electronic Publishing 7(1) (2001)Google Scholar
  7. 7.
    Bharat, K., Broder, A.: A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst. 30(1-7), 379–388 (1998)CrossRefGoogle Scholar
  8. 8.
    Bharat, K., Broder, A., Dean, J., Henzinger, M.: A comparison of techniques to find mirrored hosts on the WWW. J. Am. Soc. Inf. Sci. 51(12), 1114–1122 (2000)CrossRefGoogle Scholar
  9. 9.
    Chang, K., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the Web: observations and implications. SIGMOD Rec. 33(3), 61–70 (2004)CrossRefGoogle Scholar
  10. 10.
    Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proc. of WebDB 2004 (2004)Google Scholar
  11. 11.
    Gomes, D., Silva, M.J.: Characterizing a national community web. ACM Trans. Internet Technol. 5(3), 508–531 (2005)CrossRefGoogle Scholar
  12. 12.
    O’Neill, E.T., McClain, P.D., Lavoie, B.F.: A methodology for sampling the World Wide Web. Annual Review of OCLC Research 1997 (1997)Google Scholar
  13. 13.
    Shestakov, D.: Deep Web: databases on the Web. In: Handbook of Research on Innovations in Database Technologies and Applications, pp. 581–588. IGI Global (2009)Google Scholar
  14. 14.
    Shestakov, D.: On building a search interface discovery system. In: Proceedings of VLDB Workshops 2009, pp. 114–125 (2009)Google Scholar
  15. 15.
    Shestakov, D.: Measuring the deep Web (2011) (submitted)Google Scholar
  16. 16.
    Shestakov, D., Salakoski, T.: On estimating the scale of national deep Web. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 780–789. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  17. 17.
    Thompson, S.: Sampling. John Wiley & Sons, New York (1992)MATHGoogle Scholar
  18. 18.
    Tolosa, G., Bordignon, F., Baeza-Yates, R., Castillo, C.: Characterization of the Argentinian Web. Cybermetrics 11(1) (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Denis Shestakov
    • 1
  1. 1.Department of Media TechnologyAalto UniversityEspooFinland

Personalised recommendations