Advertisement

Information Retrieval

, Volume 17, Issue 3, pp 203–228 | Cite as

Discover hidden web properties by random walk on bipartite graph

  • Yan Wang
  • Jie Liang
  • Jianguo Lu
Article

Abstract

This paper proposes to use random walk (RW) to discover the properties of the deep web data sources that are hidden behind searchable interfaces. The properties, such as the average degree and population size of both documents and terms, are of interests to general public, and find their applications in business intelligence, data integration and deep web crawling. We show that simple RW can outperform the uniform random (UR) samples disregarding the high cost of UR sampling. We prove that in the idealized case when the degrees follow Zipf’s law, the sample size of UR sampling needs to grow in the order of O(N/ln 2 N) with the corpus size N, while the sample size of RW sampling grows logarithmically. Reuters corpus is used to demonstrate that the term degrees resemble power law distribution, thus RW is better than UR sampling. On the other hand, document degrees have lognormal distribution and exhibit a smaller variance, therefore UR sampling is slightly better.

Keywords

Hidden data source Deep web Random walk Graph sampling Estimator Zipf’s law 

Notes

Acknowledgements

The authors thank Dingding Li for helpful discussions. The work is supported by Natural Sciences and Engineering Research Council of Canada (NSERC) and State Key Laboratory for Novel Software Technology at Nanjing University.

References

  1. Amstrup, S., McDonald, T. & Manly, B. (2005). Handbook of capture–recapture analysis. Princeton, NJ: Princeton University Press.Google Scholar
  2. Bar-Yossef, Z. & Gurevich, M. (2006). Random sampling from a search engine’s index. In Proceedings of the 15th international conference on World Wide Web (pp. 367–376) Edinburgh, Scotland: ACM.Google Scholar
  3. Bar-Yossef, Z. & Gurevich, M. (2008). Random sampling from a search engine’s index. Journal of the ACM, 55(5), 1–74.CrossRefMathSciNetGoogle Scholar
  4. Bar-Yossef, Z. & Gurevich, M. (2011). Efficient search engine measurements. ACM Transactions on the Web (TWEB), 5(4), 1–48.CrossRefGoogle Scholar
  5. Bergman, M. K. (2001). White paper: The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1).Google Scholar
  6. Bharat, K., & Broder, A. (1998). A technique for measuring the relative size and overlap of public web search engines. Computer Networks and ISDN Systems, 30(1–7), 379–388.CrossRefGoogle Scholar
  7. Broder, A., et al. (2006). Estimating corpus size via queries. In CIKM (pp. 594–603). ACM.Google Scholar
  8. Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS), 19(2), 97–130.CrossRefGoogle Scholar
  9. Callan, J., Connell, M., & Du, A. (1999). Automatic discovery of language models for text databases. ACM SIGMOD Record, 28(2), 479–490.CrossRefGoogle Scholar
  10. Chao, A., Lee, S. & Jeng, S. (1992). Estimating population size for capture–recapture data when capture probabilities vary by time and individual animal. Biometrics, 48(1), 201–216. CrossRefzbMATHGoogle Scholar
  11. Cochran, W. (1977). Sampling techniques. New York: Wiley.zbMATHGoogle Scholar
  12. Darroch, J. (1958). The multiple-recapture census: I. Estimation of a closed population. Biometrika, 45(3/4), 343–359.CrossRefzbMATHMathSciNetGoogle Scholar
  13. Dasgupta, A., Das, G. & Mannila H. (2007). A random walk approach to sampling hidden databases. In SIGMOD (pp. 629–640). ACM.Google Scholar
  14. Dasgupta, A., Jin, X., Jewell, B., Zhang, N. & Das, G. (2010). Unbiased estimation of size and other aggregates over hidden web databases. In SIGMOD (pp. 855–866). ACM.Google Scholar
  15. Gjoka, M., Kurant, M., Butts, C. & Markopoulou A. (2009). A walk in facebook: Uniform sampling of users in online social networks. Arxiv preprint [arXiv:0906.0060].Google Scholar
  16. Gjoka, M., Kurant, M., Butts, C., & Markopoulou A. (2011). Practical recommendations on crawling online social networks. IEEE Journal on Selected Areas in Communications, 29(9), 1872–1892.CrossRefGoogle Scholar
  17. Gulli, A., & Signorini, A. (2005). The indexable web is more than 11.5 billion pages. In Special interest tracks and posters of the 14th international conference on World Wide Web (pp 902–903). ACM.Google Scholar
  18. Haas P. J., Naughton J.F., Seshadri S., & Stokes L. (1995). Sampling-Based estimation of the number of distinct values of an attribute. In VLDB (pp. 311–322).Google Scholar
  19. Hansen, M. & Hurwitz, W. (1943). On the theory of sampling from finite populations. The Annals of Mathematical Statistics, 14(4), 333–362.CrossRefzbMATHMathSciNetGoogle Scholar
  20. Henzinger, M., Heydon, A., Mitzenmacher, M., Najork, M. (2000). On near-uniform URL sampling. Computer Networks, 33(1–6), 295–308.CrossRefGoogle Scholar
  21. Ipeirotis, P. G., Gravano, L., & Sahami, M. (2001). Probe, count, and classify: categorizing hidden web databases. In SIGMOD (pp. 67–78). ACM.Google Scholar
  22. Katzir, L., Liberty, E., & Somekh, O. (2011). Estimating sizes of social networks via biased sampling. In WWW (pp. 597–606). ACM.Google Scholar
  23. Kurant, M., Markopoulou, A., & Thiran, P. (2011). Towards unbiased bfs sampling. IEEE Journal on Selected Areas in Communications, 29(9), 1799–1809.CrossRefGoogle Scholar
  24. Lawrence, S., & Giles, C. L. (1998). Searching the world wide web. Science, 280(5360), 98–100.Google Scholar
  25. Leskovec, J., & Faloutsos, C. (2006). Sampling from large graphs. In SIGKDD pp. 631–636. ACM.Google Scholar
  26. Liu, J. (2008). Monte Carlo strategies in scientific computing. New York: Springer.zbMATHGoogle Scholar
  27. Lovász, L. (1993). Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty, 2(1), 1–46.Google Scholar
  28. Lu, J. (2008). Efficient estimation of the size of text deep web data source. In Proceedings of the 17th ACM conference on Information and knowledge management (pp. 1485–1486). ACM.Google Scholar
  29. Lu, J. (2010). Ranking bias in deep web size estimation using capture recapture method. Data & Knowledge Engineering, 69(8), 866–879.CrossRefGoogle Scholar
  30. Lu, J., & Li, D. (2010). Estimating deep web data source size by capture–recapture method. Information Retrieval, 13(1), 70–95.CrossRefGoogle Scholar
  31. Lu, J., & Li, D. (2012). Sampling online social networks by random walk. In ACM SIGKDD workshop on hot topics in online social networks (pp. 33–40). ACM.Google Scholar
  32. Lu, J. & Li, D. (2013, in press). Bias correction in small sample from big data. IEEE Transactions of Knowledge and Data Engineering, TKDE.Google Scholar
  33. Lu, J., Wang, Y., Liang, J., Chen, J., & Liu, J. (2008). An approach to deep web crawling by sampling. In IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, 2008. WI-IAT’08(Vol. 1, pp. 718–724).Google Scholar
  34. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. (2008). Google’s deep web crawl. Proceedings of the VLDB Endowment 1(2), 1241–1252.Google Scholar
  35. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21, 1087.CrossRefGoogle Scholar
  36. Montemurro, M. (2001). Beyond the Zipf–Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and Its Applications, 300(3), 567–578.CrossRefzbMATHGoogle Scholar
  37. Newman, M. (2010). Networks: An introduction. Oxford: Oxford University Press.CrossRefGoogle Scholar
  38. Olston, C., & Najork, M. (2010). Web Crawling. Foundations and Trends in Information Retrieval, 4(3), 175–246.CrossRefzbMATHGoogle Scholar
  39. Papagelis, M., Das, G., & Koudas, N. (2011). Sampling online social networks. IEEE Transactions on Knowledge and Data Engineering, 99, 1–1.Google Scholar
  40. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In VLDB (pp. 129–138). Morgan Kaufmann Publishers Inc.Google Scholar
  41. Rasti, A., Torkjazi, M., Rejaie, R., Duffield, N., Willinger, W. & Stutzbach, D. (2009) Respondent-driven sampling for characterizing unstructured overlays. In INFOCOM, IEEE (pp. 2701–2705).Google Scholar
  42. Reuters, T. (2008). Reuters coprus. http://about.reuters.com/researchandstandards/corpus/, December 2008.
  43. Salganik, M., & Heckathorn, D. (2004). Sampling and estimation in hidden populations using respondent-driven sampling. Sociological methodology, 34(1), 193–240.CrossRefGoogle Scholar
  44. Shokouhi, M., & Si, L. (2011). Federated search. Hanover, MA: Now Publishers. Google Scholar
  45. Shokouhi, M., Zobel, J., Scholer, F., & Tahaghoghi, S. M. M. (2006). Capturing collection size for distributed non-cooperative retrieval. In SIGIR (pp. 316–323). ACM.Google Scholar
  46. Si, L., Jin, R., Callan, J., & Ogilvie P. (2002). A language modeling framework for resource selection and results merging. In Proceedings of the 11th CIKM (pp. 391–397). ACM.Google Scholar
  47. Thompson, S. (2012). Sampling. New York: Wiley.CrossRefzbMATHGoogle Scholar
  48. Wang, Y., Lu, J., Liang, J., Chen, J. & Liu, J. (2012). Selecting queries from sample to crawl deep web data sources. Web Intelligence and Agent Systems, 10(1), 75–88.Google Scholar
  49. Wejnert, C., & Heckathorn, D. (2008). Web-based network sampling. Sociological Methods & Research, 37(1), 105–134.CrossRefMathSciNetGoogle Scholar
  50. Wu, P., Wen, J., Liu, H., & Ma, W. (2006). Query selection techniques for efficient crawling of structured web sources. In ICDE, IEEE.Google Scholar
  51. Ye, S., & Wu, S. (2011). Estimating the size of online social networks. International Journal of Social Computing and Cyber-Physical Systems, 1(2), 160–179.CrossRefGoogle Scholar
  52. Zhang, M., Zhang, M. N., & Das, G. (2011). Mining a search engine’s corpus: efficient yet unbiased sampling and aggregate estimation. In SIGMOD (pp. 793–804). ACM.Google Scholar
  53. Zhou, J., Li, Y., Adhikari, V., & Zhang, Z. (2011). Counting youtube videos via random prefix sampling. In SIGCOMM (pp. 371–380). ACM.Google Scholar
  54. Zipf, G. (1949). Human behavior and the principle of least effort.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.School of InformationCentral University of Finance and EconomicsBeijingChina
  2. 2.BiblioCommons IncTorontoCanada
  3. 3.School of Computer ScienceUniversity of WindsorWindsorCanada
  4. 4.State Key Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina

Personalised recommendations