Discover hidden web properties by random walk on bipartite graph

Abstract

This paper proposes to use random walk (RW) to discover the properties of the deep web data sources that are hidden behind searchable interfaces. The properties, such as the average degree and population size of both documents and terms, are of interests to general public, and find their applications in business intelligence, data integration and deep web crawling. We show that simple RW can outperform the uniform random (UR) samples disregarding the high cost of UR sampling. We prove that in the idealized case when the degrees follow Zipf’s law, the sample size of UR sampling needs to grow in the order of O(N/ln 2 N) with the corpus size N, while the sample size of RW sampling grows logarithmically. Reuters corpus is used to demonstrate that the term degrees resemble power law distribution, thus RW is better than UR sampling. On the other hand, document degrees have lognormal distribution and exhibit a smaller variance, therefore UR sampling is slightly better.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    http://www.worldwidewebsize.com/

  2. 2.

    Available at http://qwone.com/~jason/20Newsgroups/.

  3. 3.

    http://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge. Our data contains 17,118 publication venues and the keywords (980,039) occurred in the venues.

  4. 4.

    By Eq. 38 when γ = 0.

References

  1. Amstrup, S., McDonald, T. & Manly, B. (2005). Handbook of capture–recapture analysis. Princeton, NJ: Princeton University Press.

    Google Scholar 

  2. Bar-Yossef, Z. & Gurevich, M. (2006). Random sampling from a search engine’s index. In Proceedings of the 15th international conference on World Wide Web (pp. 367–376) Edinburgh, Scotland: ACM.

  3. Bar-Yossef, Z. & Gurevich, M. (2008). Random sampling from a search engine’s index. Journal of the ACM, 55(5), 1–74.

    Article  MathSciNet  Google Scholar 

  4. Bar-Yossef, Z. & Gurevich, M. (2011). Efficient search engine measurements. ACM Transactions on the Web (TWEB), 5(4), 1–48.

    Article  Google Scholar 

  5. Bergman, M. K. (2001). White paper: The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1).

  6. Bharat, K., & Broder, A. (1998). A technique for measuring the relative size and overlap of public web search engines. Computer Networks and ISDN Systems, 30(1–7), 379–388.

    Article  Google Scholar 

  7. Broder, A., et al. (2006). Estimating corpus size via queries. In CIKM (pp. 594–603). ACM.

  8. Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS), 19(2), 97–130.

    Article  Google Scholar 

  9. Callan, J., Connell, M., & Du, A. (1999). Automatic discovery of language models for text databases. ACM SIGMOD Record, 28(2), 479–490.

    Article  Google Scholar 

  10. Chao, A., Lee, S. & Jeng, S. (1992). Estimating population size for capture–recapture data when capture probabilities vary by time and individual animal. Biometrics, 48(1), 201–216.

    Article  MATH  Google Scholar 

  11. Cochran, W. (1977). Sampling techniques. New York: Wiley.

    Google Scholar 

  12. Darroch, J. (1958). The multiple-recapture census: I. Estimation of a closed population. Biometrika, 45(3/4), 343–359.

    Article  MATH  MathSciNet  Google Scholar 

  13. Dasgupta, A., Das, G. & Mannila H. (2007). A random walk approach to sampling hidden databases. In SIGMOD (pp. 629–640). ACM.

  14. Dasgupta, A., Jin, X., Jewell, B., Zhang, N. & Das, G. (2010). Unbiased estimation of size and other aggregates over hidden web databases. In SIGMOD (pp. 855–866). ACM.

  15. Gjoka, M., Kurant, M., Butts, C. & Markopoulou A. (2009). A walk in facebook: Uniform sampling of users in online social networks. Arxiv preprint [arXiv:0906.0060].

  16. Gjoka, M., Kurant, M., Butts, C., & Markopoulou A. (2011). Practical recommendations on crawling online social networks. IEEE Journal on Selected Areas in Communications, 29(9), 1872–1892.

    Article  Google Scholar 

  17. Gulli, A., & Signorini, A. (2005). The indexable web is more than 11.5 billion pages. In Special interest tracks and posters of the 14th international conference on World Wide Web (pp 902–903). ACM.

  18. Haas P. J., Naughton J.F., Seshadri S., & Stokes L. (1995). Sampling-Based estimation of the number of distinct values of an attribute. In VLDB (pp. 311–322).

  19. Hansen, M. & Hurwitz, W. (1943). On the theory of sampling from finite populations. The Annals of Mathematical Statistics, 14(4), 333–362.

    Article  MATH  MathSciNet  Google Scholar 

  20. Henzinger, M., Heydon, A., Mitzenmacher, M., Najork, M. (2000). On near-uniform URL sampling. Computer Networks, 33(1–6), 295–308.

    Article  Google Scholar 

  21. Ipeirotis, P. G., Gravano, L., & Sahami, M. (2001). Probe, count, and classify: categorizing hidden web databases. In SIGMOD (pp. 67–78). ACM.

  22. Katzir, L., Liberty, E., & Somekh, O. (2011). Estimating sizes of social networks via biased sampling. In WWW (pp. 597–606). ACM.

  23. Kurant, M., Markopoulou, A., & Thiran, P. (2011). Towards unbiased bfs sampling. IEEE Journal on Selected Areas in Communications, 29(9), 1799–1809.

    Article  Google Scholar 

  24. Lawrence, S., & Giles, C. L. (1998). Searching the world wide web. Science, 280(5360), 98–100.

    Google Scholar 

  25. Leskovec, J., & Faloutsos, C. (2006). Sampling from large graphs. In SIGKDD pp. 631–636. ACM.

  26. Liu, J. (2008). Monte Carlo strategies in scientific computing. New York: Springer.

    Google Scholar 

  27. Lovász, L. (1993). Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty, 2(1), 1–46.

    Google Scholar 

  28. Lu, J. (2008). Efficient estimation of the size of text deep web data source. In Proceedings of the 17th ACM conference on Information and knowledge management (pp. 1485–1486). ACM.

  29. Lu, J. (2010). Ranking bias in deep web size estimation using capture recapture method. Data & Knowledge Engineering, 69(8), 866–879.

    Article  Google Scholar 

  30. Lu, J., & Li, D. (2010). Estimating deep web data source size by capture–recapture method. Information Retrieval, 13(1), 70–95.

    Article  Google Scholar 

  31. Lu, J., & Li, D. (2012). Sampling online social networks by random walk. In ACM SIGKDD workshop on hot topics in online social networks (pp. 33–40). ACM.

  32. Lu, J. & Li, D. (2013, in press). Bias correction in small sample from big data. IEEE Transactions of Knowledge and Data Engineering, TKDE.

  33. Lu, J., Wang, Y., Liang, J., Chen, J., & Liu, J. (2008). An approach to deep web crawling by sampling. In IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, 2008. WI-IAT’08(Vol. 1, pp. 718–724).

  34. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. (2008). Google’s deep web crawl. Proceedings of the VLDB Endowment 1(2), 1241–1252.

    Google Scholar 

  35. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21, 1087.

    Article  Google Scholar 

  36. Montemurro, M. (2001). Beyond the Zipf–Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and Its Applications, 300(3), 567–578.

    Article  MATH  Google Scholar 

  37. Newman, M. (2010). Networks: An introduction. Oxford: Oxford University Press.

    Google Scholar 

  38. Olston, C., & Najork, M. (2010). Web Crawling. Foundations and Trends in Information Retrieval, 4(3), 175–246.

    Article  MATH  Google Scholar 

  39. Papagelis, M., Das, G., & Koudas, N. (2011). Sampling online social networks. IEEE Transactions on Knowledge and Data Engineering, 99, 1–1.

    Google Scholar 

  40. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In VLDB (pp. 129–138). Morgan Kaufmann Publishers Inc.

  41. Rasti, A., Torkjazi, M., Rejaie, R., Duffield, N., Willinger, W. & Stutzbach, D. (2009) Respondent-driven sampling for characterizing unstructured overlays. In INFOCOM, IEEE (pp. 2701–2705).

  42. Reuters, T. (2008). Reuters coprus. http://about.reuters.com/researchandstandards/corpus/, December 2008.

  43. Salganik, M., & Heckathorn, D. (2004). Sampling and estimation in hidden populations using respondent-driven sampling. Sociological methodology, 34(1), 193–240.

    Article  Google Scholar 

  44. Shokouhi, M., & Si, L. (2011). Federated search. Hanover, MA: Now Publishers.

  45. Shokouhi, M., Zobel, J., Scholer, F., & Tahaghoghi, S. M. M. (2006). Capturing collection size for distributed non-cooperative retrieval. In SIGIR (pp. 316–323). ACM.

  46. Si, L., Jin, R., Callan, J., & Ogilvie P. (2002). A language modeling framework for resource selection and results merging. In Proceedings of the 11th CIKM (pp. 391–397). ACM.

  47. Thompson, S. (2012). Sampling. New York: Wiley.

    Google Scholar 

  48. Wang, Y., Lu, J., Liang, J., Chen, J. & Liu, J. (2012). Selecting queries from sample to crawl deep web data sources. Web Intelligence and Agent Systems, 10(1), 75–88.

    Google Scholar 

  49. Wejnert, C., & Heckathorn, D. (2008). Web-based network sampling. Sociological Methods & Research, 37(1), 105–134.

    Article  MathSciNet  Google Scholar 

  50. Wu, P., Wen, J., Liu, H., & Ma, W. (2006). Query selection techniques for efficient crawling of structured web sources. In ICDE, IEEE.

  51. Ye, S., & Wu, S. (2011). Estimating the size of online social networks. International Journal of Social Computing and Cyber-Physical Systems, 1(2), 160–179.

    Article  Google Scholar 

  52. Zhang, M., Zhang, M. N., & Das, G. (2011). Mining a search engine’s corpus: efficient yet unbiased sampling and aggregate estimation. In SIGMOD (pp. 793–804). ACM.

  53. Zhou, J., Li, Y., Adhikari, V., & Zhang, Z. (2011). Counting youtube videos via random prefix sampling. In SIGCOMM (pp. 371–380). ACM.

  54. Zipf, G. (1949). Human behavior and the principle of least effort.

Download references

Acknowledgements

The authors thank Dingding Li for helpful discussions. The work is supported by Natural Sciences and Engineering Research Council of Canada (NSERC) and State Key Laboratory for Novel Software Technology at Nanjing University.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jianguo Lu.

Appendix

Appendix

Both Theorems 1 and 2 assume that the degrees follow the Zipf’s–Mandelbrot law (Montemurro 2001) which states that if the term degrees d i are sorted in descending order, then

$$d_i = \frac{A}{\alpha+i},$$
(26)

where α and A are constants. α ≪ N. All the degrees sum up to τ, i.e.,

$$\sum_{1}^{N} d_i \approx \int\limits_{1}^{N} \frac{A}{\alpha+x} dx \approx A \ln (\frac{\alpha + N}{\alpha+1}) =A\ln B=\tau,$$
(27)

where we use B = (α + N)/(α + 1) to make our derivations more concise. Therefore the normalizing constant \(A=\tau/\ln B\). Besides, ∑ N1 d 2 i can be approximated by the following since N is a very large number:

$$\sum_{1}^{N} d_i^2 \approx \int\limits_{1}^{N} \frac{A^2}{(\alpha+x)^2} dx \approx \frac{A^2}{\alpha+1}.$$
(28)

Proof of Theorem 1

Based on Eqs. 27 and 28, the variance of all the degrees is

$$\begin{aligned} \sigma^2 &=\langle d^2 \rangle-\langle d \rangle^2 = \langle d \rangle^2 \left[ N \frac{\sum_{1}^{N} d_i^2}{ (\sum_{1}^{N} d_i)^2} -1\right]\\ &\approx \langle d \rangle^2 \left[ \frac{N}{ (\alpha+1) \ln^2 B}-1 \right]. \end{aligned}$$
(29)

Using Eq. 14 the variance of \(\widehat{\langle d \rangle}_{SM}\) is

$$var(\widehat{\langle d \rangle}_{SM})= \frac{\langle d \rangle^2}{n} \left[ \frac{N}{ (\alpha+1) \ln^2 B}-1 \right].$$
(30)

Proof of Theorem 2

When nodes are sampled with simple RW, the asymptotic probability of the node i being visited is p i  = d i / τ. When n nodes \((x_1, x_2, \ldots, x_n)\) are sampled, where each \(x_i \in \{1, \ldots, N\}\), the Hansen–Hurwitz size estimator of the population size N is (Thompson 2012):

$$\widehat{N}_{H}=\frac{1}{n}\sum_{i=1}^{n} \frac{1}{p_{x_i}}=\frac{\tau}{n} \sum_{1}^{n} \frac{1}{d_{x_i}},$$
(31)

and the variance of \(\widehat{N_H}\) is (Thompson 2012):

$$var(\widehat{N_H})=\frac{1}{n}\sum_{i=1}^{N}p_i\left(\frac{1}{p_i}-N\right)^2.$$
(32)

Replacing p i with d i /τ and expand d i with A/(α + i), we have

$$var(\widehat{N_H})=\frac{1}{n} \left( \frac{\tau}{A} \sum_{1}^{N} i-N^2 \right) \approx \frac{N^2}{n} \left( \frac{\ln B}{2} -1\right).$$
(33)

The Taylor expansion of \(\widehat{\langle d \rangle}_H\) around N is

$$\widehat{\langle d \rangle}_H= \frac{\tau}{\widehat{N_H}} = \tau \left( \frac{1}{N}-\frac{\widehat{N_H}-N}{N^2}+\cdots \right).$$
(34)

By the Delta method, the variance of \(\widehat{\langle d \rangle}_H\) is

$$var(\widehat{\langle d \rangle}_H) =\tau^2 \frac{var(\widehat{N_H})}{N^4} = \frac{\langle d \rangle^2}{n}\left( \frac{\ln B}{2} -1 \right).$$
(35)

Population size estimation

Nodes are selected during RW. When selecting two nodes, the probability that the same node i is visited twice is p 2 i . Among all the nodes, the probability of having a collision is p = ∑ N i=1 p 2 i . Since there are \({\left(\begin{array}{l} n \\ 2 \end{array}\right)}\) pairs in a sample of size n, the number of collisions follows binomial distribution B(n(n − 1)/2, p) whose mean is

$$E(C) =\left(\begin{array}{l} n \\ 2 \end{array}\right) p.$$
(36)

The collision probability p can be translated into the heterogeneity of the data measured by γ using the definition of γ in Eq. 12:

$$p=\sum_{i=1}^{N} p_i^2 =\frac{1}{\tau^2}\sum_{i=1}^{N} d_i^2 =\frac{1}{N}\frac{\langle d^2\rangle} {\langle d\rangle^2} =\frac{1}{N}(\gamma^2+1).$$
(37)

Combining Eqs. 37 and 36 we obtain the expected number of collisions is:

$$E(C) =\left(\begin{array}{l} n \\ 2 \end{array}\right)\frac{\gamma^2+1}{N}.$$
(38)

Hence the population size can be described by

$$N=(\gamma^2+1)\left(\begin{array}{l}n \\ 2 \end{array}\right) \frac{1}{E(C)}.$$
(39)

Since E(C) is unknown, it can be estimated by the observed collisions C. This gives us the estimator

$$\widehat{N}=(\gamma^2+1)\left(\begin{array}{l} n \\ 2 \end{array}\right) \frac{1}{C}.$$
(40)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Wang, Y., Liang, J. & Lu, J. Discover hidden web properties by random walk on bipartite graph. Inf Retrieval 17, 203–228 (2014). https://doi.org/10.1007/s10791-013-9230-7

Download citation

Keywords

  • Hidden data source
  • Deep web
  • Random walk
  • Graph sampling
  • Estimator
  • Zipf’s law