Information Retrieval

, Volume 17, Issue 3, pp 203–228

Discover hidden web properties by random walk on bipartite graph

Article

Abstract

This paper proposes to use random walk (RW) to discover the properties of the deep web data sources that are hidden behind searchable interfaces. The properties, such as the average degree and population size of both documents and terms, are of interests to general public, and find their applications in business intelligence, data integration and deep web crawling. We show that simple RW can outperform the uniform random (UR) samples disregarding the high cost of UR sampling. We prove that in the idealized case when the degrees follow Zipf’s law, the sample size of UR sampling needs to grow in the order of O(N/ln2N) with the corpus size N, while the sample size of RW sampling grows logarithmically. Reuters corpus is used to demonstrate that the term degrees resemble power law distribution, thus RW is better than UR sampling. On the other hand, document degrees have lognormal distribution and exhibit a smaller variance, therefore UR sampling is slightly better.

Keywords

Hidden data source Deep web Random walk Graph sampling Estimator Zipf’s law 

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.School of InformationCentral University of Finance and EconomicsBeijingChina
  2. 2.BiblioCommons IncTorontoCanada
  3. 3.School of Computer ScienceUniversity of WindsorWindsorCanada
  4. 4.State Key Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina

Personalised recommendations