Abstract
In this paper, we present a novel approach utilizing attributes correlation for the sampling task on nonuniform hidden databases. We propose the method of calculating the attributes dependency and construct the sampling template according to the attributes dependency. Then, we use the sampling template to generate initial sampling queries and propose a bottom-up algorithm to search the sampling template. We also conduct extensive experiments over real deep Web sites and controlled databases to illustrate that our sampling method has good performance both on the quality and efficiency.
Similar content being viewed by others
References
Olken F, Rotem D. Random sampling from databases—a survey [J]. Statistics & Computing, 1995, 5(1): 25–42.
Vitter J S. Random sampling with a reservoir [J]. ACM Transactions on Mathematical Software, 1985, 11(1): 37–57.
Das G. Approximate query processing techniques [C]//The 20th Brazilian Symposium on Databases, Amsterdam: Elsevier Press, 2005: 8–9.
Garofalakis M N, Gibbons P B. Approximate query processing: Taming the terabytes [EB/OL]. [2009-12-15]. http://www.dia.uniroma3.it/~vldbproc/tut4.pdf.
Bharat K, Broder A. A technique for measuring the relative size and overlap of public Web search engines [J]. Compute Networks, 1998, 30(1–7): 379–388.
Bar-Yossef Z, Gurevich M. Random sampling from a search engine’s index [EB/OL]. [2009-12-15]. http://www.2006.org/programme/item.php?id=3047.
Callan J P, Connell M E. Query-based sampling of text databases[J]. ACM Transactions on Information Systems, 2001, 19(2): 97–130.
Panagiotis L G, Ipeirotis G. Distributed search over the hidden web: Hierarchical database sampling and selection [EB/OL]. [2009-12-15]. http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/papers/S11P03.pdf.
Hedley Y L, Younas M, James A E, et al. A two phase sampling technique for information extraction from hidden Web databases[EB/OL]. [2009-12-15]. http://nike.psu.edu/widm04/accepted.html.
Hedley Y, Younas M, James A, et al. Sampling, information extraction and summarisation of hidden Web databases [J]. Data and Knowledge Engineering, 2006, 59(2): 213–230.
Dasgupta A, Das D, Mannila H. A random walk approach to sampling hidden databases [EB/OL].[2009-12-15]. http://sigmod07.riit.tsinghua.edu.cn/acceptedPaperForSIGMOD.html.
Dasgupte Arjun, Zhang Nan. Gautam Das: Leveraging COUNT Information in Sampling Hidden Databases [EB/OL]. [2009-12-15]. http://i.cs.hku.hk/icde2009/accepted_papers.html.
Liu W, Meng X F, Ling Y Y. A graph based approach for Web database sampling [J]. Journal of Software, 2008, 19(2): 179–193 (Ch).
Author information
Authors and Affiliations
Corresponding author
Additional information
Foundation item: Supported by the National Natural Science Foundation of China (60970018)
Biography: TIAN Jianwei(1982–), male, Ph. D. candidate, research direction: Web mining.
Rights and permissions
About this article
Cite this article
Tian, J., Li, S. & Tang, X. Web database sampling approach based on attribute correlation. Wuhan Univ. J. Nat. Sci. 15, 297–302 (2010). https://doi.org/10.1007/s11859-010-0655-1
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11859-010-0655-1