Skip to main content
Log in

Web database sampling approach based on attribute correlation

  • Published:
Wuhan University Journal of Natural Sciences

Abstract

In this paper, we present a novel approach utilizing attributes correlation for the sampling task on nonuniform hidden databases. We propose the method of calculating the attributes dependency and construct the sampling template according to the attributes dependency. Then, we use the sampling template to generate initial sampling queries and propose a bottom-up algorithm to search the sampling template. We also conduct extensive experiments over real deep Web sites and controlled databases to illustrate that our sampling method has good performance both on the quality and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Olken F, Rotem D. Random sampling from databases—a survey [J]. Statistics & Computing, 1995, 5(1): 25–42.

    Article  Google Scholar 

  2. Vitter J S. Random sampling with a reservoir [J]. ACM Transactions on Mathematical Software, 1985, 11(1): 37–57.

    Article  MATH  MathSciNet  Google Scholar 

  3. Das G. Approximate query processing techniques [C]//The 20th Brazilian Symposium on Databases, Amsterdam: Elsevier Press, 2005: 8–9.

    Google Scholar 

  4. Garofalakis M N, Gibbons P B. Approximate query processing: Taming the terabytes [EB/OL]. [2009-12-15]. http://www.dia.uniroma3.it/~vldbproc/tut4.pdf.

  5. Bharat K, Broder A. A technique for measuring the relative size and overlap of public Web search engines [J]. Compute Networks, 1998, 30(1–7): 379–388.

    Google Scholar 

  6. Bar-Yossef Z, Gurevich M. Random sampling from a search engine’s index [EB/OL]. [2009-12-15]. http://www.2006.org/programme/item.php?id=3047.

  7. Callan J P, Connell M E. Query-based sampling of text databases[J]. ACM Transactions on Information Systems, 2001, 19(2): 97–130.

    Article  Google Scholar 

  8. Panagiotis L G, Ipeirotis G. Distributed search over the hidden web: Hierarchical database sampling and selection [EB/OL]. [2009-12-15]. http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/papers/S11P03.pdf.

  9. Hedley Y L, Younas M, James A E, et al. A two phase sampling technique for information extraction from hidden Web databases[EB/OL]. [2009-12-15]. http://nike.psu.edu/widm04/accepted.html.

  10. Hedley Y, Younas M, James A, et al. Sampling, information extraction and summarisation of hidden Web databases [J]. Data and Knowledge Engineering, 2006, 59(2): 213–230.

    Article  Google Scholar 

  11. Dasgupta A, Das D, Mannila H. A random walk approach to sampling hidden databases [EB/OL].[2009-12-15]. http://sigmod07.riit.tsinghua.edu.cn/acceptedPaperForSIGMOD.html.

  12. Dasgupte Arjun, Zhang Nan. Gautam Das: Leveraging COUNT Information in Sampling Hidden Databases [EB/OL]. [2009-12-15]. http://i.cs.hku.hk/icde2009/accepted_papers.html.

  13. Liu W, Meng X F, Ling Y Y. A graph based approach for Web database sampling [J]. Journal of Software, 2008, 19(2): 179–193 (Ch).

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shijun Li.

Additional information

Foundation item: Supported by the National Natural Science Foundation of China (60970018)

Biography: TIAN Jianwei(1982–), male, Ph. D. candidate, research direction: Web mining.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tian, J., Li, S. & Tang, X. Web database sampling approach based on attribute correlation. Wuhan Univ. J. Nat. Sci. 15, 297–302 (2010). https://doi.org/10.1007/s11859-010-0655-1

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11859-010-0655-1

Key words

CLC number

Navigation