Web database sampling approach based on attribute correlation

Tian, Jianwei; Li, Shijun; Tang, Xiaoyue

doi:10.1007/s11859-010-0655-1

Web database sampling approach based on attribute correlation

Published: 05 August 2010

Volume 15, pages 297–302, (2010)
Cite this article

Wuhan University Journal of Natural Sciences

Jianwei Tian¹,
Shijun Li¹ &
Xiaoyue Tang¹

43 Accesses
Explore all metrics

Abstract

In this paper, we present a novel approach utilizing attributes correlation for the sampling task on nonuniform hidden databases. We propose the method of calculating the attributes dependency and construct the sampling template according to the attributes dependency. Then, we use the sampling template to generate initial sampling queries and propose a bottom-up algorithm to search the sampling template. We also conduct extensive experiments over real deep Web sites and controlled databases to illustrate that our sampling method has good performance both on the quality and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Olken F, Rotem D. Random sampling from databases—a survey [J]. Statistics & Computing, 1995, 5(1): 25–42.
Article Google Scholar
Vitter J S. Random sampling with a reservoir [J]. ACM Transactions on Mathematical Software, 1985, 11(1): 37–57.
Article MATH MathSciNet Google Scholar
Das G. Approximate query processing techniques [C]//The 20th Brazilian Symposium on Databases, Amsterdam: Elsevier Press, 2005: 8–9.
Google Scholar
Garofalakis M N, Gibbons P B. Approximate query processing: Taming the terabytes [EB/OL]. [2009-12-15]. http://www.dia.uniroma3.it/~vldbproc/tut4.pdf.
Bharat K, Broder A. A technique for measuring the relative size and overlap of public Web search engines [J]. Compute Networks, 1998, 30(1–7): 379–388.
Google Scholar
Bar-Yossef Z, Gurevich M. Random sampling from a search engine’s index [EB/OL]. [2009-12-15]. http://www.2006.org/programme/item.php?id=3047.
Callan J P, Connell M E. Query-based sampling of text databases[J]. ACM Transactions on Information Systems, 2001, 19(2): 97–130.
Article Google Scholar
Panagiotis L G, Ipeirotis G. Distributed search over the hidden web: Hierarchical database sampling and selection [EB/OL]. [2009-12-15]. http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/papers/S11P03.pdf.
Hedley Y L, Younas M, James A E, et al. A two phase sampling technique for information extraction from hidden Web databases[EB/OL]. [2009-12-15]. http://nike.psu.edu/widm04/accepted.html.
Hedley Y, Younas M, James A, et al. Sampling, information extraction and summarisation of hidden Web databases [J]. Data and Knowledge Engineering, 2006, 59(2): 213–230.
Article Google Scholar
Dasgupta A, Das D, Mannila H. A random walk approach to sampling hidden databases [EB/OL].[2009-12-15]. http://sigmod07.riit.tsinghua.edu.cn/acceptedPaperForSIGMOD.html.
Dasgupte Arjun, Zhang Nan. Gautam Das: Leveraging COUNT Information in Sampling Hidden Databases [EB/OL]. [2009-12-15]. http://i.cs.hku.hk/icde2009/accepted_papers.html.
Liu W, Meng X F, Ling Y Y. A graph based approach for Web database sampling [J]. Journal of Software, 2008, 19(2): 179–193 (Ch).
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer, Wuhan University, Wuhan, 430072, Hubei, China
Jianwei Tian, Shijun Li & Xiaoyue Tang

Authors

Jianwei Tian
View author publications
You can also search for this author in PubMed Google Scholar
Shijun Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyue Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shijun Li.

Additional information

Foundation item: Supported by the National Natural Science Foundation of China (60970018)

Biography: TIAN Jianwei(1982–), male, Ph. D. candidate, research direction: Web mining.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tian, J., Li, S. & Tang, X. Web database sampling approach based on attribute correlation. Wuhan Univ. J. Nat. Sci. 15, 297–302 (2010). https://doi.org/10.1007/s11859-010-0655-1

Download citation

Received: 16 January 2010
Published: 05 August 2010
Issue Date: August 2010
DOI: https://doi.org/10.1007/s11859-010-0655-1

Key words

CLC number

TP 311.13

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Web database sampling approach based on attribute correlation

Abstract

Access this article

Similar content being viewed by others

A Multiple-Phase Stratification-Based Hierarchical Clustering Over a Deep Web Data Source

A survey on semantic schema discovery

DOM2R-Graph: A Web Attribute Extraction Architecture with Relation-Aware Heterogeneous Graph Transformer

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

Web database sampling approach based on attribute correlation

Abstract

Access this article

Similar content being viewed by others

A Multiple-Phase Stratification-Based Hierarchical Clustering Over a Deep Web Data Source

A survey on semantic schema discovery

DOM2R-Graph: A Web Attribute Extraction Architecture with Relation-Aware Heterogeneous Graph Transformer

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation