Range Query Estimation for Dirty Data Management System

Zhang, Yan; Yang, Long; Wang, Hongzhi

doi:10.1007/978-3-642-32281-5_15

Range Query Estimation for Dirty Data Management System

Yan Zhang²¹,
Long Yang²¹ &
Hongzhi Wang²¹

Conference paper

1646 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7418))

Abstract

In recent years, data quality issues have attracted wide attention. Data quality is mainly caused by dirty data. Currently, many methods for dirty data management have been proposed, and one of them is entity-based relational database in which one tuple represents an entity. The traditional query optimizations having the ability to estimate the cost of execution of a query plan have not been suitable for the new entity-based model. Then new query optimizations need to be developed. In this paper, we propose new query selectivity estimation based on histogram, and focus on solving the overestimation which traditional methods lead to. We prove our approaches are unbiased. The experimental results on both real and synthetic data sets show that our approaches can give good estimates with low error.

This paper was partially supported by NGFR 973 grant 2012CB316200 and NSFC grant 61003046, 6111113089. Doctoral Fund of Ministry of Education of China (No. 20102302120054).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Batini, C., Scannapieco, M.: Data quality: concepts, methodologies and techniques. Springer (2006)
Google Scholar
English, L.: Plain English on data quality: Information quality management: The next frontier. DM Review Magazine (2000)
Google Scholar
Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Fuxman, A.D., Miller, R.J.: First-Order Query Rewriting for Inconsistent Databases. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 337–351. Springer, Heidelberg (2005)
Chapter Google Scholar
Fuxman, A., Fazli, E., Miller, R.J.: Conquer: Efficient management of inconsistent databases. In: SIGMOD, pp. 155–166 (2005)
Google Scholar
Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: A probabilistic approach. In: ICDE, p. 30 (2006)
Google Scholar
Boulos, J., Dalvi, N., Mandhani, B., Mathur, S., Re, C., Suciu, D.: MYSTIQ: a system for finding more answers by using probabilities. In: SIGMOD, pp. 891–893 (2005)
Google Scholar
Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. In: CIDR, pp. 262–276 (2005)
Google Scholar
Hassanzadeh, O., Miller, R.J.: Creating probabilistic databases from duplicated data. The VLDB Journal, 1141–1166 (2009)
Google Scholar
Lenzerini, M.: Data integration: A theoretical perspective. In: PODS, pp. 233–246 (2002)
Google Scholar
Dong, X.L., Halevy, A., Yu, C.: Data integration with uncertainty. The VLDB Journal, 469–500 (2009)
Google Scholar
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Whang, S.E., Su, Q., Widom, J.: Swoosh: a generic approach to entity resolution. The VLDB Journal, 255–276 (2008)
Google Scholar
Li, Y., Wang, H., Gao, H.: Efficient Entity Resolution Based on Sequence Rules. In: Shen, G., Huang, X. (eds.) CSIE 2011. CCIS, vol. 152, pp. 381–388. Springer, Heidelberg (2011)
Chapter Google Scholar
Ioannidis, Y.E.: The history of histograms (abridged). In: VLDB, pp. 19–30 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Harbin Institute of Technology, China
Yan Zhang, Long Yang & Hongzhi Wang

Authors

Yan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Long Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, No. 92, West Dazhi Street, 150001, Heilongjiang, Harbin, China
Hong Gao
Information and Computer Science Department, University of Hawaii, 1680 East West Road, 96822, Honolulu, HI, USA
Lipyeow Lim
School of Computer Science, Fudan University, No. 220, Handan Road, 200433, Shanghai, China
Wei Wang
School of Computer Science and Technology, Sichuan University, No. 29 Jiuyanqiao Wangjing Road, 610064, Chengdu, Sichuan, China
Chuan Li
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon,, Hong Kong, China
Lei Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Yang, L., Wang, H. (2012). Range Query Estimation for Dirty Data Management System. In: Gao, H., Lim, L., Wang, W., Li, C., Chen, L. (eds) Web-Age Information Management. WAIM 2012. Lecture Notes in Computer Science, vol 7418. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32281-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-32281-5_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32280-8
Online ISBN: 978-3-642-32281-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics