Advertisement

Frontiers of Computer Science

, Volume 12, Issue 5, pp 984–999 | Cite as

Efficient histogram-based range query estimation for dirty data

  • Yan Zhang
  • Hongzhi Wang
  • Long Yang
  • Jianzhong Li
Research Article
  • 19 Downloads

Abstract

In recent years, data quality issues have attracted wide attentions. Data quality problems are mainly caused by dirty data. Currently, many methods for dirty data management have been proposed, and one of them is entity-based relational database in which one tuple represents an entity. The traditional query optimizations are not suitable for the new entity-based model. Then new query optimizations need to be developed. In this paper, we propose a new query selectivity estimation strategy based on histogram, and focus on solving the overestimation which traditional methods lead to. We prove our approaches are unbiased. The experimental results on both real and synthetic data sets show that our approaches can give good estimates with low error.

Keywords

query estimation data quality histogram dirty data management 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

This paper was partially supported by the National Natural Science Foundation of China (Grant Nos. U1509216 and 61472099), National Sci-Tech Support Plan (2015BAH10F01), the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province (LC2016026), and MOE–Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology, China.

Supplementary material

11704_2016_5551_MOESM1_ESM.ppt (186 kb)
Efficient histogram-based range query estimation for dirty data

References

  1. 1.
    Batini C, Scannapieco M. Data Quality: Concepts, Methodologies and Techniques. New York: Springer Publishing Company, Inc., 2006zbMATHGoogle Scholar
  2. 2.
    Lenzerini M. Data integration: a theoretical perspective. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2015, 233–246Google Scholar
  3. 3.
    Dong X L, Halevy A, Yu C. Data integration with uncertainty. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(2): 469–500CrossRefGoogle Scholar
  4. 4.
    Redman T. The impact of poor data quality on the typical enterprise. Communications of the ACM, 1998, 41(2): 49–71CrossRefGoogle Scholar
  5. 5.
    Raman D, Ton Z. Execution: the missing link in retail operations. Jutas Bus.l, 2001, 43(3): 489–503Google Scholar
  6. 6.
    English L P. Information quality management: the next frontier. In: Proceedings of ASQ World Conference on Quality and Improvement. 2001Google Scholar
  7. 7.
    Rahm E, Do H H. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 2000, 23(23): 3–13Google Scholar
  8. 8.
    Fan WF, Li J, Ma S, Tang N, Yu W. Interaction between record matching and data repairing. Journal of Data & Information Quality, 2011, 4(4): 469–480Google Scholar
  9. 9.
    Fuxman A D, Miller R J. First-order query rewriting for inconsistent databases. In: Proceedings of International Conference on Database Theory. 2005, 337–351Google Scholar
  10. 10.
    Andritsos P, Fuxman A, Miller R J. Clean answers over dirty databases: a probabilistic approach. IEEE Computer Society, 2006, 30Google Scholar
  11. 11.
    Wolf G, Kalavagattu A, Khatri H, Balakrishnan R, Chokshi B, Fan J, Chen Y, Kambhampati S. Query processing over incomplete autonomous databases: query rewriting using learned data dependencies. The VLDB Journal, 2009, 18(5): 1167–1190CrossRefGoogle Scholar
  12. 12.
    Fuxman A, Fazli E, Miller J. Conquer: efficient management of inconsistent databases. In: Proceedings of SIGMOD Conference. 2005, 155–166Google Scholar
  13. 13.
    Boulos J, Dalvi N, Mandhani B, Mathur S, Re C, Suciu D. MYSTIQ: a system for finding more answers by using probabilities. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 891–893Google Scholar
  14. 14.
    Dalvi N, Suciu D. Management of probabilistic data: foundations and challenges. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2007, 1–12Google Scholar
  15. 15.
    Widom J. Trio: a system for integrated management of data, accuracy, and lineage. In: Proceedings of the Conference on Innovative Data Systems Research (CIDR). 2005, 262–276Google Scholar
  16. 16.
    Hassanzadeh O, Miller R J. Creating probabilistic databases from duplicated data. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(5): 1141–1166CrossRefGoogle Scholar
  17. 17.
    Benjelloun O, Garcia-Molina H, Menestrina D, Whang S E, Su Q, Widom J. Swoosh: a generic approach to entity resolution. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(1): 255–276CrossRefGoogle Scholar
  18. 18.
    Whang S E, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H. Entity resolution with iterative blocking. In: Proceedings of the 35th SIGMOD International Conference on Management of Data. 2009, 219–232CrossRefGoogle Scholar
  19. 19.
    Li Y, Wang H, Gao H. Efficient entity resolution based on sequence rules. In: Proceedings of Communications in Computer and Information Science. 2011, 381–388Google Scholar
  20. 20.
    Lu W, Fung G P C, Du X, Zhou X, Chen L, Deng K. Approximate entity extraction in temporal databases. World Wide Web, 2011, 14(2): 157–186CrossRefGoogle Scholar
  21. 21.
    Zhang W J, Zhan L M, Zhang Y, Cheema M A, Lin X M. Efficient top-k similarity join processing over multi-valued objects. World Wide Web, 2014, 17(3): 285–309CrossRefGoogle Scholar
  22. 22.
    Ioannidis Y E. The history of histograms (abridged). In: Proceedings of the 29th International Conference on Very Large Data Bases. 2004, 19–30Google Scholar
  23. 23.
    Cormode G, Garofalakis M. Histograms and wavelets on probabilistic data. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(8): 1142–1157CrossRefGoogle Scholar
  24. 24.
    Cormode G, Deligiannakis A, Garofalakis M, McGregor A. Probabilistic histograms for probabilistic data. Proceedings of the VLDB Endowment, 2009, 2(1): 526–537CrossRefGoogle Scholar
  25. 25.
    Wang H Z, Liu X L, Li J Z, Tong X, Yang L, Li Y K. EntityManager: an entity-based dirty data management system. In: Proceedings of International Conference on Database Systems for Advanced Applications. 2013, 468–471CrossRefGoogle Scholar
  26. 26.
    Abiteboul S, Kanellakis P, Grahne G. On the representation and querying of sets of possible worlds. Theoretical Computer Science, 1987, 16(3): 34–48zbMATHGoogle Scholar
  27. 27.
    Fuhr N, Rolleke T. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 1997, 15(1): 32–66CrossRefGoogle Scholar
  28. 28.
    Lakshmanan L, Leone N, Ross R, Subrahmanian V S. Probview: a flexible probabilistic database system. ACM Transactions on Database Systems, 1997, 22(3): 419–469CrossRefGoogle Scholar
  29. 29.
    Nierman A, Jagadish H. ProTDB: probabilistic data in XML. In: Proceedings of the 28th International Conference on Very Large Data Bases. 2002, 646–657CrossRefGoogle Scholar
  30. 30.
    Jin C Q, Yi K, Chen L, Yu J X, Lin X. Sliding-window top-k queries on uncertain streams. Proceedings of the VLDB Endowment, 2008, 1(1): 301–312CrossRefGoogle Scholar
  31. 31.
    Burdick D, Deshpande P M, Jayram T S, Ramakrishnan R, Vaithyanathan S. OLAP over uncertain and imprecise data. The VLDB Journal—The International Journal on Very Large Data Bases, 2007, 16(1): 123–144CrossRefGoogle Scholar
  32. 32.
    Qi Y, Jain R, Singh S, Prabhakar S. Threshold query optimization for uncertain data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2010, 315–326Google Scholar
  33. 33.
    Tao Y F, Cheng R, Xiao X K, Ngai W K, Kao B, Prabhakar S. Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 922–933Google Scholar
  34. 34.
    Tao Y F, Xiao X K, Cheng R. Range search on multidimensional uncertain data. ACM Transactions on Database Systems, 2007, 32(3): 15CrossRefGoogle Scholar
  35. 35.
    Dalvi N, Suciu D. Efficient query evaluation on probabilistic databases. In: Proceedings of International Conference on Very Large Databases. 2008, 16(1): 119–128Google Scholar
  36. 36.
    Cheng R, Kalashnikov D V, Prabhakar S. Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2003, 551–562Google Scholar
  37. 37.
    Pei J, Jiang B, Lin X M, Yuan Y D. Probabilistic skylines on uncertain data. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 15–26Google Scholar
  38. 38.
    Dellis E, Seeger B. Efficient computation of reverse skyline queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 291–302Google Scholar
  39. 39.
    Soliman M A, Ilyas I F, Chang K C C. Top-k query processing in uncertain databases. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 896–905Google Scholar
  40. 40.
    Ge T, Zdonik S, Madden S. Top-k queries on uncertain data: on score distribution and typical answers. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 375–388CrossRefGoogle Scholar
  41. 41.
    Wang G R, Huo H, Han D H, Hui X Y. Query processing and optimization techniques over streamed fragmented XML. World Wide Web, 2008, 11(3): 339–359CrossRefGoogle Scholar
  42. 42.
    Barbosa D, Mignet L, Veltri P. Studying the XML Web: gathering statistics from an XML sample. World Wide Web, 2006, 9(2): 187–212CrossRefGoogle Scholar
  43. 43.
    Kooi R. The optimization of queries in relational databases. Dissertation for the Doctoral Degree. Cleveland, Ohio: Case Western Reserve University, 1980Google Scholar
  44. 44.
    Piatetsky-Shapiro G, Connell C. Accurate estimation of the number of tuples satisfying a condition. ACM SIGMOD Record, 1984, 14(2): 256–276CrossRefGoogle Scholar
  45. 45.
    Ioannidis Y, Poosala V. Balancing histogram optimality and practicality for query result size estimation. ACM SIGMOD Record, 1995, 24(2): 233–244CrossRefGoogle Scholar
  46. 46.
    Gunopulos D, Kollios G, Tsotras V J, Domeniconi C. Approximating multi-dimensional aggregate range queries over real attributes. ACM SIGMOD Record, 2000, 29(2): 463–474.CrossRefGoogle Scholar
  47. 47.
    Bruno N, Chaudhuri S, Gravano L. STHoles: a multidimensional workload aware histogram. ACM SIGMOD Record, 2001, 30(2): 211–222CrossRefGoogle Scholar
  48. 48.
    Haas P J, Naughton J F, Seshadri S, Swami A N. Selectivity and cost estimation for joins based on random sampling. Journal of Computer and System Sciences, 1996, 52(3): 550–569MathSciNetCrossRefzbMATHGoogle Scholar
  49. 49.
    Lipton R J, Naughton J F. Query size estimation by adaptive sampling. Journal of Computer and System Sciences, 1995, 51(1): 18–25MathSciNetCrossRefzbMATHGoogle Scholar
  50. 50.
    Olken F. Random sampling from databases. Dissertation for the Doctoral Degree. University of California at Berkeley, 1997Google Scholar
  51. 51.
    Ngu A, Harangsri B, Shepherd J. Query size estimation for joins using systematic sampling. Distributed and Parallel Databases, 2004, 15(3): 237–275CrossRefGoogle Scholar
  52. 52.
    Chaudhuri S, Das G, Narasayya V R. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems, 2007, 32(2): 9CrossRefGoogle Scholar
  53. 53.
    Zhang Y, Yang L, Wang H Z. Range query estimation for dirty data management system. In: Proceedings of International Conference on Web-Age Information Management. 2012, 152–164CrossRefGoogle Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2017

Authors and Affiliations

  • Yan Zhang
    • 1
  • Hongzhi Wang
    • 1
  • Long Yang
    • 1
  • Jianzhong Li
    • 1
  1. 1.Department of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations