Advertisement

Journal of Computer Science and Technology

, Volume 31, Issue 4, pp 720–740 | Cite as

Determining the Real Data Completeness of a Relational Dataset

  • Yong-Nan LiuEmail author
  • Jian-Zhong Li
  • Zhao-Nian Zou
Regular Paper

Abstract

Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it. Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.

Keywords

data quality data completeness functional dependency data completeness model optimal algorithm 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Rahm E, Do H H. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 2000, 23(4): 3-13.Google Scholar
  2. [2]
    Eckerson W W. Data warehousing special report: Data quality and the bottom line. Application Development Trends, 2002, (5): 1-9.Google Scholar
  3. [3]
    Poleto F Z, Singer J M, Paulino C D. Missing data mechanisms and their implications on the analysis of categorical data. Statistics and Computing, 2011, 21(1): 31-43.MathSciNetCrossRefzbMATHGoogle Scholar
  4. [4]
    Chen K, Chen H, Conway N, Hellerstein J M, Parikh T S. Usher: Improving data quality with dynamic forms. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(8): 1138-1153.CrossRefGoogle Scholar
  5. [5]
    Arocena P C, Glavic B, Miller R J. Value invention in data exchange. In Proc. the 2013 International Conference on Management of Data, June 2013, pp.157-168.Google Scholar
  6. [6]
    Dong X L, Gabrilovich E, Murphy K, Dang V, Horn W, Lugaresi C, Sun S, Zhang W. Knowledge-based trust: Estimating the trustworthiness of web sources. Proceedings of the VLDB Endowment, 2015, 8(9): 938-949.CrossRefGoogle Scholar
  7. [7]
    Wolf G, Khatri H, Chokshi B, Fan J, Chen Y, Kambhampati S. Query processing over incomplete autonomous databases. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.651-662.Google Scholar
  8. [8]
    Motro A. Integrity = Validity + Completeness. ACM Transactions on Database Systems, 1989, 14(4): 480-502.CrossRefGoogle Scholar
  9. [9]
    Yang K, Li J, Wang C. Missing values estimation in microarray data with partial least squares regression. In Proc. the 6th International Conference on Computational Science, May 2006, pp.662-669.Google Scholar
  10. [10]
    Beskales G, Ilyas I F, Golab L. Sampling the repairs of functional dependency violations under hard constraints. Proceedings of the VLDB Endowment, 2010, 3(1/2): 197-207.CrossRefGoogle Scholar
  11. [11]
    Li P, Dong X, Maurino A, Srivastava D. Linking temporal records. Proceedings of the VLDB Endowment, 2011, 4(11): 956-967.zbMATHGoogle Scholar
  12. [12]
    Motro A, Rakov I. Not all answers are equally good: Estimating the quality of database answers. In Proc. the 1997 Flexible Query Answering Systems, June 1997, pp.1-21.Google Scholar
  13. [13]
    Naumann F, Freytag J C, Leser U. Completeness of integrated information sources. Information Systems, 2004, 29(7): 583-615.CrossRefGoogle Scholar
  14. [14]
    Biswas J, Naumann F, Qiu Q. Assessing the completeness of sensor data. In Proc. the 11th International Conference on Database Systems for Advanced Applications, April 2006, pp.717-732.Google Scholar
  15. [15]
    Levy A Y. Obtaining complete answers from incomplete databases. In Proc. the 22nd International Conference on Very Large Data Bases, Sept. 1996, pp.402-412.Google Scholar
  16. [16]
    Razniewski S, Nutt W. Completeness of queries over incomplete databases. Proceedings of the VLDB Endowment, 2011, 4(11): 749-760.Google Scholar
  17. [17]
    FanW, Geerts F. Capturing missing tuples and missing values. In Proc. the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2010, pp.169-178.Google Scholar
  18. [18]
    Fan W, Geerts F. Relative information completeness. ACM Transactions on Database Systems, 2010, 35(4): Article No. 27.Google Scholar
  19. [19]
    Prokoshyna N, Szlichta J, Chiang F, Miller R J, Srivastava D. Combining quantitative and logical data cleaning. Proceedings of the VLDB Endowment, 2015, 9(4): 300-311.CrossRefGoogle Scholar
  20. [20]
    Abiteboul S, Hull R, Vianu V. Foundations of Databases: The Logical Level (1st edition). Addison-Wesley Longman Publishing Co., Inc., 1995.Google Scholar
  21. [21]
    Silberschatz A, Korth H, Sudarshan S. Database System Concepts (4th edition). McGraw-Hill Education, 2001.Google Scholar
  22. [22]
    Cheng S, Li J. Sampling based (epsilon, delta)-approximate aggregation algorithm in sensor networks. In Proc. the 29th International Conference on Distributed Computing Systems, June 2009, pp.273-280.Google Scholar
  23. [23]
    Khalefa M E, Mokbel M F, Levandoski J J. Skyline query processing for incomplete data. In Proc. the 24th International Conference on Data Engineering, April 2008, pp.556-565.Google Scholar
  24. [24]
    Salloum M, Dong X L, Srivastava D, Tsotras V J. Online ordering of overlapping data sources. Proceedings of the VLDB Endowment, 2013, 7(3): 133-144.CrossRefGoogle Scholar
  25. [25]
    Zhao B, Rubinstein B I, Gemmell J, Han J. A Bayesian approach to discovering truth from conflicting sources for data integration. Proceedings of the VLDB Endowment, 2012, 5(6): 550-561.CrossRefGoogle Scholar
  26. [26]
    Dong X L, Saha B, Srivastava D. Less is more: Selecting sources wisely for integration. In Proc. the 39th International Conference on Very Large Data Bases, August 2013, pp.37-48.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringHarbin Institute of TechnologyHarbinChina

Personalised recommendations