Abstract

Real-life data are often dirty: inconsistent, inaccurate, incomplete, stale and duplicated. Dirty data have been a longstanding issue, and the prevalent use of Internet has been increasing the risks, in an unprecedented scale, of creating and propagating dirty data. Dirty data are reported to cost US industry billions of dollars each year. There is no reason to believe that the scale of the problem is any different in any other society that depends on information technology. With these comes the need for improving data quality, a topic as important as traditional data management tasks for coping with the quantity of the data.

We aim to provide an overview of recent advances in the area of data quality, from theory to practical techniques. We promote a conditional dependency theory for capturing data inconsistencies, a new form of dynamic constraints for data deduplication, a theory of relative information completeness for characterizing incomplete data, and a data currency model for answering queries with current values from possibly stale data in the absence of reliable timestamps. We also discuss techniques for automatically discovering data quality rules, detecting errors in real-life data, and for correcting errors with performance guarantees.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995)Google Scholar
  2. 2.
    Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS (1999)Google Scholar
  3. 3.
    Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques. Springer (2006)Google Scholar
  4. 4.
    Bertossi, L.: Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers (2011)Google Scholar
  5. 5.
    Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD (2005)Google Scholar
  6. 6.
    Bravo, L., Fan, W., Geerts, F., Ma, S.: Increasing the expressivity of conditional functional dependencies without extra complexity. In: ICDE (2008)Google Scholar
  7. 7.
    Bravo, L., Fan, W., Ma, S.: Extending dependencies with conditions. In: VLDB (2007)Google Scholar
  8. 8.
    Chiang, F., Miller, R.: Discovering data quality rules. In: VLDB (2008)Google Scholar
  9. 9.
    Chomicki, J.: Consistent Query Answering: Five Easy Pieces. In: Schwentick, T., Suciu, D. (eds.) ICDT 2007. LNCS, vol. 4353, pp. 1–17. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  10. 10.
    Codd, E.F.: Relational completeness of data base sublanguages. In: Data Base Systems: Courant Computer Science Symposia Series 6. Prentice-Hall (1972)Google Scholar
  11. 11.
    Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: Consistency and accuracy. In: VLDB (2007)Google Scholar
  12. 12.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. In: VLDB (2009)Google Scholar
  13. 13.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. In: VLDB (2009)Google Scholar
  14. 14.
    Eckerson, W.W.: Data quality and the bottom line: Achieving business success through a commitment to high quality data. The Data Warehousing Institute (2002)Google Scholar
  15. 15.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. TKDE 19(1) (2007)Google Scholar
  16. 16.
    English, L.: Plain English on data quality: Information quality management: The next frontier. DM Review Magazine (April 2000)Google Scholar
  17. 17.
    Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008)Google Scholar
  18. 18.
    Fan, W., Gao, H., Jia, X., Li, J., Ma, S.: Dynamic constraints for record matching. VLDB J. 20(4), 495–520 (2011)CrossRefGoogle Scholar
  19. 19.
    Fan, W., Geerts, F.: Capturing missing tuples and missing values. In: PODS, pp. 169–178 (2010)Google Scholar
  20. 20.
    Fan, W., Geerts, F.: Relative information completeness. TODS 35(4) (2010)Google Scholar
  21. 21.
    Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan & Claypool Publishers (2012)Google Scholar
  22. 22.
    Fan, W., Geerts, F., Jia, X.: Semandaq: A data quality system based on conditional functional dependencies. In: VLDB, demo (2008)Google Scholar
  23. 23.
    Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(1) (2008)Google Scholar
  24. 24.
    Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. TKDE 23(5), 683–698 (2011)Google Scholar
  25. 25.
    Fan, W., Geerts, F., Ma, S., Müller, H.: Detecting inconsistencies in distributed data. In: ICDE, pp. 64–75 (2010)Google Scholar
  26. 26.
    Fan, W., Geerts, F., Wijsen, J.: Determining the currency of data. TODS (to appear)Google Scholar
  27. 27.
    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: SIGMOD (2011)Google Scholar
  28. 28.
    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)CrossRefGoogle Scholar
  29. 29.
    Fan, W., Li, J., Tang, N., Yu, W.: Incremental detection of inconsistencies in distributed data. In: ICDE (2012)Google Scholar
  30. 30.
    Fan, W., Libkin, L.: On XML integrity constraints in the presence of DTDs. J. ACM 49(3), 368–406 (2002)MathSciNetGoogle Scholar
  31. 31.
    Fan, W., Ma, S., Hu, Y., Liu, J., Wu, Y.: Propagating functional dependencies with conditions. In: VLDB, pp. 391–407 (2008)Google Scholar
  32. 32.
    Fan, W., Siméon, J.: Integrity constraints for XML. JCSS 66(1), 256–293 (2003)Google Scholar
  33. 33.
    Fellegi, I., Holt, D.: A systematic approach to automatic edit and imputation. J. American Statistical Association 71(353), 17–35 (1976)CrossRefGoogle Scholar
  34. 34.
    Gartner. Forecast: Enterprise software markets, worldwide, 2008-2015, 2011 update. Technical report, Gartner (2011)Google Scholar
  35. 35.
    Golab, L., Karloff, H., Korn, F., Srivastava, D., Yu, B.: On generating near-optimal tableaux for conditional functional dependencies. In: VLDB (2008)Google Scholar
  36. 36.
    Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer (2009)Google Scholar
  37. 37.
    Loshin, D.: Master Data Management. Knowledge Integrity, Inc. (2009)Google Scholar
  38. 38.
    Miller, D.W., et al.: Missing prenatal records at a birth center: A communication problem quantified. In: AMIA Annu. Symp. Proc. (2005)Google Scholar
  39. 39.
    Naumann, F., Herschel, M.: An Introduction to Duplicate Detection. Morgan & Claypool Publishers (2010)Google Scholar
  40. 40.
    Otto, B., Weber, K.: From health checks to the seven sisters: The data quality journey at BT (September 2009), BT TR-BE HSG/CC CDQ/8Google Scholar
  41. 41.
    Redman, T.: The impact of poor data quality on the typical enterprise. Commun. ACM 2, 79–82 (1998)CrossRefGoogle Scholar
  42. 42.
  43. 43.
    Shilakes, C.C., Tylman, J.: Enterprise information portals. Technical report. Merrill Lynch, Inc., New York (November 1998)Google Scholar
  44. 44.
    Song, S., Chen, L.: Discovering matching dependencies. In: CIKM (2009)Google Scholar
  45. 45.
    Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M.: GDR: a system for guided data repair. In: SIGMOD (2010)Google Scholar
  46. 46.
    Zhang, H., Diao, Y., Immerman, N.: Recognizing patterns in streams with imprecise timestamps. In: VLDB (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Wenfei Fan
    • 1
  1. 1.University of Edinburgh and Harbin Institute of TechnologyChina

Personalised recommendations