Advertisement

Cleaning data with Llunatic

  • 104 Accesses

Abstract

Data cleaning (or data repairing) is considered a crucial problem in many database-related tasks. It consists in making a database consistent with respect to a given set of constraints. In recent years, repairing methods have been proposed for several classes of constraints. These methods, however, tend to hard-code the strategy to repair conflicting values and are specialized toward specific classes of constraints. In this paper, we develop a general chase-based repairing framework, referred to as Llunatic, in which repairs can be obtained for a large class of constraints and by using different strategies to select preferred values. The framework is based on an elegant formalization in terms of labeled instances and partially ordered preference labels. In this context, we revisit concepts such as upgrades, repairs and the chase. In Llunatic, various repairing strategies can be slotted in, without the need for changing the underlying implementation. Furthermore, Llunatic is the first data repairing system which is DBMS-based. We report experimental results that confirm its good scalability and show that various instantiations of the framework result in repairs of good quality.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    Typically, to make this step deterministic, an ordering on null values is assumed and the smaller null value is replaced by the larger one. We assume that \(\bot _0<\bot _1< \bot _2<\bot _3 <\cdots \).

  2. 2.

    Of course, here the universe of discourse of the first-order structure being \({\textsc {consts}}\cup {\textsc {nulls}}\cup \textsc {lluns}\) (and \(\textsc {Tids}\) for the \({\textsf {Tid}}\)-attributes). Similarly to constants and nulls, lluns are treated as constants.

  3. 3.

    https://github.com/donatellosantoro/Llunatic.

  4. 4.

    http://www.medicare.gov/hospitalcompare/.

  5. 5.

    https://datasets.imdbws.com/.

References

  1. 1.

    Abedjan, Z., Chu, X., Deng, D., Fernandez, R.C., Ilyas, I.F., Ouzzani, M., Papotti, P., Stonebraker, M., Tang, N.: Detecting data errors: Where are we and what needs to be done? PVLDB 9(12), 993–1004 (2016)

  2. 2.

    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)

  3. 3.

    Arocena, P.C., Glavic, B., Mecca, G., Miller, R.J., Papotti, P., Santoro, D.: Messing up with BART: error generation for evaluating data-cleaning algorithms. PVLDB 9(2), 36–47 (2015)

  4. 4.

    Beeri, C., Vardi, M.: A proof procedure for data dependencies. J. ACM 31(4), 718–741 (1984)

  5. 5.

    Benedikt, M., Konstantinidis, G., Mecca, G., Motik, B., Papotti, P., Santoro, D., Tsamoura, E.: Benchmarking the chase. In: PODS, pp. 37–52 (2017)

  6. 6.

    Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

  7. 7.

    Bertossi, L.: Database Repairing and Consistent Query Answering. Morgan & Claypool, San Rafael (2011)

  8. 8.

    Bertossi, L., Kolahi, S., Lakshmanan, L.: Data cleaning and query answering with matching dependencies and matching functions. In: ICDT, pp. 268–279 (2011)

  9. 9.

    Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3, 197–207 (2010)

  10. 10.

    Bleifuß, T., Kruse, S., Naumann, F.: Efficient denial constraint discovery with Hydra. Proc. VLDB Endow. 11(3), 311–323 (2017)

  11. 11.

    Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD, pp. 143–154 (2005)

  12. 12.

    Cao, Y., Fan, W., Yu, W.: Determining the relative accuracy of attributes. In: SIGMOD, pp. 565–576 (2013)

  13. 13.

    Caroprese, L., Greco, S., Zumpano, E.: Active integrity constraints for database consistency maintenance. IEEE Trans. Knowl. Data Eng. 21(7), 1042–1058 (2009)

  14. 14.

    Chiang, F., Miller, R.J.: A unified model for data and constraint repair. In: ICDE (2011)

  15. 15.

    Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: Overview and emerging challenges. In: SIGMOD, pp. 2201–2206 (2016)

  16. 16.

    Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE, pp. 458–469 (2013)

  17. 17.

    Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In: SIGMOD, pp. 1247–1261 (2015)

  18. 18.

    Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: VLDB, pp. 315–326 (2007)

  19. 19.

    Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A.K., Ilyas, I., Ouzzani, M., Tang, N.: Nadeef: a commodity data cleaning system. In: SIGMOD, pp. 541–552 (2013)

  20. 20.

    Deng, D., Tao, W., Abedjan, Z., Elmagarmid, A.K., Ilyas, I.F., Madden, S., Ouzzani, M., Stonebraker, M., Tang, N.: Entity consolidation: the golden record problem. CoRR arXiv:1709.10436 (2017)

  21. 21.

    Experian: White paper: The data quality benchmark report (2015)

  22. 22.

    Fagin, R., Kolaitis, P., Miller, R., Popa, L.: Data exchange: semantics and query answering. TCS 336(1), 89–124 (2005)

  23. 23.

    Fan, W., Gao, H., Jia, X., Li, J., Ma, S.: Dynamic constraints for record matching. VLDB J. 20(4), 495–520 (2011)

  24. 24.

    Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan & Claypool, San Rafael (2012)

  25. 25.

    Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM TODS 33, 6 (2008)

  26. 26.

    Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23(5), 683–698 (2011)

  27. 27.

    Fan, W., Geerts, F., Ma, S., Müller, H.: Detecting inconsistencies in distributed data. In: Proceedings of the 26th International Conference on Data Engineering, ICDE, pp. 64–75 (2010)

  28. 28.

    Fan, W., Geerts, F., Wijsen, J.: Determining the currency of data. In: PODS, pp. 71–82 (2011)

  29. 29.

    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), 173–184 (2010)

  30. 30.

    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: SIGMOD, pp. 469–480 (2011)

  31. 31.

    Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. PVLDB 6(9), 625–636 (2013)

  32. 32.

    Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Mapping and cleaning. In: ICDE, pp. 232–243 (2014)

  33. 33.

    He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., Tang, N.: Interactive and deterministic data cleaning. In: SIGMOD, pp. 893–907 (2016)

  34. 34.

    Hernández, M., Koutrika, G., Krishnamurthy, R., Popa, L., Wisnesky, R.: Hil: a high-level scripting language for entity integration. In: EDBT, pp. 549–560 (2013)

  35. 35.

    Huhtala, Y., Kärkkäinen, J., Pasi Porkka, P., Toivonen, H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)

  36. 36.

    Ilyas, I.F.: Effective data cleaning with continuous evaluation. IEEE Data Eng. Bull. 39(2), 38–46 (2016)

  37. 37.

    Ilyas, I.F., Chu, X.: Trends in cleaning relational data: consistency and deduplication. Found. Trends Databases 5(4), 281–393 (2015)

  38. 38.

    Imieliński, T., Lipski, W.: Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)

  39. 39.

    Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J., Tang, N., Yin, S.: Bigdansing: a system for big data cleansing. In: SIGMOD, pp. 1215–1230 (2015)

  40. 40.

    Kimelfeld, B., Livshits, E., Peterfreund, L.: Detecting ambiguity in prioritized database repairing. In: ICDT, pp. 17:1–17:20 (2017)

  41. 41.

    Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: ICDT, pp. 53–62 (2009)

  42. 42.

    Koudas, N., Saha, A., Srivastava, D., Venkatasubramanian, S.: Metric functional dependencies. In: ICDE, pp. 1275–1278 (2009)

  43. 43.

    Loshin, D.: Master Data Management. Knowl. Integrity, Inc., Washington, DC (2009)

  44. 44.

    Marnette, B., Mecca, G., Papotti, P., Raunich, S., Santoro, D.: ++Spicy: an opensource tool for second-generation schema mapping and data exchange. PVLDB 4(12), 1438–1441 (2011)

  45. 45.

    Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In: SIGMOD, pp. 821–833 (2016)

  46. 46.

    Rammelaere, J., Geerts, F.: Revisiting conditional functional dependency discovery: splitting the “c” from the “fd”. In: ECML/PKDD, pp. 552–568 (2018)

  47. 47.

    Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: holistic data repairs with probabilistic inference. PVLDB 10(11), 1190–1201 (2017)

  48. 48.

    Saha, B., Srivastava, D.: Data quality: the other face of big data. In: ICDE, pp. 1294–1297 (2014)

  49. 49.

    Song, S., Chen, L.: Differential dependencies: reasoning and discovery. ACM Trans. Database Syst. 36(3), 16 (2011)

  50. 50.

    Staworko, S., Chomicki, J., Marcinkowski, J.: Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64(2–3), 209–246 (2012)

  51. 51.

    Volkovs, M., Chiang, F., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: ICDE (2014)

  52. 52.

    Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: SIGMOD (2014)

  53. 53.

    Wijsen, J.: Database repairing using updates. ACM Trans. Database Syst. 30(3), 722–768 (2005)

  54. 54.

    Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: SIGMOD, pp. 553–564 (2013)

  55. 55.

    Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)

Download references

Author information

Correspondence to Paolo Papotti.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Geerts, F., Mecca, G., Papotti, P. et al. Cleaning data with Llunatic. The VLDB Journal (2019) doi:10.1007/s00778-019-00586-5

Download citation

Keywords

  • Data quality
  • Data cleaning
  • Data repairing
  • Chase
  • Data repairing system
  • Constraints
  • Rules
  • Repair algorithm
  • Cleaning rules
  • Dependencies
  • Error detection