Abstract
Data cleaning (or data repairing) is considered a crucial problem in many database-related tasks. It consists in making a database consistent with respect to a given set of constraints. In recent years, repairing methods have been proposed for several classes of constraints. These methods, however, tend to hard-code the strategy to repair conflicting values and are specialized toward specific classes of constraints. In this paper, we develop a general chase-based repairing framework, referred to as Llunatic, in which repairs can be obtained for a large class of constraints and by using different strategies to select preferred values. The framework is based on an elegant formalization in terms of labeled instances and partially ordered preference labels. In this context, we revisit concepts such as upgrades, repairs and the chase. In Llunatic, various repairing strategies can be slotted in, without the need for changing the underlying implementation. Furthermore, Llunatic is the first data repairing system which is DBMS-based. We report experimental results that confirm its good scalability and show that various instantiations of the framework result in repairs of good quality.
Similar content being viewed by others
Notes
Typically, to make this step deterministic, an ordering on null values is assumed and the smaller null value is replaced by the larger one. We assume that \(\bot _0<\bot _1< \bot _2<\bot _3 <\cdots \).
Of course, here the universe of discourse of the first-order structure being \({\textsc {consts}}\cup {\textsc {nulls}}\cup \textsc {lluns}\) (and \(\textsc {Tids}\) for the \({\textsf {Tid}}\)-attributes). Similarly to constants and nulls, lluns are treated as constants.
References
Abedjan, Z., Chu, X., Deng, D., Fernandez, R.C., Ilyas, I.F., Ouzzani, M., Papotti, P., Stonebraker, M., Tang, N.: Detecting data errors: Where are we and what needs to be done? PVLDB 9(12), 993–1004 (2016)
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)
Arocena, P.C., Glavic, B., Mecca, G., Miller, R.J., Papotti, P., Santoro, D.: Messing up with BART: error generation for evaluating data-cleaning algorithms. PVLDB 9(2), 36–47 (2015)
Beeri, C., Vardi, M.: A proof procedure for data dependencies. J. ACM 31(4), 718–741 (1984)
Benedikt, M., Konstantinidis, G., Mecca, G., Motik, B., Papotti, P., Santoro, D., Tsamoura, E.: Benchmarking the chase. In: PODS, pp. 37–52 (2017)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Bertossi, L.: Database Repairing and Consistent Query Answering. Morgan & Claypool, San Rafael (2011)
Bertossi, L., Kolahi, S., Lakshmanan, L.: Data cleaning and query answering with matching dependencies and matching functions. In: ICDT, pp. 268–279 (2011)
Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3, 197–207 (2010)
Bleifuß, T., Kruse, S., Naumann, F.: Efficient denial constraint discovery with Hydra. Proc. VLDB Endow. 11(3), 311–323 (2017)
Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD, pp. 143–154 (2005)
Cao, Y., Fan, W., Yu, W.: Determining the relative accuracy of attributes. In: SIGMOD, pp. 565–576 (2013)
Caroprese, L., Greco, S., Zumpano, E.: Active integrity constraints for database consistency maintenance. IEEE Trans. Knowl. Data Eng. 21(7), 1042–1058 (2009)
Chiang, F., Miller, R.J.: A unified model for data and constraint repair. In: ICDE (2011)
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: Overview and emerging challenges. In: SIGMOD, pp. 2201–2206 (2016)
Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE, pp. 458–469 (2013)
Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In: SIGMOD, pp. 1247–1261 (2015)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: VLDB, pp. 315–326 (2007)
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A.K., Ilyas, I., Ouzzani, M., Tang, N.: Nadeef: a commodity data cleaning system. In: SIGMOD, pp. 541–552 (2013)
Deng, D., Tao, W., Abedjan, Z., Elmagarmid, A.K., Ilyas, I.F., Madden, S., Ouzzani, M., Stonebraker, M., Tang, N.: Entity consolidation: the golden record problem. CoRR arXiv:1709.10436 (2017)
Experian: White paper: The data quality benchmark report (2015)
Fagin, R., Kolaitis, P., Miller, R., Popa, L.: Data exchange: semantics and query answering. TCS 336(1), 89–124 (2005)
Fan, W., Gao, H., Jia, X., Li, J., Ma, S.: Dynamic constraints for record matching. VLDB J. 20(4), 495–520 (2011)
Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan & Claypool, San Rafael (2012)
Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM TODS 33, 6 (2008)
Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23(5), 683–698 (2011)
Fan, W., Geerts, F., Ma, S., Müller, H.: Detecting inconsistencies in distributed data. In: Proceedings of the 26th International Conference on Data Engineering, ICDE, pp. 64–75 (2010)
Fan, W., Geerts, F., Wijsen, J.: Determining the currency of data. In: PODS, pp. 71–82 (2011)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), 173–184 (2010)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: SIGMOD, pp. 469–480 (2011)
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. PVLDB 6(9), 625–636 (2013)
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Mapping and cleaning. In: ICDE, pp. 232–243 (2014)
He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., Tang, N.: Interactive and deterministic data cleaning. In: SIGMOD, pp. 893–907 (2016)
Hernández, M., Koutrika, G., Krishnamurthy, R., Popa, L., Wisnesky, R.: Hil: a high-level scripting language for entity integration. In: EDBT, pp. 549–560 (2013)
Huhtala, Y., Kärkkäinen, J., Pasi Porkka, P., Toivonen, H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)
Ilyas, I.F.: Effective data cleaning with continuous evaluation. IEEE Data Eng. Bull. 39(2), 38–46 (2016)
Ilyas, I.F., Chu, X.: Trends in cleaning relational data: consistency and deduplication. Found. Trends Databases 5(4), 281–393 (2015)
Imieliński, T., Lipski, W.: Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)
Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J., Tang, N., Yin, S.: Bigdansing: a system for big data cleansing. In: SIGMOD, pp. 1215–1230 (2015)
Kimelfeld, B., Livshits, E., Peterfreund, L.: Detecting ambiguity in prioritized database repairing. In: ICDT, pp. 17:1–17:20 (2017)
Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: ICDT, pp. 53–62 (2009)
Koudas, N., Saha, A., Srivastava, D., Venkatasubramanian, S.: Metric functional dependencies. In: ICDE, pp. 1275–1278 (2009)
Loshin, D.: Master Data Management. Knowl. Integrity, Inc., Washington, DC (2009)
Marnette, B., Mecca, G., Papotti, P., Raunich, S., Santoro, D.: ++Spicy: an opensource tool for second-generation schema mapping and data exchange. PVLDB 4(12), 1438–1441 (2011)
Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In: SIGMOD, pp. 821–833 (2016)
Rammelaere, J., Geerts, F.: Revisiting conditional functional dependency discovery: splitting the “c” from the “fd”. In: ECML/PKDD, pp. 552–568 (2018)
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: holistic data repairs with probabilistic inference. PVLDB 10(11), 1190–1201 (2017)
Saha, B., Srivastava, D.: Data quality: the other face of big data. In: ICDE, pp. 1294–1297 (2014)
Song, S., Chen, L.: Differential dependencies: reasoning and discovery. ACM Trans. Database Syst. 36(3), 16 (2011)
Staworko, S., Chomicki, J., Marcinkowski, J.: Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64(2–3), 209–246 (2012)
Volkovs, M., Chiang, F., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: ICDE (2014)
Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: SIGMOD (2014)
Wijsen, J.: Database repairing using updates. ACM Trans. Database Syst. 30(3), 722–768 (2005)
Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: SIGMOD, pp. 553–564 (2013)
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)
Funding
Paolo Papotti has been partially supported by Agence Nationale de la Recherche (Grant No. ANR-18-CE23-0019).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Geerts, F., Mecca, G., Papotti, P. et al. Cleaning data with Llunatic. The VLDB Journal 29, 867–892 (2020). https://doi.org/10.1007/s00778-019-00586-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-019-00586-5