Abstract
Relaxed functional dependencies (rfds) are properties expressing important relationships among data. Thanks to the introduction of approximations in data comparison and/or validity, they can capture constraints useful for several purposes, such as the identification of data inconsistencies or patterns of semantically related data. Nevertheless, rfds can provide benefits only if they can be automatically discovered from data. In this paper we present an rfd discovery algorithm relying on a lattice structured search space, previously used for fd discovery, new pruning strategies, and a new candidate rfd validation method. An experimental evaluation demonstrates the discovery performances of the proposed algorithm on real datasets, also providing a comparison with other algorithms.
Similar content being viewed by others
Notes
The software and the datasets used in the evaluation are available online https://dastlab.github.io/dime/.
References
Abedjan Z, Schulze P, Naumann F (2014) DFD: efficient functional dependency discovery. In: Proceedings of the 23rd ACM international conference on information and knowledge management, CIKM ’14, pp 949–958. https://doi.org/10.1145/2661829.2661884
Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J 24(4):557–581. https://doi.org/10.1007/s00778-015-0389-y
Arenas M, Libkin L (2004) A normal form for XML documents. ACM Trans Database Syst 29(1):195–232. https://doi.org/10.1145/974750.974757
Berti-Équille L, Harmouch H, Naumann F, Novelli N, Thirumuruganathan S (2018) Discovery of genuine functional dependencies from relational data with missing values. Proc VLDB Endowment 11(8):880–892. https://doi.org/10.14778/3204028.3204032
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. https://archive.ics.uci.edu/ml/index.php
Bohannon P, Fan W, Geerts F, Jia X, Kementsietsidis A (2007) Conditional functional dependencies for data cleaning. In: Proceedings of the 25th international conference on data engineering, ICDE ’07, pp 746–755. https://doi.org/10.1109/ICDE.2007.367920
Caruccio L, Deufemia V, Polese G (2016a) On the discovery of relaxed functional dependencies. In: Proceedings of the 20th international database engineering & applications symposium, IDEAS ’16, pp 53–61. https://doi.org/10.1145/2938503.2938519
Caruccio L, Deufemia V, Polese G (2016b) Relaxed functional dependencies–a survey of approaches. IEEE Trans Knowl Data Eng 28(1):147–165. https://doi.org/10.1109/TKDE.2015.2472010
Caruccio L, Deufemia V, Polese G (2019) Visualization of (multimedia) dependencies from big data. Multimedia Tools and Applications 78(23):33151–33167. https://doi.org/10.1007/s11042-019-07951-0
Chang SK, Deufemia V, Polese G, Vacca M (2007) A normalization framework for multimedia databases. IEEE Trans Knowl Data Eng 19(12):1666–1679. https://doi.org/10.1109/TKDE.2007.190651
Chardin B, Coquery E, Pailloux M, Petit JM (2017) RQL: a query language for rule discovery in databases. Theoret Comput Sci 658:357–374. https://doi.org/10.1016/j.tcs.2016.11.004
Chiang F, Miller RJ (2008) Discovering data quality rules. Proc VLDB Endowment 1(1):1166–1177. https://doi.org/10.14778/1453856.1453980
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16. https://doi.org/10.1109/TKDE.2007.250581
Fan W, Geerts F, Lakshmanan LVS, Xiong M (2009) Discovering conditional functional dependencies. In: Proceedings of the 25th international conference on data engineering, ICDE ’09, pp 1231–1234. https://doi.org/10.1109/ICDE.2009.208
Fan W, Gao H, Jia X, Li J, Ma S (2011) Dynamic constraints for record matching. VLDB J 20(4):495–520. https://doi.org/10.1007/s00778-010-0206-6
Flach PA, Savnik I (1999) Database dependency discovery: a machine learning approach. AI Commun 12(3):139–160
Giannella C, Robertson E (2004) On approximation measures for functional dependencies. Inf Syst 29(6):483–507. https://doi.org/10.1016/j.is.2003.10.006
Golab L, Karloff H, Korn F, Srivastava D, Yu B (2008) On generating near-optimal tableaux for conditional functional dependencies. Proc VLDB Endowment 1(1):376–390. https://doi.org/10.14778/1453856.1453900
Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H (1999) TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput J 42(2):100–111. https://doi.org/10.1093/comjnl/42.2.100
Ilyas IF, Markl V, Haas P, Brown P, Aboulnaga A (2004) CORDS: automatic discovery of correlations and soft functional dependencies. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04, pp 647–658. https://doi.org/10.1145/1007568.1007641
Johnson DS, Garey MR (1979) Computers and intractability: a guide to the theory of NP-completeness. WH Freeman, New York
Kim M, Candan KS (2016) Decomposition-by-normalization (DBN): leveraging approximate functional dependencies for efficient CP and tucker decompositions. Data Min Knowl Disc 30(1):1–46. https://doi.org/10.1007/s10618-015-0401-6
King RS, Legendre JJ (2003) Discovery of functional and approximate functional dependencies in relational databases. Adv Decis Sci 7(1):49–59. https://doi.org/10.1155/S117391260300004X
Kivinen J, Mannila H (1995) Approximate inference of functional dependencies from relations. Theoret Comput Sci 149(1):129–149. https://doi.org/10.1016/0304-3975(95)00028-U
Kleinberg J, Tardos E (2006) Algorithm design. Pearson Education India, New Delhi
Kwashie S, Liu J, Li J, Ye F (2014) Mining differential dependencies: a subspace clustering approach. In: Proceedings of the 25th Australasian database conference, ADC ’14, pp 50–61. https://doi.org/10.1007/978-3-319-08608-8_5
Kwashie S, Liu J, Li J, Ye F (2015) Efficient discovery of differential dependencies through association rules mining. In: Proceedings of the 26th Australasian database conference, ADC ’15, pp 3–15. https://doi.org/10.1007/978-3-319-19548-3_1
Lee ML, Ling TW, Low WL (2002) Designing functional dependencies for XML. In: Proceedings of the 8th international conference on extending database technology, EDBT ’02, pp 124–141. https://doi.org/10.1007/3-540-45876-X_10
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707–710
Li J (2006) On optimal rule discovery. IEEE Trans Knowl Data Eng 18(4):460–471. https://doi.org/10.1109/TKDE.2006.1599385
Liu J, Li J, Liu C, Chen Y (2012) Discover dependencies from data–a review. IEEE Trans Knowl Data Eng 24(2):251–264. https://doi.org/10.1109/TKDE.2010.197
Lopes S, Petit JM, Lakhal L (2000) Efficient discovery of functional dependencies and Armstrong relations. In: Proceedings of the 7th international conference on extending database technology, EDBT ’00, pp 350–364. https://doi.org/10.1007/3-540-46439-5_24
Nambiar U, Kambhampati S (2004) Mining approximate functional dependencies and concept similarities to answer imprecise queries. In: Proceedings of the 7th international workshop on the web and databases, WebDB ’04, pp 73–78. https://doi.org/10.1145/1017074.1017093
Novelli N, Cicchetti R (2001) FUN: an efficient algorithm for mining functional and embedded dependencies. In: Proceedings of the 8th international conference database theory, ICDT ’01, pp 189–203. https://doi.org/10.1007/3-540-44503-X_13
Papenbrock T, Naumann F (2016) A hybrid approach to functional dependency discovery. In: Proceedings of the 2016 ACM SIGMOD international conference on management of data, SIGMOD ’16, pp 821–833. https://doi.org/10.1145/2882903.2915203
Papenbrock T, Bergmann T, Finke M, Zwiener J, Naumann F (2015a) Data profiling with Metanome. Proc VLDB Endowment 8(12):1860–1863. https://doi.org/10.14778/2824032.2824086
Papenbrock T, Ehrlich J, Marten J, Neubert T, Rudolph JP, Schönberg M, Zwiener J, Naumann F (2015b) Functional dependency discovery: an experimental evaluation of seven algorithms. Proc VLDB Endowment 8(10):1082–1093. https://doi.org/10.14778/2794367.2794377
Raju KVSVN, Majumdar AK (1988) Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM Trans Database Syst 13(2):129–166. https://doi.org/10.1145/42338.42344
Sánchez D, Serrano JM, Blanco I, Martín-Bautista MJ, Vila MA (2008) Using association rules to mine for strong approximate dependencies. Data Min Knowl Disc 16(3):313–348. https://doi.org/10.1007/s10618-008-0092-3
Song S (2010) Data dependencies in the presence of difference. PhD thesis, The Hong Kong University
Song S, Chen L (2011) Differential dependencies: reasoning and discovery. ACM Trans Database Syst 36:16. https://doi.org/10.1145/2000824.2000826
Song S, Chen L (2013) Efficient discovery of similarity constraints for matching dependencies. Data Knowl Eng 87:146–166. https://doi.org/10.1016/j.datak.2013.06.003
Song S, Chen L, Cheng H (2014) Efficient determination of distance thresholds for differential dependencies. IEEE Trans Knowl Data Eng 26(9):2179–2192. https://doi.org/10.1109/TKDE.2013.84
Song S, Sun Y, Zhang A, Chen L, Wang J (2018) Enriching data imputation under similarity rule constraints. To appear in IEEE transactions on knowledge and data engineering
Szlichta J, Golab L, Srivastava D (2015) On axiomatization and inference complexity over a hierarchy of functional dependencies. In: Proceedings of the 9th Alberto Mendelzon international workshop on foundations of data management, AMW ’15
Vianu V (1987) Dynamic functional dependencies and database aging. J ACM 34(1):28–59. https://doi.org/10.1145/7531.7918
Wyss C, Giannella C, Robertson E (2001) FastFDs: a heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. In: Proceedings of the 3rd international conference on data warehousing and knowledge discovery, DaWaK ’01, pp 101–110. https://doi.org/10.1007/3-540-44801-2_11
Yao H, Hamilton HJ (2007) Mining functional dependencies from data. Data Min Knowl Disc 16(2):197–219. https://doi.org/10.1007/s10618-007-0083-9
Yao H, Hamilton HJ, Butz CJ (2002) FD\_Mine: Discovering functional dependencies in a database using equivalences. In: Proceedings of the 2nd international conference on data mining, ICDM ’02, pp 729–732. https://doi.org/10.1109/ICDM.2002.1184040
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Toon Calders.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix–The DiM \(\varepsilon \) algorithm
Appendix–The DiM \(\varepsilon \) algorithm
The DiM\(\varepsilon \) discovery algorithm follows the process of column-based discovery algorithms, in which the generation of candidate rfds for a given instance r of a relation schema R is accomplished through a level-by-level visit to a lattice structure, by considering all the possible k-combinations of attributes, and by increasing k at each level in the range [2, n], where n is the number of attributes in R. Moreover, at each level only un-pruned candidates (see Sect. 5.2) are considered, and for each of them a validation process is performed. For this reason, DiM\(\varepsilon \) can be considered as a column-based algorithm. However, since it aims to discover rfds, it has to employ a new validation strategy, and new data configurations to highlight similarities between tuples.
Among the phases of the DiM\(\varepsilon \)’s discovery process described in Sect. 5, in what follows we provide the pseudo-code for those introducing some novelty with respect to other column-based discovery algorithms.
Generating similar pattern subsets The procedure for generating the set of similar pattern subsets for an attribute A is shown in Algorithm 1. It takes as input a relation instance r, an attribute A, and a constraint \(\phi _A\), in order to construct and return \({I}_{A}\). In particular, for each tuple \(t_i \in r\) (Line 4), it obtains the similar pattern \(\tau ^{t_i}_{X}\) by means of Algorithm 2 (Line 5), adding it to \({I}_{A}\), and associating \(t_i\) to it (Lines 6–7). Notice that it is not necessary to explicitly store difference matrices. Indeed, Algorithm 2 directly constructs stripped similar pattern subsets from differences on tuple pairs. It takes as input a relation instance r, an attribute A, a tuple \(t_i\), and a constraint \(\phi _A\), in order to construct and return \(\tau ^{t_i}_{X}\). In particular, it analyzes each tuple \(t_j \in r\) in order to evaluate the difference between values of tuple pairs \((t_i, t_j)\) on attribute A (Line 4); it calculates such difference according to the constraint \(\phi _A\) (Line 5), and it inserts \(t_j\) into \(\tau ^{t_i}_{X}\) if and only if the computed difference satisfies \(\phi _A\) (Lines 6–8).
The procedure for generating the set of similar pattern subsets for an attribute set \(X \cup A\) is shown in Algorithm 3. It generates \({I}_{X \cup A}\) from \({I}_{X}\) and \({I}_{A}\). In particular, for each similar pattern of \({I}_{X}\), it obtains the tuples associated to it by means of the procedure GetTuples (Line 5). Furthermore, for each associated tuple \(t_i\), it obtains the similar pattern \(\tau ^{t_i}_{A}\) from \({I}_{A}\), in order to construct the similar pattern subsets for \({I}_{{X \cup A}}\), by intersecting the two considered ones from \({I}_{X}\) and \({I}_{A}\) (Line 8), adding it to \({I}_{{X \cup A}}\) and associating \(t_i\) to it (Lines 9–10).
Validating a candidaterfd The general DiM\(\varepsilon \) validation phase is described in Algorithm 4. First of all, it computes \(v_{X \rightarrow A}\) and \(\epsilon _c\), and if \(v_{X \rightarrow A} > \epsilon _c\) it discards the candidate rfd (Lines 5–8). Otherwise, it verifies the disjointness property, and if it holds, it invokes Algorithm 5 for an exact computation of the g3-error in polynomial time, otherwise Algorithm 7 is invoked for an approximate computation of the g3-error.
In the following, we provide the details of the two algorithms for calculating the g3-error.
Computing theg3-error for disjoint similar pattern subsets The procedure that DiM\(\varepsilon \) follows to calculate the g3-error is shown in Algorithm 5. It takes as input the sets \(I_{X}\) and \(I_{X \cup A}\), plus a relation instance r to calculate the fraction of tuples to be removed to make the candidate rfd\(X_{\varPhi _1}\xrightarrow []{\varPsi \le \varepsilon }A_{\varPhi _2}\) valid. In particular, it analyzes each \(similar_X\in I_{X}\) (Lines 5–9), and it obtains the maximum cardinality \(max{|G_{X\rightarrow A}|_{S_{X}}}\) (Algorithm 6), according to the associated tuples of \(similar_X\) (Lines 6–7). Then, the obtained sum is normalized with respect to the total number of tuples, and removed from 1, as defined by the g3-error definition (Line 10).
Algorithm 6 takes as input a set similar\(_X\), its associated tuples, and a set \(I_{X \cup A}\), in order to calculate the maximum cardinality \(max{|G_{X\rightarrow A}|_{S_{X}}}\). In particular, for each similar\(_{X \cup A}\), it identifies all similar\(_{X \cup A}\)\(\equiv \) similar\(_X\) (Line 7), aiming to obtain the maximum cardinality among them (Lines 8–10).
Computing theg3-error for intersecting similar pattern subsets In what follows, we show the greedy solution to calculate the g3-error for intersecting similar pattern subsets. It is derived from an algorithm with factor-2 approximation for the minimum vertex cover problem (Johnson and Garey 1979).
To calculate the g3-error for a candidate rfd\(X\rightarrow A\), Algorithm 7 takes as input the sets \(I_{X}\) and \(I_{X \cup A}\), and a relation instance r. In particular, it selects all tuple pairs that are similar on \(I_{X}\) but not on \(I_{X \cup A}\) (Line 3), whereas in Lines 4–8 it iteratively selects a violating tuple pair \(p=(t_1,t_2)\) in order to remove all violating pairs involving \(t_1\) or \(t_2\). Finally, the algorithm normalizes the number of violating tuple pairs selected on Line 5 (errorSet) on the size of the relation instance r (Line 9).
Rights and permissions
About this article
Cite this article
Caruccio, L., Deufemia, V. & Polese, G. Mining relaxed functional dependencies from data. Data Min Knowl Disc 34, 443–477 (2020). https://doi.org/10.1007/s10618-019-00667-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-019-00667-7