Mining relaxed functional dependencies from data

Caruccio, Loredana; Deufemia, Vincenzo; Polese, Giuseppe

doi:10.1007/s10618-019-00667-7

Mining relaxed functional dependencies from data

Published: 23 December 2019

Volume 34, pages 443–477, (2020)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

1261 Accesses
24 Citations
Explore all metrics

Abstract

Relaxed functional dependencies (rfds) are properties expressing important relationships among data. Thanks to the introduction of approximations in data comparison and/or validity, they can capture constraints useful for several purposes, such as the identification of data inconsistencies or patterns of semantically related data. Nevertheless, rfds can provide benefits only if they can be automatically discovered from data. In this paper we present an rfd discovery algorithm relying on a lattice structured search space, previously used for fd discovery, new pruning strategies, and a new candidate rfd validation method. An experimental evaluation demonstrates the discovery performances of the proposed algorithm on real datasets, also providing a comparison with other algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Formalization and Discovery of Approximate Conditional Functional Dependencies

Relaxed Functional Dependency Discovery in Heterogeneous Data Lakes

Computing Dependencies Using FCA

Notes

https://hpi.de/naumann/projects/data-profiling-and-analytics/metanome-data-profiling.html.
The software and the datasets used in the evaluation are available online https://dastlab.github.io/dime/.
http://csxstatic.ist.psu.edu/.

References

Abedjan Z, Schulze P, Naumann F (2014) DFD: efficient functional dependency discovery. In: Proceedings of the 23rd ACM international conference on information and knowledge management, CIKM ’14, pp 949–958. https://doi.org/10.1145/2661829.2661884
Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J 24(4):557–581. https://doi.org/10.1007/s00778-015-0389-y
Article Google Scholar
Arenas M, Libkin L (2004) A normal form for XML documents. ACM Trans Database Syst 29(1):195–232. https://doi.org/10.1145/974750.974757
Article Google Scholar
Berti-Équille L, Harmouch H, Naumann F, Novelli N, Thirumuruganathan S (2018) Discovery of genuine functional dependencies from relational data with missing values. Proc VLDB Endowment 11(8):880–892. https://doi.org/10.14778/3204028.3204032
Article Google Scholar
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. https://archive.ics.uci.edu/ml/index.php
Bohannon P, Fan W, Geerts F, Jia X, Kementsietsidis A (2007) Conditional functional dependencies for data cleaning. In: Proceedings of the 25th international conference on data engineering, ICDE ’07, pp 746–755. https://doi.org/10.1109/ICDE.2007.367920
Caruccio L, Deufemia V, Polese G (2016a) On the discovery of relaxed functional dependencies. In: Proceedings of the 20th international database engineering & applications symposium, IDEAS ’16, pp 53–61. https://doi.org/10.1145/2938503.2938519
Caruccio L, Deufemia V, Polese G (2016b) Relaxed functional dependencies–a survey of approaches. IEEE Trans Knowl Data Eng 28(1):147–165. https://doi.org/10.1109/TKDE.2015.2472010
Article Google Scholar
Caruccio L, Deufemia V, Polese G (2019) Visualization of (multimedia) dependencies from big data. Multimedia Tools and Applications 78(23):33151–33167. https://doi.org/10.1007/s11042-019-07951-0
Article Google Scholar
Chang SK, Deufemia V, Polese G, Vacca M (2007) A normalization framework for multimedia databases. IEEE Trans Knowl Data Eng 19(12):1666–1679. https://doi.org/10.1109/TKDE.2007.190651
Article Google Scholar
Chardin B, Coquery E, Pailloux M, Petit JM (2017) RQL: a query language for rule discovery in databases. Theoret Comput Sci 658:357–374. https://doi.org/10.1016/j.tcs.2016.11.004
Article MathSciNet MATH Google Scholar
Chiang F, Miller RJ (2008) Discovering data quality rules. Proc VLDB Endowment 1(1):1166–1177. https://doi.org/10.14778/1453856.1453980
Article Google Scholar
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16. https://doi.org/10.1109/TKDE.2007.250581
Article Google Scholar
Fan W, Geerts F, Lakshmanan LVS, Xiong M (2009) Discovering conditional functional dependencies. In: Proceedings of the 25th international conference on data engineering, ICDE ’09, pp 1231–1234. https://doi.org/10.1109/ICDE.2009.208
Fan W, Gao H, Jia X, Li J, Ma S (2011) Dynamic constraints for record matching. VLDB J 20(4):495–520. https://doi.org/10.1007/s00778-010-0206-6
Article Google Scholar
Flach PA, Savnik I (1999) Database dependency discovery: a machine learning approach. AI Commun 12(3):139–160
MathSciNet Google Scholar
Giannella C, Robertson E (2004) On approximation measures for functional dependencies. Inf Syst 29(6):483–507. https://doi.org/10.1016/j.is.2003.10.006
Article Google Scholar
Golab L, Karloff H, Korn F, Srivastava D, Yu B (2008) On generating near-optimal tableaux for conditional functional dependencies. Proc VLDB Endowment 1(1):376–390. https://doi.org/10.14778/1453856.1453900
Article Google Scholar
Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H (1999) TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput J 42(2):100–111. https://doi.org/10.1093/comjnl/42.2.100
Article MATH Google Scholar
Ilyas IF, Markl V, Haas P, Brown P, Aboulnaga A (2004) CORDS: automatic discovery of correlations and soft functional dependencies. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04, pp 647–658. https://doi.org/10.1145/1007568.1007641
Johnson DS, Garey MR (1979) Computers and intractability: a guide to the theory of NP-completeness. WH Freeman, New York
MATH Google Scholar
Kim M, Candan KS (2016) Decomposition-by-normalization (DBN): leveraging approximate functional dependencies for efficient CP and tucker decompositions. Data Min Knowl Disc 30(1):1–46. https://doi.org/10.1007/s10618-015-0401-6
Article MathSciNet MATH Google Scholar
King RS, Legendre JJ (2003) Discovery of functional and approximate functional dependencies in relational databases. Adv Decis Sci 7(1):49–59. https://doi.org/10.1155/S117391260300004X
Article MathSciNet MATH Google Scholar
Kivinen J, Mannila H (1995) Approximate inference of functional dependencies from relations. Theoret Comput Sci 149(1):129–149. https://doi.org/10.1016/0304-3975(95)00028-U
Article MathSciNet MATH Google Scholar
Kleinberg J, Tardos E (2006) Algorithm design. Pearson Education India, New Delhi
Google Scholar
Kwashie S, Liu J, Li J, Ye F (2014) Mining differential dependencies: a subspace clustering approach. In: Proceedings of the 25th Australasian database conference, ADC ’14, pp 50–61. https://doi.org/10.1007/978-3-319-08608-8_5
Kwashie S, Liu J, Li J, Ye F (2015) Efficient discovery of differential dependencies through association rules mining. In: Proceedings of the 26th Australasian database conference, ADC ’15, pp 3–15. https://doi.org/10.1007/978-3-319-19548-3_1
Lee ML, Ling TW, Low WL (2002) Designing functional dependencies for XML. In: Proceedings of the 8th international conference on extending database technology, EDBT ’02, pp 124–141. https://doi.org/10.1007/3-540-45876-X_10
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707–710
MathSciNet Google Scholar
Li J (2006) On optimal rule discovery. IEEE Trans Knowl Data Eng 18(4):460–471. https://doi.org/10.1109/TKDE.2006.1599385
Article Google Scholar
Liu J, Li J, Liu C, Chen Y (2012) Discover dependencies from data–a review. IEEE Trans Knowl Data Eng 24(2):251–264. https://doi.org/10.1109/TKDE.2010.197
Article Google Scholar
Lopes S, Petit JM, Lakhal L (2000) Efficient discovery of functional dependencies and Armstrong relations. In: Proceedings of the 7th international conference on extending database technology, EDBT ’00, pp 350–364. https://doi.org/10.1007/3-540-46439-5_24
Nambiar U, Kambhampati S (2004) Mining approximate functional dependencies and concept similarities to answer imprecise queries. In: Proceedings of the 7th international workshop on the web and databases, WebDB ’04, pp 73–78. https://doi.org/10.1145/1017074.1017093
Novelli N, Cicchetti R (2001) FUN: an efficient algorithm for mining functional and embedded dependencies. In: Proceedings of the 8th international conference database theory, ICDT ’01, pp 189–203. https://doi.org/10.1007/3-540-44503-X_13
Papenbrock T, Naumann F (2016) A hybrid approach to functional dependency discovery. In: Proceedings of the 2016 ACM SIGMOD international conference on management of data, SIGMOD ’16, pp 821–833. https://doi.org/10.1145/2882903.2915203
Papenbrock T, Bergmann T, Finke M, Zwiener J, Naumann F (2015a) Data profiling with Metanome. Proc VLDB Endowment 8(12):1860–1863. https://doi.org/10.14778/2824032.2824086
Article Google Scholar
Papenbrock T, Ehrlich J, Marten J, Neubert T, Rudolph JP, Schönberg M, Zwiener J, Naumann F (2015b) Functional dependency discovery: an experimental evaluation of seven algorithms. Proc VLDB Endowment 8(10):1082–1093. https://doi.org/10.14778/2794367.2794377
Article Google Scholar
Raju KVSVN, Majumdar AK (1988) Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM Trans Database Syst 13(2):129–166. https://doi.org/10.1145/42338.42344
Article Google Scholar
Sánchez D, Serrano JM, Blanco I, Martín-Bautista MJ, Vila MA (2008) Using association rules to mine for strong approximate dependencies. Data Min Knowl Disc 16(3):313–348. https://doi.org/10.1007/s10618-008-0092-3
Article MathSciNet Google Scholar
Song S (2010) Data dependencies in the presence of difference. PhD thesis, The Hong Kong University
Song S, Chen L (2011) Differential dependencies: reasoning and discovery. ACM Trans Database Syst 36:16. https://doi.org/10.1145/2000824.2000826
Article Google Scholar
Song S, Chen L (2013) Efficient discovery of similarity constraints for matching dependencies. Data Knowl Eng 87:146–166. https://doi.org/10.1016/j.datak.2013.06.003
Article Google Scholar
Song S, Chen L, Cheng H (2014) Efficient determination of distance thresholds for differential dependencies. IEEE Trans Knowl Data Eng 26(9):2179–2192. https://doi.org/10.1109/TKDE.2013.84
Article Google Scholar
Song S, Sun Y, Zhang A, Chen L, Wang J (2018) Enriching data imputation under similarity rule constraints. To appear in IEEE transactions on knowledge and data engineering
Szlichta J, Golab L, Srivastava D (2015) On axiomatization and inference complexity over a hierarchy of functional dependencies. In: Proceedings of the 9th Alberto Mendelzon international workshop on foundations of data management, AMW ’15
Vianu V (1987) Dynamic functional dependencies and database aging. J ACM 34(1):28–59. https://doi.org/10.1145/7531.7918
Article MathSciNet Google Scholar
Wyss C, Giannella C, Robertson E (2001) FastFDs: a heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. In: Proceedings of the 3rd international conference on data warehousing and knowledge discovery, DaWaK ’01, pp 101–110. https://doi.org/10.1007/3-540-44801-2_11
Yao H, Hamilton HJ (2007) Mining functional dependencies from data. Data Min Knowl Disc 16(2):197–219. https://doi.org/10.1007/s10618-007-0083-9
Article MathSciNet Google Scholar
Yao H, Hamilton HJ, Butz CJ (2002) FD\_Mine: Discovering functional dependencies in a database using equivalences. In: Proceedings of the 2nd international conference on data mining, ICDM ’02, pp 729–732. https://doi.org/10.1109/ICDM.2002.1184040

Download references

Author information

Authors and Affiliations

University of Salerno, via Giovanni Paolo II n.132, Fisciano, SA, 84084, USA
Loredana Caruccio, Vincenzo Deufemia & Giuseppe Polese

Authors

Loredana Caruccio
View author publications
You can also search for this author in PubMed Google Scholar
Vincenzo Deufemia
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Polese
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vincenzo Deufemia.

Additional information

Responsible editor: Toon Calders.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix–The DiM \(\varepsilon \) algorithm

The DiM\(\varepsilon \) discovery algorithm follows the process of column-based discovery algorithms, in which the generation of candidate rfds for a given instance r of a relation schema R is accomplished through a level-by-level visit to a lattice structure, by considering all the possible k-combinations of attributes, and by increasing k at each level in the range [2, n], where n is the number of attributes in R. Moreover, at each level only un-pruned candidates (see Sect. 5.2) are considered, and for each of them a validation process is performed. For this reason, DiM\(\varepsilon \) can be considered as a column-based algorithm. However, since it aims to discover rfds, it has to employ a new validation strategy, and new data configurations to highlight similarities between tuples.

Among the phases of the DiM\(\varepsilon \)’s discovery process described in Sect. 5, in what follows we provide the pseudo-code for those introducing some novelty with respect to other column-based discovery algorithms.

Generating similar pattern subsets The procedure for generating the set of similar pattern subsets for an attribute A is shown in Algorithm 1. It takes as input a relation instance r, an attribute A, and a constraint \(\phi _A\), in order to construct and return \({I}_{A}\). In particular, for each tuple \(t_i \in r\) (Line 4), it obtains the similar pattern \(\tau ^{t_i}_{X}\) by means of Algorithm 2 (Line 5), adding it to \({I}_{A}\), and associating \(t_i\) to it (Lines 6–7). Notice that it is not necessary to explicitly store difference matrices. Indeed, Algorithm 2 directly constructs stripped similar pattern subsets from differences on tuple pairs. It takes as input a relation instance r, an attribute A, a tuple \(t_i\), and a constraint \(\phi _A\), in order to construct and return \(\tau ^{t_i}_{X}\). In particular, it analyzes each tuple \(t_j \in r\) in order to evaluate the difference between values of tuple pairs \((t_i, t_j)\) on attribute A (Line 4); it calculates such difference according to the constraint \(\phi _A\) (Line 5), and it inserts \(t_j\) into \(\tau ^{t_i}_{X}\) if and only if the computed difference satisfies \(\phi _A\) (Lines 6–8).

The procedure for generating the set of similar pattern subsets for an attribute set \(X \cup A\) is shown in Algorithm 3. It generates \({I}_{X \cup A}\) from \({I}_{X}\) and \({I}_{A}\). In particular, for each similar pattern of \({I}_{X}\), it obtains the tuples associated to it by means of the procedure GetTuples (Line 5). Furthermore, for each associated tuple \(t_i\), it obtains the similar pattern \(\tau ^{t_i}_{A}\) from \({I}_{A}\), in order to construct the similar pattern subsets for \({I}_{{X \cup A}}\), by intersecting the two considered ones from \({I}_{X}\) and \({I}_{A}\) (Line 8), adding it to \({I}_{{X \cup A}}\) and associating \(t_i\) to it (Lines 9–10).

Validating a candidaterfd The general DiM\(\varepsilon \) validation phase is described in Algorithm 4. First of all, it computes \(v_{X \rightarrow A}\) and \(\epsilon _c\), and if \(v_{X \rightarrow A} > \epsilon _c\) it discards the candidate rfd (Lines 5–8). Otherwise, it verifies the disjointness property, and if it holds, it invokes Algorithm 5 for an exact computation of the g3-error in polynomial time, otherwise Algorithm 7 is invoked for an approximate computation of the g3-error.

In the following, we provide the details of the two algorithms for calculating the g3-error.

Computing theg3-error for disjoint similar pattern subsets The procedure that DiM\(\varepsilon \) follows to calculate the g3-error is shown in Algorithm 5. It takes as input the sets \(I_{X}\) and \(I_{X \cup A}\), plus a relation instance r to calculate the fraction of tuples to be removed to make the candidate rfd\(X_{\varPhi _1}\xrightarrow []{\varPsi \le \varepsilon }A_{\varPhi _2}\) valid. In particular, it analyzes each \(similar_X\in I_{X}\) (Lines 5–9), and it obtains the maximum cardinality \(max{|G_{X\rightarrow A}|_{S_{X}}}\) (Algorithm 6), according to the associated tuples of \(similar_X\) (Lines 6–7). Then, the obtained sum is normalized with respect to the total number of tuples, and removed from 1, as defined by the g3-error definition (Line 10).

Algorithm 6 takes as input a set similar\(_X\), its associated tuples, and a set \(I_{X \cup A}\), in order to calculate the maximum cardinality \(max{|G_{X\rightarrow A}|_{S_{X}}}\). In particular, for each similar\(_{X \cup A}\), it identifies all similar\(_{X \cup A}\)\(\equiv \) similar\(_X\) (Line 7), aiming to obtain the maximum cardinality among them (Lines 8–10).

Computing theg3-error for intersecting similar pattern subsets In what follows, we show the greedy solution to calculate the g3-error for intersecting similar pattern subsets. It is derived from an algorithm with factor-2 approximation for the minimum vertex cover problem (Johnson and Garey 1979).

To calculate the g3-error for a candidate rfd\(X\rightarrow A\), Algorithm 7 takes as input the sets \(I_{X}\) and \(I_{X \cup A}\), and a relation instance r. In particular, it selects all tuple pairs that are similar on \(I_{X}\) but not on \(I_{X \cup A}\) (Line 3), whereas in Lines 4–8 it iteratively selects a violating tuple pair \(p=(t_1,t_2)\) in order to remove all violating pairs involving \(t_1\) or \(t_2\). Finally, the algorithm normalizes the number of violating tuple pairs selected on Line 5 (errorSet) on the size of the relation instance r (Line 9).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Caruccio, L., Deufemia, V. & Polese, G. Mining relaxed functional dependencies from data. Data Min Knowl Disc 34, 443–477 (2020). https://doi.org/10.1007/s10618-019-00667-7

Download citation

Received: 07 February 2019
Accepted: 09 December 2019
Published: 23 December 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10618-019-00667-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining relaxed functional dependencies from data

Abstract

Access this article

Similar content being viewed by others

Formalization and Discovery of Approximate Conditional Functional Dependencies