Skip to main content
Log in

Mining relaxed functional dependencies from data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Relaxed functional dependencies (rfds) are properties expressing important relationships among data. Thanks to the introduction of approximations in data comparison and/or validity, they can capture constraints useful for several purposes, such as the identification of data inconsistencies or patterns of semantically related data. Nevertheless, rfds can provide benefits only if they can be automatically discovered from data. In this paper we present an rfd discovery algorithm relying on a lattice structured search space, previously used for fd discovery, new pruning strategies, and a new candidate rfd validation method. An experimental evaluation demonstrates the discovery performances of the proposed algorithm on real datasets, also providing a comparison with other algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://hpi.de/naumann/projects/data-profiling-and-analytics/metanome-data-profiling.html.

  2. The software and the datasets used in the evaluation are available online https://dastlab.github.io/dime/.

  3. http://csxstatic.ist.psu.edu/.

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vincenzo Deufemia.

Additional information

Responsible editor: Toon Calders.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix–The DiM \(\varepsilon \) algorithm

Appendix–The DiM \(\varepsilon \) algorithm

The DiM\(\varepsilon \) discovery algorithm follows the process of column-based discovery algorithms, in which the generation of candidate rfds for a given instance r of a relation schema R is accomplished through a level-by-level visit to a lattice structure, by considering all the possible k-combinations of attributes, and by increasing k at each level in the range [2, n], where n is the number of attributes in R. Moreover, at each level only un-pruned candidates (see Sect. 5.2) are considered, and for each of them a validation process is performed. For this reason, DiM\(\varepsilon \) can be considered as a column-based algorithm. However, since it aims to discover rfds, it has to employ a new validation strategy, and new data configurations to highlight similarities between tuples.

Among the phases of the DiM\(\varepsilon \)’s discovery process described in Sect. 5, in what follows we provide the pseudo-code for those introducing some novelty with respect to other column-based discovery algorithms.

Generating similar pattern subsets The procedure for generating the set of similar pattern subsets for an attribute A is shown in Algorithm 1. It takes as input a relation instance r, an attribute A, and a constraint \(\phi _A\), in order to construct and return \({I}_{A}\). In particular, for each tuple \(t_i \in r\) (Line 4), it obtains the similar pattern \(\tau ^{t_i}_{X}\) by means of Algorithm 2 (Line 5), adding it to \({I}_{A}\), and associating \(t_i\) to it (Lines 6–7). Notice that it is not necessary to explicitly store difference matrices. Indeed, Algorithm 2 directly constructs stripped similar pattern subsets from differences on tuple pairs. It takes as input a relation instance r, an attribute A, a tuple \(t_i\), and a constraint \(\phi _A\), in order to construct and return \(\tau ^{t_i}_{X}\). In particular, it analyzes each tuple \(t_j \in r\) in order to evaluate the difference between values of tuple pairs \((t_i, t_j)\) on attribute A (Line 4); it calculates such difference according to the constraint \(\phi _A\) (Line 5), and it inserts \(t_j\) into \(\tau ^{t_i}_{X}\) if and only if the computed difference satisfies \(\phi _A\) (Lines 6–8).

figure a
figure b
figure c

The procedure for generating the set of similar pattern subsets for an attribute set \(X \cup A\) is shown in Algorithm 3. It generates \({I}_{X \cup A}\) from \({I}_{X}\) and \({I}_{A}\). In particular, for each similar pattern of \({I}_{X}\), it obtains the tuples associated to it by means of the procedure GetTuples (Line 5). Furthermore, for each associated tuple \(t_i\), it obtains the similar pattern \(\tau ^{t_i}_{A}\) from \({I}_{A}\), in order to construct the similar pattern subsets for \({I}_{{X \cup A}}\), by intersecting the two considered ones from \({I}_{X}\) and \({I}_{A}\) (Line 8), adding it to \({I}_{{X \cup A}}\) and associating \(t_i\) to it (Lines 9–10).

Validating a candidaterfd The general DiM\(\varepsilon \) validation phase is described in Algorithm 4. First of all, it computes \(v_{X \rightarrow A}\) and \(\epsilon _c\), and if \(v_{X \rightarrow A} > \epsilon _c\) it discards the candidate rfd (Lines 5–8). Otherwise, it verifies the disjointness property, and if it holds, it invokes Algorithm 5 for an exact computation of the g3-error in polynomial time, otherwise Algorithm 7 is invoked for an approximate computation of the g3-error.

In the following, we provide the details of the two algorithms for calculating the g3-error.

figure d

Computing theg3-error for disjoint similar pattern subsets The procedure that DiM\(\varepsilon \) follows to calculate the g3-error is shown in Algorithm 5. It takes as input the sets \(I_{X}\) and \(I_{X \cup A}\), plus a relation instance r to calculate the fraction of tuples to be removed to make the candidate rfd\(X_{\varPhi _1}\xrightarrow []{\varPsi \le \varepsilon }A_{\varPhi _2}\) valid. In particular, it analyzes each \(similar_X\in I_{X}\) (Lines 5–9), and it obtains the maximum cardinality \(max{|G_{X\rightarrow A}|_{S_{X}}}\) (Algorithm 6), according to the associated tuples of \(similar_X\) (Lines 6–7). Then, the obtained sum is normalized with respect to the total number of tuples, and removed from 1, as defined by the g3-error definition (Line 10).

Algorithm 6 takes as input a set similar\(_X\), its associated tuples, and a set \(I_{X \cup A}\), in order to calculate the maximum cardinality \(max{|G_{X\rightarrow A}|_{S_{X}}}\). In particular, for each similar\(_{X \cup A}\), it identifies all similar\(_{X \cup A}\)\(\equiv \) similar\(_X\) (Line 7), aiming to obtain the maximum cardinality among them (Lines 8–10).

figure e
figure f
figure g

Computing theg3-error for intersecting similar pattern subsets In what follows, we show the greedy solution to calculate the g3-error for intersecting similar pattern subsets. It is derived from an algorithm with factor-2 approximation for the minimum vertex cover problem (Johnson and Garey 1979).

To calculate the g3-error for a candidate rfd\(X\rightarrow A\), Algorithm 7 takes as input the sets \(I_{X}\) and \(I_{X \cup A}\), and a relation instance r. In particular, it selects all tuple pairs that are similar on \(I_{X}\) but not on \(I_{X \cup A}\) (Line 3), whereas in Lines 4–8 it iteratively selects a violating tuple pair \(p=(t_1,t_2)\) in order to remove all violating pairs involving \(t_1\) or \(t_2\). Finally, the algorithm normalizes the number of violating tuple pairs selected on Line 5 (errorSet) on the size of the relation instance r (Line 9).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Caruccio, L., Deufemia, V. & Polese, G. Mining relaxed functional dependencies from data. Data Min Knowl Disc 34, 443–477 (2020). https://doi.org/10.1007/s10618-019-00667-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-019-00667-7

Keywords

Navigation