Skip to main content

Algorithms for the discovery of embedded functional dependencies

Abstract

Embedded functional dependencies (eFDs) advance data management applications by data completeness and integrity requirements. We show that the discovery problem of eFDs is \({\mathsf {NP}}\)-complete, \(\mathsf {W}[2]\)-complete in the output, and has a minimum solution space that is larger than the maximum solution space for functional dependencies. Nevertheless, we use novel data structures and search strategies to develop row-efficient, column-efficient, and hybrid algorithms for eFD discovery. Our experiments demonstrate that the algorithms scale well in terms of their design targets, and that ranking the eFDs by the number of redundant data values they cause can provide useful guidance in identifying meaningful eFDs for applications. We further demonstrate the benefits of introducing completeness requirements and ranking by the number of redundant data values for other variants of functional dependencies. Finally, we show how to compute informative Armstrong samples and illustrate the performance of our algorithms on the benchmark data. The informative Armstrong samples can be used to find eFDs that are meaningful for the application domain but violated by a given data set due to inconsistencies.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    https://archive.ics.uci.edu/ml/.

References

  1. 1.

    Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. Synthesis Lectures on Data Management. Morgan & Claypool, New York (2018)

    Google Scholar 

  2. 2.

    Abedjan, Z., Schulze, P., Naumann, F.: DFD: efficient functional dependency discovery. In: CIKM, pp. 949–958 (2014)

  3. 3.

    Berti-Équille, L., Harmouch, H., Naumann, F., Novelli, N., Thirumuruganathan, S.: Discovery of genuine functional dependencies from relational data with missing values. PVLDB 11(8), 880–892 (2018)

    Google Scholar 

  4. 4.

    Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: LIPIcs-Leibniz International Proceedings in Informatics, volume 63. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2017)

  5. 5.

    Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)

  6. 6.

    Bravo, L., Fan, W., Geerts, F., Ma, S.: Increasing the expressivity of conditional functional dependencies without extra complexity. In: ICDE, pp. 516–525 (2008)

  7. 7.

    Caruccio, L., Deufemia, V., Polese, G.: Relaxed functional dependencies—a survey of approaches. IEEE Trans. Knowl. Data Eng. 28(1), 147–165 (2016)

    Article  Google Scholar 

  8. 8.

    Demetrovics, J., Katona, G.O.H., Miklós, D., Thalheim, B.: On the number of independent functional dependencies. In: FoIKS, pp. 83–91 (2006)

  9. 9.

    Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6:1–6:48 (2008)

    Article  Google Scholar 

  10. 10.

    Fan, W., Geerts, F., Lakshmanan, L.V.S., Xiong, M.: Discovering conditional functional dependencies. In: ICDE, pp. 1231–1234 (2009)

  11. 11.

    Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23(5), 683–698 (2011)

    Article  Google Scholar 

  12. 12.

    Flach, P.A., Savnik, I.: Database dependency discovery. AI Commun. 12(3), 139–160 (1999)

    MathSciNet  Google Scholar 

  13. 13.

    Gallier, J.: Discrete Mathematics. Universitext. Springer, New York (2011)

    Book  Google Scholar 

  14. 14.

    Giannella, C., Wyss, C.: Finding minimal keys in a relation instance. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.7086 (1999)

  15. 15.

    Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)

    Article  Google Scholar 

  16. 16.

    Kruse, S., Naumann, F.: Efficient discovery of approximate dependencies. PVLDB 11(7), 759–772 (2018)

    Google Scholar 

  17. 17.

    Link, S., Wei, Z.: Logical schema design that quantifies update inefficiency and join efficiency. In: SIGMOD, pp. 1169–1181 (2021)

  18. 18.

    Lopes, S., Petit, J., Lakhal, L.: Efficient discovery of functional dependencies and Armstrong relations. In: EDBT, pp. 350–364 (2000)

  19. 19.

    Mannila, H., Räihä, K.: Design by example: an application of Armstrong relations. J. Comput. Syst. Sci. 33(2), 126–141 (1986)

    MathSciNet  Article  Google Scholar 

  20. 20.

    Mannila, H., Räihä, K.: Dependency inference. In: VLDB, pp. 155–158 (1987)

  21. 21.

    Marchi, F.D., Petit, J.: Semantic sampling of existing databases through informative Armstrong databases. Inf. Syst. 32(3), 446–457 (2007)

    Article  Google Scholar 

  22. 22.

    Novelli, N., Cicchetti, R.: Functional and embedded dependency inference: a data mining point of view. Inf. Syst. 26(7), 477–506 (2001)

  23. 23.

    Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J., Schönberg, M., Zwiener, J., Naumann, F.: Functional dependency discovery: an experimental evaluation of seven algorithms. PVLDB 8(10), 1082–1093 (2015)

    Google Scholar 

  24. 24.

    Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In: SIGMOD, pp. 821–833 (2016)

  25. 25.

    Papenbrock, T., Naumann, F.: Data-driven schema normalization. In: EDBT, pp. 342–353 (2017)

  26. 26.

    Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: GORDIAN: efficient and scalable discovery of composite keys. In: VLDB, pp. 691–702 (2006)

  27. 27.

    Stănică, P.: Good lower and upper bounds on binomial coefficients. JIPAM. J. Inequal. Pure Appl. Math. 2(3), Article 30,5 (2001)

    MathSciNet  Google Scholar 

  28. 28.

    Visengeriyeva, L., Abedjan, Z.: Anatomy of metadata for data curation. ACM J. Data Inf. Qual. 12(3), 16:1–16:30 (2020)

    Google Scholar 

  29. 29.

    Wei, Z., Hartmann, S., Link, S.: Discovery algorithms for embedded functional dependencies. In: SIGMOD, pp. 833–843 (2020)

  30. 30.

    Wei, Z., Leck, U., Link, S.: Discovery and ranking of embedded uniqueness constraints. PVLDB 12(13), 2339–2352 (2019)

    Google Scholar 

  31. 31.

    Wei, Z., Link, S.: Embedded cardinality constraints. In: CAiSE, pp. 523–538 (2018)

  32. 32.

    Wei, Z., Link, S.: DataProf: Semantic profiling for iterative data cleansing and business rule acquisition. In: SIGMOD, pp. 1793–1796 (2018)

  33. 33.

    Wei, Z., Link, S.: Discovery and ranking of functional dependencies. In: ICDE, pp. 1526–1537 (2019)

  34. 34.

    Wei, Z., Link, S.: Embedded functional dependencies and data-completeness tailored database design. PVLDB 12(11), 1458–1470 (2019)

    Google Scholar 

  35. 35.

    Wei, Z., Link, S.: Embedded functional dependencies and data-completeness tailored database design. ACM Trans. Database Syst. 46(2), 7:1–7:46 (2021)

    Article  Google Scholar 

  36. 36.

    Wyss, C.M., Giannella, C., Robertson, E.L.: FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances. In: DaWaK, pp. 101–110 (2001)

  37. 37.

    Yao, H., Hamilton, H.J., Butz, C.J.: Fd\(\_\)mine: Discovering functional dependencies in a database using equivalences. In: ICDM, pp. 729–732 (2002)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Sebastian Link.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Some results were presented at SIGMOD 2020 [29].

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 168 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wei, Z., Hartmann, S. & Link, S. Algorithms for the discovery of embedded functional dependencies. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00684-3

Download citation

Keywords

  • Algorithm
  • Armstrong sample
  • Completeness requirement
  • Data redundancy
  • Discovery
  • Integrity requirement
  • Intractability
  • Missing data
  • Functional Dependency