Advertisement

Distributed and Parallel Databases

, Volume 37, Issue 1, pp 133–176 | Cite as

Incrementally updating unary inclusion dependencies in dynamic data

  • Nuhad ShaabaniEmail author
  • Christoph Meinel
Article
  • 309 Downloads
Part of the following topical collections:
  1. Special Issue on Scientific and Statistical Data Management

Abstract

Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications like data profiling, data cleaning, entity resolution, and schema matching. Their discovery in an unknown dataset is at the core of any data-analysis effort. Therefore, several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are appropriate for application on dynamic datasets. In these cases, discovery techniques should be able to efficiently update the inclusion dependencies after an update in the dataset, without reprocessing the entire dataset. We present the first approach for incrementally updating the unary inclusion dependencies. In particular, our approach is based on the concept of attribute clustering, from which the unary inclusion dependencies are efficiently derivable. We incrementally update the clusters after each update of the dataset. An update of the clusters does not need access to the dataset because of special data structures designed to efficiently support the updating process. We performed an exhaustive analysis of our approach by applying it to large datasets with several hundred attributes and more than 116.2 million tuples. The results showed that the incremental discovery significantly reduces the runtime needed by the static discovery. This reduction in the runtime is up to 99.9996% for both the insertion and the deletion.

Keywords

Algorithms Data profiling Incremental discovery Dynamic data Change data capture 

Notes

References

  1. 1.
    Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., Gehrke, J., Haas, L., Halevy, A., Han, J., Jagadish, H.V., Labrinidis, A., Madden, S., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Ross, K., Shahabi, C., Suciu, D., Vaithyanathan, S., Widom, J.: Challenges and opportunities with big data: a white paper prepared for the computing community consortium committee of the computing research association. Tech. Rep. (2012). http://cra.org/ccc/resources/ccc-led-whitepapers/. Accessed 19 Oct 2017
  2. 2.
    Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada, pp. 159–170 (2008)Google Scholar
  3. 3.
    Naumann, F.: Data profiling revisited. SIGMOD Rec. 42(4), 40–49 (2013)CrossRefGoogle Scholar
  4. 4.
    Saha, B., Srivastava, D.: Data quality: the other face of big data. In: IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp 1294–1297 (2014)Google Scholar
  5. 5.
    Smith, K.P., Seligman, L.J., Rosenthal, A., Kurcz, C., Greer, M., Macheret, C., Sexton, M., Eckstein, A.: “Big metadata”: the need for principled metadata management in big data ecosystems. In: Proceedings of the Third Workshop on Data analytics in the Cloud, DanaC 2014, June 22, 2014, Snowbird, Utah, USA, In Conjunction with ACM SIGMOD/PODS Conference, pp. 13:1–13:4 (2014)Google Scholar
  6. 6.
    Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L., Ho, C.T.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIGMOD Rec. 30(1), 78–83 (2001)CrossRefGoogle Scholar
  7. 7.
    Casanova, M.A., Tucherman, L., Furtado, A.L.: Enforcing inclusion dependencies and referencial integrity. In: Proceedings of the 14th International Conference on Very Large Data Bases (VLDB ’88), pp. 38–49 (1988)Google Scholar
  8. 8.
    Gryz, J.: Query folding with inclusion dependencies. In: Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, USA, February 23–27, 1998, pp. 126–133 (1998)Google Scholar
  9. 9.
    Levene, M., Vincent, M.W.: Justification for inclusion dependency normal form. IEEE Trans. Knowl. Data Eng. 12(2), 281–291 (2000)CrossRefGoogle Scholar
  10. 10.
    Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: On multi-column foreign key discovery. PVLDB 3(1), 805–814 (2010)Google Scholar
  11. 11.
    Bauckmann, J., Leser, U., Naumann, F.: Efficiently computing inclusion dependencies for schema discovery. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06), p. 2 (2006)Google Scholar
  12. 12.
    DeMarchi, F., Lopes, S., Petit, J.: Unary and n-ary inclusion dependency discovery in relational databases. J. Intell. Inf. Syst. 32(1), 53–73 (2009)CrossRefGoogle Scholar
  13. 13.
    Papenbrock, T., Kruse, S., Quiané-Ruiz, J., Naumann, F.: Divide & conquer-based inclusion dependency discovery. PVLDB 8(7), 774–785 (2015)Google Scholar
  14. 14.
    Shaabani, N., Meinel, C.: Scalable inclusion dependency discovery. In: Database Systems for Advanced Applications—20th International Conference, DASFAA 2015, Hanoi, Vietnam, April 20–23, 2015, Proceedings, Part I, pp. 425–440 (2015)Google Scholar
  15. 15.
    DeMarchi, F., Petit, J.: Zigzag: a new algorithm for mining large inclusion dependencies in database. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), 19–22 December 2003, Melbourne, Florida, USA, pp. 27–34 (2003)Google Scholar
  16. 16.
    Koeller, A., Rundensteiner, E.A.: Discovery of high-dimensional inclusion dependencies. In: Proceedings of the 19th International Conference on Data Engineering, March 5–8, 2003, Bangalore, India, pp. 683–685 (2003)Google Scholar
  17. 17.
    Shaabani, N., Meinel, C.: Detecting maximum inclusion dependencies without candidate generation. In: Database and Expert Systems Applications—27th International Conference, DEXA 2016, Porto, Portugal, September 5–8, 2016, Proceedings, Part II, pp. 118–133 (2016)Google Scholar
  18. 18.
    Liu, J., Li, J., Liu, C., Chen, Y.: Discover dependencies from data—a review. IEEE Trans. Knowl. Data Eng. 24(2), 251–264 (2012)CrossRefGoogle Scholar
  19. 19.
    Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)CrossRefGoogle Scholar
  20. 20.
    Shaabani, N., Meinel, C.: Incremental discovery of inclusion dependencies. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, June 27–29, 2017, pp. 2:1–2:12 (2017)Google Scholar
  21. 21.
    Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697–708 (2014)Google Scholar
  22. 22.
    Newman, S.: Building microservices—designing fine-grained systems, 1st edn. O’Reilly, Sebastopol (2015)Google Scholar
  23. 23.
    Renz, J., Navarro-Suarez, G., Sathi, R., Staubitz, T., Meinel, C.: Enabling schema agnostic learning analytics in a service-oriented MOOC platform. In: Proceedings of the Third ACM Conference on Learning @ Scale, L@S 2016, Edinburgh, Scotland, UK, April 25–26, 2016, pp. 137–140 (2016)Google Scholar
  24. 24.
    Evoke Software Data profiling and mapping. The essential first step in data migration and integration projects. Tech. Rep. (2000). http://ciains.info/elearning/Solutions/ANew/DataMigrationFirstSteps.pdf. Accessed 19 Oct 2017
  25. 25.
    Kleppmann, M.: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly, Sebastopol (2016)Google Scholar
  26. 26.
    Das, S., Botev, C., Surlaker, K., Ghosh, B., Varadarajan, B., Nagaraj, S., Zhang, D., Gao, L., Westerman, J., Ganti, P., Shkolnik, B., Topiwala, S., Pachev, A., Somasundaram, N., Subramaniam, S.: All aboard the databus!: Linkedin’s scalable consistent change data capture platform. In: ACM Symposium on Cloud Computing (SOCC ’12), San Jose, CA, USA, October 14–17, 2012, p. 18 (2012)Google Scholar
  27. 27.
    Sharma, Y., Ajoux, P., Ang, P., Callies, D., Choudhary, A., Demailly, L., Fersch, T., Guz, L.A., Kotulski, A., Kulkarni, S., Kumar, S., Li, H.C., Li, J., Makeev, E., Prakasam, K., van Renesse, R., Roy, S., Seth, P., Song, Y.J., Wester, B., Veeraraghavan, K., Xie, P.: Wormhole: reliable pub-sub to support geo-replicated internet services. In: 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 15, Oakland, CA, USA, May 4–6, 2015, pp. 351–366 (2015)Google Scholar
  28. 28.
    Kille, B., Hopfgartner, F., Brodt, T., Heintz, T.: The plista dataset. In: Proceedings of the 2013 International News Recommender Systems Workshop and Challenge, NRS ’13, pp. 16–23 (2013)Google Scholar
  29. 29.
    Bell, S., Brockhausen, P.: Discovery of data dependencies in relational databases. Tech. Rep. Universität Dortmund (1995)Google Scholar
  30. 30.
    Kantola, M., Mannila, H., Räihä, K., Siirtola, H.: Discovering functional and inclusion dependencies in relational databases. Int. J. Intell. Syst. 7(7), 591–607 (1992)CrossRefzbMATHGoogle Scholar
  31. 31.
    Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: 11th International Symposium on Parameterized and Exact Computation (IPEC 2016), August 24–26, 2016, Aarhus, Denmark, pp. 6:1–6:13 (2016)Google Scholar
  32. 32.
    Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V.: Mining database structure; or, how to build a data quality browser. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 3-6, 2002, pp. 240–251 (2002)Google Scholar
  33. 33.
    DeMarchi, F., Petit, J.: Approximating a set of approximate inclusion dependencies. In: Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’05 Conference held in Gdansk, Poland, June 13-16, 2005, pp. 633–640 (2005)Google Scholar
  34. 34.
    Koeller, A., Rundensteiner, E.A.: Heuristic strategies for inclusion dependency discovery. In: On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE, OTM Confederated International Conferences, Agia Napa, Cyprus, October 25-29, 2004, Proceedings, Part II, pp. 891–908Google Scholar
  35. 35.
    Kruse, S., Papenbrock, T., Dullweber, C., Finke, M., Hegner, M., Zabel, M., Zöllner, C., Naumann, F.: Fast approximate discovery of inclusion dependencies. In: Datenbanksysteme für Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), 6-10, März 2017, Stuttgart, Germany, Proceedings, pp. 207–226 (2004)Google Scholar
  36. 36.
    Lopes, S., Petit, J., Toumani, F.: Discovering interesting inclusion dependencies: application to logical database tuning. Inf. Syst. 27(1), 1–19 (2002)CrossRefzbMATHGoogle Scholar
  37. 37.
    Rostin, A., Albrecht, O., Bauckmann, J., Naumann, F., Leser, U.: A machine learning approach to foreign key discovery. In: 12th International Workshop on the Web and Databases, WebDB 2009, Providence, Rhode Island, USA, June 28, 2009 (2009)Google Scholar
  38. 38.
    Memari, M., Link, S., Dobbie, G.: SQL data profiling of foreign keys. In: Proceedings of the Conceptual Modeling–34th International Conference, ER 2015, Stockholm, Sweden, October 19–22, 2015, pp. 229–243 (2015)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Hasso-Plattner-InstitutPotsdamGermany

Personalised recommendations