Skip to main content

Heuristic Strategies for Inclusion Dependency Discovery

  • Conference paper
On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE (OTM 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3291))

Abstract

Inclusion dependencies (INDs) between databases are assertions of subset-relationships between sets of attributes (dimensions) in two relations. Such dependencies are useful for a number of purposes related to information integration, such as database similarity discovery and foreign key discovery.

An exhaustive approach at discovering INDs between two relations suffers from the dimensionality curse, since the number of potential mappings of size k between the attributes of two relations is exponential in k. Levelwise (Apriori-like) approaches at discovery do not scale for this reason beyond a k of 8 to 10. Approaches modeling the similarity space as a hypergraph (with the hyperedges of the graph representing sets of related attributes) are promising, but also do not scale very well.

This paper discusses approaches to scale discovery algorithms for INDs. The major obstacle to scalability is the exponentially growing size of the data structure representing potential INDs. Therefore, the focus of our solution is on heuristic techniques that reduce the number of IND candidates considered by the algorithm. Despite the use of heuristics, the accuracy of the results is good for real-world data.

Experiments are presented assessing the quality of the discovery results versus the runtime savings. We conclude that the heuristic approach is useful and improves scalability significantly. It is particularly applicable for relations that have attributes with few distinct values.

This work was supported in part by the NSF NYI grant #IRI 97–96264, the NSF CISE Instrumentation grant #IRIS 97–29878, and the NSF grant #IIS 9988776.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Casanova, M.A., Fagin, R., Papadimitriou, C.H.: Inclusion dependencies and their interaction with functional dependencies. In: Proceedings of ACM Conference on Principles of Database Systems (PODS), pp. 171–176 (1982)

    Google Scholar 

  2. Kantola, M., Mannila, H., Räihä, K.J., Siirtola, H.: Discovering functional and inclusion dependencies in relational databases. International J. Of Intelligent Systems 7, 591–607 (1992)

    Article  MATH  Google Scholar 

  3. Koeller, A., Rundensteiner, E.A.: Discovery of high-dimensional inclusion dependencies. In: Proceedings of IEEE International Conference on Data Engineering, Bangalore, India, pp. 683–685. IEEE, Los Alamitos (2003)

    Google Scholar 

  4. de Marchi, F., Petit, J.M.: Zigzag: A new algorithm for mining large inclusion dependencies in databases. In: 3rd Intl. Conf. on Data Mining, Melbourne, Florida, pp. 27–34. IEEE, Los Alamitos (2003)

    Chapter  Google Scholar 

  5. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal: Very Large Data Bases 10, 334–350 (2001)

    Article  MATH  Google Scholar 

  6. de Marchi, F., Lopes, S., Petit, J.M., Toumani, F.: Analysis of existing databases at the logical level: the DBA companion project. SIGMOD Record (ACM Special Interest Group on Management of Data) 32, 47–52 (2003)

    Google Scholar 

  7. Mannila, H., Räihä, K.J.: Algorithms for inferring functional-dependencies from relations. Data & Knowledge Engineering 12, 83–99 (1994)

    Article  MATH  Google Scholar 

  8. Lee, A.J., Nica, A., Rundensteiner, E.A.: The EVE approach: View synchronization in dynamic distributed environments. IEEE Transactions on Knowledge and Data Engineering (TKDE) 14, 931–954 (2002)

    Article  Google Scholar 

  9. Gryz, J.: Query folding with inclusion dependencies. In: Proc. Intl. Conf. on Data Engineering, pp. 126–133. IEEE Computer Society, Los Alamitos (1998)

    Google Scholar 

  10. Koeller, A.: Integration of Heterogeneous Databases: Discovery of Meta- Information and Maintenance of Schema-Restructuring Views. PhD thesis, Worcester Polytechnic Institute, Worcester, MA, USA (2001)

    Google Scholar 

  11. de Marchi, F., Lopes, S., Petit, J.M.: Efficient algorithms for mining inclusion dependencies. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 464–476 (2002)

    Google Scholar 

  12. Aggarwal, C.C., Yu, P.S.: Online generation of association rules. In: Proceedings of IEEE International Conference on Data Engineering, pp. 402–411 (1998)

    Google Scholar 

  13. Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1, 241–258 (1997)

    Article  Google Scholar 

  14. Mitra, P., Wiederhold, G., Jannink, J.: Semi-automatic integration of knowledge sources. In: Proc. of the 2nd Int. Conf. On Information Fusion (FUSION 1999), Sunnyvale, California (1999)

    Google Scholar 

  15. Beneventano, D., Bergamaschi, S., Castano, S., et al.: Information integration: The MOMIS project demonstration. In: International Conference on Very Large Data Bases, pp. 611–614 (2000)

    Google Scholar 

  16. Demetrovics, J., Thi, V.D.: Some remarks on generating armstrong and inferring functional dependencies relation. Acta Cybernetica 12, 167–180 (1995)

    MATH  MathSciNet  Google Scholar 

  17. Koeller, A., Rundensteiner, E.A.: Discovery of high-dimensional inclusion dependencies. Technical Report WPI-CS-TR-02-15, Worcester Polytechnic Institute, Dept. of Computer Science (2002)

    Google Scholar 

  18. Rice, J.A.: Mathematical Statistics and Data Analysis, 2nd edn. Duxbury Press, Boston (1995)

    MATH  Google Scholar 

  19. Lim, W., Harrison, J.: Discovery of constraints from data for information system reverse engineering. In: Proc. of Australian Software Engineering Conference (ASWEC 1997), Sydney, Australia (1997)

    Google Scholar 

  20. Zaki, M.J.: Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering (TKDE) 12, 372–390 (2000)

    Article  Google Scholar 

  21. Mitchell, J.C.: Inference rules for functional and inclusion dependencies. In: Proceedings of ACM Symposium on Principles of Database Systems, Atlanta, Georgia, pp. 58–69 (1983)

    Google Scholar 

  22. Mannino, M.V., Chu, P., Sager, T.: Statistical profile estimation in database systems. ACM Computing Surveys 20 (1988)

    Google Scholar 

  23. Hon, W.C., Zhang, Z., Zhou, N.: Statistical inference of unknown attribute values in databases. In: Proceedings of International Conference on Information and Knowledge Management, pp. 21–30 (1993)

    Google Scholar 

  24. Batini, C., Lenzerini, M., Navathe, S.: A comparative analysis of methodologies for database schema integration. ACM Computing Surveys 18, 323–364 (1986)

    Article  Google Scholar 

  25. Larson, J.A., Navathe, S.B., Elmasri, R.: A theory of attribute equivalence in databases with application to schema integration. IEEE Transactions on Software Engineering 15, 449–463 (1989)

    Article  MATH  Google Scholar 

  26. Li, W., Clifton, C.: SemInt: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data and Knowledge Engineering 33(1), 49–84 (2000)

    Article  MATH  Google Scholar 

  27. Doan, A., Domingos, P., Halevy, A.: Learning source description for data integration. In: Proceedings of the Third International Workshop on the Web and Databases (WebDB), Dallas, 81–86 (2000)

    Google Scholar 

  28. Kang, J., Naughton, J.F.: On schema matching with opaque column names and data values. In: Proceedings of SIGMOD, pp. 205–216 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Koeller, A., Rundensteiner, E.A. (2004). Heuristic Strategies for Inclusion Dependency Discovery. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. OTM 2004. Lecture Notes in Computer Science, vol 3291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30469-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30469-2_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23662-7

  • Online ISBN: 978-3-540-30469-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics