Heuristic Strategies for the Discovery of Inclusion Dependencies and Other Patterns

Koeller, Andreas; Rundensteiner, Elke A.

doi:10.1007/11617808_7

Andreas Koeller²¹ &
Elke A. Rundensteiner²²

Part of the book series: Lecture Notes in Computer Science ((JODS,volume 3870))

259 Accesses
4 Citations

Abstract

Inclusion dependencies (INDs) between databases are assertions of subset-relationships between sets of attributes (dimensions) in two relations. Such dependencies are useful for a number of purposes related to information integration, such as database similarity discovery and foreign key discovery.

An exhaustive approach at discovering INDs between two relations suffers from the dimensionality curse, since the number of potential mappings between the attributes of two relations is exponential in the number of attributes. For this reason, levelwise (Apriori-like) approaches at discovery do not scale beyond relations with 8 to 10 attributes. Approaches modeling the similarity space as graphs or hypergraphs are promising, but also do not scale very well.

This paper discusses approaches to scale discovery algorithms for INDs and some other similarity patterns in databases. The major obstacle to scalability is the exponentially growing size of the data structure representing potential INDs. Therefore, the focus of our solution is on heuristic techniques that reduce the number of IND candidates considered by the algorithm. Despite the use of heuristics, the accuracy of the results is good for real-world data.

Experiments are presented assessing the quality of the discovery results versus the runtime savings. We conclude that the heuristic approach is useful and improves scalability significantly. It is particularly applicable for relations that have attributes with few distinct values.

This work was supported in part by NSF grant #IIS9988776.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Casanova, M.A., Fagin, R., Papadimitriou, C.H.: Inclusion dependencies and their interaction with functional dependencies. In: Proceedings of ACM Conference on Principles of Database Systems (PODS), pp. 171–176 (1982)
Google Scholar
Kantola, M., Mannila, H., Räihä, K.J., Siirtola, H.: Discovering functional and inclusion dependencies in relational databases. International J. Of Intelligent Systems 7, 591–607 (1992)
Article MATH Google Scholar
Mannila, H., Räihä, K.J.: Algorithms for inferring functional-dependencies from relations. Data & Knowledge Engineering 12, 83–99 (1994)
Article MATH Google Scholar
de Marchi, F., Lopes, S., Petit, J.M., Toumani, F.: Analysis of existing databases at the logical level: the DBA companion project. SIGMOD Record (ACM Special Interest Group on Management of Data) 32, 47–52 (2003)
Google Scholar
Lee, A.J., Nica, A., Rundensteiner, E.A.: The EVE approach: View synchronization in dynamic distributed environments. IEEE Transactions on Knowledge and Data Engineering (TKDE) 14, 931–954 (2002)
Article Google Scholar
Gryz, J.: Query folding with inclusion dependencies. In: Proc. Intl. Conf. on Data Engineering, pp. 126–133. IEEE Computer Society, Los Alamitos (1998)
Chapter Google Scholar
Calì, A., Lembo, D., Rosati, R.: Query rewriting and answering under constraints in data integration systems. In: Proceedings of IJCAI, pp. 16–21 (2003)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal: Very Large Data Bases 10, 334–350 (2001)
Article MATH Google Scholar
Koeller, A.: Integration of Heterogeneous Databases: Discovery of Meta-Information and Maintenance of Schema-Restructuring Views. Ph.D thesis, Worcester Polytechnic Institute, Worcester, MA, USA (2001)
Google Scholar
de Marchi, F., Lopes, S., Petit, J.M.: Efficient algorithms for mining inclusion dependencies. In: Jensen, C.S., Jeffery, K., Pokorný, J., Šaltenis, S., Bertino, E., Böhm, K., Jarke, M. (eds.) EDBT 2002. LNCS, vol. 2287, pp. 464–476. Springer, Heidelberg (2002)
Chapter Google Scholar
Aggarwal, C.C., Yu, P.S.: Online generation of association rules. In: Proceedings of IEEE International Conference on Data Engineering, pp. 402–411 (1998)
Google Scholar
Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1, 241–258 (1997)
Article Google Scholar
Koeller, A., Rundensteiner, E.A.: Discovery of high-dimensional inclusion dependencies. In: Proceedings of IEEE International Conference on Data Engineering, Bangalore, India, pp. 683–685. IEEE, Los Alamitos (2003)
Google Scholar
de Marchi, F., Petit, J.M.: Zigzag: A new algorithm for mining large inclusion dependencies in databases. In: 3rd Intl. Conf. on Data Mining, Melbourne, Florida, pp. 27–34. IEEE, Los Alamitos (2003)
Chapter Google Scholar
Mitra, P., Wiederhold, G., Jannink, J.: Semi-automatic integration of knowledge sources. In: Proc. of the 2nd Int. Conf. On Information Fusion (FUSION 1999), Sunnyvale, California (1999)
Google Scholar
Beneventano, D., Bergamaschi, S., Castano, S., et al.: Information integration: The MOMIS project demonstration. In: International Conference on Very Large Data Bases, pp. 611–614 (2000)
Google Scholar
Koeller, A., Rundensteiner, E.A.: Heuristic Strategies for Inclusion Dependency Discovery. In: Proceedings of 3rd International Conference on Ontologies, Databases and Applications of Semantics (ODBASE), pp. 891–908 (2004)
Google Scholar
Koeller, A., Rundensteiner, E.A.: Discovery of high-dimensional inclusion dependencies. Technical Report WPI-CS-TR-02-15, Worcester Polytechnic Institute, Dept. of Computer Science (2002)
Google Scholar
Demetrovics, J., Thi, V.D.: Some remarks on generating armstrong and inferring functional dependencies relation. Acta Cybernetica 12, 167–180 (1995)
MATH MathSciNet Google Scholar
Rice, J.A.: Mathematical Statistics and Data Analysis, 2nd edn. Duxbury Press, Boston (1995)
MATH Google Scholar
Lim, W., Harrison, J.: Discovery of constraints from data for information system reverse engineering. In: Proc. of Australian Software Engineering Conference (ASWEC 1997), Sydney, Australia (1997)
Google Scholar
Zaki, M.J.: Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering (TKDE) 12, 372–390 (2000)
Article Google Scholar
Chandra, A.K., Vardi, M.Y.: The implication problem for functional and inclusion dependencies is undecidable. SIAM J. Comput. 3, 671–677 (1985)
Article MathSciNet Google Scholar
Mitchell, J.C.: Inference rules for functional and inclusion dependencies. In: Proceedings of ACM Symposium on Principles of Database Systems, Atlanta, Georgia, pp. 58–69 (1983)
Google Scholar
Mannino, M.V., Chu, P., Sager, T.: Statistical profile estimation in database systems. ACM Computing Surveys 20 (1988)
Google Scholar
Hon, W.C., Zhang, Z., Zhou, N.: Statistical inference of unknown attribute values in databases. In: Proceedings of International Conference on Information and Knowledge Management, pp. 21–30 (1993)
Google Scholar
Batini, C., Lenzerini, M., Navathe, S.: A comparative analysis of methodologies for database schema integration. ACM Computing Surveys 18, 323–364 (1986)
Article Google Scholar
Larson, J.A., Navathe, S.B., Elmasri, R.: A theory of attribute equivalence in databases with application to schema integration. IEEE Transactions on Software Engineering 15, 449–463 (1989)
Article MATH Google Scholar
Li, W., Clifton, C.: SemInt: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data and Knowledge Engineering 33(1), 49–84 (2000)
Article MATH Google Scholar
Doan, A., Domingos, P., Halevy, A.: Learning source description for data integration. In: Proceedings of the Third International Workshop on the Web and Databases (WebDB), Dallas, pp. 81–86 (2000)
Google Scholar
Kang, J., Naughton, J.F.: On schema matching with opaque column names and data values. Proceedings of SIGMOD pp. 205–216 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Oracle Corporation, NEDC, Nashua, NH, 03062, USA
Andreas Koeller
Department of Computer Science, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA, 01609, USA
Elke A. Rundensteiner

Authors

Andreas Koeller
View author publications
You can also search for this author in PubMed Google Scholar
Elke A. Rundensteiner
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

EPFL-IC-IIF-LBD, Station 14 - INJ 236, 1015, Lausanne, Switzerland
Stefano Spaccapietra
Informatica e Automazione, Università Roma Tre, Via Vasca Navale 79, 00146, Roma, Italy
Paolo Atzeni
Computer Science Department, UCLA, 90095, Los Angeles, CA, USA
Wesley W. Chu
Dipartimento di Informatica e Sistemistica, Universitá di Roma “La Sapienza”, Via Ariosto 25, 00185, Roma, Italy
Tiziana Catarci
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Katia P. Sycara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koeller, A., Rundensteiner, E.A. (2006). Heuristic Strategies for the Discovery of Inclusion Dependencies and Other Patterns. In: Spaccapietra, S., Atzeni, P., Chu, W.W., Catarci, T., Sycara, K.P. (eds) Journal on Data Semantics V. Lecture Notes in Computer Science, vol 3870. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11617808_7

Download citation

DOI: https://doi.org/10.1007/11617808_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31426-4
Online ISBN: 978-3-540-31427-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics