Database Schema Matching Using Machine Learning with Feature Selection

  • Jacob Berlin
  • Amihai Motro
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2348)

Abstract

Schema matching, the problem of finding mappings between the attributes of two semantically related database schemas, is an important aspect of many database applications such as schema integration, data warehousing, and electronic commerce. Unfortunately, schema matching remains largely a manual, labor-intensive process. Furthermore, the effort required is typically linear in the number of schemas to be matched; the next pair of schemas to match is not any easier than the previous pair. In this paper we describe a system, called Automatch, that uses machine learning techniques to automate schema matching. Based primarily on Bayesian learning, the system acquires probabilistic knowledge from examples that have been provided by domain experts. This knowledge is stored in a knowledge base called the attribute dictionary. When presented with a pair of new schemas that need to be matched (and their corresponding database instances), Automatch uses the attribute dictionary to find an optimal matching. We also report initial results from the Automatch project.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993.Google Scholar
  2. 2.
    Algorithmic Solutions. The LEDA Users Manual (Version 4.2.1), 2001.Google Scholar
  3. 3.
    Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999.Google Scholar
  4. 4.
    Jacob Berlin and Amihai Motro. Autoplex: Automated discovery of content for virtual databases. In Proceedings of the Ninth International Conference on Cooperative Information Systems, pages 108–122, 2001.Google Scholar
  5. 5.
    Silvana Castano and Valeria De Antonellis. A schema analysis and reconciliation tool environment for heterogeneous databases. In Proceedings of the International Database Engineering and Applications Symposium, pages 53–62, 1999.Google Scholar
  6. 6.
    AnHai Doan, Pedro Domingos, and Alon Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings ACM Special Interest Group for the Management of Data (SIGMOD), 2001.Google Scholar
  7. 7.
    Pedro Domingos and Michael Pazzani. Conditions for the optimality of the simple bayesian classifier. In Proceedings of the 13th International Conference on Machine Learning, pages 105–112, 1996.Google Scholar
  8. 8.
    Pat Langley, Wayne Iba, and Kevin Thompson. An analysis of bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 223–228, 1992.Google Scholar
  9. 9.
    Wen-Syan Li and Chris Clifton. Semantic integration in heterogeneous databases using neural networks. In Proceedings of 20th International Conference on Very Large Data Bases, pages 1–12, 1994.Google Scholar
  10. 10.
    Wen-Syan Li and Chris Clifton. Semint: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data & Knowledge Engineering, 33(1):49–84, 2000.MATHCrossRefGoogle Scholar
  11. 11.
    Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. Generic schema matching with cupid. In Proceedings of the 27th International Conferences on Very Large Databases, pages 49–58, 2001.Google Scholar
  12. 12.
    Renée Miller, Laura Haas, and Mauricio Hernández. Schema mapping as query discovery. In Proceedings of the 26th International Conferences on Very Large Databases, pages 77–88, 2000.Google Scholar
  13. 13.
    Tom Mitchell. Machine Learning. McGraw-Hill, 1997.Google Scholar
  14. 14.
    Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.Google Scholar
  15. 15.
    Erhard Rahm and Philip Bernstein. On matching schemas automatically. Technical Report MSR-TR-2001-17, Microsoft, Redmond, WA, February 2001.Google Scholar
  16. 16.
    Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A bayesian approach to filtering junk e-mail. AAAI-98 Workshop on Learning for Text Categorization, 1998.Google Scholar
  17. 17.
    Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Jacob Berlin
    • 1
  • Amihai Motro
    • 1
  1. 1.Information and Software Engineering DepartmentGeorge Mason UniversityFairfaxVirginia

Personalised recommendations