Machine Learning

, Volume 50, Issue 3, pp 279–301 | Cite as

Learning to Match the Schemas of Data Sources: A Multistrategy Approach

  • AnHai Doan
  • Pedro Domingos
  • Alon Halevy


The problem of integrating data from multiple data sources—either on the Internet or within enterprises—has received much attention in the database and AI communities. The focus has been on building data integration systems that provide a uniform query interface to the sources. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the query interface and the source schemas. Examples of mappings are “element location maps to address” and “price maps to listed-price”. We propose a multistrategy learning approach to automatically find such mappings. The approach applies multiple learner modules, where each module exploits a different type of information either in the schemas of the sources or in their data, then combines the predictions of the modules using a meta-learner. Learner modules employ a variety of techniques, ranging from Naive Bayes and nearest-neighbor classification to entity recognition and information retrieval. We describe the LSD system, which employs this approach to find semantic mappings. To further improve matching accuracy, LSD exploits domain integrity constraints, user feedback, and nested structures in XML data. We test LSD experimentally on several real-world domains. The experiments validate the utility of multistrategy learning for data integration and show that LSD proposes semantic mappings with a high degree of accuracy.

schema matching multistrategy learning data integration 


  1. Ashish, N., & Knoblock, C. A. (1997). Wrapper generation for semi-structured internet sources. SIGMOD Record, 26:4, 8–15.Google Scholar
  2. Brazdil, P., & Muggleton, S. (1991). Learning to relate terms in a multiple agent environment. Lecture Notes in Artificial Intelligence, European Working Session on Learning, 482.Google Scholar
  3. Castano, S., & De Antonellis, V. (1999). A schema analysis and reconciliation tool environment for heterogeneous databases. In proc. of the Int. Databases Engineering and Applications Symposium (IDEAS-99) (pp. 53–62).Google Scholar
  4. Cohen, W., & Hirsh, H. (1998). Joints that generalize: Text classification using WHIRL. In Proc. of the Fourth Int. Conf. on Knowledge Discovery and Data Mining (KDD).Google Scholar
  5. Chalupsky, H. (2000). Ontomorph: A translation system for symbolic knoledge. In Principles of Knowledge Representation and Reasoning.Google Scholar
  6. Clifton, C., Housman, E., & Rosenthal, A. (1997), Experience with a combined approach to attribute-matching across heterogeneous databases. In Proc. of the IFIP Working Conference on Data Semantics (DS-7).Google Scholar
  7. Doan, A., Domingos, P., & Halevy, A. (2001). Reconciling schemas of disparate data sources: A machine learning approach. In Proceedings of the ACM SIGMOD Conference.Google Scholar
  8. Doan, A., Domingos, P., & Halevy, A. (2002). Learning complex mapping between structured representations. Technical Report UW-CSE-2002, University of Washington.Google Scholar
  9. Duda, R. O., & Hart, P. E. (1974). Pattern classification and scene analysis. New York: John Wiley and Sons.Google Scholar
  10. Doan, A., Madhavan, J., Domingos, P., & Halevy, A. (2002). Learning to map ontologies on the semantic web. In Proceedings of the World-Wide Web Conference (WWW-02).Google Scholar
  11. Do, H., Melnik, S., & Rahm, E. (2002). Comparison of schema matching evaluations. In Proceeding of the 2nd Int. Workshop on Web Databases (German Informatics Society).Google Scholar
  12. Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–130.Google Scholar
  13. Donoho, S., & Rendell, L. (1996). Constructive induction using fragmentary knowledge. In Proc. of the 13th Int. Conf. on Machine Learning (pp. 113–121).Google Scholar
  14. Extensible markup language (XML) 1.0., 1998. W3C Recommendation.Google Scholar
  15. Freitag, D. (1998). Machine learning for information extraction in informal domains. Ph.D. Thesis, Dept. of Computer Science, Carnegie Mellon University.Google Scholar
  16. Friedman, M., & Weld, D. (1997). Efficiently executing information-gathering plans. In Proc. of the Int. Joint Conf. of AI (IJCAI).Google Scholar
  17. Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., & Widom, J. (1997). The TSIMMIS project: Integration of heterogeneous information sources. Journal of Intelligent Inf. Systems, 8:2.Google Scholar
  18. Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., & Vassalos, V. (1998). Template-based wrappers in the TSIMMIS system (system demonstration). In ACM Sigmod Record, Tucson, Arizona.Google Scholar
  19. Hart, P., Nilsson, N., & Raphael, B. (1972). Correction to “a formal basis for the heuristic determination of minimum cost paths”. SIGART Newsletter, 37, 28–29.Google Scholar
  20. Ives, Z., Florescu, D., Friedman, M., Levy, A., & Weld, D. (1999). An adaptive query execution system for data integration. In Proc. of SIGMOD.Google Scholar
  21. Knoblock, C., Minton, S., Ambite, J., Ashish, N., Modi, P., Muslea, I., Philpot, A., & Tejada, S. (1998). Modeling web sources for information integration. In Proc. of the National Conference on Artificial Intelligence (AAAI).Google Scholar
  22. Keim, G., Shazeer, N., Littman, M., Agarwal, S., Cheves, C., Fitzgerald, J., Grosland, J., Jiang, F., Pollard, S., & Weinmeister, K. (1999). PROVERB: The probabilistic cruciverbalist. In Proc. of the 6th National Conf. on Artificial Intelligence (AAAI-99) (pp. 710–717).Google Scholar
  23. Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:1/2, 15–68.Google Scholar
  24. Kushmerick, N. (2000). Wrapper verification. World Wild Web Journal, 3:2, 79–94.Google Scholar
  25. Lacher, M., & Groh, G. (2001). Facilitating the exchange of explicit knowledge through ontology mappings. In Proceedings of the 14th Int. FLAIRS Conference.Google Scholar
  26. Lambrecht, E., Kambhampati, S., & Gnanaprakasam, S. (1999). Optimizing recursive information gathering plans. In Proc. of the Int. Joint Conf. on AI (IJCAI).Google Scholar
  27. Levy, A. Y., Rajaraman, A., & Ordille, J. (1996). Querying heterogeneous information sources using source descriptions. In Proc. of VLDB.Google Scholar
  28. Li, W., & Clifton, C. (2000). SEMINT: A tool for identifying attribute correspondence in heterogeneous databases using neural networks. Data and Knowledge Engineering, 33, 49–84.Google Scholar
  29. LSD's website, accessible from Scholar
  30. Madhavan, J., Halevy, A., Domingos, P., & Bernstein, P. (2002). Representing and reasoning about mappings between domain models. In Proceedings of the National AI Conference (AAAI-02).Google Scholar
  31. McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification, In Proceedings of the AAAI-98 Workshop on Learning fot Text Categorization.Google Scholar
  32. McGuinness, D., Fikes, R., Rice, J., & Wilder, S. (2000). The Chimaera ontology environment. In Proceedings of the 17th National Conference on Artificial Intelligence.Google Scholar
  33. Miller, R., Haas, L., & Hernandez, M. (2000). Schema mapping as query discovery. In Proc. of VLDB.Google Scholar
  34. Melnik, S., Molina-Garcia, H., & Rahm, E. (2002). Similarity flooding: A versatile graph matching algorithm. In Proceedings of the International Conference on Data Engineering (ICDE).Google Scholar
  35. Michalski, R., & Tecuci, G. (Eds.) (1994). Machine learning: A multistrategy approach. San Mateo, CA: Morgan Kaufmann.Google Scholar
  36. Milo, T., & Zohar, S. (1998). Using schema matching to simplify heterogeneous data translation. In Proc. of VLDB.Google Scholar
  37. Mitra, P., Wiederhold, G., & Jannink, J. (1998). Semi-automatic integration of knowledge sources. In Proceedings of Fusion'99.Google Scholar
  38. Noy, N. F., & Musen, M. A. (2000). PROMPT: Algorithm and tool for automated ontology merging and alignment. In Proceedings of the National Conference on Artificial Intelligence (AAAI).Google Scholar
  39. Noy, N. F., & Musen, M. A. (2001). Anchor-PROMPT: Using non-local context for semantic matching. In Proceedings of the Workshop on Ontologies and Information Sharing at the International Joint Conference on Artificial Intelligence (IJCAI).Google Scholar
  40. Palopoli, L., Sacca, D., & Ursino, D. (1998). Semi-automatic, semantic discovery of properties from database schemes. In Proc. of the Int. Database Engineering and Applications Symposium (IDEAS-98) (pp. 244–253).Google Scholar
  41. Perkowitz, M., & Etzioni, O. (1995). Category translation: Learning to understand information on the Internet. In Proc. of Int. Joint Conf. on AI (IJCAI).Google Scholar
  42. Punyakanok,V., & Roth, D. (2001). The use of classifiers in sequential inference. In Proceedings of the Conference on Neural Information Processing Systems (NIPS-00).Google Scholar
  43. Rahm, E., & Bernstein, P. A. (2001). On matching schemas automatically. Technical Report MSR-TR-2001-17, 2001. Microsoft Research, Redmon, WA.Google Scholar
  44. Ryutaro, I., Hideaki, T., & Shinichi, H. (2001). Rule induction for concept hierarchy alignment. In Proceedings of the 2nd Workshop on Ontology Learning at the 17th Int. Joint Conf. on AI (IJCAI).Google Scholar
  45. Ting, K. M., & Witten, I. H. (1999). Issues in stacked generalization. Journal of Artificial Intelligence Research, 10, 271–289.Google Scholar
  46. Wolpert, D. (1992). Stacked generalization. Neural Networks, 5, 241–259.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • AnHai Doan
    • 1
  • Pedro Domingos
    • 2
  • Alon Halevy
    • 2
  1. 1.Department of Computer ScienceUniversity of IllinoisUrbana-ChampaignUSA
  2. 2.Department of Computer Science and EngineeringUniversity of WashingtonSeattleUSA

Personalised recommendations