The VLDB Journal

, Volume 22, Issue 5, pp 689–710 | Cite as

Schema matching prediction with applications to data source discovery and dynamic ensembling

Special Issue Paper

Abstract

Web-scale data integration involves fully automated efforts which lack knowledge of the exact match between data descriptions. In this paper, we introduce schema matching prediction, an assessment mechanism to support schema matchers in the absence of an exact match. Given attribute pair-wise similarity measures, a predictor predicts the success of a matcher in identifying correct correspondences. We present a comprehensive framework in which predictors can be defined, designed, and evaluated. We formally define schema matching evaluation and schema matching prediction using similarity spaces and discuss a set of four desirable properties of predictors, namely correlation, robustness, tunability, and generalization. We present a method for constructing predictors, supporting generalization, and introduce prediction models as means of tuning prediction toward various quality measures. We define the empirical properties of correlation and robustness and provide concrete measures for their evaluation. We illustrate the usefulness of schema matching prediction by presenting three use cases: We propose a method for ranking the relevance of deep Web sources with respect to given user needs. We show how predictors can assist in the design of schema matching systems. Finally, we show how prediction can support dynamic weight setting of matchers in an ensemble, thus improving upon current state-of-the-art weight setting methods. An extensive empirical evaluation shows the usefulness of predictors in these use cases and demonstrates the usefulness of prediction models in increasing the performance of schema matching.

Keywords

Data integration Schema matching Prediction 

References

  1. 1.
    Batini, C., Lenzerini, M., Navathe, S.B.: A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. (CSUR) 18(4), 323–364 (1986)CrossRefGoogle Scholar
  2. 2.
    Bellahsene, Z.: Schema Matching and Mapping. Springer, New York (2011)CrossRefMATHGoogle Scholar
  3. 3.
    Bergamaschi, S., Castano, S., Vincini, M., Beneventano, D.: Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36(3), 215–249 (2001)CrossRefMATHGoogle Scholar
  4. 4.
    Bernstein, P.A., Melnik, S.: Meta data management. In: ICDE, p. 875. IEEE (2004)Google Scholar
  5. 5.
    Castano, S., Antonellis, V.D.: Global viewing of heterogeneous data sources. IEEE Trans. Knowl. Data Eng. 13(2), 277–297 (2001)CrossRefGoogle Scholar
  6. 6.
    Cheng, R., Gong, J., Cheung, D.: Managing uncertainty of XML schema matching. In: Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pp. 297–308 (2010)Google Scholar
  7. 7.
    Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum, Hillsdale (1988)MATHGoogle Scholar
  8. 8.
    Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 861–874, New York, NY, USA, ACM (2008)Google Scholar
  9. 9.
    Do, H.-H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: Chaudhri, A., Jeckle, M., Rahm, E., Unland, R. (eds.) Web, Web-Services, and Database Systems, vol. 2593, LNCS, pp. 221–237. Springer, Berlin (2003)CrossRefGoogle Scholar
  10. 10.
    Do, H.H., Rahm, E.: Coma: a system for flexible combination of schema matching approaches. In: Proceedings of VLDB, pp. 610–621. VLDB Endowment (2002)Google Scholar
  11. 11.
    Doan, A.H., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: ACM SIGMOD Record, vol. 30, pp. 509–520. ACM (2001)Google Scholar
  12. 12.
    Doan, A.H., Madhavan, J., Domingos, P., Halevy, A.: Learning to map between ontologies on the semantic web. In: Proceedings of the 11th International Conference on World Wide Web, pp. 662–673. ACM Press (2002)Google Scholar
  13. 13.
    Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. VLDB J. 18, 469–500 (2009)CrossRefGoogle Scholar
  14. 14.
    dos Santos Mello, R., Castano, S., Heuser, C.A.: A method for the unification of xml schemata. Inform. Softw. Technol. 44(4), 241–249 (2002)CrossRefGoogle Scholar
  15. 15.
    Draper, N., Smith, H.: Applied Regression Analysis, 2nd edn. Wiley, New York (1981)MATHGoogle Scholar
  16. 16.
    Euzenat, J.: Semantic precision and recall for ontology alignment evaluation. In: Proc. IJCAI, pp. 348–353 (2007)Google Scholar
  17. 17.
    Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)CrossRefGoogle Scholar
  18. 18.
    Gal, A.: Uncertain schema matching. Synth. Lect. Data Manag. 3(1), 1–97 (2011)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Gal, A., Anaby-Tavor, A., Trombetta, A., Montesi, D.: A framework for modeling and evaluating automatic semantic reconciliation. VLDB J. 14(1), 50–67 (2005)CrossRefGoogle Scholar
  20. 20.
    Gal, A., Modica, G., Jamil, H., Eyal, A.: Automatic ontology matching using application semantics. AI Mag. 26(1), 21 (2005)Google Scholar
  21. 21.
    Gal, A., Sagi, T.: Tuning the ensemble selection process of schema matchers. Inform. Syst. 35(8), 845–859 (2010)CrossRefGoogle Scholar
  22. 22.
    He, B., Chang, K.C.-C.: Statistical schema matching across web query interfaces. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD ’03, pp. 217–228, New York, NY, USA, ACM (2003)Google Scholar
  23. 23.
    Kang, J., Naughton, J.F.: On schema matching with opaque column names and data values. In: International Conference on Management of Data: Proceedings of the 2003 ACM SIGMOD International Conference on Management of data, vol. 9, pp. 205–216 (2003)Google Scholar
  24. 24.
    Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: Xclust: clustering xml schemas for effective integration. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, CIKM ’02, pp. 292–299, New York, NY, USA, ACM (2002)Google Scholar
  25. 25.
    Lee, Y., Sayyadian, M., Doan, A.H., Rosenthal, A.S.: eTuner: tuning schema matching software using synthetic scenarios. VLDB J. 16(1), 97–122 (2007)CrossRefGoogle Scholar
  26. 26.
    Luenberger, D.: Optimization by Vector Space Methods. Wiley-Interscience, New York (1997)Google Scholar
  27. 27.
    Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: Proc. ICDE, pp. 57–68, April (2005)Google Scholar
  28. 28.
    Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 49–58 (2001)Google Scholar
  29. 29.
    Madhavan, J., Jeffery, S., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A.: Web-scale data integration: You can only afford to pay as you go. In: Proceedings of CIDR, pp. 342–350 (2007) Google Scholar
  30. 30.
    Magnani, M., Rizopoulos, N., McBrien, P., Montesi, D.: Schema integration based on uncertain semantic mappings. In: Conceptual Modeling ER 2005, pp. 31–46 (2005)Google Scholar
  31. 31.
    Mao, M., Peng, Y. Spring, M.: A harmony based adaptive ontology mapping approach. In: Proc. of SWWS (2008)Google Scholar
  32. 32.
    Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: ICDE, pp. 117–128. IEEE (2002)Google Scholar
  33. 33.
    Meo, P.D., Quattrone, G., Terracina, G., Ursino, D.: Integration of xml schemas at various severity levels. Inform. Syst. 31(6), 397–434 (2006)CrossRefGoogle Scholar
  34. 34.
    Miles, J., Shevlin, M.: Applying Regression and Correlation: A Guide for Students and Researchers. Sage, London (2001)Google Scholar
  35. 35.
    Miller, R.J., Hernandez, M.A., Haas, L.M., Yan, L.-L., Ho, C.T.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIgMOD Rec. 30(1), 78–83 (2001)CrossRefGoogle Scholar
  36. 36.
    Ngo, D.H., Bellahsene, Z.: Evaluating the Interaction between the different Matchers (or Strategies) in Ontology Matching Task. In: Manfred Hauswirth, J.X.P., Euzenat, J. (eds.) International Semantic Web Conference—ISWC 2012, p. 12, Boston, États-Unis (2012)Google Scholar
  37. 37.
    Palopoli, L., Terracina, G., Ursino, D.: Experiences using dike, a system for supporting cooperative information system and data warehouse design. Inform. Syst. 28(7), 835–865 (2003)CrossRefGoogle Scholar
  38. 38.
    Peukert, E., Eberius, J., Rahm, E.: AMC-a framework for modelling and comparing matching systems as matching processes. In: ICDE, pp. 1304–1307. IEEE (2011)Google Scholar
  39. 39.
    Peukert, E., Eberius, J., Rahm, E.: A self-configuring schema matching system. In: ICDE (2012)Google Scholar
  40. 40.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)CrossRefMATHGoogle Scholar
  41. 41.
    Rodriguez-Gianolli, P., Mylopoulos, J.: A semantic approach to xml-based data integration. In: Kunii, H.S., Jajodia, S., Slvberg A.S. (eds.) Conceptual Modeling ER 2001, vol. 2224. Lecture Notes in Computer Science, pp. 117–132. Springer, Berlin (2001)Google Scholar
  42. 42.
    Sagi, T., Gal, A.: Non-binary evaluation for schema matching. In: Conceptual Modelling—ER 2012, Oct. (2012)Google Scholar
  43. 43.
    Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 183–236 (1990)CrossRefGoogle Scholar
  44. 44.
    Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. J. Data Semant. IV, 146–171 (2005)Google Scholar
  45. 45.
    Smith, K., Morse, M., Mork, P., Li, M., Rosenthal, A., Allen, D., Seligman, L., Wolf, C.: The role of schema matching in large enterprises. In: Proc, CIDR (2009)Google Scholar
  46. 46.
    Steel, R.G.D., Torrie, J.H.: Principles and Procedures of Statistics. McGraw-Hill, New York (1960)MATHGoogle Scholar
  47. 47.
    Tu, K., Yu, Y.: CMC: Combining multiple schema-matching strategies based on credibility prediction. In: Zhou, L., Ooi, B., Meng, X. (eds.) Database Systems for Advanced Applications, vol. 3453. LNCS, pp. 995–995. Springer, Berlin (2005)Google Scholar
  48. 48.
    Wang, J., Wen, J., Lochovsky, F., Ma, W.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 408–419. VLDB Endowment (2004)Google Scholar
  49. 49.
    Yang, X., Lee, M., Ling, T.: Resolving structural conflicts in the integration of xml schemas: A semantic approach. In: Song, I.-Y., Liddle, S., Ling, T.-W., Scheuermann, P. (eds.) Conceptual Modeling—ER 2003, vol. 2813. Lecture Notes in Computer Science, pp. 520–533. Springer, Berlin (2003)Google Scholar
  50. 50.
    Zobel, J., Moffat, A.: Exploring the similarity space. SIGIR Forum 32, 18–34 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Technion-Israel Institute of TechnologyHaifaIsrael

Personalised recommendations