Skip to main content

YAM: A Step Forward for Generating a Dedicated Schema Matcher

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9620))

Abstract

Discovering correspondences between schema elements is a crucial task for data integration. Most schema matching tools are semi-automatic, e.g., an expert must tune certain parameters (thresholds, weights, etc.). They mainly use aggregation methods to combine similarity measures. The tuning of a matcher, especially for its aggregation function, has a strong impact on the matching quality of the resulting correspondences, and makes it difficult to integrate a new similarity measure or to match specific domain schemas. In this paper, we present YAM (Yet Another Matcher), a matcher factory which enables the generation of a dedicated schema matcher for a given schema matching scenario. For this purpose we have formulated the schema matching task as a classification problem. Based on this machine learning framework, YAM automatically selects and tunes the best method to combine similarity measures (e.g., a decision tree, an aggregation function). In addition, we describe how user inputs, such as a preference between recall or precision, can be closely integrated during the generation of the dedicated matcher. Many experiments run against matchers generated by YAM and traditional matching tools confirm the benefits of a matcher factory and the significant impact of user preferences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The name of the tool refers to a discussion during a panel session at XSym 2007.

  2. 2.

    Extensible Markup Language (XML) (November 2015).

  3. 3.

    JavaScript Object Notation (JSON) (November 2015).

  4. 4.

    Second String (November 2015).

  5. 5.

    We focus on supervised classification, i.e., all training data are labelled with a class.

  6. 6.

    The two schemas of a pair may be necessary to compute similarity values, for instance with structural or contextual measures.

  7. 7.

    Other stop conditions may be used, for instance “all training data have been correctly classified”.

  8. 8.

    If the user has not provided a sufficient number of correspondences, YAM will extract others from the repository.

  9. 9.

    Second String (November 2015).

  10. 10.

    Free Web Services (November 2015).

  11. 11.

    Only the F-measure plot is provided since the plots for precision and recall follow the same trend as the F-measure.

  12. 12.

    This classifier is named instance-based since the correspondences (included in the training scenarios) are considered as instances during learning. Our approach does not currently use schema instances.

  13. 13.

    Some GUIs already exist to facilitate this task by suggesting the most probable correspondences.

References

  1. Altschul, S.F., Erickson, B.W.: Optimal sequence alignment using affine gap costs. Bull. Math. Biol. 48(5–6), 603–616 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  2. Aumueller, D., Do, H.-H., Massmann, S., Rahm, E.: Schema and ontology matching with coma++. In: SIGMOD, pp. 906–908 (2005)

    Google Scholar 

  3. Bellahsene, Z., Bonifati, A., Rahm, E. (eds.): Schema Matching and Mapping. Springer, Heidelberg (2011)

    MATH  Google Scholar 

  4. Berlin, J., Motro, A.: Autoplex: automated discovery of content for virtual databases. In: Batini, C., Giunchiglia, F., Giorgini, P., Mecella, M. (eds.) CoopIS 2001. LNCS, vol. 2172, pp. 108–122. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  5. Berlin, J., Motro, A.: Database schema matching using machine learning with feature selection. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, p. 452. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  6. Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. PVLDB 4(11), 695–701 (2011)

    Google Scholar 

  7. Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-supervised Learning. MIT Press, Cambridge (2006)

    Google Scholar 

  8. Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI 2003 (2003)

    Google Scholar 

  9. Cruz, I.F., Antonelli, F.P., Stroe, C.: AgreementMaker: efficient matching for large real-world schemas and ontologies. PVLDB 2(2), 1586–1589 (2009)

    Google Scholar 

  10. Djeddi, W.E., Khadir, M.T.: Ontology alignment using artificial neural network for large-scale ontologies. Int. J. Metadata Semant. Ontol. 8(1), 75–92 (2013)

    Article  Google Scholar 

  11. Do, H.H., Rahm, E.: Coma - a system for flexible combination of schema matching approaches. In: VLDB, pp. 610–621 (2002)

    Google Scholar 

  12. Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: SIGMOD, pp. 509–520 (2001)

    Google Scholar 

  13. Doan, A.H., Madhavan, J., Dhamankar, R., Domingos, P., Halevy, A.Y.: Learning to match ontologies on the semantic web. VLDB J. 12(4), 303–319 (2003)

    Article  Google Scholar 

  14. Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Ontology matching: a machine learning approach. In: Staab, S., Studer, R. (eds.) Handbook on Ontologies in Information Systems, pp. 397–416. Springer, Heidelberg (2004)

    Google Scholar 

  15. Dougherty, J., Kohavi, R., Sahami, M., et al.: Supervised and unsupervised discretization of continuous features. In: Proceedings of 12th International Conference on Machine Learning, vol. 12, 194–202 (1995)

    Google Scholar 

  16. Dragut, E., Lawrence, R.: Composing mappings between schemas using a reference ontology. In: Meersman, R. (ed.) OTM 2004. LNCS, vol. 3290, pp. 783–800. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  17. Duchateau, F., Bellahsene, Z.: Designing a benchmark for the assessmentof schema matching tools. Open J. Databases (OJDB) 1, 3–25 (2014). RonPub, Germany

    Google Scholar 

  18. Duchateau, F., Bellahsene, Z., Coletta, R.: A flexible approach for planning schema matching algorithms. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part I. LNCS, vol. 5331, pp. 249–264. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  19. Duchateau, F., Bellahsene, Z., Roche, M.: A context-based measure for discovering approximate semantic matching between schema elements. In: Research Challenges in Information Science (RCIS) (2007)

    Google Scholar 

  20. Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg (2007)

    MATH  Google Scholar 

  21. Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in decision tree generation. Mach. Learn. 8(1), 87–102 (1992)

    MATH  Google Scholar 

  22. Gal, A.: Uncertain Schema Matching. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2011)

    Book  MATH  Google Scholar 

  23. Garner, S.R.: Weka: the waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)

    Google Scholar 

  24. Hammer, J., Stonebraker, M., Topsakal, O.: Thalia: test harness for the assessment of legacy information integration approaches. In: ICDE, pp. 485–486 (2005)

    Google Scholar 

  25. Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E.G.M., Milios, E.: Information retrieval by semantic similarity. Int. J. Seman. Web Inf. Syst. 2(3), 55–73 (2006)

    Article  Google Scholar 

  26. Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)

    Google Scholar 

  27. Lee, Y., Sayyadian, M., Doan, A.H., Rosenthal, A.: eTuner: tuning schema matching software using synthetic scenarios. VLDB J. 16(1), 97–122 (2007)

    Article  Google Scholar 

  28. Li, J., Tang, J., Li, Y., Luo, Q.: Rimom: a dynamic multistrategy ontology alignment framework. IEEE Trans. Knowl. Data Eng. 21(8), 1218–1232 (2009)

    Article  Google Scholar 

  29. Lin, D.: An information-theoretic definition of similarity. In: ICML 1998, pp. 296–304 (1998)

    Google Scholar 

  30. Malgorzata, M., Anja, J., Jérôme, E.: Applying an analytic method for matching approach selection. In: CEUR Workshop Proceedings of Ontology Matching, vol. 225. CEUR-WS.org (2006)

    Google Scholar 

  31. Marie, A., Gal, A.: Boosting schema matchers. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part I. LNCS, vol. 5331, pp. 283–300. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  32. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: aversatile graph matching algorithm and its application to schema matching. In: Proceedings of ICDE, pp. 117–128 (2002)

    Google Scholar 

  33. Melnik, S., Rahm, E., Bernstein, P.A.: Developing metadata-intensive applications with Rondo. J. Web Seman. I, 47–74 (2003)

    Article  Google Scholar 

  34. Mitchell, T.: Machine Learning. McGraw-Hill Education, New York (1997). (ISE Editions)

    MATH  Google Scholar 

  35. Mork, P., Seligman, L., Rosenthal, A., Korb, J., Wolf, C.: The harmony integration workbench. J. Data Seman. 11, 65–93 (2008)

    Google Scholar 

  36. Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)

    Article  Google Scholar 

  37. University of Illinois: The UIUC web integration repository (2003). http://metaquerier.cs.uiuc.edu/repository

  38. Paulheim, H., Hertling, S., Ritze, D.: Towards evaluating interactive ontology matching tools. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 31–45. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  39. Peukert, E., Eberius, J., Rahm, E.: Rule-based construction of matching processes. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, New York, pp. 2421–2424. ACM (2011)

    Google Scholar 

  40. Resnik, P.: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. 11, 95–130 (1999)

    MATH  Google Scholar 

  41. Secondstring (2014). http://secondstring.sourceforge.net/

  42. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–171. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  43. Shvaiko, P., Euzenat, J.: Ten challenges for ontology matching. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part II. LNCS, vol. 5332, pp. 1164–1182. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  44. Smith, K., Morse, M., Mork, P., Li, M., Rosenthal, A., Allen, D., Seligman, L.: The role of schema matching in large enterprises. In: CIDR (2009)

    Google Scholar 

  45. Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research, pp. 354–359 (1990)

    Google Scholar 

  46. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Philip, Y.S., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)

    Article  Google Scholar 

  47. Xu, L., Embley, D.W.: Using domain ontologies to discover direct and indirect matches for schema elements, pp. 97–103 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabien Duchateau .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Duchateau, F., Bellahsene, Z. (2016). YAM: A Step Forward for Generating a Dedicated Schema Matcher. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV. Lecture Notes in Computer Science(), vol 9620. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49534-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-49534-6_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-49533-9

  • Online ISBN: 978-3-662-49534-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics