YAM: A Step Forward for Generating a Dedicated Schema Matcher

Duchateau, Fabien; Bellahsene, Zohra

doi:10.1007/978-3-662-49534-6_5

Fabien Duchateau¹⁶ &
Zohra Bellahsene¹⁷

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9620))

448 Accesses
5 Citations

Abstract

Discovering correspondences between schema elements is a crucial task for data integration. Most schema matching tools are semi-automatic, e.g., an expert must tune certain parameters (thresholds, weights, etc.). They mainly use aggregation methods to combine similarity measures. The tuning of a matcher, especially for its aggregation function, has a strong impact on the matching quality of the resulting correspondences, and makes it difficult to integrate a new similarity measure or to match specific domain schemas. In this paper, we present YAM (Yet Another Matcher), a matcher factory which enables the generation of a dedicated schema matcher for a given schema matching scenario. For this purpose we have formulated the schema matching task as a classification problem. Based on this machine learning framework, YAM automatically selects and tunes the best method to combine similarity measures (e.g., a decision tree, an aggregation function). In addition, we describe how user inputs, such as a preference between recall or precision, can be closely integrated during the generation of the dedicated matcher. Many experiments run against matchers generated by YAM and traditional matching tools confirm the benefits of a matcher factory and the significant impact of user preferences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The name of the tool refers to a discussion during a panel session at XSym 2007.
2.
Extensible Markup Language (XML) (November 2015).
3.
JavaScript Object Notation (JSON) (November 2015).
4.
Second String (November 2015).
5.
We focus on supervised classification, i.e., all training data are labelled with a class.
6.
The two schemas of a pair may be necessary to compute similarity values, for instance with structural or contextual measures.
7.
Other stop conditions may be used, for instance “all training data have been correctly classified”.
8.
If the user has not provided a sufficient number of correspondences, YAM will extract others from the repository.
9.
Second String (November 2015).
10.
Free Web Services (November 2015).
11.
Only the F-measure plot is provided since the plots for precision and recall follow the same trend as the F-measure.
12.
This classifier is named instance-based since the correspondences (included in the training scenarios) are considered as instances during learning. Our approach does not currently use schema instances.
13.
Some GUIs already exist to facilitate this task by suggesting the most probable correspondences.

References

Altschul, S.F., Erickson, B.W.: Optimal sequence alignment using affine gap costs. Bull. Math. Biol. 48(5–6), 603–616 (1986)
Article MathSciNet MATH Google Scholar
Aumueller, D., Do, H.-H., Massmann, S., Rahm, E.: Schema and ontology matching with coma++. In: SIGMOD, pp. 906–908 (2005)
Google Scholar
Bellahsene, Z., Bonifati, A., Rahm, E. (eds.): Schema Matching and Mapping. Springer, Heidelberg (2011)
MATH Google Scholar
Berlin, J., Motro, A.: Autoplex: automated discovery of content for virtual databases. In: Batini, C., Giunchiglia, F., Giorgini, P., Mecella, M. (eds.) CoopIS 2001. LNCS, vol. 2172, pp. 108–122. Springer, Heidelberg (2001)
Chapter Google Scholar
Berlin, J., Motro, A.: Database schema matching using machine learning with feature selection. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, p. 452. Springer, Heidelberg (2002)
Chapter Google Scholar
Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. PVLDB 4(11), 695–701 (2011)
Google Scholar
Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-supervised Learning. MIT Press, Cambridge (2006)
Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI 2003 (2003)
Google Scholar
Cruz, I.F., Antonelli, F.P., Stroe, C.: AgreementMaker: efficient matching for large real-world schemas and ontologies. PVLDB 2(2), 1586–1589 (2009)
Google Scholar
Djeddi, W.E., Khadir, M.T.: Ontology alignment using artificial neural network for large-scale ontologies. Int. J. Metadata Semant. Ontol. 8(1), 75–92 (2013)
Article Google Scholar
Do, H.H., Rahm, E.: Coma - a system for flexible combination of schema matching approaches. In: VLDB, pp. 610–621 (2002)
Google Scholar
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: SIGMOD, pp. 509–520 (2001)
Google Scholar
Doan, A.H., Madhavan, J., Dhamankar, R., Domingos, P., Halevy, A.Y.: Learning to match ontologies on the semantic web. VLDB J. 12(4), 303–319 (2003)
Article Google Scholar
Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Ontology matching: a machine learning approach. In: Staab, S., Studer, R. (eds.) Handbook on Ontologies in Information Systems, pp. 397–416. Springer, Heidelberg (2004)
Google Scholar
Dougherty, J., Kohavi, R., Sahami, M., et al.: Supervised and unsupervised discretization of continuous features. In: Proceedings of 12th International Conference on Machine Learning, vol. 12, 194–202 (1995)
Google Scholar
Dragut, E., Lawrence, R.: Composing mappings between schemas using a reference ontology. In: Meersman, R. (ed.) OTM 2004. LNCS, vol. 3290, pp. 783–800. Springer, Heidelberg (2004)
Chapter Google Scholar
Duchateau, F., Bellahsene, Z.: Designing a benchmark for the assessmentof schema matching tools. Open J. Databases (OJDB) 1, 3–25 (2014). RonPub, Germany
Google Scholar
Duchateau, F., Bellahsene, Z., Coletta, R.: A flexible approach for planning schema matching algorithms. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part I. LNCS, vol. 5331, pp. 249–264. Springer, Heidelberg (2008)
Chapter Google Scholar
Duchateau, F., Bellahsene, Z., Roche, M.: A context-based measure for discovering approximate semantic matching between schema elements. In: Research Challenges in Information Science (RCIS) (2007)
Google Scholar
Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg (2007)
MATH Google Scholar
Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in decision tree generation. Mach. Learn. 8(1), 87–102 (1992)
MATH Google Scholar
Gal, A.: Uncertain Schema Matching. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2011)
Book MATH Google Scholar
Garner, S.R.: Weka: the waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)
Google Scholar
Hammer, J., Stonebraker, M., Topsakal, O.: Thalia: test harness for the assessment of legacy information integration approaches. In: ICDE, pp. 485–486 (2005)
Google Scholar
Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E.G.M., Milios, E.: Information retrieval by semantic similarity. Int. J. Seman. Web Inf. Syst. 2(3), 55–73 (2006)
Article Google Scholar
Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)
Google Scholar
Lee, Y., Sayyadian, M., Doan, A.H., Rosenthal, A.: eTuner: tuning schema matching software using synthetic scenarios. VLDB J. 16(1), 97–122 (2007)
Article Google Scholar
Li, J., Tang, J., Li, Y., Luo, Q.: Rimom: a dynamic multistrategy ontology alignment framework. IEEE Trans. Knowl. Data Eng. 21(8), 1218–1232 (2009)
Article Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: ICML 1998, pp. 296–304 (1998)
Google Scholar
Malgorzata, M., Anja, J., Jérôme, E.: Applying an analytic method for matching approach selection. In: CEUR Workshop Proceedings of Ontology Matching, vol. 225. CEUR-WS.org (2006)
Google Scholar
Marie, A., Gal, A.: Boosting schema matchers. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part I. LNCS, vol. 5331, pp. 283–300. Springer, Heidelberg (2008)
Chapter Google Scholar
Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: aversatile graph matching algorithm and its application to schema matching. In: Proceedings of ICDE, pp. 117–128 (2002)
Google Scholar
Melnik, S., Rahm, E., Bernstein, P.A.: Developing metadata-intensive applications with Rondo. J. Web Seman. I, 47–74 (2003)
Article Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill Education, New York (1997). (ISE Editions)
MATH Google Scholar
Mork, P., Seligman, L., Rosenthal, A., Korb, J., Wolf, C.: The harmony integration workbench. J. Data Seman. 11, 65–93 (2008)
Google Scholar
Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Article Google Scholar
University of Illinois: The UIUC web integration repository (2003). http://metaquerier.cs.uiuc.edu/repository
Paulheim, H., Hertling, S., Ritze, D.: Towards evaluating interactive ontology matching tools. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 31–45. Springer, Heidelberg (2013)
Chapter Google Scholar
Peukert, E., Eberius, J., Rahm, E.: Rule-based construction of matching processes. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, New York, pp. 2421–2424. ACM (2011)
Google Scholar
Resnik, P.: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. 11, 95–130 (1999)
MATH Google Scholar
Secondstring (2014). http://secondstring.sourceforge.net/
Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–171. Springer, Heidelberg (2005)
Chapter Google Scholar
Shvaiko, P., Euzenat, J.: Ten challenges for ontology matching. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part II. LNCS, vol. 5332, pp. 1164–1182. Springer, Heidelberg (2008)
Chapter Google Scholar
Smith, K., Morse, M., Mork, P., Li, M., Rosenthal, A., Allen, D., Seligman, L.: The role of schema matching in large enterprises. In: CIDR (2009)
Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research, pp. 354–359 (1990)
Google Scholar
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Philip, Y.S., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Article Google Scholar
Xu, L., Embley, D.W.: Using domain ontologies to discover direct and indirect matches for schema elements, pp. 97–103 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Université Lyon 1, LIRIS UMR 5205, Lyon, France
Fabien Duchateau
Université Montpellier, LIRMM, Montpellier, France
Zohra Bellahsene

Authors

Fabien Duchateau
View author publications
You can also search for this author in PubMed Google Scholar
Zohra Bellahsene
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabien Duchateau .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Josef Küng
FAW, University of Linz, Linz, Austria
Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Duchateau, F., Bellahsene, Z. (2016). YAM: A Step Forward for Generating a Dedicated Schema Matcher. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV. Lecture Notes in Computer Science(), vol 9620. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49534-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-662-49534-6_5
Published: 20 February 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49533-9
Online ISBN: 978-3-662-49534-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics