Abstract
The emergence of increasing number of collaborating organizations has made clear the need for supporting interoperability infrastructures, enabling sharing and exchange of data among organizations. Schema matching and schema integration are the crucial components of the interoperability infrastructures, and their semi-automation to interrelate or integrate heterogeneous and autonomous databases in collaborative networks is desired. The Semi-Automatic Schema Matching and INTegration (SASMINT) System introduced in this paper identifies and resolves several important syntactic, semantic, and structural conflicts among schemas of relational databases to find their likely matches automatically. Furthermore, after getting the user validation on the matched results, it proposes an integrated schema. SASMINT uses a combination of a variety of metrics and algorithms from the Natural Language Processing and Graph Theory domains for its schema matching. For the schema integration, it utilizes a number of derivation rules defined in the scope of the research work explained in this paper. Furthermore, a derivation language called SASMINT Derivation Markup Language (SDML) is defined for capturing and formulating both the results of matching and the integration that can be further used, for example for federated query processing from independent databases. In summary, the paper focuses on addressing: (1) conflicts among schemas that make automatic schema matching and integration difficult, (2) the main components of the SASMINT approach and system, (3) in-depth exploration of SDML, (4) heuristic rules designed and implemented as part of the schema integration component of the SASMINT system, and (5) experimental evaluation of SASMINT.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Afsarmanesh H, Wiedijk M, Hertzberger LO et al (1996) Cooperation of CIM expert systems supported by PEER. J Stud Inf Control 5(2): 157–169
Afsarmanesh H, Wiedijk M, Tuijnman F et al (1994) The PEER information management language user manual. Technical Report. Department of Computer Systems, University of Amsterdam
An Y, Mylopoulos J, Borgida A (2006) Building semantic mappings from databases to ontologies. In: Twenty-First National Conference on Artificial Intelligence (AAAI-06) Nectar Track, Boston
Arens Y, Knoblock CA, Shen W-M (1996) Query reformulation for dynamic information integration. J Intell Inf Syst 6(2/3): 99–130
Aumueller D, Do HH, Massmann S et al (2005) Schema and ontology matching with COMA++. In: ACM SIGMOD international conference on management of data. ACM, Baltimore, pp 906–908
Aygün RS (2008) S2S: structural-to-syntactic matching similar documents. Knowl Inf Syst 16(3): 303–329
Batini C, Lenzerini M (1984) A methodology for data schema integration in the entity relationship model. IEEE Trans Softw Eng 10(6): 650–664
Batini C, Lenzerini M, Navathe S (1986) A comparative analysis of methodologies for database schema integration. ACM Comput Surv 18(4): 323–364
Bayardo RJ, Bohrer W, Brice R et al (1997) InfoSleuth: agent-based semantic integration of information in open and dynamic environments. In: ACM SIGMOD international conference on management of data. ACM, Tucson, pp 195–206
Bergamaschi S, Castano S, Vimercati SDCD et al (1998) A semantic approach to information integration: the MOMIS project. In: Sesto Convegno della Associazione Italiana per l’Intelligenza Artificiale (AI*IA98), Padova, Italy
Bernstein PA, Melnik S, Petropoulos M et al (2004) Industrial-strength schema matching. SIGMOD Rec 33(4): 38–43
Blondel VD, Gajardo A, Heymans M et al (2004) A measure of similarity between graph vertices: applications to synonym extraction and Web searching. SIAM Rev 46(4): 647–666
Candan KS, Kim JW, Liu H et al (2006) Discovering mappings in hierarchical data from multiple sources using the inherent structure. Knowl Inf Syst 10(2): 185–210
Chiticariu L, Kolaitis PG, Popa L (2008) Interactive generation of integrated schemas. In: ACM SIGMOD international conference on management of data. ACM, Vancouver, pp 833–846
Choi N, Song I-Y, Han H (2006) A survey on ontology mapping. SIGMOD Rec 35(3): 34–41
Cleverdon CW, Keen EM (1966) Aslib–Cranfield research project. Technical Report. Cranfield Institute of Technology, Cranfield
Dayal U, Hwang H-Y (1982) View definition and generalization for database integration in multibase: a system for heterogeneous distributed databases. In: Berkeley workshop, pp 203–238
Do HH, Rahm E (2002) COMA—a system for flexible combination of schema matching approaches. In: International conference on very large databases (VLDB), VLDB Endowment. Hong Kong, China, pp 610–621
Doan AH, Domingos P, Halevy A (2001) Reconciling schemas of disparate data sources—a machine-learning approach. In: ACM SIGMOD international conference on management of data. ACM, Santa Barbara, pp 509–520
ElMasri R, Larson J, Navathe SB (1987) Integration algorithms for federated databases and logical database design. Technical Report. Honeywell Corporate Systems Development Division
Embley DW, Xu L, Ding Y (2004) Automatic direct and indirect schema mapping: experiences and lessons learned. SIGMOD Rec 33(4): 14–19
Euzenat J, Shvaiko P (2007) Ontology matching. Springer, Heidelberg, p p 445
Fellbaum C (1998) An electronic lexical database. MIT press, Cambridge, p p 445
Gal A (2006) Managing uncertainty in schema matching with Top-K schema mappings. J Data Semant Special Issue Emerg Semant 6: 90–114
Gal A (2007) Why is schema matching tough and what can we do about it. SIGMOD Rec 35(4): 2–5
Garcia-Molina H, Papakonstantinou Y, Quass D et al (1997) The TSIMMIS approach to mediation: data models and languages. J Intell Inf Syst 8(2): 117–132
Giunchiglia F, Yatskevich M, Shvaiko P (2007) Semantic matching: algorithms and implementation. J Data Semant 9: 1–38
Goh C, Bresson S, Madnich S et al (1999) Context interchange: new features and formalisms for the intelligent integration of information. ACM Trans Inf Syst 17(3): 270–293
GraphML (2008) http://graphml.graphdrawing.org/
GXL (2008) http://www.gupro.de/GXL/
Haase P, Siebes R, Harmelen Fv (2008) Expertise-based peer selection in peer-to-peer networks. Knowl Inf Syst 15(1): 75–107
Jaccard P (1912) The distribution of flora in the alpine zone. New Phytol 11(2): 37–50
Jaro MA (1995) Probabilistic linkage of large public health data files. Stat Med 14: 491–498
JGraph (2008) http://www.jgraph.com/
JGraphT (2008) http://jgrapht.sourceforge.net/
Kalfoglou Y, Schorlemmer M (2003) Ontology mapping: the state of the art. Knowl Eng Rev J 18(1): 1–31
Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a pine code from an ice cream cone. In: 5th international conference on systems documentation. Toronto, Ontario, Canada, pp 24–26
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Cybern Control Theor 10(8): 707–710
Li W, Clifton C (2000a) SEMINT: a tool for identifying attribute correspondence in heterogeneous databases using neural networks. J Data Knowl Eng 33(1): 49–84
Li W, Clifton C, Liu SY (2000b) Using neural networks: implementation and experiences. Knowl Inf Syst 2(1): 73–96
Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with cupid. In: International conference on very large databases (VLDB). Morgan Kaufmann, San Francisco, pp 49–58
Magnani M, Montesi D (2007) Uncertainty in data integration: current approaches and open problems. In: International VLDB workshop on management of uncertain data, pp 18–32
Mannino MV, Effelsberg W (1984) A methodology for global schema design. Technical Report, Computer and Information Sciences Department, University of Florida
Melnik S, Garcia-Molina H, Rahm E (2002) Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: International conference on data engineering. IEEE Computer Society, San Jose, CA, USA, pp 117–128
Melnik S, Rahm E, Bernstein PA (2003) Rondo: a programming platform for generic model management. In: ACM SIGMOD international conference on management of data, pp 193–204
Mena E, Illarramendi A, Kashyap V et al (2000) OBSERVER: an approach for query processing in global information systems based on interoperation across pre-existing ontologies. Distrib Parallel Databases J 8(2): 223–271
Miller RJ, Haas LM, Hernandez MA (2000) Schema mapping as query discovery. In: International conference on very large databases (VLDB). Morgan Kaufmann, Cairo, pp 77–88
Monge AE, Elkan C (1996) The field matching problem: algorithms and applications. In: Second international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 267–270
Motro A, Buneman P (1981) Constructing superviews. In: ACM SIGMOD international conference on management of data, ACM, Ann Arbor, pp 56–64
Nottelmann H, Straccia U (2007) Information retrieval and machine learning for probabilistic schema matching. Inf Process Manage 43(3): 552–576
Pinto HS, Martins JP (2004) Ontologies: how can they be built. Knowl Inf Syst 6(4): 441–464
Pottinger R, Bernstein PA (2008) Schema merging and mapping creation for relational sources. In: International conference on extending database technology (EDBT). ACM, Nantes, pp 73–84
Pottinger RA, Bernstein PA (2003) Merging models based on given correspondences. In: International conference on very large databases (VLDB). Morgan Kaufmann, Berlin, pp 826–873
Rahm E, Do HH, Massmann S (2004) Matching large XML schemas. SIGMOD Rec 33(4): 26–31
Rijsbergen CJV (1979) Information retrieval. Butterworth, London
Saleem K, Bellahsene Z, Hunt E (2008) PORSCHE: Performance ORiented SCHEma mediation. Inf Syst 33(7–8): 637–657
Salton G, Yang CS (1973) On the specification of term values in automatic indexing. J Documentation 29: 351–372
Sheth A, Larson J (1990) Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput Surv 22(3): 183–236
Tuijnman F, Afsarmanesh H (1993) Management of shared data in federated cooperative PEER environment. Int J Intell Cooperation Inf Syst 2(4): 451–473
Unal O, Afsarmanesh H (2006a) Interoperability in collaborative network of biodiversity organizations. In: 7th PRO-VE. Springer, Helsinki, pp 515–524
Unal O, Afsarmanesh H (2006b) SASMINT system for database interoperability in collaborative networks. In: OTM conferences, Lecture Notes in Computer Science. Springer, Montpellier, pp 91–108
Unal O, Afsarmanesh H (2006c) Using linguistic techniques for schema matching. In: International conference on software and data technologies. INSTICC Press, Setubal, pp 115–120
Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15(1): 55–73
Wang G, Goguen J, Nam Y et al (2004) Critical points for interactive schema matching. In: Sixth Asia Pacific web conference. Lecture Notes in Computer Science, Springer, pp 654–664
Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: 32nd annual meeting of the association for computational linguistics. Association for Computational Linguistics, Las Cruces, pp 133–138
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution,and reproduction in any medium, provided the original author(s) and source are credited.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Unal, O., Afsarmanesh, H. Semi-automated schema integration with SASMINT. Knowl Inf Syst 23, 99–128 (2010). https://doi.org/10.1007/s10115-009-0217-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0217-z