Schema matching based on SQL statements

Ding, Guohui; Sun, Shasha; Wang, Guoren

doi:10.1007/s10619-019-07268-9

Schema matching based on SQL statements

Published: 09 May 2019

Volume 38, pages 193–226, (2020)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Guohui Ding¹,
Shasha Sun¹ &
Guoren Wang²

493 Accesses
4 Citations
Explore all metrics

Abstract

Schema matching is a critical step in numerous database applications such as web data sources integrating, data warehouse loading and information exchanging among several authorities. In this paper, we propose to exploit the similarities of the SQL statements in the query logs to find the correspondences between attributes in the schemas to be matched. We discover three kinds of similarities which benefit schema matching, that is, the similarity of clauses itself, the similarity of the frequency of clauses occurring in different SQL statements and the similarity of statistics about the relationship among clauses. We combine the clauses related to the similarities into a graph, and then transform the task of matching attributes into the problem of matching the graphs. Through matching the graphs, we obtain a set of attribute sequence pairs with the similarity score. Actually, each sequence pair represents a set of correspondences. Next, we exploit the techniques from the quadratic programming field to decompose the sequence pairs into correspondences, that is, to obtain the similarity score of each correspondence. Finally, an efficient method is used to choose the best correspondence for each attribute from the candidate set. The experimental study shows that the proposed approach is effective and its combination with other matchers has good performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Relationship Matching of Data Sources: A Graph-Based Approach

Schema Matching Based on Source Codes

A Linear Program for Holistic Matching: Assessment on Schema Matching Benchmark

References

An, Y., Borgid, A., ZMiller, R. J.: A semantic approach to discovering schema mapping expressions. In: Proceedings of the International Conference on Data Engineering ICDE’07, pp. 206–215 (2007)
Aumueller, D., Do, H. H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005, pp. 906–908 (2005)
Bellahsene, Z., Bonifati, A., Rahm, E. (eds.): Schema Matching and Mapping. Data-Centric Systems and Applications. Springer, New York (2011)
MATH Google Scholar
Bilke, A., Naumann, F.: Schema matching using duplicates. In: Proceedings of the International Conference on Data Engineering ICDE’05, pp. 69–80 (2005)
Bohannon, P., Elnahrawy, E., Fan, W., Flaster, M.: Putting context into schema matching. In: Proceedings of the Very Large Data Bases VLDB’06, pp. 307–318 (2006)
Bonifati, A., Comignani, U., Coquery, E., Thion, R.: Interactive mapping specification with exemplar tuples. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14–19, 2017, pp. 667–682 (2017)
BreHRM. https://github.com/dreajay/hrm
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998)
Article Google Scholar
Cunha, J., Erwig, M., Saraiva, J.: Automatically inferring classsheet models from spreadsheets. In: Proceedings of the Visual Languages and Human-Centric Computing VL/HCC’10, pp. 93–100 (2010)
Dai, B. T., Koudas, N., Srivastavat, D., Tung, A. K. H., Venkatasubramaniant, S.: Validating multi-column schema matchings by type. In: Proceedings of the International Conference on Data Engineering ICDE’08, pp. 120–129 (2008)
Do, H. H., Rahm, E.: Coma—a system for flexible combination of schema matching approaches. In: Proceedings of the Very Large Data Bases VLDB’02, pp. 610–621 (2002)
Chapter Google Scholar
Do, H.H., Rahm, E.: Matching large schemas: approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)
Article Google Scholar
Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’01, pp. 509–520 (2001)
Dong, X., Halevy, A. Y., Yu, C.: Data integration with uncertainty. In: Proceedings of the Very Large Data Bases VLDB’07, pp. 687–698 (2007)
Dong, X., BertiEquille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. In: Proceedings of the Very Large Data Bases VLDB’09, pp. 550–561 (2009)
Article Google Scholar
Elmeleegy, H., Ouzzani, M., Elmagarmid, A.: Usage-based schema matching. In: Proceedings of the International Conference on Data Engineering ICDE’08, pp. 20–29 (2008)
Fagin, R., Kolaitis, P., Miller, R., Popa, L.: Data exchange: Semantics and query answering. In: Proceedings of the International Conference on Database Theory ICDT’03, pp. 207–224 (2003)
Google Scholar
Gal, A.: Uncertain Schema Matching. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2011)
Google Scholar
Gal, A., Roitman, H., Sagi, T.: From diversity-based prediction to better ontology & schema matching. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11–15, 2016, pp. 1145–1155 (2016)
Gu, B., Li, Z., Zhang, X., Liu, A., Liu, G., Zheng, K., Zhao, L., Zhou, X.: The interaction between schema matching and record matching in data integration. IEEE Trans. Knowl. Data Eng. 29(1), 186–199 (2017)
Article Google Scholar
Halevy, A., Rajaraman, A., Ordille, J.: Data integration: The teenage years. In: Proceedings of the Very Large Data Bases VLDB’06, pp. 9–16 (2006)
Haveliwala, T.H.: Efficient Computation of PageRank. Stanford University, Stanford (1999)
Google Scholar
He, B., Chang, K. C. C., Han, J.: Discovering complex matchings across web query interfaces: a correlation mining approach. In: Proceedings of the Knowledge Discovery and Data Mining KDD’04, pp. 148–157 (2004)
Kang, J., Naughton, J. F.: On schema matching with opaque column names and data values. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’03, pp. 205–216 (2003)
KinHRM. https://github.com/kinmengbgm/bgm-hrm
Kohler, H., Zhou, X., Sadiq, S., Shu, Y., Taylor, K.: Sampling dirty data for matching attributes. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’10, pp. 63–74 (2010)
Kutty, S., Nayak, R., Chen, L.: A people-to-people matching system using graph mining techniques. World Wide Web 17(3), 311–349 (2014)
Article Google Scholar
Kuwada, H., Hashimoto, K., Ishihara, Y., Fujiwara, T.: The consistency and absolute consistency problems of XML schema mappings between restricted dtds. World Wide Web 18(5), 1443–1461 (2015)
Article Google Scholar
Li, W.S., Clifton, C.: Semint: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng. 33(1), 49–84 (2000)
Article Google Scholar
Madhavan, J., Bernstein, P. A., Doan, A., Halevy, A. Y.: Corpus-based schema matching. In: Proceedings of the International Conference on Data Engineering ICDE’05, pp. 57–68 (2005)
Massmann, S., Engmann, D., Rahm, E.: COMA++: results for the ontology alignment contest OAEI 2006. In: Proceedings of the 1st International Workshop on Ontology Matching (OM-2006) Collocated with the 5th International Semantic Web Conference (ISWC-2006), Athens, Georgia, USA, November 5, 2006 (2006)
Maßmann, S., Raunich, S., Aumüller, D., Arnold, P., Rahm, E.: Evolution of the COMA match system. In: Proceedings of the 6th International Workshop on Ontology Matching, Bonn, Germany, October 24, 2011 (2011)
Mecca, G., Papotti, P., Raunich, S.: Core schema mappings. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’09, pp. 655–668 (2009)
Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: Proceedings of the International Conference on Data Engineering ICDE’02, pp. 117–128 (2002)
Miller, R. J., Haas, L. M., Hernandez, M. A.: Schema mapping as query discovery. In: Proceedings of the Very Large Data Bases VLDB’00, pp. 77–99 (2000)
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer Science & Business Media, New York (2000)
MATH Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab, Stanford (1999)
Google Scholar
Pottinger, R. A., Bernstein, P. A.: Merging models based on given correspondences. In: Proceedings of the Very Large Data Bases VLDB’03, pp. 826–873 (2003)
Chapter Google Scholar
Qian, L., Cafarella, M. J., Jagadish, H. V.: Sample-driven schema mapping. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’12, pp. 73–84 (2012)
Radwan, A., Popa, L., Stanoi, I. R., Younis, A.: Top-k generation of integrated schemas based on directed and weighted correspondences. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’09, pp. 641–654 (2009)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. J. Very Large Data Bases (VLDB) 10(4), 334–350 (2001)
Article Google Scholar
Sarma, A. D., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’08, pp. 861–874 (2008)
Velegrakis, Y., Miller, R. J., Hernandez, M. A., Fagin, R.: Translating web data. In: Proceedings of the Very Large Data Bases VLDB’02, pp. 598–609 (2002)
Warren, R. H., Tompa, F.: Multicolumn substring matching for database schema translation. In: Proceedings of the Very Large Data Bases VLDB’06, pp. 331–342 (2006)
Yakout, M., Ganjam, K., Chakrabarti, K., Chaudhuri, S.: Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’12, pp. 97–108 (2012)
Zhang, M., Chakrabarti, K.: Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’13, pp. 145–156 (2013)
Zhang, C. J., Zhao, Z., Chen, L., Jagadish, H. V., Cao, C. C.: Crowdmatcher: crowd-assisted schema matching. In: Proceedings of the International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 721–724 (2014)
Zhou, Y., H, C., Yu, J. X.: Graph clustering based on structural/attribute similarities. In: Proceedings of the Very Large Data Bases VLDB’09, pp. 718–729 (2009)
Article Google Scholar

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China (Grant No. 61303016) and the Normal Project Foundation of Education Department of LiaoNing Province (Grant No. L2012045).

Author information

Authors and Affiliations

School of Computer, Shenyang Aerospace University, Shenyang, China
Guohui Ding & Shasha Sun
School of Computer Science & Technology, Beijing Institute of Technology, Beijing, China
Guoren Wang

Authors

Guohui Ding
View author publications
You can also search for this author in PubMed Google Scholar
Shasha Sun
View author publications
You can also search for this author in PubMed Google Scholar
Guoren Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guohui Ding.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ding, G., Sun, S. & Wang, G. Schema matching based on SQL statements. Distrib Parallel Databases 38, 193–226 (2020). https://doi.org/10.1007/s10619-019-07268-9

Download citation

Published: 09 May 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10619-019-07268-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Schema matching based on SQL statements

Abstract

Access this article

Similar content being viewed by others

Relationship Matching of Data Sources: A Graph-Based Approach

Schema Matching Based on Source Codes

A Linear Program for Holistic Matching: Assessment on Schema Matching Benchmark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Schema matching based on SQL statements

Abstract

Access this article

Similar content being viewed by others

Relationship Matching of Data Sources: A Graph-Based Approach

Schema Matching Based on Source Codes

A Linear Program for Holistic Matching: Assessment on Schema Matching Benchmark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation