Abstract
Schema matching is a critical step in numerous database applications such as web data sources integrating, data warehouse loading and information exchanging among several authorities. In this paper, we propose to exploit the similarities of the SQL statements in the query logs to find the correspondences between attributes in the schemas to be matched. We discover three kinds of similarities which benefit schema matching, that is, the similarity of clauses itself, the similarity of the frequency of clauses occurring in different SQL statements and the similarity of statistics about the relationship among clauses. We combine the clauses related to the similarities into a graph, and then transform the task of matching attributes into the problem of matching the graphs. Through matching the graphs, we obtain a set of attribute sequence pairs with the similarity score. Actually, each sequence pair represents a set of correspondences. Next, we exploit the techniques from the quadratic programming field to decompose the sequence pairs into correspondences, that is, to obtain the similarity score of each correspondence. Finally, an efficient method is used to choose the best correspondence for each attribute from the candidate set. The experimental study shows that the proposed approach is effective and its combination with other matchers has good performance.
Similar content being viewed by others
References
An, Y., Borgid, A., ZMiller, R. J.: A semantic approach to discovering schema mapping expressions. In: Proceedings of the International Conference on Data Engineering ICDE’07, pp. 206–215 (2007)
Aumueller, D., Do, H. H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005, pp. 906–908 (2005)
Bellahsene, Z., Bonifati, A., Rahm, E. (eds.): Schema Matching and Mapping. Data-Centric Systems and Applications. Springer, New York (2011)
Bilke, A., Naumann, F.: Schema matching using duplicates. In: Proceedings of the International Conference on Data Engineering ICDE’05, pp. 69–80 (2005)
Bohannon, P., Elnahrawy, E., Fan, W., Flaster, M.: Putting context into schema matching. In: Proceedings of the Very Large Data Bases VLDB’06, pp. 307–318 (2006)
Bonifati, A., Comignani, U., Coquery, E., Thion, R.: Interactive mapping specification with exemplar tuples. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14–19, 2017, pp. 667–682 (2017)
BreHRM. https://github.com/dreajay/hrm
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998)
Cunha, J., Erwig, M., Saraiva, J.: Automatically inferring classsheet models from spreadsheets. In: Proceedings of the Visual Languages and Human-Centric Computing VL/HCC’10, pp. 93–100 (2010)
Dai, B. T., Koudas, N., Srivastavat, D., Tung, A. K. H., Venkatasubramaniant, S.: Validating multi-column schema matchings by type. In: Proceedings of the International Conference on Data Engineering ICDE’08, pp. 120–129 (2008)
Do, H. H., Rahm, E.: Coma—a system for flexible combination of schema matching approaches. In: Proceedings of the Very Large Data Bases VLDB’02, pp. 610–621 (2002)
Do, H.H., Rahm, E.: Matching large schemas: approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)
Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’01, pp. 509–520 (2001)
Dong, X., Halevy, A. Y., Yu, C.: Data integration with uncertainty. In: Proceedings of the Very Large Data Bases VLDB’07, pp. 687–698 (2007)
Dong, X., BertiEquille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. In: Proceedings of the Very Large Data Bases VLDB’09, pp. 550–561 (2009)
Elmeleegy, H., Ouzzani, M., Elmagarmid, A.: Usage-based schema matching. In: Proceedings of the International Conference on Data Engineering ICDE’08, pp. 20–29 (2008)
Fagin, R., Kolaitis, P., Miller, R., Popa, L.: Data exchange: Semantics and query answering. In: Proceedings of the International Conference on Database Theory ICDT’03, pp. 207–224 (2003)
Gal, A.: Uncertain Schema Matching. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2011)
Gal, A., Roitman, H., Sagi, T.: From diversity-based prediction to better ontology & schema matching. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11–15, 2016, pp. 1145–1155 (2016)
Gu, B., Li, Z., Zhang, X., Liu, A., Liu, G., Zheng, K., Zhao, L., Zhou, X.: The interaction between schema matching and record matching in data integration. IEEE Trans. Knowl. Data Eng. 29(1), 186–199 (2017)
Halevy, A., Rajaraman, A., Ordille, J.: Data integration: The teenage years. In: Proceedings of the Very Large Data Bases VLDB’06, pp. 9–16 (2006)
Haveliwala, T.H.: Efficient Computation of PageRank. Stanford University, Stanford (1999)
He, B., Chang, K. C. C., Han, J.: Discovering complex matchings across web query interfaces: a correlation mining approach. In: Proceedings of the Knowledge Discovery and Data Mining KDD’04, pp. 148–157 (2004)
Kang, J., Naughton, J. F.: On schema matching with opaque column names and data values. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’03, pp. 205–216 (2003)
Kohler, H., Zhou, X., Sadiq, S., Shu, Y., Taylor, K.: Sampling dirty data for matching attributes. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’10, pp. 63–74 (2010)
Kutty, S., Nayak, R., Chen, L.: A people-to-people matching system using graph mining techniques. World Wide Web 17(3), 311–349 (2014)
Kuwada, H., Hashimoto, K., Ishihara, Y., Fujiwara, T.: The consistency and absolute consistency problems of XML schema mappings between restricted dtds. World Wide Web 18(5), 1443–1461 (2015)
Li, W.S., Clifton, C.: Semint: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng. 33(1), 49–84 (2000)
Madhavan, J., Bernstein, P. A., Doan, A., Halevy, A. Y.: Corpus-based schema matching. In: Proceedings of the International Conference on Data Engineering ICDE’05, pp. 57–68 (2005)
Massmann, S., Engmann, D., Rahm, E.: COMA++: results for the ontology alignment contest OAEI 2006. In: Proceedings of the 1st International Workshop on Ontology Matching (OM-2006) Collocated with the 5th International Semantic Web Conference (ISWC-2006), Athens, Georgia, USA, November 5, 2006 (2006)
Maßmann, S., Raunich, S., Aumüller, D., Arnold, P., Rahm, E.: Evolution of the COMA match system. In: Proceedings of the 6th International Workshop on Ontology Matching, Bonn, Germany, October 24, 2011 (2011)
Mecca, G., Papotti, P., Raunich, S.: Core schema mappings. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’09, pp. 655–668 (2009)
Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: Proceedings of the International Conference on Data Engineering ICDE’02, pp. 117–128 (2002)
Miller, R. J., Haas, L. M., Hernandez, M. A.: Schema mapping as query discovery. In: Proceedings of the Very Large Data Bases VLDB’00, pp. 77–99 (2000)
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer Science & Business Media, New York (2000)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab, Stanford (1999)
Pottinger, R. A., Bernstein, P. A.: Merging models based on given correspondences. In: Proceedings of the Very Large Data Bases VLDB’03, pp. 826–873 (2003)
Qian, L., Cafarella, M. J., Jagadish, H. V.: Sample-driven schema mapping. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’12, pp. 73–84 (2012)
Radwan, A., Popa, L., Stanoi, I. R., Younis, A.: Top-k generation of integrated schemas based on directed and weighted correspondences. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’09, pp. 641–654 (2009)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. J. Very Large Data Bases (VLDB) 10(4), 334–350 (2001)
Sarma, A. D., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’08, pp. 861–874 (2008)
Velegrakis, Y., Miller, R. J., Hernandez, M. A., Fagin, R.: Translating web data. In: Proceedings of the Very Large Data Bases VLDB’02, pp. 598–609 (2002)
Warren, R. H., Tompa, F.: Multicolumn substring matching for database schema translation. In: Proceedings of the Very Large Data Bases VLDB’06, pp. 331–342 (2006)
Yakout, M., Ganjam, K., Chakrabarti, K., Chaudhuri, S.: Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’12, pp. 97–108 (2012)
Zhang, M., Chakrabarti, K.: Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’13, pp. 145–156 (2013)
Zhang, C. J., Zhao, Z., Chen, L., Jagadish, H. V., Cao, C. C.: Crowdmatcher: crowd-assisted schema matching. In: Proceedings of the International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 721–724 (2014)
Zhou, Y., H, C., Yu, J. X.: Graph clustering based on structural/attribute similarities. In: Proceedings of the Very Large Data Bases VLDB’09, pp. 718–729 (2009)
Acknowledgements
This research was supported by the National Natural Science Foundation of China (Grant No. 61303016) and the Normal Project Foundation of Education Department of LiaoNing Province (Grant No. L2012045).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ding, G., Sun, S. & Wang, G. Schema matching based on SQL statements. Distrib Parallel Databases 38, 193–226 (2020). https://doi.org/10.1007/s10619-019-07268-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-019-07268-9