Skip to main content
Log in

Schema matching based on SQL statements

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Schema matching is a critical step in numerous database applications such as web data sources integrating, data warehouse loading and information exchanging among several authorities. In this paper, we propose to exploit the similarities of the SQL statements in the query logs to find the correspondences between attributes in the schemas to be matched. We discover three kinds of similarities which benefit schema matching, that is, the similarity of clauses itself, the similarity of the frequency of clauses occurring in different SQL statements and the similarity of statistics about the relationship among clauses. We combine the clauses related to the similarities into a graph, and then transform the task of matching attributes into the problem of matching the graphs. Through matching the graphs, we obtain a set of attribute sequence pairs with the similarity score. Actually, each sequence pair represents a set of correspondences. Next, we exploit the techniques from the quadratic programming field to decompose the sequence pairs into correspondences, that is, to obtain the similarity score of each correspondence. Finally, an efficient method is used to choose the best correspondence for each attribute from the candidate set. The experimental study shows that the proposed approach is effective and its combination with other matchers has good performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. An, Y., Borgid, A., ZMiller, R. J.: A semantic approach to discovering schema mapping expressions. In: Proceedings of the International Conference on Data Engineering ICDE’07, pp. 206–215 (2007)

  2. Aumueller, D., Do, H. H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005, pp. 906–908 (2005)

  3. Bellahsene, Z., Bonifati, A., Rahm, E. (eds.): Schema Matching and Mapping. Data-Centric Systems and Applications. Springer, New York (2011)

    MATH  Google Scholar 

  4. Bilke, A., Naumann, F.: Schema matching using duplicates. In: Proceedings of the International Conference on Data Engineering ICDE’05, pp. 69–80 (2005)

  5. Bohannon, P., Elnahrawy, E., Fan, W., Flaster, M.: Putting context into schema matching. In: Proceedings of the Very Large Data Bases VLDB’06, pp. 307–318 (2006)

  6. Bonifati, A., Comignani, U., Coquery, E., Thion, R.: Interactive mapping specification with exemplar tuples. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14–19, 2017, pp. 667–682 (2017)

  7. BreHRM. https://github.com/dreajay/hrm

  8. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998)

    Article  Google Scholar 

  9. Cunha, J., Erwig, M., Saraiva, J.: Automatically inferring classsheet models from spreadsheets. In: Proceedings of the Visual Languages and Human-Centric Computing VL/HCC’10, pp. 93–100 (2010)

  10. Dai, B. T., Koudas, N., Srivastavat, D., Tung, A. K. H., Venkatasubramaniant, S.: Validating multi-column schema matchings by type. In: Proceedings of the International Conference on Data Engineering ICDE’08, pp. 120–129 (2008)

  11. Do, H. H., Rahm, E.: Coma—a system for flexible combination of schema matching approaches. In: Proceedings of the Very Large Data Bases VLDB’02, pp. 610–621 (2002)

    Chapter  Google Scholar 

  12. Do, H.H., Rahm, E.: Matching large schemas: approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)

    Article  Google Scholar 

  13. Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’01, pp. 509–520 (2001)

  14. Dong, X., Halevy, A. Y., Yu, C.: Data integration with uncertainty. In: Proceedings of the Very Large Data Bases VLDB’07, pp. 687–698 (2007)

  15. Dong, X., BertiEquille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. In: Proceedings of the Very Large Data Bases VLDB’09, pp. 550–561 (2009)

    Article  Google Scholar 

  16. Elmeleegy, H., Ouzzani, M., Elmagarmid, A.: Usage-based schema matching. In: Proceedings of the International Conference on Data Engineering ICDE’08, pp. 20–29 (2008)

  17. Fagin, R., Kolaitis, P., Miller, R., Popa, L.: Data exchange: Semantics and query answering. In: Proceedings of the International Conference on Database Theory ICDT’03, pp. 207–224 (2003)

    Google Scholar 

  18. Gal, A.: Uncertain Schema Matching. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2011)

    Google Scholar 

  19. Gal, A., Roitman, H., Sagi, T.: From diversity-based prediction to better ontology & schema matching. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11–15, 2016, pp. 1145–1155 (2016)

  20. Gu, B., Li, Z., Zhang, X., Liu, A., Liu, G., Zheng, K., Zhao, L., Zhou, X.: The interaction between schema matching and record matching in data integration. IEEE Trans. Knowl. Data Eng. 29(1), 186–199 (2017)

    Article  Google Scholar 

  21. Halevy, A., Rajaraman, A., Ordille, J.: Data integration: The teenage years. In: Proceedings of the Very Large Data Bases VLDB’06, pp. 9–16 (2006)

  22. Haveliwala, T.H.: Efficient Computation of PageRank. Stanford University, Stanford (1999)

    Google Scholar 

  23. He, B., Chang, K. C. C., Han, J.: Discovering complex matchings across web query interfaces: a correlation mining approach. In: Proceedings of the Knowledge Discovery and Data Mining KDD’04, pp. 148–157 (2004)

  24. Kang, J., Naughton, J. F.: On schema matching with opaque column names and data values. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’03, pp. 205–216 (2003)

  25. KinHRM. https://github.com/kinmengbgm/bgm-hrm

  26. Kohler, H., Zhou, X., Sadiq, S., Shu, Y., Taylor, K.: Sampling dirty data for matching attributes. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’10, pp. 63–74 (2010)

  27. Kutty, S., Nayak, R., Chen, L.: A people-to-people matching system using graph mining techniques. World Wide Web 17(3), 311–349 (2014)

    Article  Google Scholar 

  28. Kuwada, H., Hashimoto, K., Ishihara, Y., Fujiwara, T.: The consistency and absolute consistency problems of XML schema mappings between restricted dtds. World Wide Web 18(5), 1443–1461 (2015)

    Article  Google Scholar 

  29. Li, W.S., Clifton, C.: Semint: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng. 33(1), 49–84 (2000)

    Article  Google Scholar 

  30. Madhavan, J., Bernstein, P. A., Doan, A., Halevy, A. Y.: Corpus-based schema matching. In: Proceedings of the International Conference on Data Engineering ICDE’05, pp. 57–68 (2005)

  31. Massmann, S., Engmann, D., Rahm, E.: COMA++: results for the ontology alignment contest OAEI 2006. In: Proceedings of the 1st International Workshop on Ontology Matching (OM-2006) Collocated with the 5th International Semantic Web Conference (ISWC-2006), Athens, Georgia, USA, November 5, 2006 (2006)

  32. Maßmann, S., Raunich, S., Aumüller, D., Arnold, P., Rahm, E.: Evolution of the COMA match system. In: Proceedings of the 6th International Workshop on Ontology Matching, Bonn, Germany, October 24, 2011 (2011)

  33. Mecca, G., Papotti, P., Raunich, S.: Core schema mappings. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’09, pp. 655–668 (2009)

  34. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: Proceedings of the International Conference on Data Engineering ICDE’02, pp. 117–128 (2002)

  35. Miller, R. J., Haas, L. M., Hernandez, M. A.: Schema mapping as query discovery. In: Proceedings of the Very Large Data Bases VLDB’00, pp. 77–99 (2000)

  36. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer Science & Business Media, New York (2000)

    MATH  Google Scholar 

  37. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab, Stanford (1999)

    Google Scholar 

  38. Pottinger, R. A., Bernstein, P. A.: Merging models based on given correspondences. In: Proceedings of the Very Large Data Bases VLDB’03, pp. 826–873 (2003)

    Chapter  Google Scholar 

  39. Qian, L., Cafarella, M. J., Jagadish, H. V.: Sample-driven schema mapping. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’12, pp. 73–84 (2012)

  40. Radwan, A., Popa, L., Stanoi, I. R., Younis, A.: Top-k generation of integrated schemas based on directed and weighted correspondences. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’09, pp. 641–654 (2009)

  41. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. J. Very Large Data Bases (VLDB) 10(4), 334–350 (2001)

    Article  Google Scholar 

  42. Sarma, A. D., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’08, pp. 861–874 (2008)

  43. Velegrakis, Y., Miller, R. J., Hernandez, M. A., Fagin, R.: Translating web data. In: Proceedings of the Very Large Data Bases VLDB’02, pp. 598–609 (2002)

  44. Warren, R. H., Tompa, F.: Multicolumn substring matching for database schema translation. In: Proceedings of the Very Large Data Bases VLDB’06, pp. 331–342 (2006)

  45. Yakout, M., Ganjam, K., Chakrabarti, K., Chaudhuri, S.: Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’12, pp. 97–108 (2012)

  46. Zhang, M., Chakrabarti, K.: Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In: Proceedings of the Special Interest Group on Management Of Data SIGMOD’13, pp. 145–156 (2013)

  47. Zhang, C. J., Zhao, Z., Chen, L., Jagadish, H. V., Cao, C. C.: Crowdmatcher: crowd-assisted schema matching. In: Proceedings of the International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 721–724 (2014)

  48. Zhou, Y., H, C., Yu, J. X.: Graph clustering based on structural/attribute similarities. In: Proceedings of the Very Large Data Bases VLDB’09, pp. 718–729 (2009)

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China (Grant No. 61303016) and the Normal Project Foundation of Education Department of LiaoNing Province (Grant No. L2012045).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guohui Ding.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ding, G., Sun, S. & Wang, G. Schema matching based on SQL statements. Distrib Parallel Databases 38, 193–226 (2020). https://doi.org/10.1007/s10619-019-07268-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-019-07268-9

Keywords

Navigation