Abstract
Schema matching aims to identify the correspondences among attributes of database schemas. It is frequently considered as the most challenging and decisive stage existing in many contemporary web semantics and database systems. Low-quality algorithmic matchers fail to provide improvement while manually annotation consumes extensive human efforts. Further complications arise from data privacy in certain domains such as healthcare, where only schema-level matching should be used to prevent data leakage. For this problem, we propose SMAT, a new deep learning model based on state-of-the-art natural language processing techniques to obtain semantic mappings between source and target schemas using only the attribute name and description. SMAT avoids directly encoding domain knowledge about the source and target systems, which allows it to be more easily deployed across different sites. We also introduce a new benchmark dataset, OMAP, based on real-world schema-level mappings from the healthcare domain. Our extensive evaluation of various benchmark datasets demonstrates the potential of SMAT to help automate schema-level matching tasks.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alexe, B., Hernández, M., Popa, L., Tan, W.C.: Mapmerge: correlating independent schema mappings. Proc. VLDB Endow. 3(1–2), 81–92 (2010)
Arenas, M., Barceló, P., Libkin, L., Murlak, F.: Foundations of Data Exchange. Cambridge University Press, Cambridge (2014)
Atzeni, P., Bellomarini, L., Papotti, P., Torlone, R.: Meta-mappings for schema mapping reuse. Proc. VLDB Endow. 12(5), 557–569 (2019). https://doi.org/10.14778/3303753.3303761
Cappuzzo, R., Papotti, P., Thirumuruganathan, S.: Creating embeddings of heterogeneous relational datasets for data integration tasks. In: Proceedings of SIGMOD, pp. 1335–1349 (2020)
Ten Cate, B., Kolaitis, P.G., Qian, K., Tan, W.C.: Active learning of GAV schema mappings. In: Proceedings of SIGMOD/PODS, pp. 355–368 (2018)
Chen, C., Golshan, B., Halevy, A.Y., Tan, W.C., Doan, A.: Biggorilla: an open-source ecosystem for data preparation and integration. IEEE Data Eng. Bull. 41(2), 10–22 (2018)
Centers for medicare & medicaid services (cms). https://www.cms.gov/OpenPayments/Explore-the-Data/Data-Overview.html
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of EMNLP, pp. 670–680 (2017)
Cui, Y., Chen, Z., Wei, S., Wang, S., Liu, T., Hu, G.: Attention-over-attention neural networks for reading comprehension. In: Proceedings of ACL (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of of NAACL-HLT, pp. 4171–4186 (2019)
Do, H.H., Rahm, E.: Coma–a system for flexible combination of schema matching approaches. In: Proceedings of VLDB, pp. 610–621 (2002)
Dong, Q., Gong, S., Zhu, X.: Imbalanced deep learning by minority class incremental rectification. IEEE Trans. Pattern Analy. Mach. Intell. 41(6), 1367–1381 (2019). https://doi.org/10.1109/TPAMI.2018.2832629
Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02463-4_12
Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Schema mapping evolution through composition and inversion. In: Schema Matching and Mapping, pp. 191–222. Springer (2011)
Fernandez, R.C., et al.: Seeping semantics: linking datasets using word embeddings for data discovery. In: Proceedings of ICDE, pp. 989–1000 (2018)
Gal, A.: Uncertain schema matching. Synth. Lect. Data Manag. 3(1), 1–97 (2011)
Gal, A., Roitman, H., Shraga, R.: Learning to rerank schema matches. IEEE Trans. Knowl. Data Eng. (2019)
Halevy, A., Nemes, E., Dong, X., Madhavan, J., Zhang, J.: Similarity search for web services. In: Proceedings of the 30th VLDB Conference, pp. 372–383 (2004)
Han, L., Kashyap, A.L., Finin, T., Mayfield, J., Weese, J.: Umbc\_ebiquity-core: semantic textual similarity systems. In: Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pp. 44–52 (2013)
He, B., Chang, K.C.C.: Statistical schema matching across web query interfaces. In: Proceedings of SIGMOD, pp. 217–228 (2003)
Hernandez, M., Ho, H., Naumann, F., Popa, L.: Clio: a schema mapping tool for information integration. In: 8th International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN 2005), p. 1. IEEE (2005)
Johnson, A.E., et al.: Mimic-iii, a freely accessible critical care database. Sci. Data 3, 160035 (2016)
Kettouch, M.S., Luca, C., Hobbs, M., Dascalu, S.: Using semantic similarity for schema matching of semi-structured and linked data. In: 2017 Internet Technologies and Applications (ITA), pp. 128–133. IEEE (2017)
Kolyvakis, P., Kalousis, A., Kiritsis, D.: Deepalignment: unsupervised ontology matching with refined word vectors. In: Proceedings of NAACL-HLT, pp. 787–798 (2018)
Koutras, C., Fragkoulis, M., Katsifodimos, A., Lofi, C.: Rema: graph embeddings-based relational schema matching. In: EDBT/ICDT Workshops (2020)
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020)
Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mecca, G., Papotti, P., Santoro, D.: Schema mappings: from data translation to data cleaning. In: Flesca, S., Greco, S., Masciari, E., Saccà, D. (eds.) A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years. SBD, vol. 31, pp. 203–217. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-61893-7_12
Mudgal, S., Kumar, S.: Deep learning for entity matching: A design space exploration. Tech. rep. (2018)
Nguyen, Q.V.H., Weidlich, M., Nguyen, T.T., Miklós, Z., Aberer, K., Gal, A.: Reconciling matching networks of conceptual models. Tech. rep. (2019)
Observational Health Data Sciences and Informatics: The book of OHDSI. Independently published (2019)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of EMNLP, pp. 1532–1543 (2014)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
Shraga, R., Gal, A., Roitman, H.: Adnev: cross-domain schema matching using deep similarity matrix adjustment and evaluation. Proc. VLDB 13(9), 1401–1415 (2020)
Toan, N.T., Cong, P.T., Thang, D.C., Hung, N.Q.V., Stantic, B.: Bootstrapping uncertainty in schema covering. In: Wang, J., Cong, G., Chen, J., Qi, J. (eds.) ADC 2018. LNCS, vol. 10837, pp. 336–342. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92013-9_29
Walonoski, J., et al.: Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inform. Assoc. 25(3), 230–238 (2017)
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of SIGMOD, pp. 95–106 (2004)
Acknowledgements
This work was supported by the National Science Foundation award IIS-#1838200, National Institute of Health award 1K01LM012924, and Google Cloud Platform research credits.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, J., Shin, B., Choi, J.D., Ho, J.C. (2021). SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema Matching. In: Bellatreche, L., Dumas, M., Karras, P., Matulevičius, R. (eds) Advances in Databases and Information Systems. ADBIS 2021. Lecture Notes in Computer Science(), vol 12843. Springer, Cham. https://doi.org/10.1007/978-3-030-82472-3_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-82472-3_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82471-6
Online ISBN: 978-3-030-82472-3
eBook Packages: Computer ScienceComputer Science (R0)