Skip to main content

SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema Matching

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12843))

Abstract

Schema matching aims to identify the correspondences among attributes of database schemas. It is frequently considered as the most challenging and decisive stage existing in many contemporary web semantics and database systems. Low-quality algorithmic matchers fail to provide improvement while manually annotation consumes extensive human efforts. Further complications arise from data privacy in certain domains such as healthcare, where only schema-level matching should be used to prevent data leakage. For this problem, we propose SMAT, a new deep learning model based on state-of-the-art natural language processing techniques to obtain semantic mappings between source and target schemas using only the attribute name and description. SMAT avoids directly encoding domain knowledge about the source and target systems, which allows it to be more easily deployed across different sites. We also introduce a new benchmark dataset, OMAP, based on real-world schema-level mappings from the healthcare domain. Our extensive evaluation of various benchmark datasets demonstrates the potential of SMAT  to help automate schema-level matching tasks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/JZCS2018/SMAT.

  2. 2.

    https://github.com/JZCS2018/SMAT.

References

  1. Alexe, B., Hernández, M., Popa, L., Tan, W.C.: Mapmerge: correlating independent schema mappings. Proc. VLDB Endow. 3(1–2), 81–92 (2010)

    Article  Google Scholar 

  2. Arenas, M., Barceló, P., Libkin, L., Murlak, F.: Foundations of Data Exchange. Cambridge University Press, Cambridge (2014)

    MATH  Google Scholar 

  3. Atzeni, P., Bellomarini, L., Papotti, P., Torlone, R.: Meta-mappings for schema mapping reuse. Proc. VLDB Endow. 12(5), 557–569 (2019). https://doi.org/10.14778/3303753.3303761

  4. Cappuzzo, R., Papotti, P., Thirumuruganathan, S.: Creating embeddings of heterogeneous relational datasets for data integration tasks. In: Proceedings of SIGMOD, pp. 1335–1349 (2020)

    Google Scholar 

  5. Ten Cate, B., Kolaitis, P.G., Qian, K., Tan, W.C.: Active learning of GAV schema mappings. In: Proceedings of SIGMOD/PODS, pp. 355–368 (2018)

    Google Scholar 

  6. Chen, C., Golshan, B., Halevy, A.Y., Tan, W.C., Doan, A.: Biggorilla: an open-source ecosystem for data preparation and integration. IEEE Data Eng. Bull. 41(2), 10–22 (2018)

    Google Scholar 

  7. Centers for medicare & medicaid services (cms). https://www.cms.gov/OpenPayments/Explore-the-Data/Data-Overview.html

  8. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of EMNLP, pp. 670–680 (2017)

    Google Scholar 

  9. Cui, Y., Chen, Z., Wei, S., Wang, S., Liu, T., Hu, G.: Attention-over-attention neural networks for reading comprehension. In: Proceedings of ACL (2017)

    Google Scholar 

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of of NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  11. Do, H.H., Rahm, E.: Coma–a system for flexible combination of schema matching approaches. In: Proceedings of VLDB, pp. 610–621 (2002)

    Google Scholar 

  12. Dong, Q., Gong, S., Zhu, X.: Imbalanced deep learning by minority class incremental rectification. IEEE Trans. Pattern Analy. Mach. Intell. 41(6), 1367–1381 (2019). https://doi.org/10.1109/TPAMI.2018.2832629

    Article  Google Scholar 

  13. Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02463-4_12

    Chapter  Google Scholar 

  14. Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Schema mapping evolution through composition and inversion. In: Schema Matching and Mapping, pp. 191–222. Springer (2011)

    Google Scholar 

  15. Fernandez, R.C., et al.: Seeping semantics: linking datasets using word embeddings for data discovery. In: Proceedings of ICDE, pp. 989–1000 (2018)

    Google Scholar 

  16. Gal, A.: Uncertain schema matching. Synth. Lect. Data Manag. 3(1), 1–97 (2011)

    Article  Google Scholar 

  17. Gal, A., Roitman, H., Shraga, R.: Learning to rerank schema matches. IEEE Trans. Knowl. Data Eng. (2019)

    Google Scholar 

  18. Halevy, A., Nemes, E., Dong, X., Madhavan, J., Zhang, J.: Similarity search for web services. In: Proceedings of the 30th VLDB Conference, pp. 372–383 (2004)

    Google Scholar 

  19. Han, L., Kashyap, A.L., Finin, T., Mayfield, J., Weese, J.: Umbc\_ebiquity-core: semantic textual similarity systems. In: Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pp. 44–52 (2013)

    Google Scholar 

  20. He, B., Chang, K.C.C.: Statistical schema matching across web query interfaces. In: Proceedings of SIGMOD, pp. 217–228 (2003)

    Google Scholar 

  21. Hernandez, M., Ho, H., Naumann, F., Popa, L.: Clio: a schema mapping tool for information integration. In: 8th International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN 2005), p. 1. IEEE (2005)

    Google Scholar 

  22. Johnson, A.E., et al.: Mimic-iii, a freely accessible critical care database. Sci. Data 3, 160035 (2016)

    Google Scholar 

  23. Kettouch, M.S., Luca, C., Hobbs, M., Dascalu, S.: Using semantic similarity for schema matching of semi-structured and linked data. In: 2017 Internet Technologies and Applications (ITA), pp. 128–133. IEEE (2017)

    Google Scholar 

  24. Kolyvakis, P., Kalousis, A., Kiritsis, D.: Deepalignment: unsupervised ontology matching with refined word vectors. In: Proceedings of NAACL-HLT, pp. 787–798 (2018)

    Google Scholar 

  25. Koutras, C., Fragkoulis, M., Katsifodimos, A., Lofi, C.: Rema: graph embeddings-based relational schema matching. In: EDBT/ICDT Workshops (2020)

    Google Scholar 

  26. Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020)

  27. Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  28. Mecca, G., Papotti, P., Santoro, D.: Schema mappings: from data translation to data cleaning. In: Flesca, S., Greco, S., Masciari, E., Saccà, D. (eds.) A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years. SBD, vol. 31, pp. 203–217. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-61893-7_12

    Chapter  Google Scholar 

  29. Mudgal, S., Kumar, S.: Deep learning for entity matching: A design space exploration. Tech. rep. (2018)

    Google Scholar 

  30. Nguyen, Q.V.H., Weidlich, M., Nguyen, T.T., Miklós, Z., Aberer, K., Gal, A.: Reconciling matching networks of conceptual models. Tech. rep. (2019)

    Google Scholar 

  31. Observational Health Data Sciences and Informatics: The book of OHDSI. Independently published (2019)

    Google Scholar 

  32. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of EMNLP, pp. 1532–1543 (2014)

    Google Scholar 

  33. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)

    Google Scholar 

  34. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)

  35. Shraga, R., Gal, A., Roitman, H.: Adnev: cross-domain schema matching using deep similarity matrix adjustment and evaluation. Proc. VLDB 13(9), 1401–1415 (2020)

    Article  Google Scholar 

  36. Toan, N.T., Cong, P.T., Thang, D.C., Hung, N.Q.V., Stantic, B.: Bootstrapping uncertainty in schema covering. In: Wang, J., Cong, G., Chen, J., Qi, J. (eds.) ADC 2018. LNCS, vol. 10837, pp. 336–342. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92013-9_29

    Chapter  Google Scholar 

  37. Walonoski, J., et al.: Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inform. Assoc. 25(3), 230–238 (2017)

    Google Scholar 

  38. Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of SIGMOD, pp. 95–106 (2004)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Science Foundation award IIS-#1838200, National Institute of Health award 1K01LM012924, and Google Cloud Platform research credits.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, J., Shin, B., Choi, J.D., Ho, J.C. (2021). SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema Matching. In: Bellatreche, L., Dumas, M., Karras, P., Matulevičius, R. (eds) Advances in Databases and Information Systems. ADBIS 2021. Lecture Notes in Computer Science(), vol 12843. Springer, Cham. https://doi.org/10.1007/978-3-030-82472-3_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-82472-3_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-82471-6

  • Online ISBN: 978-3-030-82472-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics