Skip to main content

PROCLAIM: An Unsupervised Approach to Discover Domain-Specific Attribute Matchings from Heterogeneous Sources

  • Conference paper
  • First Online:
Advanced Information Systems Engineering (CAiSE 2020)

Abstract

Schema matching is a critical problem in many applications where the main goal is to match attributes coming from heterogeneous sources. In this paper, we propose PROCLAIM (PROfile-based Cluster-Labeling for AttrIbute Matching), an automatic, unsupervised clustering-based approach to match attributes of a large number of heterogeneous sources. We define the concept of attribute profile to characterize the main properties of an attribute using: (i) the statistical distribution and the dimension of the attribute’s values, (ii) the name and textual descriptions related to the attribute. The attribute matchings produced by PROCLAIM give the best representation of heterogeneous sources thanks to the cluster-labeling function we defined. We evaluate PROCLAIM on 45,000 different data sources coming from oil and gas authority open data website (The data is published under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)). The results we obtain are promising and validate our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Data from: https://www.kaggle.com/.

  2. 2.

    The data is published under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

References

  1. Alwan, A.A., Nordin, A., Alzeber, M., Abualkishik, A.Z.: A survey of schema matching research using database schemas and instances. IJACSA 8(10) (2017)

    Google Scholar 

  2. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. In: ACM SIGMOD Record, vol. 28, pp. 49–60. ACM (1999)

    Google Scholar 

  3. Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)

    Article  Google Scholar 

  4. Cerda, P., Varoquaux, G., Kégl, B.: Similarity encoding for learning with dirty categorical variables. Mach. Learn. 107(8–10), 1477–1494 (2018)

    Article  MathSciNet  Google Scholar 

  5. Charu, C.A., Chandan, K.R.: Data Clustering: Algorithms and Applications (2013)

    Google Scholar 

  6. De Sa, C., et al.: DeepDive: declarative knowledge base construction. ACM SIGMOD Rec. 45(1), 60–67 (2016)

    Article  Google Scholar 

  7. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise (1996)

    Google Scholar 

  8. Gubanov, M., Priya, M., Podkorytov, M.: IntelliLIGHT: a flashlight for large-scale dark structured data (2017)

    Google Scholar 

  9. Gupta, R., Halevy, A., Wang, X., Whang, S.E., Wu, F.: Biperpedia: an ontology for search applications. Proc. VLDB Endow. 7(7), 505–516 (2014)

    Article  Google Scholar 

  10. Jiang, S., Liang, J., Xiao, Y., Tang, H., Huang, H., Tan, J.: Towards the completion of a domain-specific knowledge base with emerging query terms. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1430–1441. IEEE (2019)

    Google Scholar 

  11. Kola, A., More, H., Soderman, S., Gubanov, M.: Generating unified famous objects (UFOs) from the classified object tables. In: IEEE Big Data, pp. 4771–4773. IEEE (2017)

    Google Scholar 

  12. NEXLA: An introduction to big data formats understanding Avro, Parquet, and ORC. In: NEXLA White Paper, pp. 1–12 (2018)

    Google Scholar 

  13. Rubenstein, D., Yin, W., Frame, M.D.: Biofluid Mechanics: An Introduction to Fluid Mechanics, Macrocirculation, and Microcirculation. Academic Press, Cambridge (2015)

    MATH  Google Scholar 

  14. Vohra, D.: Apache Parquet. Practical Hadoop Ecosystem, pp. 325–335. Apress, Berkeley, CA (2016). https://doi.org/10.1007/978-1-4842-2199-0_8

    Chapter  Google Scholar 

  15. Winn, J., Guiver, J., Webster, S., Zaykov, Y., Kukla, M., Fabian, D.: Alexandria: unsupervised high-precision knowledge base construction using a probabilistic program. In: AKBC (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Molood Arman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Arman, M., Wlodarczyk, S., Bennacer Seghouani, N., Bugiotti, F. (2020). PROCLAIM: An Unsupervised Approach to Discover Domain-Specific Attribute Matchings from Heterogeneous Sources. In: Herbaut, N., La Rosa, M. (eds) Advanced Information Systems Engineering. CAiSE 2020. Lecture Notes in Business Information Processing, vol 386. Springer, Cham. https://doi.org/10.1007/978-3-030-58135-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58135-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58134-3

  • Online ISBN: 978-3-030-58135-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics