PROCLAIM: An Unsupervised Approach to Discover Domain-Specific Attribute Matchings from Heterogeneous Sources

Arman, Molood; Wlodarczyk, Sylvain; Bennacer Seghouani, Nacéra; Bugiotti, Francesca

doi:10.1007/978-3-030-58135-0_2

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 386))

Included in the following conference series:

International Conference on Advanced Information Systems Engineering

444 Accesses
1 Citations

Abstract

Schema matching is a critical problem in many applications where the main goal is to match attributes coming from heterogeneous sources. In this paper, we propose PROCLAIM (PROfile-based Cluster-Labeling for AttrIbute Matching), an automatic, unsupervised clustering-based approach to match attributes of a large number of heterogeneous sources. We define the concept of attribute profile to characterize the main properties of an attribute using: (i) the statistical distribution and the dimension of the attribute’s values, (ii) the name and textual descriptions related to the attribute. The attribute matchings produced by PROCLAIM give the best representation of heterogeneous sources thanks to the cluster-labeling function we defined. We evaluate PROCLAIM on 45,000 different data sources coming from oil and gas authority open data website (The data is published under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)). The results we obtain are promising and validate our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Data from: https://www.kaggle.com/.
2.
The data is published under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

References

Alwan, A.A., Nordin, A., Alzeber, M., Abualkishik, A.Z.: A survey of schema matching research using database schemas and instances. IJACSA 8(10) (2017)
Google Scholar
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. In: ACM SIGMOD Record, vol. 28, pp. 49–60. ACM (1999)
Google Scholar
Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)
Article Google Scholar
Cerda, P., Varoquaux, G., Kégl, B.: Similarity encoding for learning with dirty categorical variables. Mach. Learn. 107(8–10), 1477–1494 (2018)
Article MathSciNet Google Scholar
Charu, C.A., Chandan, K.R.: Data Clustering: Algorithms and Applications (2013)
Google Scholar
De Sa, C., et al.: DeepDive: declarative knowledge base construction. ACM SIGMOD Rec. 45(1), 60–67 (2016)
Article Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise (1996)
Google Scholar
Gubanov, M., Priya, M., Podkorytov, M.: IntelliLIGHT: a flashlight for large-scale dark structured data (2017)
Google Scholar
Gupta, R., Halevy, A., Wang, X., Whang, S.E., Wu, F.: Biperpedia: an ontology for search applications. Proc. VLDB Endow. 7(7), 505–516 (2014)
Article Google Scholar
Jiang, S., Liang, J., Xiao, Y., Tang, H., Huang, H., Tan, J.: Towards the completion of a domain-specific knowledge base with emerging query terms. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1430–1441. IEEE (2019)
Google Scholar
Kola, A., More, H., Soderman, S., Gubanov, M.: Generating unified famous objects (UFOs) from the classified object tables. In: IEEE Big Data, pp. 4771–4773. IEEE (2017)
Google Scholar
NEXLA: An introduction to big data formats understanding Avro, Parquet, and ORC. In: NEXLA White Paper, pp. 1–12 (2018)
Google Scholar
Rubenstein, D., Yin, W., Frame, M.D.: Biofluid Mechanics: An Introduction to Fluid Mechanics, Macrocirculation, and Microcirculation. Academic Press, Cambridge (2015)
MATH Google Scholar
Vohra, D.: Apache Parquet. Practical Hadoop Ecosystem, pp. 325–335. Apress, Berkeley, CA (2016). https://doi.org/10.1007/978-1-4842-2199-0_8
Chapter Google Scholar
Winn, J., Guiver, J., Webster, S., Zaykov, Y., Kukla, M., Fabian, D.: Alexandria: unsupervised high-precision knowledge base construction using a probabilistic program. In: AKBC (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Services Pétroliers Schlumberger, 34000, Montpellier, France
Molood Arman & Sylvain Wlodarczyk
Université Paris-Saclay, CNRS, Laboratoire de Recherche en Informatique, 91405, Orsay, France
Molood Arman, Nacéra Bennacer Seghouani & Francesca Bugiotti

Authors

Molood Arman
View author publications
You can also search for this author in PubMed Google Scholar
Sylvain Wlodarczyk
View author publications
You can also search for this author in PubMed Google Scholar
Nacéra Bennacer Seghouani
View author publications
You can also search for this author in PubMed Google Scholar
Francesca Bugiotti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Molood Arman .

Editor information

Editors and Affiliations

Université Paris1 Panthéon-Sorbonne, Paris, France
Nicolas Herbaut
University of Melbourne, Melbourne, VIC, Australia
Marcello La Rosa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arman, M., Wlodarczyk, S., Bennacer Seghouani, N., Bugiotti, F. (2020). PROCLAIM: An Unsupervised Approach to Discover Domain-Specific Attribute Matchings from Heterogeneous Sources. In: Herbaut, N., La Rosa, M. (eds) Advanced Information Systems Engineering. CAiSE 2020. Lecture Notes in Business Information Processing, vol 386. Springer, Cham. https://doi.org/10.1007/978-3-030-58135-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-58135-0_2
Published: 28 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58134-3
Online ISBN: 978-3-030-58135-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics