Metadata Discovery Using Data Sampling and Exploratory Data Analysis

  • Hiba KhalidEmail author
  • Robert Wrembel
  • Esteban Zimányi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11815)


Metadata discovery is a prominent contributor towards understanding the semantics of data, relationships between data, and fundamental data features for the purpose of data management, query processing, and data integration. Metadata discovery is constantly evolving with the help of data profiling and manual annotators, resulting in various good quality data profiling techniques and tools. Even though, there are different metadata standards specified for distinct fields such as finance, biology, experimental physics, medicine, there is no generic method that discovers metadata automatically or presents them in a unified way. In this paper, we present a technique for discovering and generating metadata for data sources that do not provide explicit metadata. To this end, we apply exploratory data analysis to produce two kinds of metadata, i.e., administrative and technical, in order to find similarities between resources, w.r.t. their structures and contents. Our technique was evaluated experimentally. The results show that the technique allows to identify similar data sources and compute their similarity measures.


Data profiling Metadata management Discovery Enrichment 



The work of Hiba Khalid is supported by the European Commission through the Erasmus Mundus Joint Doctorate project Information Technologies for Business Intelligence-Doctoral College (IT4BI-DC).

The work of Robert Wrembel is supported from the grant of the Polish National Agency for Academic Exchange, within the Bekker programme.


  1. 1.
    Sakr, Sherif, Zomaya, Albert Y. (eds.): Encyclopedia of Big Data Technologies. Springer, Cham (2019). Scholar
  2. 2.
    Abedjan, Z., Golab, L., Naumann, F.: Data profiling. In: IEEE International Conference on Data Engineering (ICDE), pp. 1432–1435 (2016)Google Scholar
  3. 3.
    Aindrila Ghosh, J.M., Nashaat, M.: A comprehensive review of tools for exploratory analysis of tabular industrial datasets. Vis. Inform. 2, 235–253 (2018) CrossRefGoogle Scholar
  4. 4.
    Bauckmann, J., Leser, U., Naumann, F.: Efficiently computing inclusion dependencies for schema discovery. In: International Conference on Data Engineering Workshops, p. 2 (2006)Google Scholar
  5. 5.
    Bouguettaya, A., Benatallah, B., Elmargamid, A.: Interconnecting Heterogeneous Information Systems. Springer, Boston (1998). Kluwer Academic Publishers, ISBN 0792382161CrossRefGoogle Scholar
  6. 6.
    Ceravolo, P., et al.: Big data semantics. J. Data Semant. 7(2), 65–85 (2018)CrossRefGoogle Scholar
  7. 7.
    Chen, C.L.P., Zhang, C.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci. 275, 314–347 (2014)CrossRefGoogle Scholar
  8. 8.
    DublinCore: Dublin core metadata initiative.
  9. 9.
    Duggan, J., et al.: The BigDAWG polystore system. SIGMOD Rec. 44(2), 11–16 (2015)CrossRefGoogle Scholar
  10. 10.
    Edvardsen, L.F.H.: Using the structural content of documents to automatically generate quality metadata. Ph.D. thesis, Norwegian University of Science and Technology (2013)Google Scholar
  11. 11.
    Ehrlich, J., Roick, M., Schulze, L., Zwiener, J., Papenbrock, T., Naumann, F.: Holistic data profiling: simultaneous discovery of various metadata. In: International Conference on Extending Database Technology (EDBT), pp. 305–316 (2016)Google Scholar
  12. 12.
    Elmagarmid, A., Rusinkiewicz, M., Sheth, A. (eds.): Management of Heterogeneous and Autonomous Database Systems. Morgan Kaufmann, San Francisco (1999)Google Scholar
  13. 13.
    Gali, N., Mariescu-Istodor, R., Frnti, P.: Similarity measures for title matching. In: International Conference on Pattern Recognition (ICPR) (2016)Google Scholar
  14. 14.
    Gallinucci, E., Golfarelli, M., Rizzi, S.: Schema profiling of document-oriented databases. Inf. Syst. 75, 13–25 (2018)CrossRefGoogle Scholar
  15. 15.
    Halevy, A.Y., et al.: Goods: organizing google’s datasets. In: ACM SIGMOD International Conference on Management of Data, pp. 795–806 (2016)Google Scholar
  16. 16.
    Hewasinghage, M., Varga, J., Abelló, A., Zimányi, E.: Managing polyglot systems metadata with hypergraphs. In: International Conference on Conceptual Modeling (ER), pp. 463–478 (2018)Google Scholar
  17. 17.
    IEEE: IEEE LOM: IEEE standard for learning object metadata.
  18. 18.
    IEEE Standards Association: IEEE Big Data Governance and Metadata Management (BDGMM).
  19. 19.
    IEEELO: IEEE standard for learning object metadata.
  20. 20.
    Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. Springer, Heidelberg (2003). Scholar
  21. 21.
    Kaggle: UK car accidents 2005–2015.
  22. 22.
    Kolaitis, P.G.: Reflections on schema mappings, data exchange, and metadata management. In: ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 107–109 (2018)Google Scholar
  23. 23.
    Kunz, M., Puchta, A., Groll, S., Fuchs, L., Pernul, G.: Attribute quality management for dynamic identity and access management. J. Inf. Secur. Appl. 44, 64–79 (2019)Google Scholar
  24. 24.
    Liu, M., Wang, Q.: Rogas: a declarative framework for network analytics. In: International Conference on Very Large Data Bases (VLDB), vol. 9, no. 13, pp. 1561–1564 (2016)Google Scholar
  25. 25.
    March, F.D., Lopes, S., Petit, J.-M: Efficient algorithms for mining inclusion dependencies. In: International Conference on Extending Database Technology (EDBT), pp. 464–476 (2002)Google Scholar
  26. 26.
    Poole, J., Chang, D., Tolbert, D., Mellor, D.: Common Warehouse Metamodel. Wiley, Developer’s Guide (2003)Google Scholar
  27. 27.
    Russom, P.: Data lakes: purposes, practices, patterns, and platforms (2017). TDWI white paperGoogle Scholar
  28. 28.
  29. 29.
    Stefanowski, J., Krawiec, K., Wrembel, R.: Exploring complex and big data. Appl. Math. Comput. Sci. 27(4), 669–679 (2017)MathSciNetzbMATHGoogle Scholar
  30. 30.
    Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: Conference on Innovative Data Systems Research (CIDR) (2015)Google Scholar
  31. 31.
  32. 32.
    Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Analytical metadata modeling for next generation BI systems. J. Syst. Softw. 144, 240–254 (2018)CrossRefGoogle Scholar
  33. 33.
    Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Comput. 25(3), 38–49 (1992)CrossRefGoogle Scholar
  34. 34.
    Wu, D., Sakr, S., Zhu, L.: HDM: optimized big data processing with data provenance. In: International Conference on Extending Database Technology (EDBT), pp. 530–533 (2017)Google Scholar
  35. 35.
    Wylot, M., Cudré-Mauroux, P., Hauswirth, M., Groth, P.T.: Storing, tracking, and querying provenance in linked data. IEEE Trans. Knowl. Data Eng. (TKDE) 29(8), 1751–1764 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Hiba Khalid
    • 1
    • 2
    Email author
  • Robert Wrembel
    • 2
  • Esteban Zimányi
    • 1
  1. 1.Université Libre de BruxellesBrusselsBelgium
  2. 2.Poznan University of TechnologyPoznańPoland

Personalised recommendations