Abstract
In enterprises, data is usually distributed across multiple data sources and stored in heterogeneous formats. The harmonization and integration of data is a prerequisite to leverage it for AI initiatives. Recently, data catalogs pose a promising solution to semantically classify and organize data sources across different environments and to enrich raw data with metadata. Data catalogs therefore allow to create a single, clear, and easy-accessible interface for training and testing computational models. Despite a lively discussion among practitioners, there is little research on data catalogs. In this paper, we systematically review existing literature and answer the following questions: (1) What are the conceptual components of a data catalog? and (2) Which guidelines can be recommended to implement a data catalog? The results benefit practitioners in implementing a data catalog to accelerate any AI initiative and researchers with a compilation of future research directions.
The research in this paper has been funded by BMK, BMDW, and the Province of Upper Austria in the frame of the COMET Programme managed by FFG.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
VizieR is an online data catalog for astronomical data: http://vizier.u-strasbg.fr.
- 2.
https://www.w3.org/TR/vocab-dcat (Apr. 2021).
- 3.
https://www.w3.org/TR/prov-o (Apr. 2021).
- 4.
https://www.go-fair.org/fair-principles (Apr. 2021).
- 5.
References
Barbosa, E.B.d.M., Sena, G.d.: Scientific data dissemination a data catalogue to assist research organizations. Ciência da Informação 37, 19–25 (04 2008)
Dibowski, H., et al.: Using semantic technologies to manage a data lake: data catalog, provenance and access control, p. 17 (2020)
Ehrlinger, L., Wöß, W.: Automated data quality monitoring. In: Talburt, J.R. (ed.) Proceedings of the 22nd MIT International Conference on Information Quality (ICIQ 2017), Little Rock, AR, USA, pp. 15.1–15.9 (2017)
Feilmayr, C., Wöß, W.: An analysis of ontologies and their success factors for application to business. Data Knowl. Eng. 101, 1–23 (2016)
Fischer, L., et al.: AI system engineering-key challenges and lessons learned. Mach. Learn. Knowl. Extr. 3(1), 56–83 (2021)
Data Quality - Part 8: Information and Data Quality Concepts and Measuring. Standard, International Organization for Standardization, Switzerland (2015)
Jensen, S., et al.: A hybrid XML-relational grid metadata catalog. In: International Conference on Parallel Processing Workshops (ICPPW 2006), pp. 8–24 (2006)
Kitchenham, B.: Procedures for performing systematic reviews, p. 33 (2004)
Labadie, C., et al.: Fair enough? Enhancing the usage of enterprise data with data catalogs. In: 2020 IEEE 22nd Conference on Business Informatics (CBI), vol. 1, pp. 201–210, June 2020
Lee, H.J., Sohn, M.: Construction of tag-based dynamic data catalog (TaDDCaT) using ontology. In: 2012 15th International Conference on Network-Based Information Systems, pp. 697–702 (2012). https://doi.org/10.1109/NBiS.2012.116
Martin Kurth, David Ruddy, N.R.: Repurposing MARC metadata: using digital project experience to develop a metadata management design. Library Hi Tech 22(2), 153–165 (2004). https://doi.org/10.1108/07378830410524585
Quimbert, E., Jeffery, K., Martens, C., Martin, P., Zhao, Z.: Data cataloguing. In: Zhao, Z., Hellström, M. (eds.) Towards Interoperable Research Infrastructures for Environmental and Earth Sciences. LNCS, vol. 12003, pp. 140–161. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52829-4_8
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Riley, J.: Understanding metadata: what is metadata, and what is it for? National Information Standards Organization (NISO) (2017). https://groups.niso.org/apps/group_public/download.php/17446/Understanding%20Met%E2%80%A6
Shanmugam, S., Seshadri, G.: Aspects of data cataloguing for enterprise data platforms. In: IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), pp. 134–139 (2016)
Skopal, T., et al.: Improving findability of open data beyond data catalogs. In: Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services, pp. 413–417. ACM (2019)
Vicknair, C.: Research issues in data provenance. In: Proceedings of the 48th Annual Southeast Regional Conference. ACM SE 2010, Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1900008.1900037
Wang, X.: An analysis of the benefits and issues in the development of an enterprise data catalogue. Master’s thesis, School of Information Management, Victoria Business School, Victoria University of Wellington (2014)
Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3(1), 160018 (2016)
Winningham, S.: Knowledge nugget: business glossary vs. data dictionaries (2019). https://web.stanford.edu/dept/pres-provost/cgi-bin/dg/wordpress/knowledge-nugget-business-glossary-vs-data-dictionaries
Zaidi, E., et al.: Data catalogs are the new black in data management and analytics (2017). https://www.gartner.com/en/documents/3837968/data-catalogs-are-the-new-black-in-data-management-and-a
Zhu, H., et al.: Data and information quality research: its evolution and future. In: Computing Handbook: Information Systems and Information Technology, pp. 16.1–16.20. Chapman and Hall/CRC, London (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ehrlinger, L., Schrott, J., Melichar, M., Kirchmayr, N., Wöß, W. (2021). Data Catalogs: A Systematic Literature Review and Guidelines to Implementation. In: Kotsis, G., et al. Database and Expert Systems Applications - DEXA 2021 Workshops. DEXA 2021. Communications in Computer and Information Science, vol 1479. Springer, Cham. https://doi.org/10.1007/978-3-030-87101-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-87101-7_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87100-0
Online ISBN: 978-3-030-87101-7
eBook Packages: Computer ScienceComputer Science (R0)