Skip to main content

Data Catalogs: A Systematic Literature Review and Guidelines to Implementation

  • Conference paper
  • First Online:
Database and Expert Systems Applications - DEXA 2021 Workshops (DEXA 2021)

Abstract

In enterprises, data is usually distributed across multiple data sources and stored in heterogeneous formats. The harmonization and integration of data is a prerequisite to leverage it for AI initiatives. Recently, data catalogs pose a promising solution to semantically classify and organize data sources across different environments and to enrich raw data with metadata. Data catalogs therefore allow to create a single, clear, and easy-accessible interface for training and testing computational models. Despite a lively discussion among practitioners, there is little research on data catalogs. In this paper, we systematically review existing literature and answer the following questions: (1) What are the conceptual components of a data catalog? and (2) Which guidelines can be recommended to implement a data catalog? The results benefit practitioners in implementing a data catalog to accelerate any AI initiative and researchers with a compilation of future research directions.

The research in this paper has been funded by BMK, BMDW, and the Province of Upper Austria in the frame of the COMET Programme managed by FFG.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    VizieR is an online data catalog for astronomical data: http://vizier.u-strasbg.fr.

  2. 2.

    https://www.w3.org/TR/vocab-dcat (Apr. 2021).

  3. 3.

    https://www.w3.org/TR/prov-o (Apr. 2021).

  4. 4.

    https://www.go-fair.org/fair-principles (Apr. 2021).

  5. 5.

    https://github.com/lisehr/dq-meerkat.

References

  1. Barbosa, E.B.d.M., Sena, G.d.: Scientific data dissemination a data catalogue to assist research organizations. Ciência da Informação 37, 19–25 (04 2008)

    Google Scholar 

  2. Dibowski, H., et al.: Using semantic technologies to manage a data lake: data catalog, provenance and access control, p. 17 (2020)

    Google Scholar 

  3. Ehrlinger, L., Wöß, W.: Automated data quality monitoring. In: Talburt, J.R. (ed.) Proceedings of the 22nd MIT International Conference on Information Quality (ICIQ 2017), Little Rock, AR, USA, pp. 15.1–15.9 (2017)

    Google Scholar 

  4. Feilmayr, C., Wöß, W.: An analysis of ontologies and their success factors for application to business. Data Knowl. Eng. 101, 1–23 (2016)

    Google Scholar 

  5. Fischer, L., et al.: AI system engineering-key challenges and lessons learned. Mach. Learn. Knowl. Extr. 3(1), 56–83 (2021)

    Article  Google Scholar 

  6. Data Quality - Part 8: Information and Data Quality Concepts and Measuring. Standard, International Organization for Standardization, Switzerland (2015)

    Google Scholar 

  7. Jensen, S., et al.: A hybrid XML-relational grid metadata catalog. In: International Conference on Parallel Processing Workshops (ICPPW 2006), pp. 8–24 (2006)

    Google Scholar 

  8. Kitchenham, B.: Procedures for performing systematic reviews, p. 33 (2004)

    Google Scholar 

  9. Labadie, C., et al.: Fair enough? Enhancing the usage of enterprise data with data catalogs. In: 2020 IEEE 22nd Conference on Business Informatics (CBI), vol. 1, pp. 201–210, June 2020

    Google Scholar 

  10. Lee, H.J., Sohn, M.: Construction of tag-based dynamic data catalog (TaDDCaT) using ontology. In: 2012 15th International Conference on Network-Based Information Systems, pp. 697–702 (2012). https://doi.org/10.1109/NBiS.2012.116

  11. Martin Kurth, David Ruddy, N.R.: Repurposing MARC metadata: using digital project experience to develop a metadata management design. Library Hi Tech 22(2), 153–165 (2004). https://doi.org/10.1108/07378830410524585

  12. Quimbert, E., Jeffery, K., Martens, C., Martin, P., Zhao, Z.: Data cataloguing. In: Zhao, Z., Hellström, M. (eds.) Towards Interoperable Research Infrastructures for Environmental and Earth Sciences. LNCS, vol. 12003, pp. 140–161. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52829-4_8

    Chapter  Google Scholar 

  13. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)

    Article  Google Scholar 

  14. Riley, J.: Understanding metadata: what is metadata, and what is it for? National Information Standards Organization (NISO) (2017). https://groups.niso.org/apps/group_public/download.php/17446/Understanding%20Met%E2%80%A6

  15. Shanmugam, S., Seshadri, G.: Aspects of data cataloguing for enterprise data platforms. In: IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), pp. 134–139 (2016)

    Google Scholar 

  16. Skopal, T., et al.: Improving findability of open data beyond data catalogs. In: Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services, pp. 413–417. ACM (2019)

    Google Scholar 

  17. Vicknair, C.: Research issues in data provenance. In: Proceedings of the 48th Annual Southeast Regional Conference. ACM SE 2010, Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1900008.1900037

  18. Wang, X.: An analysis of the benefits and issues in the development of an enterprise data catalogue. Master’s thesis, School of Information Management, Victoria Business School, Victoria University of Wellington (2014)

    Google Scholar 

  19. Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3(1), 160018 (2016)

    Google Scholar 

  20. Winningham, S.: Knowledge nugget: business glossary vs. data dictionaries (2019). https://web.stanford.edu/dept/pres-provost/cgi-bin/dg/wordpress/knowledge-nugget-business-glossary-vs-data-dictionaries

  21. Zaidi, E., et al.: Data catalogs are the new black in data management and analytics (2017). https://www.gartner.com/en/documents/3837968/data-catalogs-are-the-new-black-in-data-management-and-a

  22. Zhu, H., et al.: Data and information quality research: its evolution and future. In: Computing Handbook: Information Systems and Information Technology, pp. 16.1–16.20. Chapman and Hall/CRC, London (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lisa Ehrlinger .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ehrlinger, L., Schrott, J., Melichar, M., Kirchmayr, N., Wöß, W. (2021). Data Catalogs: A Systematic Literature Review and Guidelines to Implementation. In: Kotsis, G., et al. Database and Expert Systems Applications - DEXA 2021 Workshops. DEXA 2021. Communications in Computer and Information Science, vol 1479. Springer, Cham. https://doi.org/10.1007/978-3-030-87101-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87101-7_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87100-0

  • Online ISBN: 978-3-030-87101-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics