Abstract
The increased flexibility brought by Data Lake technologies, along with size and heterogeneity of quickly changing data sources, bring novel challenges to their management. Making sense of disparate data and supporting users to identify the most relevant sources for a given analytic request are indeed critical requirements to make data actionable. This is particularly relevant in data science applications, where users want to analyse statistical measures from a variety of data sources. To this aim, in the paper we introduce a knowledge-based approach for a Semantic Data Lake, capable of supporting efficient integration of data sources and their alignment to a Knowledge Graph representing indicators of interest, their mathematical formulas and dimensions of analysis. By leveraging manipulation of indicator formulas, a query-driven discovery approach is exploited to dynamically identify the sources, along with the needed transformations, to respond a given .
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
KPIOnto specifications are available at http://w3id.org/kpionto.
- 6.
- 7.
Under this assumption, the set containment is equivalent to the overlap (or Szymkiewicz–Simpson) coefficient, i.e., \(\frac{|X \cap Y|}{|min(X,Y)|}\).
- 8.
References
Broder, A.Z .: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp. 21–29. IEEE (1997)
Diamantini, C., Lo Giudice, P., Potena, D., Storti, E., Ursino, D.: An approach to extracting topic-guided views from the sources of a data lake. Inf. Syst. Front. 23, 243–262 (2021)
Diamantini, C., Potena, D., Storti, E.: Analytics for citizens: a linked open data model for statistical data exploration. Concurr. Comput. Pract. Exp. 33(8), e4186 (2021)
Diamantini, C., Potena, D., Storti, E.: A semantic data lake model for analytic query-driven discovery. In: The 23rd International Conference on Information Integration and Web Intelligence, iiWAS2021, pp. 183–186. Association for Computing Machinery, New York, NY, USA (2021)
Farid, M., Roatis, A., Ilyas, I.F., Hoffmann, H., Chu, X.: CLAMS: bringing quality to Data Lakes. In: Proceedings of the International Conference on Management of Data (SIGMOD/PODS 2016), pp. 2089–2092. ACM, San Francisco, CA, USA (2016)
Fernandez, R.C.: Seeping semantics: linking datasets using word embeddings for data discovery. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 989–1000. IEEE (2018)
Giebler, C., Gröger, C., Hoos, E., Schwarz, H., Mitschang, B.: Leveraging the data lake: current state and challenges. In: Ordonez, C., Song, I., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) Big Data Analytics and Knowledge Discovery. pp, pp. 179–188. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-27520-4_13
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the International Conference on Management of Data (SIGMOD 2016), pp. 2097–2100. ACM, San Francisco, CA, USA (2016)
Hale, T., Webster, S., Petherick, A., Phillips, T., Kira, B.: Oxford COVID-19 government response tracker. Technical report, Blavatnik School of Government (2020)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)
Microsoft. Covid-19 data lake. https://docs.microsoft.com/en-us/azure/open-datasets/dataset-covid-19-data-lake. Accessed 23 Feb 2022
Miller, R.J.: Open data integration. Proc. VLDB Endow. 11(12), 2130–2139 (2018)
Nargesian, F., Zhu, E., Miller, R.J., Pu, K.Q., Arocena, P.C.: Data lake management: challenges and opportunities. Proc. VLDB Endow. 12(12), 1986–1989 (2019)
Oram, A.: Managing the Data Lake. O’Reilly, Sebastopol (2015)
Sawadogo, P., Darmont, J.: On data lake architectures and metadata management. J. Intell. Inf. Syst. 56(1), 97–120 (2020). https://doi.org/10.1007/s10844-020-00608-7
Zhu, E., Nargesian, F., Pu, K.Q., Miller, R.J.: LSH ensemble: internet-scale domain search. Proc. VLDB Endow. 9(12), 1185–1196 (2016)
Zhu, E., Pu, K.Q., Nargesian, F., Miller, R.J.: Interactive navigation of open data linkages. Proc. VLDB Endow. 10(12), 1837–1840 (2017)
Zhu, E., Markovtsev, V.: ekzhu/datasketch: first stable release, February 2017. https://doi.org/10.5281/zenodo.290602
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Diamantini, C., Potena, D., Storti, E. (2022). A Knowledge-Based Approach to Support Analytic Query Answering in Semantic Data Lakes. In: Chiusano, S., Cerquitelli, T., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2022. Lecture Notes in Computer Science, vol 13389. Springer, Cham. https://doi.org/10.1007/978-3-031-15740-0_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-15740-0_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15739-4
Online ISBN: 978-3-031-15740-0
eBook Packages: Computer ScienceComputer Science (R0)