Abstract
In 2010, the concept of data lake emerged as an alternative to data warehouses for big data management. Data lakes follow a schema-on-read approach to provide rich and flexible analyses. However, although trendy in both the industry and academia, the concept of data lake is still maturing, and there are still few methodological approaches to data lake design. Thus, we introduce a new approach to design a data lake and propose an extensive metadata system to activate richer features than those usually supported in data lake approaches. We implement our approach in the AUDAL data lake, where we jointly exploit both textual documents and tabular data, in contrast with structured and/or semi-structured data typically processed in data lakes from the literature. Furthermore, we also innovate by leveraging metadata to activate both data retrieval and content analysis, including Text-OLAP and SQL querying. Finally, we show the feasibility of our approach using a real-word use case on the one hand, and a benchmark on the other hand.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
AURA-PMI is a multidisciplinary project in Management and Computer Sciences, aiming at studying the digital transformation, servicization and business model mutation of industrial SMEs in the French Auvergne-Rhône-Alpes (AURA) Region.
- 2.
References
Allan, J., Lavrenko, V., Malin, D., Swan, R.: Detections, bounds, and timelines: UMass and TDT-3. In: Proceedings of TDT-3, pp. 167–174 (2000)
Armbrust, M., Ghodsi, A., Xin, R., Zaharia, M.: Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In: Proceedings of CIDR (2021)
Bagozi, A., Bianchini, D., Antonellis, V.D., Garda, M., Melchiori, M.: Personalised exploration graphs on semantic data lakes. In: Proceedings of OTM, pp. 22–39 (2019)
Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: CoreKG: a knowledge lake service. In: PVLDB, vol. 11, no. 12, pp. 1942–1945 (2018)
Bogatu, A., Fernandes, A., Paton, N., Konstantinou, N.: Dataset discovery in data lakes. In: Proceedings of ICDE (2020)
Brooke, J.: SUS: a quick and dirty usability scale. Usability Eval. Ind. 189, 4–7 (1996)
Chen, Z., Narasayya, V., Chaudhuri, S.: Fast foreign-key detection in Microsoft SQL server PowerPivot for excel. In: PVLDB, vol. 7, no. 13, pp. 1417–1428 (2014)
Codd, E., Codd, S., Salley, C.: Providing OLAP (on-line analytical processing) to user-analysts, an IT mandate. E. F. Codd and Associates (1993)
Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., Ursino, D.: A new metadata model to uniformly handle heterogeneous data lake sources. In: Benczúr, A., et al. (eds.) ADBIS 2018. CCIS, vol. 909, pp. 165–177. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00063-9_17
Dixon, J.: Pentaho, hadoop, and data lakes (2010). https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
Elastic: Elasticsearch (2020). https://www.elastic.co
Fang, H.: Managing data lakes in big data era. In: Proceedings of CYBER, pp. 820–824 (2015)
Farrugia, A., Claxton, R., Thompson, S.: Towards social network analytics for understanding and managing enterprise data lakes. In: Proceedings of ASONAM, pp. 1213–1220 (2016)
Fernandez, R.C., Abedjan, Z., Koko, F., Yuan, G., Madden, S., Stonebraker, M.: Aurum: a data discovery system. In: Proceedings of ICDE, pp. 1001–1012 (2018)
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of SIGMOD, pp. 2097–2100 (2016)
Hai, R., Quix, C., Zhou, C.: Query rewriting for heterogeneous data lakes. In: Benczúr, A., Thalheim, B., Horváth, T. (eds.) ADBIS 2018. LNCS, vol. 11019, pp. 35–49. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98398-1_3
Halevy, A., et al.: Managing google’s data lake: an overview of the GOODS system. In: Proceedings of SIGMOD, pp. 795–806 (2016)
Hellerstein, J.M., et al.: Ground: a data context service. In: Proceedings of CIDR (2017)
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Khine, P.P., Wang, Z.S.: Data lake: a new ideology in big data era. In: Proceedings of WCSN. ITM Web of Conferences, vol. 17, pp. 1–6 (2017)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of ICML, pp. 1188–1196 (2014)
Leclercq, E., Savonnet, M.: A tensor based data model for polystore: an application to social networks data. In: Proceedings of IDEAS, pp. 110–118 (2018)
Maccioni, A., Torlone, R.: KAYAK: a framework for just-in-time data preparation in a data lake. In: Proceedings of CAiSE, pp. 474–489 (2018)
Madera, C., Laurent, A.: The next information architecture evolution: the data lake wave. In: Proceedings of MEDES, pp. 174–180 (2016)
Malysiak-Mrozek, B., Stabla, M., Mrozek, D.: Soft and declarative fishing of information in big data lake. IEEE Trans. Fuzzy Syst. 26(5), 2732–2747 (2018)
Mehmood, H., et al.: Implementing big data lake for heterogeneous data sources. In: Proceedings of ICDEW, pp. 37–44 (2019)
MongoDB-Inc.: The database for modern applications (2020). https://www.mongodb.com/
Nargesian, F., Zhu, E., Pu, K.Q., Miller, R.J.: Table union search on open data. In: PVLDB, vol. 11, pp. 813–825 (2018)
Neo4J Inc.: The Neo4j graph platform (2018). https://neo4j.com
Pu, W., Liu, N., Yan, S., Yan, J., Xie, K., Chen, Z.: Local word bag model for text categorization. In: Proceedings of ICDM, pp. 625–630 (2007)
Russom, P.: Data lakes purposes. Patterns, and platforms. TDWI Research, Practices (2017)
Sawadogo, P.N., Kibata, T., Darmont, J.: Metadata management for textual documents in data lakes. In: Proceedings of ICEIS, pp. 72–83 (2019)
Sawadogo, P.N., Scholly, É., Favre, C., Ferey, É., Loudcher, S., Darmont, J.: Metadata systems for data lakes: models and features. In: Welzer, T., et al. (eds.) ADBIS 2019. CCIS, vol. 1064, pp. 440–451. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30278-8_43
SQLite-Consortium: What is SQLite? (2020). https://www.sqlite.org/
Suriarachchi, I., Plale, B.: Crossing analytics systems: a case for integrated provenance in data lakes. In: Proceedings of e-Science, pp. 349–354 (2016)
The Apache Software Foundation: Apache Tika - a content analysis toolkit (2018). https://tika.apache.org/
Visengeriyeva, L., Abedjan, Z.: Anatomy of metadata for data curation. J. Data Inf. Qual. 12(3), 1–3 (2020)
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1), 37–52 (1987)
Acknowledgments
P. N. Sawadogo’s Ph.D. is funded by the Auvergne-Rhône-Alpes Region through the AURA-PMI project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Sawadogo, P.N., Darmont, J., Noûs, C. (2021). Joint Management and Analysis of Textual Documents and Tabular Data Within the AUDAL Data Lake. In: Bellatreche, L., Dumas, M., Karras, P., Matulevičius, R. (eds) Advances in Databases and Information Systems. ADBIS 2021. Lecture Notes in Computer Science(), vol 12843. Springer, Cham. https://doi.org/10.1007/978-3-030-82472-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-82472-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82471-6
Online ISBN: 978-3-030-82472-3
eBook Packages: Computer ScienceComputer Science (R0)