Joint Management and Analysis of Textual Documents and Tabular Data Within the AUDAL Data Lake

Sawadogo, Pegdwendé N.; Darmont, Jérôme; Noûs, Camille

doi:10.1007/978-3-030-82472-3_8

Pegdwendé N. Sawadogo¹²,
Jérôme Darmont¹² &
Camille Noûs¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12843))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

620 Accesses
5 Citations

Abstract

In 2010, the concept of data lake emerged as an alternative to data warehouses for big data management. Data lakes follow a schema-on-read approach to provide rich and flexible analyses. However, although trendy in both the industry and academia, the concept of data lake is still maturing, and there are still few methodological approaches to data lake design. Thus, we introduce a new approach to design a data lake and propose an extensive metadata system to activate richer features than those usually supported in data lake approaches. We implement our approach in the AUDAL data lake, where we jointly exploit both textual documents and tabular data, in contrast with structured and/or semi-structured data typically processed in data lakes from the literature. Furthermore, we also innovate by leveraging metadata to activate both data retrieval and content analysis, including Text-OLAP and SQL querying. Finally, we show the feasibility of our approach using a real-word use case on the one hand, and a benchmark on the other hand.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
AURA-PMI is a multidisciplinary project in Management and Computer Sciences, aiming at studying the digital transformation, servicization and business model mutation of industrial SMEs in the French Auvergne-Rhône-Alpes (AURA) Region.
2.
https://github.com/Pegdwende44/AUDAL.

References

Allan, J., Lavrenko, V., Malin, D., Swan, R.: Detections, bounds, and timelines: UMass and TDT-3. In: Proceedings of TDT-3, pp. 167–174 (2000)
Google Scholar
Armbrust, M., Ghodsi, A., Xin, R., Zaharia, M.: Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In: Proceedings of CIDR (2021)
Google Scholar
Bagozi, A., Bianchini, D., Antonellis, V.D., Garda, M., Melchiori, M.: Personalised exploration graphs on semantic data lakes. In: Proceedings of OTM, pp. 22–39 (2019)
Google Scholar
Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: CoreKG: a knowledge lake service. In: PVLDB, vol. 11, no. 12, pp. 1942–1945 (2018)
Google Scholar
Bogatu, A., Fernandes, A., Paton, N., Konstantinou, N.: Dataset discovery in data lakes. In: Proceedings of ICDE (2020)
Google Scholar
Brooke, J.: SUS: a quick and dirty usability scale. Usability Eval. Ind. 189, 4–7 (1996)
Google Scholar
Chen, Z., Narasayya, V., Chaudhuri, S.: Fast foreign-key detection in Microsoft SQL server PowerPivot for excel. In: PVLDB, vol. 7, no. 13, pp. 1417–1428 (2014)
Google Scholar
Codd, E., Codd, S., Salley, C.: Providing OLAP (on-line analytical processing) to user-analysts, an IT mandate. E. F. Codd and Associates (1993)
Google Scholar
Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., Ursino, D.: A new metadata model to uniformly handle heterogeneous data lake sources. In: Benczúr, A., et al. (eds.) ADBIS 2018. CCIS, vol. 909, pp. 165–177. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00063-9_17
Chapter Google Scholar
Dixon, J.: Pentaho, hadoop, and data lakes (2010). https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
Elastic: Elasticsearch (2020). https://www.elastic.co
Fang, H.: Managing data lakes in big data era. In: Proceedings of CYBER, pp. 820–824 (2015)
Google Scholar
Farrugia, A., Claxton, R., Thompson, S.: Towards social network analytics for understanding and managing enterprise data lakes. In: Proceedings of ASONAM, pp. 1213–1220 (2016)
Google Scholar
Fernandez, R.C., Abedjan, Z., Koko, F., Yuan, G., Madden, S., Stonebraker, M.: Aurum: a data discovery system. In: Proceedings of ICDE, pp. 1001–1012 (2018)
Google Scholar
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of SIGMOD, pp. 2097–2100 (2016)
Google Scholar
Hai, R., Quix, C., Zhou, C.: Query rewriting for heterogeneous data lakes. In: Benczúr, A., Thalheim, B., Horváth, T. (eds.) ADBIS 2018. LNCS, vol. 11019, pp. 35–49. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98398-1_3
Chapter Google Scholar
Halevy, A., et al.: Managing google’s data lake: an overview of the GOODS system. In: Proceedings of SIGMOD, pp. 795–806 (2016)
Google Scholar
Hellerstein, J.M., et al.: Ground: a data context service. In: Proceedings of CIDR (2017)
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Khine, P.P., Wang, Z.S.: Data lake: a new ideology in big data era. In: Proceedings of WCSN. ITM Web of Conferences, vol. 17, pp. 1–6 (2017)
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of ICML, pp. 1188–1196 (2014)
Google Scholar
Leclercq, E., Savonnet, M.: A tensor based data model for polystore: an application to social networks data. In: Proceedings of IDEAS, pp. 110–118 (2018)
Google Scholar
Maccioni, A., Torlone, R.: KAYAK: a framework for just-in-time data preparation in a data lake. In: Proceedings of CAiSE, pp. 474–489 (2018)
Google Scholar
Madera, C., Laurent, A.: The next information architecture evolution: the data lake wave. In: Proceedings of MEDES, pp. 174–180 (2016)
Google Scholar
Malysiak-Mrozek, B., Stabla, M., Mrozek, D.: Soft and declarative fishing of information in big data lake. IEEE Trans. Fuzzy Syst. 26(5), 2732–2747 (2018)
Article Google Scholar
Mehmood, H., et al.: Implementing big data lake for heterogeneous data sources. In: Proceedings of ICDEW, pp. 37–44 (2019)
Google Scholar
MongoDB-Inc.: The database for modern applications (2020). https://www.mongodb.com/
Nargesian, F., Zhu, E., Pu, K.Q., Miller, R.J.: Table union search on open data. In: PVLDB, vol. 11, pp. 813–825 (2018)
Google Scholar
Neo4J Inc.: The Neo4j graph platform (2018). https://neo4j.com
Pu, W., Liu, N., Yan, S., Yan, J., Xie, K., Chen, Z.: Local word bag model for text categorization. In: Proceedings of ICDM, pp. 625–630 (2007)
Google Scholar
Russom, P.: Data lakes purposes. Patterns, and platforms. TDWI Research, Practices (2017)
Google Scholar
Sawadogo, P.N., Kibata, T., Darmont, J.: Metadata management for textual documents in data lakes. In: Proceedings of ICEIS, pp. 72–83 (2019)
Google Scholar
Sawadogo, P.N., Scholly, É., Favre, C., Ferey, É., Loudcher, S., Darmont, J.: Metadata systems for data lakes: models and features. In: Welzer, T., et al. (eds.) ADBIS 2019. CCIS, vol. 1064, pp. 440–451. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30278-8_43
Chapter Google Scholar
SQLite-Consortium: What is SQLite? (2020). https://www.sqlite.org/
Suriarachchi, I., Plale, B.: Crossing analytics systems: a case for integrated provenance in data lakes. In: Proceedings of e-Science, pp. 349–354 (2016)
Google Scholar
The Apache Software Foundation: Apache Tika - a content analysis toolkit (2018). https://tika.apache.org/
Visengeriyeva, L., Abedjan, Z.: Anatomy of metadata for data curation. J. Data Inf. Qual. 12(3), 1–3 (2020)
Article Google Scholar
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1), 37–52 (1987)
Article Google Scholar

Download references

Acknowledgments

P. N. Sawadogo’s Ph.D. is funded by the Auvergne-Rhône-Alpes Region through the AURA-PMI project.

Author information

Authors and Affiliations

Université de Lyon, Lyon 2, UR ERIC 5 avenue Pierre Mendès France, 69676, Bron Cedex, France
Pegdwendé N. Sawadogo & Jérôme Darmont
Université de Lyon, Lyon 2, Laboratoire Cogitamus, Bron, France
Camille Noûs

Authors

Pegdwendé N. Sawadogo
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Darmont
View author publications
You can also search for this author in PubMed Google Scholar
Camille Noûs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pegdwendé N. Sawadogo .

Editor information

Editors and Affiliations

LIAS/ISAE-ENSMA, Futuroscope Chasseneuil Cedex, France
Ladjel Bellatreche
University of Tartu, Tartu, Estonia
Marlon Dumas
Aarhus University, Aarhus, Denmark
Panagiotis Karras
University of Tartu, Tartu, Estonia
Raimundas Matulevičius

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sawadogo, P.N., Darmont, J., Noûs, C. (2021). Joint Management and Analysis of Textual Documents and Tabular Data Within the AUDAL Data Lake. In: Bellatreche, L., Dumas, M., Karras, P., Matulevičius, R. (eds) Advances in Databases and Information Systems. ADBIS 2021. Lecture Notes in Computer Science(), vol 12843. Springer, Cham. https://doi.org/10.1007/978-3-030-82472-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-82472-3_8
Published: 16 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82471-6
Online ISBN: 978-3-030-82472-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Joint Management and Analysis of Textual Documents and Tabular Data Within the AUDAL Data Lake