Skip to main content

Joint Management and Analysis of Textual Documents and Tabular Data Within the AUDAL Data Lake

  • Conference paper
  • First Online:
Advances in Databases and Information Systems (ADBIS 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12843))

Included in the following conference series:

Abstract

In 2010, the concept of data lake emerged as an alternative to data warehouses for big data management. Data lakes follow a schema-on-read approach to provide rich and flexible analyses. However, although trendy in both the industry and academia, the concept of data lake is still maturing, and there are still few methodological approaches to data lake design. Thus, we introduce a new approach to design a data lake and propose an extensive metadata system to activate richer features than those usually supported in data lake approaches. We implement our approach in the AUDAL data lake, where we jointly exploit both textual documents and tabular data, in contrast with structured and/or semi-structured data typically processed in data lakes from the literature. Furthermore, we also innovate by leveraging metadata to activate both data retrieval and content analysis, including Text-OLAP and SQL querying. Finally, we show the feasibility of our approach using a real-word use case on the one hand, and a benchmark on the other hand.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    AURA-PMI is a multidisciplinary project in Management and Computer Sciences, aiming at studying the digital transformation, servicization and business model mutation of industrial SMEs in the French Auvergne-Rhône-Alpes (AURA) Region.

  2. 2.

    https://github.com/Pegdwende44/AUDAL.

References

  1. Allan, J., Lavrenko, V., Malin, D., Swan, R.: Detections, bounds, and timelines: UMass and TDT-3. In: Proceedings of TDT-3, pp. 167–174 (2000)

    Google Scholar 

  2. Armbrust, M., Ghodsi, A., Xin, R., Zaharia, M.: Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In: Proceedings of CIDR (2021)

    Google Scholar 

  3. Bagozi, A., Bianchini, D., Antonellis, V.D., Garda, M., Melchiori, M.: Personalised exploration graphs on semantic data lakes. In: Proceedings of OTM, pp. 22–39 (2019)

    Google Scholar 

  4. Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: CoreKG: a knowledge lake service. In: PVLDB, vol. 11, no. 12, pp. 1942–1945 (2018)

    Google Scholar 

  5. Bogatu, A., Fernandes, A., Paton, N., Konstantinou, N.: Dataset discovery in data lakes. In: Proceedings of ICDE (2020)

    Google Scholar 

  6. Brooke, J.: SUS: a quick and dirty usability scale. Usability Eval. Ind. 189, 4–7 (1996)

    Google Scholar 

  7. Chen, Z., Narasayya, V., Chaudhuri, S.: Fast foreign-key detection in Microsoft SQL server PowerPivot for excel. In: PVLDB, vol. 7, no. 13, pp. 1417–1428 (2014)

    Google Scholar 

  8. Codd, E., Codd, S., Salley, C.: Providing OLAP (on-line analytical processing) to user-analysts, an IT mandate. E. F. Codd and Associates (1993)

    Google Scholar 

  9. Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., Ursino, D.: A new metadata model to uniformly handle heterogeneous data lake sources. In: Benczúr, A., et al. (eds.) ADBIS 2018. CCIS, vol. 909, pp. 165–177. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00063-9_17

    Chapter  Google Scholar 

  10. Dixon, J.: Pentaho, hadoop, and data lakes (2010). https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/

  11. Elastic: Elasticsearch (2020). https://www.elastic.co

  12. Fang, H.: Managing data lakes in big data era. In: Proceedings of CYBER, pp. 820–824 (2015)

    Google Scholar 

  13. Farrugia, A., Claxton, R., Thompson, S.: Towards social network analytics for understanding and managing enterprise data lakes. In: Proceedings of ASONAM, pp. 1213–1220 (2016)

    Google Scholar 

  14. Fernandez, R.C., Abedjan, Z., Koko, F., Yuan, G., Madden, S., Stonebraker, M.: Aurum: a data discovery system. In: Proceedings of ICDE, pp. 1001–1012 (2018)

    Google Scholar 

  15. Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of SIGMOD, pp. 2097–2100 (2016)

    Google Scholar 

  16. Hai, R., Quix, C., Zhou, C.: Query rewriting for heterogeneous data lakes. In: Benczúr, A., Thalheim, B., Horváth, T. (eds.) ADBIS 2018. LNCS, vol. 11019, pp. 35–49. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98398-1_3

    Chapter  Google Scholar 

  17. Halevy, A., et al.: Managing google’s data lake: an overview of the GOODS system. In: Proceedings of SIGMOD, pp. 795–806 (2016)

    Google Scholar 

  18. Hellerstein, J.M., et al.: Ground: a data context service. In: Proceedings of CIDR (2017)

    Google Scholar 

  19. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)

    Article  Google Scholar 

  20. Khine, P.P., Wang, Z.S.: Data lake: a new ideology in big data era. In: Proceedings of WCSN. ITM Web of Conferences, vol. 17, pp. 1–6 (2017)

    Google Scholar 

  21. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of ICML, pp. 1188–1196 (2014)

    Google Scholar 

  22. Leclercq, E., Savonnet, M.: A tensor based data model for polystore: an application to social networks data. In: Proceedings of IDEAS, pp. 110–118 (2018)

    Google Scholar 

  23. Maccioni, A., Torlone, R.: KAYAK: a framework for just-in-time data preparation in a data lake. In: Proceedings of CAiSE, pp. 474–489 (2018)

    Google Scholar 

  24. Madera, C., Laurent, A.: The next information architecture evolution: the data lake wave. In: Proceedings of MEDES, pp. 174–180 (2016)

    Google Scholar 

  25. Malysiak-Mrozek, B., Stabla, M., Mrozek, D.: Soft and declarative fishing of information in big data lake. IEEE Trans. Fuzzy Syst. 26(5), 2732–2747 (2018)

    Article  Google Scholar 

  26. Mehmood, H., et al.: Implementing big data lake for heterogeneous data sources. In: Proceedings of ICDEW, pp. 37–44 (2019)

    Google Scholar 

  27. MongoDB-Inc.: The database for modern applications (2020). https://www.mongodb.com/

  28. Nargesian, F., Zhu, E., Pu, K.Q., Miller, R.J.: Table union search on open data. In: PVLDB, vol. 11, pp. 813–825 (2018)

    Google Scholar 

  29. Neo4J Inc.: The Neo4j graph platform (2018). https://neo4j.com

  30. Pu, W., Liu, N., Yan, S., Yan, J., Xie, K., Chen, Z.: Local word bag model for text categorization. In: Proceedings of ICDM, pp. 625–630 (2007)

    Google Scholar 

  31. Russom, P.: Data lakes purposes. Patterns, and platforms. TDWI Research, Practices (2017)

    Google Scholar 

  32. Sawadogo, P.N., Kibata, T., Darmont, J.: Metadata management for textual documents in data lakes. In: Proceedings of ICEIS, pp. 72–83 (2019)

    Google Scholar 

  33. Sawadogo, P.N., Scholly, É., Favre, C., Ferey, É., Loudcher, S., Darmont, J.: Metadata systems for data lakes: models and features. In: Welzer, T., et al. (eds.) ADBIS 2019. CCIS, vol. 1064, pp. 440–451. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30278-8_43

    Chapter  Google Scholar 

  34. SQLite-Consortium: What is SQLite? (2020). https://www.sqlite.org/

  35. Suriarachchi, I., Plale, B.: Crossing analytics systems: a case for integrated provenance in data lakes. In: Proceedings of e-Science, pp. 349–354 (2016)

    Google Scholar 

  36. The Apache Software Foundation: Apache Tika - a content analysis toolkit (2018). https://tika.apache.org/

  37. Visengeriyeva, L., Abedjan, Z.: Anatomy of metadata for data curation. J. Data Inf. Qual. 12(3), 1–3 (2020)

    Article  Google Scholar 

  38. Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1), 37–52 (1987)

    Article  Google Scholar 

Download references

Acknowledgments

P. N. Sawadogo’s Ph.D. is funded by the Auvergne-Rhône-Alpes Region through the AURA-PMI project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pegdwendé N. Sawadogo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sawadogo, P.N., Darmont, J., Noûs, C. (2021). Joint Management and Analysis of Textual Documents and Tabular Data Within the AUDAL Data Lake. In: Bellatreche, L., Dumas, M., Karras, P., Matulevičius, R. (eds) Advances in Databases and Information Systems. ADBIS 2021. Lecture Notes in Computer Science(), vol 12843. Springer, Cham. https://doi.org/10.1007/978-3-030-82472-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-82472-3_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-82471-6

  • Online ISBN: 978-3-030-82472-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics