Skip to main content
Log in

Tools for Healthcare Data Lake Infrastructure Benchmarking

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

Vast amounts of medical data are generated every day, and constitute a crucial asset to improve therapy outcomes, medical treatments and healthcare costs. Data lakes are a valuable solution for the management and analysis of such a variety and abundance of data, yet to date there is no data lake architecture specifically designed for the healthcare domain. Moreover, benchmarking the underlying infrastructure of data lakes is fundamental for optimizing resource allocation and performance, increasing the potential of this kind of data platforms. This work describes a data lake architecture to ingest, store, process, and analyze heterogeneous medical data. Also, we present a benchmark for infrastructures supporting healthcare data lakes, focusing on a variety of analysis tasks, from relational analysis to machine learning. The benchmark is tested on a virtualized implementation of our data lake architecture, and on two external cloud-based infrastructures. Our results highlight distinctions between infrastructures and tasks of different nature, according to the machine learning techniques, data sizes and formats involved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability

MIMIC-III Dataset is available with credentialed access on the PhysioNet website: https://physionet.org/content/mimiciii, and the MIMIC-III Waveform Database alone at: https://physionet.org/content/mimic3wdb-matched. The remaining datasets are freely available online. Stroke Prediction Dataset at: https://kaggle.com/datasets/fedesoriano/stroke-prediction-dataset. ICU Patients Mortality Prediction Dataset at: https://kaggle.com/datasets/msafi04/predict-mortality-of-icu-patients-physionet and from PhysioNet: https://physionet.org/content/challenge-2012. Brain MRI Images Dataset at: https://kaggle.com/datasets/navoneel/brain-mri-images-for-brain-tumor-detection. MIT-BIH Arrhythmia Database at: https://physionet.org/physiobank/database/mitdb. MIT-BIH Normal Sinus Rhythm Database at: https://physionet.org/physiobank/database/nsrdb. BIDMC Congestive Heart Failure Database at: https://physionet.org/physiobank/database/chfdb.

Code Availibility

Code regarding the tasks included in the benchmark is available at: https://github.com/TommasoD/SEASHELL. The proof-of-concept implementation of the data lake architecture is available at: https://github.com/MancoCarlo/healer-prototype.

Notes

  1. https://hadoop.apache.org

  2. https://nifi.apache.org

  3. https://kafka.apache.org

  4. https://pypi.org/project/hdfs

  5. https://atlas.apache.org

  6. https://ranger.apache.org

  7. https://www.docker.com

  8. https://kubernetes.io

  9. The proof-of-concept implementation of the data lake architecture is available at: https://github.com/MancoCarlo/healer-prototype.

  10. Code from the benchmark tasks is available at: https://github.com/TommasoD/SEASHELL.

  11. https://physionet.org/content/mimiciii

  12. https://kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

  13. https://physionet.org/content/challenge-2012

  14. https://physionet.org/physiobank/database/mitdb

  15. https://physionet.org/physiobank/database/nsrdb

  16. https://physionet.org/physiobank/database/chfdb

  17. https://kaggle.com/datasets/navoneel/brain-mri-images-for-brain-tumor-detection

  18. https://spark.apache.org/sql

  19. https://keras.io/api

  20. https://tensorflow.org

  21. https://www.databricks.com/product/faq/community-edition

  22. https://colab.research.google.com

  23. https://www.netdata.cloud

  24. While a computer has system RAM, most contemporary graphics cards have access to a dedicated set of memory known as Video RAM, or VRAM.

References

  • Agrahari, A., & Rao, D. (2017). A review paper on big data: technologies, tools and trends. International Research Journal of Engineering and Technology, 4(10), 10.

    Google Scholar 

  • Alarsan, F. I., & Younes, M. (2019). Analysis and classification of heart diseases using heartbeat features and machine learning algorithms. Journal of Big Data, 6(1). https://doi.org/10.1186/s40537-019-0244-x

  • Alwidian, J., Rahman, S. A., Gnaim, M., et al. (2020). Big data ingestion and preparation tools. Modern Applied Science, 14(9), 12–27.

    Article  Google Scholar 

  • Baim, D. S., Colucci, W. S., Monrad, E. S., et al. (1986). Survival of patients with severe congestive heart failure treated with oral milrinone. Journal of the American College of Cardiology, 7(3), 661–670. https://doi.org/10.1016/S0735-1097(86)80478-8

    Article  Google Scholar 

  • Baim, D. S., Colucci, W. S., Monrad, E. S., et al. (2000). Bidmc congestive heart failure database. PhysioNet. https://doi.org/10.13026/C29G60

  • Bajaber, F., Sakr, S., Batarfi, O., et al. (2020). Benchmarking big data systems: A survey. Computer Communications, 149, 241–251. https://doi.org/10.1016/j.comcom.2019.10.002

    Article  Google Scholar 

  • Barbierato, E., Gribaudo, M., Serazzi, G., et al. (2021). Performance evaluation of a data lake architecture via modeling techniques. In: Performance Engineering and Stochastic Modeling. Springer, pp. 115–130.

  • Batini, C., Cappiello, C., Francalanci, C., et al. (2009). Methodologies for data quality assessment and improvement. ACM computing surveys (CSUR), 41(3), 1–52.

    Article  Google Scholar 

  • Beheshti, A., Benatallah, B., Nouri, R., et al. (2017). Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2451–2454.

  • Bhattacharya, S., Rajan, V., & Shrivastava, H. (2017). Icu mortality prediction: a classification algorithm for imbalanced datasets. In: Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v31i1.10721

  • Calabrese, B., & Cannataro, M. (2015). Cloud computing in healthcare and biomedicine. Scalable Computing: Practice and Experience, 16(1), 1–18.

    Google Scholar 

  • Canham, S., Ohmann, C., Boiten, J. W., et al. (2021). EOSC-Life Report on data standards for observational and interventional studies, and interoperability between healthcare and research data. EOSC-Life: Tech. rep.

    Google Scholar 

  • Cappiello, C., Gribaudo, M., Plebani, P., et al. (2022a). Enabling real-world medicine with data lake federation: A research perspective. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare, Springer, pp. 39–56.

  • Cappiello, C., Gribaudo, M., Plebani, P., et al. (2022b). Enabling real-world medicine with data lake federation: A research perspective. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare, Springer, pp. 39–56.

  • Chakrabarty, N. (2019). Brain mri images for brain tumor detection. https://www.kaggle.com/datasets/navoneel/brain-mri-images-for-brain-tumor-detection

  • Chakraborty, M., & Kundan, A. P. (2021). Grafana. In: Monitoring Cloud-Native Applications: Lead Agile Operations Confidently Using Open Source Software. Springer, pp. 187–240.

  • Chawla, N. V., & Davis, D. A. (2013). Bringing big data to personalized healthcare: a patient-centered framework. Journal of general internal medicine, 28(3), 660–665.

    Article  Google Scholar 

  • Chollet, F., et al. (2015). Keras. https://keras.io

  • Couto, J., Borges, O. T., Ruiz, D. D., et al. (2019). A mapping study about data lakes: An improved definition and possible architectures. In: SEKE, pp. 453–578.

  • Deekshatulu, B., Chandra, P., et al. (2013). Classification of heart disease using k-nearest neighbor and genetic algorithm. Procedia technology, 10, 85–94.

    Article  Google Scholar 

  • Deligiannis, K., Raftopoulou, P., Tryfonopoulos, C., et al. (2020). Hydria: An online data lake for multi-faceted analytics in the cultural heritage domain. Big Data and Cognitive Computing, 4(2), 7.

    Article  Google Scholar 

  • Deng, J., Dong, W., Socher, R., et al. (2009). Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee, pp. 248–255.

  • Dritsas, E., & Trigka, M. (2022). Stroke risk prediction with machine learning techniques. Sensors, 22(13), 4670. https://doi.org/10.3390/s22134670

    Article  Google Scholar 

  • Eder, J., & Shekhovtsov, V. A. (2021). Data quality for federated medical data lakes. International Journal of Web Information Systems, 17(5), 407–426.

    Article  Google Scholar 

  • Esteva, A., Kuprel, B., Novoa, R. A., et al. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115–118.

    Article  Google Scholar 

  • Giacobbe, D. R., Signori, A., Del Puente, F., et al. (2021). Early detection of sepsis with machine learning techniques: A brief clinical perspective. Front Med (Lausanne), 8, 617486.

    Article  Google Scholar 

  • Giebler, C., Gröger, C., Hoos, E., et al. (2019). Leveraging the data lake: Current state and challenges. In: Proceedings of the 21st International Conference on Big Data Analytics and Knowledge Discovery (DaWaK), pp. 179–188. https://doi.org/10.1007/978-3-030-27520-4_13

  • Giebler, C., Gröger, C., Hoos, E., et al. (2020). A zone reference model for enterprise-grade data lake management. In: 2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC), IEEE, pp. 57–66.

  • Giebler, C., Gröger, C., Hoos, E., et al. (2021). The data lake architecture framework. In: Database Systems for Business, Technology and Web (BTW). Gesellschaft für Informatik, Bonn. https://doi.org/10.18420/btw2021-19

  • Goldberger, A. L., Amaral, L. A., Glass, L., et al. (2000). Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101(23), e215–e220.

  • Gulshan, V., Peng, L., Coram, M., et al. (2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22), 2402–2410.

    Article  Google Scholar 

  • Hai, R., Geisler, S., & Quix, C. (2016). Constance: An intelligent data lake system. In: Proceedings of the 2016 international conference on management of data, pp. 2097–2100.

  • Hamadou, H. B., Pedersen, T. B., & Thomsen, C. (2020). The danish national energy data lake: Requirements, technical architecture, and tool selection. In: 2020 IEEE International Conference on Big Data, IEEE, pp. 1523–1532.

  • He, K., Zhang, X., Ren, S., et al. (2016). Deep Residual Learning for Image Recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, CVPR ’16, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90

  • Heinis, T., & Ailamaki, A. (2017). Data infrastructure for medical research. Found Trends Databases, 8(3), 131–238. https://doi.org/10.1561/1900000050

    Article  Google Scholar 

  • Hlupić, T., Oreščanin, D., Ružak, D., et al. (2022). An overview of current data lake architecture models. 2022 45th Jubilee International Convention on Information (pp. 1082–1087). IEEE: Communication and Electronic Technology (MIPRO).

    Google Scholar 

  • Huang, S., Huang, J., Dai, J., et al. (2010). The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51. https://doi.org/10.1109/ICDEW.2010.5452747

  • Hukkeri, T. S., Kanoria, V., & Shetty, J. (2020). A study of enterprise data lake solutions. International Research Journal of Engineering and Technology (IRJET), 7.

  • Inmon, B. (2016). Data Lake Architecture: Designing the Data Lake and avoiding the garbage dump (1st ed.). LLC, Denville, NJ, USA: Technics Publications.

    Google Scholar 

  • Isah, H., & Zulkernine, F. (2018). A scalable and robust framework for data stream ingestion. In: 2018 IEEE International Conference on Big Data, IEEE, pp. 2900–2905.

  • Iwase, S., Nakada, Ta., Shimada, T., et al. (2022). Prediction algorithm for icu mortality and length of stay using machine learning. Scientific reports, 12(1), 12912. https://doi.org/10.1038/s41598-022-17091-5

    Article  Google Scholar 

  • Jagadeeswari, V., Subramaniyaswamy, V., Logesh, R., et al. (2018). A study on medical internet of things and big data in personalized healthcare system. Health information science and systems, 6(1), 1–20.

    Article  Google Scholar 

  • Johnson, A., Pollard, T., & Mark, R. (2016a) MIMIC-III clinical database. PhysioNet. https://doi.org/10.13026/C2XW26

  • Johnson, A., Pollard, T., Shen, L., et al. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 1–9.

    Article  Google Scholar 

  • Kagadis, G. C., Kloukinas, C., Moore, K., et al. (2013). Cloud computing in medical imaging. Medical physics, 40(7), 070901.

    Article  Google Scholar 

  • Karthikeyan, A., Garg, A., Vinod, P. K., et al. (2021). Machine learning based clinical decision support system for early covid-19 mortality prediction. Frontiers in Public Health, 9. https://doi.org/10.3389/fpubh.2021.626697

  • Khemphila, A., Boonjing, V. (2011). Heart disease classification using neural network and feature selection. In: 2011 21st International Conference on Systems Engineering, IEEE, pp. 406–409.

  • Khine, P. P., & Wang, Z. S. (2018). Data lake: a new ideology in big data era. In: ITM web of conferences, EDP Sciences, p. 03025.

  • Khosla, A., Cao, Y., Lin, C. C. Y., et al. (2010). An integrated machine learning approach to stroke prediction. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 183–192.

  • Krause, J., Gulshan, V., Rahimy, E., et al. (2018). Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy. Ophthalmology, 125(8), 1264–1272.

    Article  Google Scholar 

  • Kumar, P. (2023). A minimum metadata model for healthcare data interoperability. Master’s thesis, Politecnico di Milano, available at https://hdl.handle.net/10589/204642

  • Liu, P., Loudcher, S., Darmont, J., et al. (2021). Archaeodal: A data lake for archaeological data management and analytics. In: 25th International Database Engineering & Applications Symposium, pp. 252–262.

  • Lundervold, A. S., & Lundervold, A. (2019). An overview of deep learning in medical imaging focusing on MRI. Zeitschrift für Medizinische Physik, 29(2), 102–127. https://doi.org/10.1016/j.zemedi.2018.11.002

    Article  Google Scholar 

  • Madera, C., & Laurent, A. (2016). The next information architecture evolution: the data lake wave. In: Proceedings of the 8th international conference on management of digital ecosystems, pp. 174–180.

  • Maini, E., Venkateswarlu, B., & Gupta, A. (2018). Data lake-an optimum solution for storage andanalytics of big data in cardiovascular disease prediction system. International Journal of Computational Engineering & Management (IJCEM), 21(6), 33–39.

    Google Scholar 

  • Manco, C., Dolci, T., Azzalini, F., et al. (2023). HEALER: A data lake architecture for healthcare. In: Proceedings of the Workshops of the EDBT/ICDT 2023 Joint Conference, vol 3379. CEUR-WS.org.

  • McKinney, W., et al. (2010). Data structures for statistical computing in python. In: Proceedings of the 9th Python in Science Conference, pp. 51–56. https://doi.org/10.25080/Majora-92bf1922-00a

  • Meng, X., Bradley, J., Yavuz, B., et al. (2016). Mllib: Machine learning in apache spark. The journal of machine learning research, 17(1), 1235–1241.

    Google Scholar 

  • Mollura, M., Mantoan, G., Romano, S., et al. (2020). The role of waveform monitoring in sepsis identification within the first hour of intensive care unit stay. In: 2020 11th Conference of the European Study Group on Cardiovascular Oscillations (ESGCO), pp. 1–2. https://doi.org/10.1109/ESGCO49734.2020.9158013

  • Moody, B., Moody, G., Villarroel, M., et al. (2020). MIMIC-III waveform database matched subset. PhysioNet. https://doi.org/10.13026/c2294b

  • Moody, G. (1999). MIT-BIH normal sinus rhythm database. PhysioNet. https://doi.org/10.13026/C2NK5R

  • Moody, G., & Mark, R. (2001). The impact of the mit-bih arrhythmia database. IEEE Engineering in Medicine and Biology Magazine, 20(3), 45–50. https://doi.org/10.1109/51.932724

    Article  Google Scholar 

  • Moody, G., & Mark, R. (2005). MIT-BIH arrhythmia database. PhysioNet. https://doi.org/10.13026/C2F305

  • Nancy, A. M., & Maheswari, R. (2020). A review on unstructured data in medical data. J Crit Rev, 7, 2202–2208.

    Google Scholar 

  • Parsonson, L., Grimm, S., Bajwa, A., et al. (2012). A cloud computing medical image analysis and collaboration platform. In: Cloud Computing and Services Science, Springer, pp. 207–224.

  • Prasser, F., Kohlbacher, O., Mansmann, U., et al. (2018). Data integration for future medicine (DIFUTURE). Methods Inf Med, 57(S 01), e57–e65

  • Qian, L., Luo, Z., Du, Y., et al. (2009). Cloud computing: An overview. In: Cloud Computing: First International Conference, CloudCom 2009, Beijing, China, December 1-4, 2009. Proceedings 1, Springer, pp. 626–631.

  • Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1). https://doi.org/10.1186/2047-2501-2-3

  • Ravat, F., & Zhao, Y. (2019). Data lakes: Trends and perspectives. In: International Conference on Database and Expert Systems Applications, Springer, pp. 304–313.

  • Ren, P., Li, S., Hou, W., et al. (2021). Mhdp: an efficient data lake platform for medical multi-source heterogeneous data. In: Web Information Systems and Applications: 18th International Conference, WISA 2021, Kaifeng, China, September 24–26, 2021, Proceedings 18, Springer, pp. 727–738.

  • Rieke, N., Hancox, J., Li, W., et al. (2020). The future of digital health with federated learning. npj Digital Medicine, 3(1). https://doi.org/10.1038/s41746-020-00323-1

  • Sawadogo, P., & Darmont, J. (2021). Benchmarking data lakes featuring structured and unstructured data with dlbench. Big Data Analytics and Knowledge Discovery (pp. 15–26). Cham: Springer International Publishing.

    Chapter  Google Scholar 

  • Sawadogo, P., & Darmont, J. (2021). On data lake architectures and metadata management. Journal of Intelligent Information Systems, 56(1), 97–120.

    Article  Google Scholar 

  • Sha, M.M., & Rahamathulla, M. P. (2020). Cloud-based healthcare data management framework. KSII Transactions on Internet and Information Systems (TIIS), 14(3), 1014–1025.

  • Silva, I., Moody, G., Scott, D. J., et al. (2012). Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In: 2012 Computing in Cardiology, IEEE, pp. 245–248.

  • Soriano, F. (2021). Stroke prediction dataset. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

  • SPEC. (2017). SPEC CPU Benchmarks. https://www.spec.org/cpu/. Accessed 24 Mar 2023.

  • Taher, N. C., Mallat, I., Agoulmine, N., et al. (2019). An iot-cloud based solution for real-time and batch processing of big data: Application in healthcare. In: 2019 3rd international conference on bio-engineering for smart technologies (BioSMART), IEEE, pp. 1–8.

  • Transaction Processing Performance Council. (2021). TCPx-HS benchmark specification. Specification 1.0, Transaction Processing Performance Council. https://www.tpc.org/tpcx-hs/

  • Truică, C. O., Apostol, E. S., Darmont, J., et al. (2020). TextBenDS: a generic textual data benchmark for distributed systems. Information Systems Frontiers, 23(1), 81–100. https://doi.org/10.1007/s10796-020-09999-y

    Article  Google Scholar 

  • Walker, C., & Alrehamy, H. (2015). Personal data lake with data gravity pull. In: 2015 IEEE Fifth International Conference on Big Data and Cloud Computing, IEEE, pp. 160–167.

  • Wang, L., Zhan, J., Luo, C., et al. (2014). Bigdatabench: A big data benchmark suite from internet services. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp. 488–499. https://doi.org/10.1109/HPCA.2014.6835958

  • Weber, G. M., Murphy, S. N., McMurry, A. J., et al. (2009). The shared health research information network (shrine): a prototype federated query tool for clinical data repositories. Journal of the American Medical Informatics Association, 16(5), 624–630.

  • Weiss, K., Khoshgoftaar, T. M., & Wang, D. (2016). A survey of transfer learning. Journal of Big data, 3(1), 1–40.

    Article  Google Scholar 

  • Xin, R. (2014). Apache spark officially sets a new record in large-scale sorting. https://www.databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html. Accessed 11 July 2023.

  • Zaharia, M., Xin, R. S., Wendell, P., et al. (2016). Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11), 56–65.

    Article  Google Scholar 

Download references

Acknowledgements

We are grateful to Enrico Barbierato and Giuseppe Serazzi for their advice during the definition and realization of this work, and the support in the revision of the paper.

Funding

This work has been partially supported by the Health Big Data Project (CCR-2018-23669122), funded by the Italian Ministry of Economy and Finance and coordinated by the Italian Ministry of Health and the network Alleanza Contro il Cancro.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the definition and the design of this research. The data lake architecture was mainly created and implemented by Carlo Manco, with contributions from Tommaso Dolci, Fabio Azzalini, Marco Gribaudo and Letizia Tanca. The implementation and testing of the benchmark was mainly conducted by Lorenzo Amata, with contributions from Tommaso Dolci, Fabio Azzalini, Marco Gribaudo and Letizia Tanca. The first draft of the manuscript was written by Tommaso Dolci, Carlo Manco and Lorenzo Amata, and all authors contributed to the revision and improvement of the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Tommaso Dolci.

Ethics declarations

Ethics Approval and Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Conflict of Interest

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dolci, T., Amata, L., Manco, C. et al. Tools for Healthcare Data Lake Infrastructure Benchmarking. Inf Syst Front (2024). https://doi.org/10.1007/s10796-023-10468-5

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10796-023-10468-5

Keywords

Navigation