Abstract
Vast amounts of medical data are generated every day, and constitute a crucial asset to improve therapy outcomes, medical treatments and healthcare costs. Data lakes are a valuable solution for the management and analysis of such a variety and abundance of data, yet to date there is no data lake architecture specifically designed for the healthcare domain. Moreover, benchmarking the underlying infrastructure of data lakes is fundamental for optimizing resource allocation and performance, increasing the potential of this kind of data platforms. This work describes a data lake architecture to ingest, store, process, and analyze heterogeneous medical data. Also, we present a benchmark for infrastructures supporting healthcare data lakes, focusing on a variety of analysis tasks, from relational analysis to machine learning. The benchmark is tested on a virtualized implementation of our data lake architecture, and on two external cloud-based infrastructures. Our results highlight distinctions between infrastructures and tasks of different nature, according to the machine learning techniques, data sizes and formats involved.
Similar content being viewed by others
Data Availability
MIMIC-III Dataset is available with credentialed access on the PhysioNet website: https://physionet.org/content/mimiciii, and the MIMIC-III Waveform Database alone at: https://physionet.org/content/mimic3wdb-matched. The remaining datasets are freely available online. Stroke Prediction Dataset at: https://kaggle.com/datasets/fedesoriano/stroke-prediction-dataset. ICU Patients Mortality Prediction Dataset at: https://kaggle.com/datasets/msafi04/predict-mortality-of-icu-patients-physionet and from PhysioNet: https://physionet.org/content/challenge-2012. Brain MRI Images Dataset at: https://kaggle.com/datasets/navoneel/brain-mri-images-for-brain-tumor-detection. MIT-BIH Arrhythmia Database at: https://physionet.org/physiobank/database/mitdb. MIT-BIH Normal Sinus Rhythm Database at: https://physionet.org/physiobank/database/nsrdb. BIDMC Congestive Heart Failure Database at: https://physionet.org/physiobank/database/chfdb.
Code Availibility
Code regarding the tasks included in the benchmark is available at: https://github.com/TommasoD/SEASHELL. The proof-of-concept implementation of the data lake architecture is available at: https://github.com/MancoCarlo/healer-prototype.
Notes
The proof-of-concept implementation of the data lake architecture is available at: https://github.com/MancoCarlo/healer-prototype.
Code from the benchmark tasks is available at: https://github.com/TommasoD/SEASHELL.
While a computer has system RAM, most contemporary graphics cards have access to a dedicated set of memory known as Video RAM, or VRAM.
References
Agrahari, A., & Rao, D. (2017). A review paper on big data: technologies, tools and trends. International Research Journal of Engineering and Technology, 4(10), 10.
Alarsan, F. I., & Younes, M. (2019). Analysis and classification of heart diseases using heartbeat features and machine learning algorithms. Journal of Big Data, 6(1). https://doi.org/10.1186/s40537-019-0244-x
Alwidian, J., Rahman, S. A., Gnaim, M., et al. (2020). Big data ingestion and preparation tools. Modern Applied Science, 14(9), 12–27.
Baim, D. S., Colucci, W. S., Monrad, E. S., et al. (1986). Survival of patients with severe congestive heart failure treated with oral milrinone. Journal of the American College of Cardiology, 7(3), 661–670. https://doi.org/10.1016/S0735-1097(86)80478-8
Baim, D. S., Colucci, W. S., Monrad, E. S., et al. (2000). Bidmc congestive heart failure database. PhysioNet. https://doi.org/10.13026/C29G60
Bajaber, F., Sakr, S., Batarfi, O., et al. (2020). Benchmarking big data systems: A survey. Computer Communications, 149, 241–251. https://doi.org/10.1016/j.comcom.2019.10.002
Barbierato, E., Gribaudo, M., Serazzi, G., et al. (2021). Performance evaluation of a data lake architecture via modeling techniques. In: Performance Engineering and Stochastic Modeling. Springer, pp. 115–130.
Batini, C., Cappiello, C., Francalanci, C., et al. (2009). Methodologies for data quality assessment and improvement. ACM computing surveys (CSUR), 41(3), 1–52.
Beheshti, A., Benatallah, B., Nouri, R., et al. (2017). Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2451–2454.
Bhattacharya, S., Rajan, V., & Shrivastava, H. (2017). Icu mortality prediction: a classification algorithm for imbalanced datasets. In: Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v31i1.10721
Calabrese, B., & Cannataro, M. (2015). Cloud computing in healthcare and biomedicine. Scalable Computing: Practice and Experience, 16(1), 1–18.
Canham, S., Ohmann, C., Boiten, J. W., et al. (2021). EOSC-Life Report on data standards for observational and interventional studies, and interoperability between healthcare and research data. EOSC-Life: Tech. rep.
Cappiello, C., Gribaudo, M., Plebani, P., et al. (2022a). Enabling real-world medicine with data lake federation: A research perspective. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare, Springer, pp. 39–56.
Cappiello, C., Gribaudo, M., Plebani, P., et al. (2022b). Enabling real-world medicine with data lake federation: A research perspective. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare, Springer, pp. 39–56.
Chakrabarty, N. (2019). Brain mri images for brain tumor detection. https://www.kaggle.com/datasets/navoneel/brain-mri-images-for-brain-tumor-detection
Chakraborty, M., & Kundan, A. P. (2021). Grafana. In: Monitoring Cloud-Native Applications: Lead Agile Operations Confidently Using Open Source Software. Springer, pp. 187–240.
Chawla, N. V., & Davis, D. A. (2013). Bringing big data to personalized healthcare: a patient-centered framework. Journal of general internal medicine, 28(3), 660–665.
Chollet, F., et al. (2015). Keras. https://keras.io
Couto, J., Borges, O. T., Ruiz, D. D., et al. (2019). A mapping study about data lakes: An improved definition and possible architectures. In: SEKE, pp. 453–578.
Deekshatulu, B., Chandra, P., et al. (2013). Classification of heart disease using k-nearest neighbor and genetic algorithm. Procedia technology, 10, 85–94.
Deligiannis, K., Raftopoulou, P., Tryfonopoulos, C., et al. (2020). Hydria: An online data lake for multi-faceted analytics in the cultural heritage domain. Big Data and Cognitive Computing, 4(2), 7.
Deng, J., Dong, W., Socher, R., et al. (2009). Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee, pp. 248–255.
Dritsas, E., & Trigka, M. (2022). Stroke risk prediction with machine learning techniques. Sensors, 22(13), 4670. https://doi.org/10.3390/s22134670
Eder, J., & Shekhovtsov, V. A. (2021). Data quality for federated medical data lakes. International Journal of Web Information Systems, 17(5), 407–426.
Esteva, A., Kuprel, B., Novoa, R. A., et al. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115–118.
Giacobbe, D. R., Signori, A., Del Puente, F., et al. (2021). Early detection of sepsis with machine learning techniques: A brief clinical perspective. Front Med (Lausanne), 8, 617486.
Giebler, C., Gröger, C., Hoos, E., et al. (2019). Leveraging the data lake: Current state and challenges. In: Proceedings of the 21st International Conference on Big Data Analytics and Knowledge Discovery (DaWaK), pp. 179–188. https://doi.org/10.1007/978-3-030-27520-4_13
Giebler, C., Gröger, C., Hoos, E., et al. (2020). A zone reference model for enterprise-grade data lake management. In: 2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC), IEEE, pp. 57–66.
Giebler, C., Gröger, C., Hoos, E., et al. (2021). The data lake architecture framework. In: Database Systems for Business, Technology and Web (BTW). Gesellschaft für Informatik, Bonn. https://doi.org/10.18420/btw2021-19
Goldberger, A. L., Amaral, L. A., Glass, L., et al. (2000). Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101(23), e215–e220.
Gulshan, V., Peng, L., Coram, M., et al. (2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22), 2402–2410.
Hai, R., Geisler, S., & Quix, C. (2016). Constance: An intelligent data lake system. In: Proceedings of the 2016 international conference on management of data, pp. 2097–2100.
Hamadou, H. B., Pedersen, T. B., & Thomsen, C. (2020). The danish national energy data lake: Requirements, technical architecture, and tool selection. In: 2020 IEEE International Conference on Big Data, IEEE, pp. 1523–1532.
He, K., Zhang, X., Ren, S., et al. (2016). Deep Residual Learning for Image Recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, CVPR ’16, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
Heinis, T., & Ailamaki, A. (2017). Data infrastructure for medical research. Found Trends Databases, 8(3), 131–238. https://doi.org/10.1561/1900000050
Hlupić, T., Oreščanin, D., Ružak, D., et al. (2022). An overview of current data lake architecture models. 2022 45th Jubilee International Convention on Information (pp. 1082–1087). IEEE: Communication and Electronic Technology (MIPRO).
Huang, S., Huang, J., Dai, J., et al. (2010). The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51. https://doi.org/10.1109/ICDEW.2010.5452747
Hukkeri, T. S., Kanoria, V., & Shetty, J. (2020). A study of enterprise data lake solutions. International Research Journal of Engineering and Technology (IRJET), 7.
Inmon, B. (2016). Data Lake Architecture: Designing the Data Lake and avoiding the garbage dump (1st ed.). LLC, Denville, NJ, USA: Technics Publications.
Isah, H., & Zulkernine, F. (2018). A scalable and robust framework for data stream ingestion. In: 2018 IEEE International Conference on Big Data, IEEE, pp. 2900–2905.
Iwase, S., Nakada, Ta., Shimada, T., et al. (2022). Prediction algorithm for icu mortality and length of stay using machine learning. Scientific reports, 12(1), 12912. https://doi.org/10.1038/s41598-022-17091-5
Jagadeeswari, V., Subramaniyaswamy, V., Logesh, R., et al. (2018). A study on medical internet of things and big data in personalized healthcare system. Health information science and systems, 6(1), 1–20.
Johnson, A., Pollard, T., & Mark, R. (2016a) MIMIC-III clinical database. PhysioNet. https://doi.org/10.13026/C2XW26
Johnson, A., Pollard, T., Shen, L., et al. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 1–9.
Kagadis, G. C., Kloukinas, C., Moore, K., et al. (2013). Cloud computing in medical imaging. Medical physics, 40(7), 070901.
Karthikeyan, A., Garg, A., Vinod, P. K., et al. (2021). Machine learning based clinical decision support system for early covid-19 mortality prediction. Frontiers in Public Health, 9. https://doi.org/10.3389/fpubh.2021.626697
Khemphila, A., Boonjing, V. (2011). Heart disease classification using neural network and feature selection. In: 2011 21st International Conference on Systems Engineering, IEEE, pp. 406–409.
Khine, P. P., & Wang, Z. S. (2018). Data lake: a new ideology in big data era. In: ITM web of conferences, EDP Sciences, p. 03025.
Khosla, A., Cao, Y., Lin, C. C. Y., et al. (2010). An integrated machine learning approach to stroke prediction. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 183–192.
Krause, J., Gulshan, V., Rahimy, E., et al. (2018). Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy. Ophthalmology, 125(8), 1264–1272.
Kumar, P. (2023). A minimum metadata model for healthcare data interoperability. Master’s thesis, Politecnico di Milano, available at https://hdl.handle.net/10589/204642
Liu, P., Loudcher, S., Darmont, J., et al. (2021). Archaeodal: A data lake for archaeological data management and analytics. In: 25th International Database Engineering & Applications Symposium, pp. 252–262.
Lundervold, A. S., & Lundervold, A. (2019). An overview of deep learning in medical imaging focusing on MRI. Zeitschrift für Medizinische Physik, 29(2), 102–127. https://doi.org/10.1016/j.zemedi.2018.11.002
Madera, C., & Laurent, A. (2016). The next information architecture evolution: the data lake wave. In: Proceedings of the 8th international conference on management of digital ecosystems, pp. 174–180.
Maini, E., Venkateswarlu, B., & Gupta, A. (2018). Data lake-an optimum solution for storage andanalytics of big data in cardiovascular disease prediction system. International Journal of Computational Engineering & Management (IJCEM), 21(6), 33–39.
Manco, C., Dolci, T., Azzalini, F., et al. (2023). HEALER: A data lake architecture for healthcare. In: Proceedings of the Workshops of the EDBT/ICDT 2023 Joint Conference, vol 3379. CEUR-WS.org.
McKinney, W., et al. (2010). Data structures for statistical computing in python. In: Proceedings of the 9th Python in Science Conference, pp. 51–56. https://doi.org/10.25080/Majora-92bf1922-00a
Meng, X., Bradley, J., Yavuz, B., et al. (2016). Mllib: Machine learning in apache spark. The journal of machine learning research, 17(1), 1235–1241.
Mollura, M., Mantoan, G., Romano, S., et al. (2020). The role of waveform monitoring in sepsis identification within the first hour of intensive care unit stay. In: 2020 11th Conference of the European Study Group on Cardiovascular Oscillations (ESGCO), pp. 1–2. https://doi.org/10.1109/ESGCO49734.2020.9158013
Moody, B., Moody, G., Villarroel, M., et al. (2020). MIMIC-III waveform database matched subset. PhysioNet. https://doi.org/10.13026/c2294b
Moody, G. (1999). MIT-BIH normal sinus rhythm database. PhysioNet. https://doi.org/10.13026/C2NK5R
Moody, G., & Mark, R. (2001). The impact of the mit-bih arrhythmia database. IEEE Engineering in Medicine and Biology Magazine, 20(3), 45–50. https://doi.org/10.1109/51.932724
Moody, G., & Mark, R. (2005). MIT-BIH arrhythmia database. PhysioNet. https://doi.org/10.13026/C2F305
Nancy, A. M., & Maheswari, R. (2020). A review on unstructured data in medical data. J Crit Rev, 7, 2202–2208.
Parsonson, L., Grimm, S., Bajwa, A., et al. (2012). A cloud computing medical image analysis and collaboration platform. In: Cloud Computing and Services Science, Springer, pp. 207–224.
Prasser, F., Kohlbacher, O., Mansmann, U., et al. (2018). Data integration for future medicine (DIFUTURE). Methods Inf Med, 57(S 01), e57–e65
Qian, L., Luo, Z., Du, Y., et al. (2009). Cloud computing: An overview. In: Cloud Computing: First International Conference, CloudCom 2009, Beijing, China, December 1-4, 2009. Proceedings 1, Springer, pp. 626–631.
Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1). https://doi.org/10.1186/2047-2501-2-3
Ravat, F., & Zhao, Y. (2019). Data lakes: Trends and perspectives. In: International Conference on Database and Expert Systems Applications, Springer, pp. 304–313.
Ren, P., Li, S., Hou, W., et al. (2021). Mhdp: an efficient data lake platform for medical multi-source heterogeneous data. In: Web Information Systems and Applications: 18th International Conference, WISA 2021, Kaifeng, China, September 24–26, 2021, Proceedings 18, Springer, pp. 727–738.
Rieke, N., Hancox, J., Li, W., et al. (2020). The future of digital health with federated learning. npj Digital Medicine, 3(1). https://doi.org/10.1038/s41746-020-00323-1
Sawadogo, P., & Darmont, J. (2021). Benchmarking data lakes featuring structured and unstructured data with dlbench. Big Data Analytics and Knowledge Discovery (pp. 15–26). Cham: Springer International Publishing.
Sawadogo, P., & Darmont, J. (2021). On data lake architectures and metadata management. Journal of Intelligent Information Systems, 56(1), 97–120.
Sha, M.M., & Rahamathulla, M. P. (2020). Cloud-based healthcare data management framework. KSII Transactions on Internet and Information Systems (TIIS), 14(3), 1014–1025.
Silva, I., Moody, G., Scott, D. J., et al. (2012). Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In: 2012 Computing in Cardiology, IEEE, pp. 245–248.
Soriano, F. (2021). Stroke prediction dataset. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
SPEC. (2017). SPEC CPU Benchmarks. https://www.spec.org/cpu/. Accessed 24 Mar 2023.
Taher, N. C., Mallat, I., Agoulmine, N., et al. (2019). An iot-cloud based solution for real-time and batch processing of big data: Application in healthcare. In: 2019 3rd international conference on bio-engineering for smart technologies (BioSMART), IEEE, pp. 1–8.
Transaction Processing Performance Council. (2021). TCPx-HS benchmark specification. Specification 1.0, Transaction Processing Performance Council. https://www.tpc.org/tpcx-hs/
Truică, C. O., Apostol, E. S., Darmont, J., et al. (2020). TextBenDS: a generic textual data benchmark for distributed systems. Information Systems Frontiers, 23(1), 81–100. https://doi.org/10.1007/s10796-020-09999-y
Walker, C., & Alrehamy, H. (2015). Personal data lake with data gravity pull. In: 2015 IEEE Fifth International Conference on Big Data and Cloud Computing, IEEE, pp. 160–167.
Wang, L., Zhan, J., Luo, C., et al. (2014). Bigdatabench: A big data benchmark suite from internet services. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp. 488–499. https://doi.org/10.1109/HPCA.2014.6835958
Weber, G. M., Murphy, S. N., McMurry, A. J., et al. (2009). The shared health research information network (shrine): a prototype federated query tool for clinical data repositories. Journal of the American Medical Informatics Association, 16(5), 624–630.
Weiss, K., Khoshgoftaar, T. M., & Wang, D. (2016). A survey of transfer learning. Journal of Big data, 3(1), 1–40.
Xin, R. (2014). Apache spark officially sets a new record in large-scale sorting. https://www.databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html. Accessed 11 July 2023.
Zaharia, M., Xin, R. S., Wendell, P., et al. (2016). Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11), 56–65.
Acknowledgements
We are grateful to Enrico Barbierato and Giuseppe Serazzi for their advice during the definition and realization of this work, and the support in the revision of the paper.
Funding
This work has been partially supported by the Health Big Data Project (CCR-2018-23669122), funded by the Italian Ministry of Economy and Finance and coordinated by the Italian Ministry of Health and the network Alleanza Contro il Cancro.
Author information
Authors and Affiliations
Contributions
All authors contributed to the definition and the design of this research. The data lake architecture was mainly created and implemented by Carlo Manco, with contributions from Tommaso Dolci, Fabio Azzalini, Marco Gribaudo and Letizia Tanca. The implementation and testing of the benchmark was mainly conducted by Lorenzo Amata, with contributions from Tommaso Dolci, Fabio Azzalini, Marco Gribaudo and Letizia Tanca. The first draft of the manuscript was written by Tommaso Dolci, Carlo Manco and Lorenzo Amata, and all authors contributed to the revision and improvement of the paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics Approval and Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Conflict of Interest
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dolci, T., Amata, L., Manco, C. et al. Tools for Healthcare Data Lake Infrastructure Benchmarking. Inf Syst Front (2024). https://doi.org/10.1007/s10796-023-10468-5
Accepted:
Published:
DOI: https://doi.org/10.1007/s10796-023-10468-5