Advertisement

An Overview of Big Data Architectures in Healthcare

  • Hugo Torres
  • Filipe PortelaEmail author
  • Manuel Filipe Santos
Conference paper
  • 971 Downloads
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 747)

Abstract

It is proven that Big Data is related to an increase in efficiency and effectiveness in many areas. Although many studies have been conducted trying to prove the value of Big Data in healthcare/medicine, few practical advances have been made. In this project, an analysis and a comparison were made of the existing Big Data technologies applied in healthcare. We analyzed a Big Data solution developed for the INTCare project, a Hadoop-based solution proposed for the Maharaja Yeshwatrao hospital located in India and a solution that uses Apache Spark. The three solutions mentioned above are based on open source technology. The IBM PureData Solution for Healthcare Analytics solution used at Seattle’s Children’s Hospital and the Cisco Connected Health Solutions and Services solution are part of the proprietary solutions analyzed.

Keywords

Big Data INTCare HealthCare 

1 Introduction

Since the invention of computers, large volumes of data are generated at a surprising rate [1]. Up to 2003, 5 exabytes of data were created by the human being; currently, this amount is created in 2 days [2]. This is how Big Data began to reveal its true potential in dealing with large volumes of data from various sources and generated at high speeds. The health industry generates huge amounts of data, though most of it is stored in non-digital format. Nowadays, the trend is to digitize most of the information [3]. According to Feldman et al. [4], the increase in the volume of data in the health industry comes, not only from the creation of new forms of data (three-dimensional images, biometric sensor readings, and others), but also from the transformation of existing data, such as radiology images, DNA sequence data and other, to digital format. Given the noticeable delay in the adoption of Big Data technologies by the health industry, it is necessary to identify the challenges and potential use of Big Data in this industry and to identify cases of adoption of Big Data technologies in hospitals/healthcare clinics. To ease the adoption of Big Data technologies in the health industry, this study aims to identify, analyze, filter and compare the solutions identified.

This paper is divided into six sections: Introduction; Background; Methods and Tools; Case Study; Discussion; Conclusion and Future Work. The second section presents the challenges and potential of Big Data in the healthcare industry. In Sect. 3, the methods and tools utilized for this project are presented and described. In Sect. 4 the case study is presented, the various solutions found and the comparison between the filtered solutions. In Sect. 5 the results are analyzed in the project context. In Sect. 6, a summary of this paper is given, describing the main conclusions. Finally, in Sect. 7 a short description of the future work is presented.

2 Background

2.1 Big Data

“Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” Manyika et al. [8].

According to Zikopoulos et al. [5] the three characteristics that define Big Data are: Volume, Variety and Velocity. Hurwitz et al. [6] state that the three V’s quoted above is too simplistic and proposed a fourth V, veracity. In the literature, there are some authors who add a fifth V, value [7]. Therefore, Big Data can be characterized by: (a) Volume – refers to the large amounts of data generated that grows exponentially and comes from a variety of sources; (b) Variety - According to Zikopoulos et al. [5], society invests a large part of its time with structured data (representing 20% of the total volume of data generated), but the challenge lies in the remaining 80% which are semi-structured or unstructured; (c) Velocity - The speed at which the data is generated. Today, many of the data that is generated has an “expiration date”, that is, they are only relevant to organizations if they are analyzed almost in real time [5]; (d) Value - This characteristic is related to the economic value of the data; (e) Veracity - This characteristic is related to the quality of the data.

According to Feldman et al. [4], the increase in the volume of data in the health industry comes, not only from the creation of new forms of data (three-dimensional images, biometric sensor readings, among others), but also from the digitization of existing data, such as radiology images, DNA sequence data, among others.

The McKinsey Global Institute conducted a study to understand the potential of Big Data in five areas, one of which was the health area in the United States. Despite the importance of this sector in the country’s economy (representing 17% of GDP), it is still possible to notice a delay in the adoption of Big Data in relation to other industries [8].

3 Big Data Architectures Used in Healthcare

3.1 Big Data Architecture for the INTCare Project

INTCare is a research project developed at the Intensive Care Unit (ICU) of the Centro Hospitalar do Porto, which, at an early stage, was designed to develop an intelligent system to predict organ failure and its effects on users (Portela et al. [10]). The INTCare project uses a continuous flow and real-time data collection, from several sensors and monitoring devices, which generates a volume of data from 50 to 500 Terabytes [9]. According to Portela et al. [10], in 2009 the excessive amount of medical records on paper or manually entered into the database became apparent. After a set of studies aimed at identifying the gaps in the ICU information system, it was possible to develop a new solution based on intelligent systems capable of performing automatic tasks such as data acquisition and processing [10]. Nowadays, INTCare is a Pervasive Intelligent Decision Support System (PIDSS) that acts automatically and in real-time, in order to provide new information to decision-making entities in the ICU, i.e. physicians and nurses [10, 11, 12]. The Data Management subsystem of this architecture relies on Apache Kafka and Apache Storm for streaming data processing. For operational data processing it relies on Apache Phoenix and Apache HBase. The processing of analytical data is ensured by Apache Hive. Security, administration and operations are ensured by the following Hadoop subprojects: Apache Sqoop; Apache Knox; Apache Ranger; Apache Flume; and Apache Oozie.

3.2 Big Data Architecture with Apache Spark

Liu et al. [13], proposed an architecture for a Big Data processing tool, designed for the health area, that includes Apache Spark.

The “Data Storage” layer of this architecture, has the Hadoop Distributed File System, and Apache HBase. The “Data Processing” layer contains Apache Spark and Spark Streaming. Apache Hive and Spark SQL, are a part of the “Access to data” layer. Finally, the “Analytics and Business Intelligence” layer contains the MLib, GraphX and SparkR tools.

3.3 Big Data Architecture Designed for the Maharaja Yeshwatrao Hospital

Ojha and Mathur [14] proposed a Big Data architecture to address the needs of the Maharaja Yeshwatrao hospital located in Indore, India, which is considered, by the authors, as the largest public hospital in central India. Ojha and Mathur [14] state that the hospital generates large volumes of data based on the number of citizens who attend it daily. With the implementation of Big Data technologies, the authors intend to improve the quality of patients life, especially those in need, since long waiting times have a negative impact on the poorest patients because it forces citizens to lose working days and consequently lose a portion of their salary.

Using Big Data and data analysis tools, it will be possible to store the data the Maharaja Yeshwatrao Hospital generates and, therefore, health professionals will be able to find new knowledge, hidden patterns and trends which may result in improved treatments, reduction of readmissions, reduction of expenses, and others Ojha and Mathur [14].

The “Data Storage” layer is composed of the HDFS as well as Apache HBase. The “Data Processing” layer contains the Hadoop MapReduce. Apache Hive, Apache Pig and Apache Avro are part of the “Data Access” layer. The “Management” layer consists of the Apache Zookeeper and Apache Chukwa.

3.4 IBM PureData Solution for Healthcare Analytics

This architecture is comprised of the IBM PureData Solution for Healthcare Analytics solution that is being used at Seattle’s Children’s Hospital to improve diagnostic and patient care capabilities [15].

IBM PureData Solution for Healthcare Analytics is a solution developed by IBM that integrates various technologies to meet the Big Data needs of a healthcare organization. This solution has the following components: IBM Cognos Business Intelligence - business Intelligence suite; IBM PureData System - a highly scalable system that relies on servers, databases, storage, and others; IBM Healthcare Provider Data Model - set of data models and business solution models; IBM InfoSphere Information Server for DataWarehouse - a data integration platform that supports the capture, integration and transformation of large volumes of structured or unstructured data [16].

3.5 Cisco Connected Health Solutions and Services

The infrastructure developed by Cisco Systems, Inc., integrates multiple services into a single solution that can meet “all” the needs of a healthcare organization.

On the official Cisco Systems, Inc. website, we can see the various applications and the various services they offer. The services are divided into 6 categories: Personalized service to the user; Remote assistance and collaboration; Simplify clinical workflows; Increase efficiency in the workplace; Connect the research and development department with the production; Enable security and compliance. This solution was presented to the scientific community by Nambiar et al. [17].

4 Benchmarking

In this chapter we will compare the three selected architectures, more specifically, their components (from the “Data processing” layer). The architectures have been selected based on the type of their license, only open source solutions will be compared.

As it can be seen from Table 1, in the “Data Storage” layer, all solutions consist of the HDFS and Apache HBase tools.
Table 1.

Comparative table of the selected architectures

Layer

Big Data Architecture with Apache Spark

Big Data Architecture designed for the Maharaja Yeshwatrao Hospital

Big Data Architecture for the INTCare Project

Data Storage

HDFS

HDFS

HDFS

Apache HBase

Apache HBase

Apache HBase

Data Processing

Apache Spark

Hadoop MapReduce

Apache Kafka

Spark Streaming

Apache Storm

Apache Phoenix

Management

No information

Apache Zookeeper

Apache Flume

Apache Chukwa

Apache Oozie

Data Access

Apache Hive

Apache Hive

Apache Hive

Apache Pig

Apache Sqoop

Apache Avro

Analytical and Business Intelligence

Spark SQL

No information

Knowledge Management subsystem of the INTCare project

MLib

GraphX

SparkR

Security

No information

No information

Apache Knox

Apache Ranger

In the “Data Processing” layer, the Big Data Architecture with Apache Spark consists of Apache Spark and Spark Streaming, the Big Data Architecture designed for the Maharaja Yeshwatrao Hospital has the Hadoop MapReduce, and the Big Data Architecture for the INTCare Project consists of Apache Kafka, Apache Storm and Apache Phoenix.

In the “Management” layer the Big Data Architecture with Apache Spark does not present any tool, the Big Data Architecture designed for the Maharaja Yeshwatrao Hospital has Apache Zookeeper and Apache Chukwa, the Big Data Architecture for the INTCare Project has Apache Oozie and Apache Flume tools.

The “Data Access” layer has Apache Hive present in all architectures, but the Big Data Architecture designed for the Maharaja Yeshwatrao Hospital also includes Apache Pig and Apache Avro and the Big Data Architecture for the INTCare Project also includes Apache Sqoop.

The Apache Spark Architecture presents the Spark SQL, MLib, GraphX and SparkR tools for the “Analytical and Business Intelligence” layer, while the Big Data Architecture for the INTCare Project has the knowledge management subsystem developed for the INTCare project and the Big Data Architecture designed for the Maharaja Yeshwatrao Hospital does not present any tool.

Finally, in the “Security” layer only the Big Data Architecture for the INTCare Project presents tools, which are Apache Knox and Apache Ranger.

4.1 Comparison Between Hadoop MapReduce and Apache Spark

In this subchapter, the differences between the two data processing frameworks, Hadoop MapReduce and Apache Spark, will be presented. Afterwards, two experiments will be presented comparing the performance of the two frameworks in several scenarios.

Table 2 shows the main differences between MapReduce and Apache Spark.
Table 2.

Main differences between Hadoop MapReduce and Apache Spark [18].

Hadoop MapReduce

Apache Spark

Stores data on disk

Stores the data in memory. The data is first stored in memory and then processed

Computing based on disk memory, partial use of RAM (Random Access Memory)

Computing based on RAM memory, partial use of disk memory

Fault tolerance is achieved through replication

Fault tolerance is achieved through RDDs

Difficult to process and analyze data in real time

Can be used to modify data in real time

Inefficient for applications that need to constantly reuse the same dataset

Stores the dataset in RAM for efficient reuse

Shi et al. [19] conducted an experiment to compare the performance between the two frameworks. The experiment consisted of the execution of several workloads (WordCount, Sort, K-means) that simulated the real-world use of these frameworks.

Apache Spark performed better in the execution of WordCount. For a 1 GB input, Apache Spark was 34 s faster, for 40 GB it was 110 s faster and finally, for 200 GB Apache Spark was 398 s faster at executing the task.

When executing Sort, for a 1 GB input, Apache Spark performed the task in less time with a difference of 3 s compared to Hadoop MapReduce, but the same was not visible for an input of 100 and 500 GB, where MapReduce executed the task with a difference of 1.5 m and 20 m respectively.

For both the first and subsequent iterations of the k-means execution, Apache Spark presents shorter execution times, to emphasize the fact that the difference in time is accentuated in the following iterations due to the caching mechanism of Apache Spark.

Gu and Li [20] conducted an experiment to compare the performance of Hadoop MapReduce and Apache Spark in performing iterative tasks. PageRank was the algorithm chosen for the experiment.

Runtimes varied depending on the size of the dataset. For small datasets (between 1 and 11 MB) Apache Spark was 25–40 times faster to complete the tasks. For datasets with sizes between 40 MB and 89 MB Apache Spark was about 10–15 times faster than MapReduce. For datasets whose size is comprehended between 200 and 600 MB Apache Spark was between 3 to 5 times faster than MapReduce. When the dataset size exceeds 1 GB, MapReduce performed better than Apache Spark, and for some cases, Apache Spark failed during the execution while MapReduce concluded the task.

4.2 Comparison Between Apache Spark and Apache Storm

The experience conducted by Lu et al. [21] consisted of the execution of 7 workloads (Identity, Sample, Projection, Grep, Wordcount, DistinctCount, Statistics) to simulate various scenarios of the real use of these frameworks. After analyzing the experience that compares Apache Spark Streaming to Apache Storm, it is possible to observe that Apache Spark Streaming has better throughput values (average number of processed records per second), but the same does not happen in the latency (average of the intervals from the arrival of each record until the end of processing it) values, where Apache Storm presents better values, except for the values obtained in the execution of WordCount workload. As for the ability to handle data, Apache Storm presents worse results compared to Apache Spark Streaming.

5 Discussion

Based on this study, it is possible to conclude that there is not much scientific documentation about the implementation of Big Data technologies in hospitals/health clinics. It is also possible to conclude that the approval of the scientific community can help to overcome some of the challenges that are presented to the adoption of Big Data technologies in the health area.

Given the results obtained in the analyzed experiments, it is possible to conclude that:
  • The Big Data Architecture for the INTCare Project is best suited to handle streaming data. This solution combines Apache Kafka and Apache Storm to handle data from bedside monitors (vital signs, ventilation and others) [9];

  • The Maharaja Yeshwatrao Architecture is best suited to handle large volumes of data, although Hadoop MapReduce performs poorly against Apache Spark in most of the tests presented in subchapter 6.1, it has been able to handle large volumes of data;

  • The Big Data Architecture with Apache Spark is a hybrid solution as it has proven capable of handling streaming and batch data. However, the performance of Apache Spark is very dependent on the configuration of the cluster.

Although it was not possible to make a direct performance comparison of all the solutions chosen for benchmarking, it is possible to conclude that the most appropriate architecture for a healthcare organization is the Big Data Architecture for the INTCare Project. This solution presents in detail all the components and how they will interact with the system where they are inserted and, more importantly, it is the only solution that presents components in the “Security” layer (as it can be seen in Table 1). Since security is one of the challenges to implementing Big Data in healthcare, it is considered necessary to integrate tools that ensure data and system security in general.

6 Conclusion and Future Work

The realization of this project made it possible to understand the state of implementation of Big Data technologies in healthcare, it is potential and the main challenges. The research of applications used or designed to be implemented in hospitals/health clinics has proved to be the most challenging task of this project, due to the scarcity of literature regarding the implementation of Big Data solutions in healthcare. The research of experiments carried out on applications similar to those chosen for comparison allowed to evaluate the performance of the applications in several scenarios, therefore, it was possible to perceive the strengths and weaknesses of the chosen solutions. The research of Big Data technologies used in healthcare, revealed the variety of solutions to be explored, and showed that there is no ideal solution that can satisfy all the needs. Still there are some areas to be explored in the future, among them which include the research of Big Data solutions similar to those presented that have not yet been presented to the scientific community and the execution of practical tests on the tools presented with real data.

Notes

Acknowledgements

This work has been supported by COMPETE: POCI-01-0145-FEDER-007043 and FCT – Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013.

References

  1. 1.
    Yaqoob, I., et al.: Big data: from beginning to future. Int. J. Inf. Manag. 36(6), 1231–1247 (2016)CrossRefGoogle Scholar
  2. 2.
    Sagiroglu, S., Sinanc, D.: Big data: a review. In: 2013 International Conference on Collaboration Technologies and Systems, pp. 42–47 (2013)Google Scholar
  3. 3.
    Raghupathi, W., Raghupathi, V.: Big data analytics in healthcare: promise and potential. Heal. Inf. Sci. Syst. 2, 3 (2014)CrossRefGoogle Scholar
  4. 4.
    Feldman, B., Martin, E.M., Skotnes, T.: Big data in healthcare - hype and hope. Dr. Bonnie 360 degree (bus. Dev. Digit. Heal. 2013(1), 122–125 (2012)Google Scholar
  5. 5.
    Zikopoulos, P., Eaton, C., DeRoos, D., Deutsch, T., Lapis, G.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill, New York (2012)Google Scholar
  6. 6.
    Hurwitz, J., Nugent, A., Halper, D.F., Kaufman, M.: Big Data for Dummies. John Wiley & Sons Inc., Hoboken (2013)Google Scholar
  7. 7.
    Taurion, C.: Big Data (2013)Google Scholar
  8. 8.
    Manyika, J., et al.: Big data: the next frontier for innovation, competition, and productivity. McKinsey Glob. Inst., p. 156, June 2011Google Scholar
  9. 9.
    Gonçalves, A., Portela, F., Santos, M.F.: Towards of a real-time big data architecture to intensive care. In: Procedia Computer Science - ICTH 2017 - International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare, pp. 585–590. Elsevier (2017). ISSN 1877-0509Google Scholar
  10. 10.
    Portela, F., Santos, M.F., Machado, J., Abelha, A., Silva, Á., Rua, F.: Pervasive and intelligent decision support in intensive medicine – the complete picture (2014)Google Scholar
  11. 11.
    Guarda, T., Augusto, M.F., Barrionuevo, O., Pinto, F.M.: Internet of Things in pervasive healthcare systems. In: Next-Generation Mobile and Pervasive Healthcare Solutions, pp. 22–31 (2018)Google Scholar
  12. 12.
    Guarda, T., Orozco, W., Augusto, M.F., Morillo, G., Navarrete, S.A., Pinto, F.M.: Penetration testing on virtual environments. In: Proceedings of the 4th International Conference on Information and Network Security, ICINS 2016, pp. 9–12 (2016)Google Scholar
  13. 13.
    Liu, W., Li, Q., Cai, Y., Li, Y., Li, X.: A prototype of healthcare big data processing system based on spark, no. Bmei, pp. 516–520 (2015)Google Scholar
  14. 14.
    Ojha, M., Mathur, K.: Proposed application of big data analytics in healthcare at Maharaja Yeshwantrao hospital. In: 2016 3rd MEC International Conference on Big Data and Smart City, ICBDSC 2016, pp. 40–46 (2016)Google Scholar
  15. 15.
    Krishnan, S.M.: Application of analytics to big data in healthcare. In: Proceedings of the 32nd Southern Biomedical Engineering Conference, SBEC 2016, pp. 156–157 (2016)Google Scholar
  16. 16.
    IBM, IBM PureData Solution for Healthcare Analytics (2013)Google Scholar
  17. 17.
    Nambiar, R., Sethi, A., Bhardwaj, R., Vargheeseh, R.: A look at challenges and opportunities of big data analytics in healthcare, pp. 17–22 (2013)Google Scholar
  18. 18.
    Verma, A., Mansuri, A.H., Jain, N.: Big data management processing with Hadoop MapReduce and spark technology: a comparison. In: 2016 Symposium Colossal Data Analysis Networking, CDAN 2016 (2016)Google Scholar
  19. 19.
    Shi, J., et al.: Clash of the titans: MapReduce vs. spark for large scale data analytics. Proc. VLDB Endow. 3, 2110–2121 (2015)CrossRefGoogle Scholar
  20. 20.
    Gu, L., Li, H.: Memory or time: performance evaluation for iterative operation on hadoop and spark. In: Proceedings of the 2013 IEEE International Conference on High Performance Computing and Communications, HPCC 2013 and 2013 IEEE International Conference on Embedded and Ubiquitous Computing, EUC 2013, pp. 721–727 (2014)Google Scholar
  21. 21.
    Lu, R., Wu, G., Xie, B., Hu, J.: Stream bench: towards benchmarking modern distributed stream computing frameworks. In: Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing, UCC 2014, pp. 69–78 (2014)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Hugo Torres
    • 1
  • Filipe Portela
    • 1
    Email author
  • Manuel Filipe Santos
    • 1
  1. 1.Algoritmi Research CenterUniversity of MinhoGuimarãesPortugal

Personalised recommendations