Abstract
The growing chronic diseases patients and the centralization of medical resources cause significant economic impact resulting in hospital visits, hospital readmission, and other healthcare costs. This paper proposes a scalable and real-time system for disease prediction from medical data streams. This is carried out by integrating Twitter, Apache Kafka, Apache Spark and Apache Cassandra. Thus, Twitter users tweet attributes related to health, Kafka streaming receives all desired tweets attributes and ingest them to Spark streaming. Here, a machine learning algorithm is applied to predict health status and send back a response message through Kafka. The heart disease dataset, obtained from the UCI repository, was used for experiments. In order to enhance prediction accuracy, Relief algorithm is used for features selection. We compared sex types of relevant machine learning algorithms implemented by Spark MLlib such as Random Forest (RF), Naive Bayes, Support Vector Machine, Multilayer Perceptron, Decision Tree and Logistic Regression with the full features as well as selected features. The highest classification accuracy of 92.05% was reported using RF with selected features. The scalability of RF using Spark MLlib and WEKA framework for both training and application stages was measured. The results show significantly better performances of Spark in terms of scalability and computing times.
Similar content being viewed by others
Data Availability
The dataset analysed during the current study is available in https://archive.ics.uci.edu/ml/datasets/heart+disease
References
Abbasi A, Adjeroh D, Dredze M, Paul MJ, Zahedi FM, Zhao H, Walia N, Jain H, Sanvanson P, Shaker R et al (2014) Social media analytics for smart health. IEEE Intell Syst 29(2):60–80
Acharjya DP, Ahmed K (2016) A survey on big data analytics: challenges, open research issues and tools. Int J Adv Comput Sci Appl 7(2):511–518
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Acm Sigmod Record, vol 22. ACM, pp 207–216
Al Rasyid MUH, Yuwono W, Al Muharom S, Alasiry AH (2016) Building platform application big sensor data for e-health wireless body area network. In: Electronics symposium (IES), 2016 international. IEEE, pp 409–413
Ali SM, Gupta N, Nayak GK, Lenka RK (2016) Big data visualization: tools and challenges. In: 2016 2nd International conference on contemporary computing and informatics (IC3I). IEEE, pp 656–660
Apache Spark documentation: official webpage of Apache kafka (2017) http://spark.apache.org//. Online; Accessed 15 Dec 2017
Apache cassandra: official webpage of Apache cassandra (2017) http://cassandra.apache.org. Online; Accessed 15 Dec 2017
Apache kafka: official webpage of Apache kafka (2017) https://kafka.apache.org/. Online; Accessed 15 Dec 2017
Apache spark: official webpage of Apache spark (2017) http://spark.apache.org/ Online; Accessed 15 Dec 2017
Apache zeppelin: official webpage of Apache zeppelin (2017) https://zeppelin.apache.org. Online; Accessed 15 Dec 2017
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A et al (2015) Spark sql: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1383–1394
Basheer S, Alluhaidan AS, Bivi MA (2021) Real-time monitoring system for early prediction of heart disease using internet of things. Soft Comput 25 (18):12145–12158
Breiman L (2017) Classification and regression trees. Routledge, Evanston
Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q:1165–1188
Chen M, Hao Y, Hwang K, Wang L, Wang L (2017) Disease prediction by machine learning over big data from healthcare communities. IEEE Access 5:8869–8879
Condie T, Mineiro P, Polyzotis N, Weimer M (2013) Machine learning on big data. In: Data engineering (ICDE), 2013 IEEE 29th international conference on. IEEE, pp 1242–1244
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Ed-daoudy A, Maalmi K (2018) Application of machine learning model on streaming health data event in real-time to predict health status using spark. In: 2018 international symposium on advanced electrical and communication technologies (ISAECT). IEEE, pp 1–4
Ed-Daoudy A, Maalmi K (2019) Real-time machine learning for early detection of heart disease using big data approach. In: 2019 international conference on wireless technologies, embedded and intelligent systems (WITS). IEEE, pp 1–5
Ed-daoudy A, Maalmi K (2019) A new internet of things architecture for real-time prediction of various diseases using machine learning on big data environment. J Big Data 6(1):104
Ed-daoudy A, Maalmi K (2020) Real-time heart disease detection and monitoring system based on fast machine learning using spark. Health and Technol 10(5):1145–1154
Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview. In: Advances in knowledge discovery and data mining. AAAI. MIT Press, pp 1–34
Gao D, Li W, Cai X, Zhang R, Ouyang Y (2014) Sequential summarization: a full view of twitter trending topics. IEEE/ACM Transactions on Audio. Speech Lang Process (TASLP) 22(2):293–302
Han S, Kim K, Cha E, Kim K, Shon H (2017) System framework for cardiovascular disease prediction based on big data technology. Symmetry 9(12):293
Hassan M, Bansal SK (2018) Semantic data querying over nosql databases with apache spark. In: 2018 IEEE international conference on information reuse and integration (IRI). IEEE, pp 364–371
Hazarika AV, Ram GJSR, Jain E (2017) Performance comparision of hadoop and spark engine. In: I-SMAC (IoT in social, mobile, analytics and cloud)(I-SMAC), 2017 international conference on. IEEE, pp 671–674
Hazarika AV, Ram GJSR, Jain E (2017) Performance comparision of hadoop and spark engine. In: I-SMAC (IoT in social, mobile, analytics and cloud)(I-SMAC), 2017 international conference on. IEEE, pp 671–674
Heart disease: UCI (2020) archive.ics.uci.edu/ml/datasets/heart+disease. Online; Accessed 15 Dec 2017
Heydari ST, Ayatollahi SMT, Zare N (2012) Comparison of artificial neural networks with logistic regression for detection of obesity. J Med Syst 36 (4):2449–2454
Ho TK (1995) Random decision forests (rdf). In: Proceedings of the 3rd international conference on document analysis and recognition, pp 278–282
Ismail A, Shehab A, El-Henawy I (2019) Healthcare analysis in smart big data analytics: reviews, challenges and recommendations. In: Security in smart cities: models, applications, and challenges. Springer, pp 27–45
Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I (2017) Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J 15:104–116
Kira K, Rendell LA et al (1992) The feature selection problem: traditional methods and a new algorithm. In: Aaai, vol 2 pp 129–134
Kolajo T, Daramola O, Adebiyi A (2019) Big data stream analysis: a systematic literature review. J Big Data 6(1):47
Kumar PM, Gandhi UD (2018) A novel three-tier internet of things architecture with machine learning algorithm for early detection of heart diseases. Comput Electr Eng 65:222–235
Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev 44(2):35–40
Lee K, Agrawal A, Choudhary A (2013) Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1474–1477
Mallu L, Ezhilarasie R (2015) Live migration of virtual machines in cloud environment: a survey. Indian J Sci Technol 8(S9):326–332
Manogaran G, Lopez D (2017) A survey of big data architectures and machine learning algorithms in healthcare. Int J Biomed Eng Technol 25 (2-4):182–211
Manogaran G, Varatharajan R, Lopez D, Kumar PM, Sundarasekar R, Thota C (2018) A new architecture of internet of things and big data ecosystem for secured smart healthcare monitoring and alerting system. Futur Gener Comput Syst 82:375–387
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(1):1235–1241
Mostafaeipour A, Jahangard Rafsanjani A, Ahmadi M, Arockia Dhanraj J (2021) Investigating the performance of hadoop and spark platforms on machine learning algorithms. J Supercomput 77(2):1273–1300
Nasiri H, Nasehi S, Goudarzi M (2019) Evaluation of distributed stream processing frameworks for iot applications in smart cities. J Big Data 6 (1):52
Pourahmad S, Ayatollahi SMT, Taheri SM, Agahi ZH (2011) Fuzzy logistic regression based on the least squares approach with application in clinical studies. Comput Math Appl 62(9):3353–3365
Rallapalli S, Suryakanthi T (2016) Predicting the risk of diabetes in big data electronic health records by using scalable random forest classification algorithm. In: Advances in computing and communication engineering (ICACCE), 2016 international conference on. IEEE, pp 281–284
Rathore MM, Paul A, Ahmad A, Anisetti M, Jeon G (2017) Hadoop-based intelligent care system (hics): analytical approach for big data in iot. ACM Trans Internet Technol (TOIT) 18(1):8
Rustam F, Ashraf I, Mehmood A, Ullah S, Choi GS (2019) Tweets classification on the base of sentiments for us airline companies. Entropy 21(11):1078
Sampath P, Tamilselvi S, Kumar NS, Lavanya S, Eswari T (2017) Diabetic data analysis in healthcare using hadoop architecture over big data. Int J Biomed Eng Technol 23(2-4):137–147
Sreejith S, Rahul S, Jisha R (2016) A real time patient monitoring system for heart disease prediction using random forest algorithm. In: Advances in signal processing and intelligent recognition systems. Springer, pp 485–500
Ta V-D, Liu C-M, Nkabinde GW (2016) Big data stream computing in healthcare real-time analytics. In: Cloud computing and big data analysis (ICCCBDA), 2016 IEEE international conference on. IEEE, pp 37–42
Trigo JD, Eguzkiza A, Martínez-Espronceda M, Serrano L (2013) A cardiovascular patient follow-up system using twitter and hl7. Comput Cardiol 2013:33–36
Veiga J, Expósito RR, Pardo XC, Taboada GL, Tourifio J (2016) Performance evaluation of big data frameworks for large-scale data analytics. In: 2016 IEEE international conference on big data (Big Data). IEEE, pp 424–431
Venkatesh R, Balasubramanian C, Kaliappan M (2019) Development of big data predictive analytics model for disease prediction using machine learning technique. J Med Syst 43(8):272
Wachowicz M, Arteaga MD, Cha S, Bourgeois Y (2016) Developing a streaming data processing workflow for querying space-time activities from geotagged tweets. Comput Environ Urban Syst 59:256–268
Weka: Official webpage of Weka (2017) https://www.cs.waikato.ac.nz/ml/weka/. Online; Accessed 15 Dec 2017
Yan K, You X, Ji X, Yin G, Yang F (2016) A hybrid outlier detection method for health care big data. In: Big data and cloud computing (BDCloud), social computing and networking (SocialCom), sustainable computing and communications (SustainCom)(BDCloud-SocialCom-SustainCom), 2016 IEEE international conferences on. IEEE, pp 157–162
Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the 24th ACM symposium on operating systems principles, pp 423–438
Zaldumbide J, Sinnott RO (2015) Identification and validation of real-time health events through social media. In: Data science and data intensive systems (DSDIS), 2015 IEEE international conference on. IEEE, pp 9–16
Zhao T, Ni H, Zhou X, Qiang L, Zhang D, Yu Z (2014) Detecting abnormal patterns of daily activities for the elderly living alone. In: International conference on health information science. Springer, pp 95–108
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This article does not contain any studies with human participants or animals performed by any of the authors
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ed-daoudy, A., Maalmi, K. & El Ouaazizi, A. A scalable and real-time system for disease prediction using big data processing. Multimed Tools Appl 82, 30405–30434 (2023). https://doi.org/10.1007/s11042-023-14562-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14562-3