Skip to main content

Advertisement

Log in

A scalable and real-time system for disease prediction using big data processing

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The growing chronic diseases patients and the centralization of medical resources cause significant economic impact resulting in hospital visits, hospital readmission, and other healthcare costs. This paper proposes a scalable and real-time system for disease prediction from medical data streams. This is carried out by integrating Twitter, Apache Kafka, Apache Spark and Apache Cassandra. Thus, Twitter users tweet attributes related to health, Kafka streaming receives all desired tweets attributes and ingest them to Spark streaming. Here, a machine learning algorithm is applied to predict health status and send back a response message through Kafka. The heart disease dataset, obtained from the UCI repository, was used for experiments. In order to enhance prediction accuracy, Relief algorithm is used for features selection. We compared sex types of relevant machine learning algorithms implemented by Spark MLlib such as Random Forest (RF), Naive Bayes, Support Vector Machine, Multilayer Perceptron, Decision Tree and Logistic Regression with the full features as well as selected features. The highest classification accuracy of 92.05% was reported using RF with selected features. The scalability of RF using Spark MLlib and WEKA framework for both training and application stages was measured. The results show significantly better performances of Spark in terms of scalability and computing times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6
Algorithm 2
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Data Availability

The dataset analysed during the current study is available in https://archive.ics.uci.edu/ml/datasets/heart+disease

References

  1. Abbasi A, Adjeroh D, Dredze M, Paul MJ, Zahedi FM, Zhao H, Walia N, Jain H, Sanvanson P, Shaker R et al (2014) Social media analytics for smart health. IEEE Intell Syst 29(2):60–80

    Article  Google Scholar 

  2. Acharjya DP, Ahmed K (2016) A survey on big data analytics: challenges, open research issues and tools. Int J Adv Comput Sci Appl 7(2):511–518

    Google Scholar 

  3. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Acm Sigmod Record, vol 22. ACM, pp 207–216

  4. Al Rasyid MUH, Yuwono W, Al Muharom S, Alasiry AH (2016) Building platform application big sensor data for e-health wireless body area network. In: Electronics symposium (IES), 2016 international. IEEE, pp 409–413

  5. Ali SM, Gupta N, Nayak GK, Lenka RK (2016) Big data visualization: tools and challenges. In: 2016 2nd International conference on contemporary computing and informatics (IC3I). IEEE, pp 656–660

  6. Apache Spark documentation: official webpage of Apache kafka (2017) http://spark.apache.org//. Online; Accessed 15 Dec 2017

  7. Apache cassandra: official webpage of Apache cassandra (2017) http://cassandra.apache.org. Online; Accessed 15 Dec 2017

  8. Apache kafka: official webpage of Apache kafka (2017) https://kafka.apache.org/. Online; Accessed 15 Dec 2017

  9. Apache spark: official webpage of Apache spark (2017) http://spark.apache.org/ Online; Accessed 15 Dec 2017

  10. Apache zeppelin: official webpage of Apache zeppelin (2017) https://zeppelin.apache.org. Online; Accessed 15 Dec 2017

  11. Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A et al (2015) Spark sql: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1383–1394

  12. Basheer S, Alluhaidan AS, Bivi MA (2021) Real-time monitoring system for early prediction of heart disease using internet of things. Soft Comput 25 (18):12145–12158

    Article  Google Scholar 

  13. Breiman L (2017) Classification and regression trees. Routledge, Evanston

    Book  Google Scholar 

  14. Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q:1165–1188

  15. Chen M, Hao Y, Hwang K, Wang L, Wang L (2017) Disease prediction by machine learning over big data from healthcare communities. IEEE Access 5:8869–8879

    Article  Google Scholar 

  16. Condie T, Mineiro P, Polyzotis N, Weimer M (2013) Machine learning on big data. In: Data engineering (ICDE), 2013 IEEE 29th international conference on. IEEE, pp 1242–1244

  17. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  18. Ed-daoudy A, Maalmi K (2018) Application of machine learning model on streaming health data event in real-time to predict health status using spark. In: 2018 international symposium on advanced electrical and communication technologies (ISAECT). IEEE, pp 1–4

  19. Ed-Daoudy A, Maalmi K (2019) Real-time machine learning for early detection of heart disease using big data approach. In: 2019 international conference on wireless technologies, embedded and intelligent systems (WITS). IEEE, pp 1–5

  20. Ed-daoudy A, Maalmi K (2019) A new internet of things architecture for real-time prediction of various diseases using machine learning on big data environment. J Big Data 6(1):104

    Article  Google Scholar 

  21. Ed-daoudy A, Maalmi K (2020) Real-time heart disease detection and monitoring system based on fast machine learning using spark. Health and Technol 10(5):1145–1154

    Article  Google Scholar 

  22. Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview. In: Advances in knowledge discovery and data mining. AAAI. MIT Press, pp 1–34

  23. Gao D, Li W, Cai X, Zhang R, Ouyang Y (2014) Sequential summarization: a full view of twitter trending topics. IEEE/ACM Transactions on Audio. Speech Lang Process (TASLP) 22(2):293–302

    Google Scholar 

  24. Han S, Kim K, Cha E, Kim K, Shon H (2017) System framework for cardiovascular disease prediction based on big data technology. Symmetry 9(12):293

    Article  Google Scholar 

  25. Hassan M, Bansal SK (2018) Semantic data querying over nosql databases with apache spark. In: 2018 IEEE international conference on information reuse and integration (IRI). IEEE, pp 364–371

  26. Hazarika AV, Ram GJSR, Jain E (2017) Performance comparision of hadoop and spark engine. In: I-SMAC (IoT in social, mobile, analytics and cloud)(I-SMAC), 2017 international conference on. IEEE, pp 671–674

  27. Hazarika AV, Ram GJSR, Jain E (2017) Performance comparision of hadoop and spark engine. In: I-SMAC (IoT in social, mobile, analytics and cloud)(I-SMAC), 2017 international conference on. IEEE, pp 671–674

  28. Heart disease: UCI (2020) archive.ics.uci.edu/ml/datasets/heart+disease. Online; Accessed 15 Dec 2017

  29. Heydari ST, Ayatollahi SMT, Zare N (2012) Comparison of artificial neural networks with logistic regression for detection of obesity. J Med Syst 36 (4):2449–2454

    Article  Google Scholar 

  30. Ho TK (1995) Random decision forests (rdf). In: Proceedings of the 3rd international conference on document analysis and recognition, pp 278–282

  31. Ismail A, Shehab A, El-Henawy I (2019) Healthcare analysis in smart big data analytics: reviews, challenges and recommendations. In: Security in smart cities: models, applications, and challenges. Springer, pp 27–45

  32. Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I (2017) Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J 15:104–116

    Article  Google Scholar 

  33. Kira K, Rendell LA et al (1992) The feature selection problem: traditional methods and a new algorithm. In: Aaai, vol 2 pp 129–134

  34. Kolajo T, Daramola O, Adebiyi A (2019) Big data stream analysis: a systematic literature review. J Big Data 6(1):47

    Article  Google Scholar 

  35. Kumar PM, Gandhi UD (2018) A novel three-tier internet of things architecture with machine learning algorithm for early detection of heart diseases. Comput Electr Eng 65:222–235

    Article  Google Scholar 

  36. Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev 44(2):35–40

    Article  Google Scholar 

  37. Lee K, Agrawal A, Choudhary A (2013) Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1474–1477

  38. Mallu L, Ezhilarasie R (2015) Live migration of virtual machines in cloud environment: a survey. Indian J Sci Technol 8(S9):326–332

    Article  Google Scholar 

  39. Manogaran G, Lopez D (2017) A survey of big data architectures and machine learning algorithms in healthcare. Int J Biomed Eng Technol 25 (2-4):182–211

    Article  Google Scholar 

  40. Manogaran G, Varatharajan R, Lopez D, Kumar PM, Sundarasekar R, Thota C (2018) A new architecture of internet of things and big data ecosystem for secured smart healthcare monitoring and alerting system. Futur Gener Comput Syst 82:375–387

    Article  Google Scholar 

  41. Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(1):1235–1241

    MathSciNet  MATH  Google Scholar 

  42. Mostafaeipour A, Jahangard Rafsanjani A, Ahmadi M, Arockia Dhanraj J (2021) Investigating the performance of hadoop and spark platforms on machine learning algorithms. J Supercomput 77(2):1273–1300

    Article  Google Scholar 

  43. Nasiri H, Nasehi S, Goudarzi M (2019) Evaluation of distributed stream processing frameworks for iot applications in smart cities. J Big Data 6 (1):52

    Article  Google Scholar 

  44. Pourahmad S, Ayatollahi SMT, Taheri SM, Agahi ZH (2011) Fuzzy logistic regression based on the least squares approach with application in clinical studies. Comput Math Appl 62(9):3353–3365

    Article  MathSciNet  MATH  Google Scholar 

  45. Rallapalli S, Suryakanthi T (2016) Predicting the risk of diabetes in big data electronic health records by using scalable random forest classification algorithm. In: Advances in computing and communication engineering (ICACCE), 2016 international conference on. IEEE, pp 281–284

  46. Rathore MM, Paul A, Ahmad A, Anisetti M, Jeon G (2017) Hadoop-based intelligent care system (hics): analytical approach for big data in iot. ACM Trans Internet Technol (TOIT) 18(1):8

    Google Scholar 

  47. Rustam F, Ashraf I, Mehmood A, Ullah S, Choi GS (2019) Tweets classification on the base of sentiments for us airline companies. Entropy 21(11):1078

    Article  Google Scholar 

  48. Sampath P, Tamilselvi S, Kumar NS, Lavanya S, Eswari T (2017) Diabetic data analysis in healthcare using hadoop architecture over big data. Int J Biomed Eng Technol 23(2-4):137–147

    Article  Google Scholar 

  49. Sreejith S, Rahul S, Jisha R (2016) A real time patient monitoring system for heart disease prediction using random forest algorithm. In: Advances in signal processing and intelligent recognition systems. Springer, pp 485–500

  50. Ta V-D, Liu C-M, Nkabinde GW (2016) Big data stream computing in healthcare real-time analytics. In: Cloud computing and big data analysis (ICCCBDA), 2016 IEEE international conference on. IEEE, pp 37–42

  51. Trigo JD, Eguzkiza A, Martínez-Espronceda M, Serrano L (2013) A cardiovascular patient follow-up system using twitter and hl7. Comput Cardiol 2013:33–36

    Google Scholar 

  52. Veiga J, Expósito RR, Pardo XC, Taboada GL, Tourifio J (2016) Performance evaluation of big data frameworks for large-scale data analytics. In: 2016 IEEE international conference on big data (Big Data). IEEE, pp 424–431

  53. Venkatesh R, Balasubramanian C, Kaliappan M (2019) Development of big data predictive analytics model for disease prediction using machine learning technique. J Med Syst 43(8):272

    Article  Google Scholar 

  54. Wachowicz M, Arteaga MD, Cha S, Bourgeois Y (2016) Developing a streaming data processing workflow for querying space-time activities from geotagged tweets. Comput Environ Urban Syst 59:256–268

    Article  Google Scholar 

  55. Weka: Official webpage of Weka (2017) https://www.cs.waikato.ac.nz/ml/weka/. Online; Accessed 15 Dec 2017

  56. Yan K, You X, Ji X, Yin G, Yang F (2016) A hybrid outlier detection method for health care big data. In: Big data and cloud computing (BDCloud), social computing and networking (SocialCom), sustainable computing and communications (SustainCom)(BDCloud-SocialCom-SustainCom), 2016 IEEE international conferences on. IEEE, pp 157–162

  57. Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the 24th ACM symposium on operating systems principles, pp 423–438

  58. Zaldumbide J, Sinnott RO (2015) Identification and validation of real-time health events through social media. In: Data science and data intensive systems (DSDIS), 2015 IEEE international conference on. IEEE, pp 9–16

  59. Zhao T, Ni H, Zhou X, Qiang L, Zhang D, Yu Z (2014) Detecting abnormal patterns of daily activities for the elderly living alone. In: International conference on health information science. Springer, pp 95–108

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abderrahmane Ed-daoudy.

Ethics declarations

Ethics approval and consent to participate

This article does not contain any studies with human participants or animals performed by any of the authors

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ed-daoudy, A., Maalmi, K. & El Ouaazizi, A. A scalable and real-time system for disease prediction using big data processing. Multimed Tools Appl 82, 30405–30434 (2023). https://doi.org/10.1007/s11042-023-14562-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14562-3

Keywords

Navigation