Health Services Data: Big Data Analytics for Deriving Predictive Healthcare Insights

  • Ankit AgrawalEmail author
  • Alok Choudhary
Reference work entry
Part of the Health Services Research book series (HEALTHSR)


This chapter describes the application of big data analytics in healthcare, particularly on electronic healthcare records so as to make predictive models for healthcare outcomes and discover interesting insights. A typical workflow for such predictive analytics involves data collection, data transformation, predictive modeling, evaluation, and deployment, with each step tailored to the end goals of the project. To illustrate each of these steps, we shall take the example of recent advances in such predictive analytics on lung cancer data from the Surveillance, Epidemiology, and End Results (SEER) program. This includes the construction of accurate predictive models for lung cancer survival, development of a lung cancer outcome calculator deploying the predictive models, and association rule mining on that data for bottom-up discovery of interesting insights. The lung cancer outcome calculator illustrated here is available at


  1. Agrawal A, Choudhary A. Association rule mining based hotspot analysis on seer lung cancer data. Int J Knowl Discov Bioinform (IJKDB). 2011a;2(2):34–54.CrossRefGoogle Scholar
  2. Agrawal A, Choudhary A. Identifying hotspots in lung cancer data using association rule mining. In: 2nd IEEE ICDM workshop on biological data mining and its applications in healthcare (BioDM); 2011b. p. 995–1002.Google Scholar
  3. Agrawal A, Choudhary A. Perspective: materials informatics and big data: realization of the fourth paradigm of science in materials science. APL Mater. 2016;4(053208):1–10.Google Scholar
  4. Agrawal A, Huang X. Psiblast pairwisestatsig: reordering psi-blast hits using pairwise statistical significance. Bioinformatics. 2009;25(8):1082–3.CrossRefGoogle Scholar
  5. Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using sequence- specific and position-specific substitution matrices. IEEE/ACM Trans Comput Biol Bioinformatics. 2011;8(1):194–205.CrossRefGoogle Scholar
  6. Agrawal A, Misra S, Narayanan R, Polepeddi L, Choudhary A. A lung cancer outcome calculator using ensemble data mining on seer data. In: Proceedings of the tenth international workshop on data mining in bioinformatics (BIOKDD), New York: ACM; 2011. p. 1–9.Google Scholar
  7. Agrawal A, Misra S, Narayanan R, Polepeddi L, Choudhary A. Lung cancer survival prediction using ensemble data mining on seer data. Sci Program. 2012;20(1):29–42.Google Scholar
  8. Agrawal A, Patwary M, Hendrix W, Liao WK, Choudhary A. High performance big data clustering. IOS Press; 2013a. p. 192–211.Google Scholar
  9. Agrawal A, Al-Bahrani R, Merkow R, Bilimoria K, Choudhary A. “Colon surgery outcome prediction using acs nsqip data,” In: Proceedings of the KDD workshop on Data Mining for Healthcare (DMH); 2013b. p. 1–6.Google Scholar
  10. Agrawal A, Al-Bahrani R, Raman J, Russo MJ, Choudhary A. Lung transplant outcome prediction using unos data. In: Proceedings of the IEEE big data workshop on Bioinformatics and Health Informatics (BHI); 2013c. p. 1–8.Google Scholar
  11. Andreu-Perez J, Leff DR, Ip H, Yang G-Z. From wearable sensors to smart implants – toward pervasive and personalized healthcare. IEEE Trans Biomed Eng. 2015;62(12):2750–62.CrossRefGoogle Scholar
  12. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): the tripod statement. Ann Intern Med. 2015;162(1):55–63.CrossRefGoogle Scholar
  13. Ganguly AR, Kodra E, Agrawal A, Banerjee A, Boriah S, Chatterjee S, Chatterjee S, Choudhary A, Das D, Faghmous J, Ganguli P, Ghosh S, Hayhoe K, Hays C, Hendrix W, Fu Q, Kawale J, Kumar D, Kumar V, Liao WK, Liess S, Mawalagedara R, Mithal V, Oglesby R, Salvi K, Snyder PK, Steinhaeuser K, Wang D, Wuebbles D. Toward enhanced understanding and projections of climate extremes using physics-guided data mining techniques. Nonlinear Process Geophys. 2014;21:777–95.CrossRefGoogle Scholar
  14. Hey T, Tansley S, Tolle K, editors. The fourth paradigm: data-intensive scientific discovery. Redmond: Microsoft Research; 2009.Google Scholar
  15. Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP, Kania R, Schaeffer M, Pierre SS, et al. Big data: the future of biocuration. Nature. 2008;455(7209):47–50.CrossRefGoogle Scholar
  16. Huang X, Madan A. Cap3: a dna sequence assembly program. Genome Res. 1999;9(9):868–77.CrossRefGoogle Scholar
  17. Lee K, Agrawal A, Choudhary A. Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD); 2013.p. 1474–77.Google Scholar
  18. Lee K, Agrawal A, Choudhary A. Mining social media streams to improve public health allergy surveillance. In: Proceedings of IEEE/ACM international conference on Social Networks Analysis and Mining (ASONAM); 2015.p. 815–22.Google Scholar
  19. Magill SS, Edwards JR, Bamberg W, Beldavs ZG, Dumyati G, Kainer MA, Lynfield R, Maloney M, McAllister-Hollod L, Nadle J, Ray SM, Thompson DL, Wilson LE, Fridkin SK. Multistate point-prevalence survey of health care-associated infections. N Engl J Med. 2014;370(13):1198–208.CrossRefGoogle Scholar
  20. Marx V. Biology: the big challenges of big data. Nature. 2013;498(7453):255–60.CrossRefGoogle Scholar
  21. Mathias JS, Agrawal A, Feinglass J, Cooper AJ, Baker DW, Choudhary A. Development of a 5 year life expectancy index in older adults using predictive mining of electronic health record data. J Am Med Inform Assoc. 2013;20:e118–24. JSM and AA are co-first authors.CrossRefGoogle Scholar
  22. Misra S, Agrawal A, Liao W-k, Choudhary A. Anatomy of a hash-based long read sequence mapping algorithm for next generation dna sequencing. Bioinformatics. 2011;27(2):189–95.CrossRefGoogle Scholar
  23. ODriscoll A, Daugelaite J, Sleator RD. Big data, hadoop and cloud computing in genomics. J Biomed Inform. 2013;46(5):774–81.CrossRefGoogle Scholar
  24. Ries LAG, Eisner MP. Cancer of the lung. In: Ries LAG, Young JL, Keel GE, Eisner MP, Lin YD, Horner M-J, eds. SEER survival monograph: Cancer survival among adults: U.S. SEER program, 1988–2001, Patient and Tumor Characteristics. NIH Pub. No. 07–6215. Bethesda, Md: National Cancer Institute, SEER Program; 2007:73–80.Google Scholar
  25. SEER, Surveillance, epidemiology, and end results (seer) program ( limited-use data (1973–2006). National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch; 2008. Released April 2009, based on the November 2008 submission.
  26. Xie Y, Honbo D, Choudhary A, Zhang K, Cheng Y, Agrawal A. Voxsup: a social engagement framework. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD) (Demo paper). ACM; 2012. p. 1556–9.Google Scholar
  27. Xie Y, Chen Z, Zhang K, Cheng Y, Honbo DK, Agrawal A, Choudhary A. Muses: a multilingual sentiment elicitation system for social media data. IEEE Intell Syst. 2013a;99:1541–672.Google Scholar
  28. Xie Y, Chen Z, Cheng Y, Zhang K, Agrawal A, WK Liao, Choudhary A. Detecting and tracking disease outbreaks by mining social media data. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI); 2013b.p. 2958–60.Google Scholar
  29. Xie Y, Palsetia D, Trajcevski G, Agrawal A, Choudhary A. Silverback: scalable association mining for temporal data in columnar probabilistic databases. In: Proceedings of 30th IEEE International Conference on Data Engineering (ICDE), Industrial and Applications Track; 2014. p. 1072–83.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Electrical Engineering and Computer ScienceNorthwestern UniversityEvanstonUSA

Personalised recommendations