Processing Neurology Clinical Data for Knowledge Discovery: Scalable Data Flows Using Distributed Computing

  • Satya S. SahooEmail author
  • Annan Wei
  • Curtis Tatsuoka
  • Kaushik Ghosh
  • Samden D. Lhatoo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9605)


The rapidly increasing capabilities of neurotechnologies are generating massive volumes of complex multi-modal data at a rapid pace. This neurological big data can be leveraged to provide new insights into complex neurological disorders using data mining and knowledge discovery techniques. For example, electrophysiological signal data consisting of electroencephalogram (EEG) and electrocardiogram (ECG) can be analyzed for brain connectivity research, physiological associations to neural activity, diagnosis, and care of patients with epilepsy. However, existing approaches to store and model electrophysiological signal data has several limitations, which make it difficult for signal data to be used directly in data analysis, signal visualization tools, and knowledge discovery applications. Therefore, use of neurological big data for secondary analysis and potential development of personalized treatment strategies requires scalable data processing platforms. In this chapter, we describe the development of a high performance data flow system called Signal Data Cloud (SDC) to pre-process large-scale electrophysiological signal data using open source Apache Pig. The features of this neurological big data processing system are: (a) efficient partitioningof signal data into fixed size segments for easier storage in high performance distributed file system, (b) integration and semantic annotation of clinical metadata using an epilepsy domain ontology, and (c) transformation of raw signal data into an appropriate format for use in signal analysis platforms. In this chapter, we also discuss the various challenges being faced by the biomedical informatics community in the context of Big Data, especially the increasing need to ensure data quality and scientific reproducibility.


Electrophysiological signal data Epileptic seizure networks Neurology Clinical research Apache pig Distributed computing 



This work is supported in part by the National Institutes of Biomedical Imaging and Bioengineering (NIBIB) Big Data to Knowledge (BD2 K) grant (1U01EB020955) and the National Institutes of Neurological Disorders and Stroke (NINDS) Center for SUDEP Research grant (1U01NS090407-01).


  1. 1.
    Brain Research through Advancing Innovative Neurotechnologies (BRAIN). The White House, Washington, D.C. (2013)Google Scholar
  2. 2.
    Bargmann, C., Newsome, W., Anderson, D., et al.: BRAIN 2025: a scientific vision. US National Institutes of Health 2014Google Scholar
  3. 3.
    Marcus, D.S., Harwell, J., Olsen, T., Hodge, M., Glasser, M.F., Prior, F., Jenkinson, M., Laumann, T., Curtiss, S.W., Van Essen, D.C.: Informatics and data mining tools and strategies for the human connectome project. Front. Neuroinformatics 5 2011Google Scholar
  4. 4.
    Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, S., Franklin, M., Gehrke, J., Haas, L., Halevy, A., Han, J., Jagadish, H.V., Labrinidis, A., Madden, S., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Ross, K., Shahabi, C., Suciu, D., Vaithyanathan, S., Widom, J.: Challenges and Opportunities with Big Data. Purdue University 2011Google Scholar
  5. 5.
    Sejnowski, T.J., Churchland, P.S., Movshon, J.A.: Putting big data to good use in neuroscience. Nature Neurosci. 17, 1440?1441 (2014)CrossRefGoogle Scholar
  6. 6.
    Hagmann, P., Jonasson, L., Maeder, P., Thiran, J.P., Wedeen, V.J., Meuli, R.: Understanding diffusion MR imaging techniques: from scalar diffusion-weighted imaging to diffusion tensor imaging and beyond. RadioGraphics 26, 205?223 (2006)CrossRefGoogle Scholar
  7. 7.
    Wendling, F., Ansari-Asl, K., Bartolomei, F., Senhadji, L.: From EEG signals to brain connectivity: a model-based evaluation of interdependence measures. J. Neurosci. Methods 183, 9?18 (2009)CrossRefGoogle Scholar
  8. 8.
  9. 9.
    Wendling, F., Bartolomei, F., Senhadji, L.: Spatial analysis of intracerebral electroencephalographic signals in the time and frequency domain: identification of epileptogenic networks in partial epilepsy. Philos. Tansa. Maths Phys. Eng. Sci. 367, 297?316 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Fisher, R.S.: Emerging antiepileptic drugs. Neurology 43, 12?20 (1993)CrossRefGoogle Scholar
  11. 11.
    Wagenaar, J.B., Brinkmann, B.H., Ives, Z., Worrell, G.A., Litt, B.: A multimodal platform for cloud-based collaborative research. In: Presented at the 6th International IEEE/EMBS Conference on Neural Engineering (NER), San Diego, CA (2013)Google Scholar
  12. 12.
    Kemp, B., Olivan, J.: European data format ?plus? (EDF+), an EDF alike standard format for the exchange of physiological data. Clin. Neurophysiol. 114, 1755?1761 (2003)CrossRefGoogle Scholar
  13. 13.
    Sahoo, S.S., Wei, A., Valdez, J., Wang, L., Zonjy, B., Tatsuoka, C., Loparo, K.A., Lhatoo, S.D.: NeuroPigPen: a data management toolkit using hadoop pig for processing electrophysiological signals in neuroscience applications. Front. Neuroinformatics (2016)Google Scholar
  14. 14.
    Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of Map-Reduce: the Pig experience. In: 35th International Conference on Very Large Data Bases, Lyon, France, pp. 1414?1425 (2009)Google Scholar
  15. 15.
    Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53, 72?77 (2010)CrossRefGoogle Scholar
  16. 16.
    Friston, K.J.: Functional and effective connectivity: a review. Brain Connectivity 1, 13?36 (2011)CrossRefGoogle Scholar
  17. 17.
    Kramer, M.A., Cash, S.S.: Epilepsy as a disorder of cortical network organization. Neuroscientist 18, 360?372 (2012)CrossRefGoogle Scholar
  18. 18.
    Rogers, B.P., Morgan, V.L., Newton, A.T., Gore, J.C.: Assessing functional connectivity in the human brain by fMRI. Magn. Reson. Imaging 25, 1347?1357 (2007)CrossRefGoogle Scholar
  19. 19.
    Bodenreider, O., Stevens, R.: Bio-ontologies: Current trends and future directions. Briefings Bioinform. 7, 256?274 (2006)CrossRefGoogle Scholar
  20. 20.
    Fisher, R.S., Boas, W.E., Blume, W., Elger, C., Genton, P., Lee, P.Engel, Jr., J.: Epileptic Seizures and epilepsy: definitions proposed by the international league against epilepsy (ILAE) and the international bureau for epilepsy (IBE). Epilepsia 46, 470?472 (2005)CrossRefGoogle Scholar
  21. 21.
    Dean, J.: Challenges in building large-scale information retrieval systems. In: Invited Talk, ed. ACM International Conference on Web Search and Data Mining (WSDM) (2009)Google Scholar
  22. 22.
    Freeman, J., Vladimirov, N., Kawashima, T., Mu, Y., Sofroniew, N.J., Bennett, D.V., Rosen, J., Yang, C.T., Looger, L.L., Ahrens, M.B.: Mapping brain activity at scale with cluster computing. Nat. Methods 11, 941?950 (2014)CrossRefGoogle Scholar
  23. 23.
    Chen, D., Wang, L., Ouyang, G., Li, X.: Massively parallel neural signal processing on a many-core platform. Comput. Sci. Engg. 13, 42?51 (2011)CrossRefGoogle Scholar
  24. 24.
    Wang, L., Chen, D., Ranjan, R., Khan, S.U., KolOdziej, J., Wang, J.: Parallel processing of massive EEG data with MapReduce. presented at the ICPADS (2012)Google Scholar
  25. 25.
    Wu, Z., Huang, N.E.: Ensemble empirical mode decomposition: a noise-assisted data analysis method. Adv. Adapt. Data Anal. 1, 1?41 (2009)CrossRefGoogle Scholar
  26. 26.
    Boubela, R.N., Kalcher, K., Huf, W., Na?el, C., Moser, E.: Big data approaches for the analysis of large-scale fMRI data using apache spark and GPU processing: a demonstration on resting-state fMRI data from the human connectome project. Front. Neurosci. 9 (2016)Google Scholar
  27. 27.
    Guye, M., Bettus, G., Bartolomei, F., Cozzone, P.J.: Graph theoretical analysis of structural and functional connectivity MRI in normal and pathological brain networks. Magn. Reson. Mater. Phys., Biol. Med. 23, 409?421 (2010)CrossRefGoogle Scholar
  28. 28.
    Yang, S., Tatsuoka, C., Ghosh, K., Lacuey-Lecumberri, N., Lhatoo, S.D., Sahoo, S.S.: Comparative Evaluation for Brain Structural Connectivity Approaches: Towards Integrative Neuroinformatics Tool for Epilepsy Clinical Research. In: Presented at the AMIA 2016 Joint Summits on Translational Science, San Francisco, CA (2016)Google Scholar
  29. 29.
    Sahoo, S.S., Lhatoo, S.D., Gupta, D.K., Cui, L., Zhao, M., Jayapandian, C., Bozorgi, A., Zhang, G.Q.: Epilepsy and seizure ontology: towards an epilepsy informatics infrastructure for clinical research and patient care. J. Am. Med. Inform. Assoc. 21, 82?89 (2014)CrossRefGoogle Scholar
  30. 30.
    Hitzler, P., Krötzsch, M., Parsia, B., Patel-Schneider, P.F., Rudolph, S.: OWL 2 web ontology language primer. In: World Wide Web Consortium W3C2009Google Scholar
  31. 31.
    Lacuey, N., Zonjy, B., Kahriman, E.S., Marashly, A., Miller, J., Lhatoo, S.D., Lüders, H.O.: Homotopic reciprocal functional connectivity between anterior human insulae. Brain Struct. Funct. 221, 1?7 (2015)Google Scholar
  32. 32.
    Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet. 25, 25?29 (2000)CrossRefGoogle Scholar
  33. 33.
    Rector, A.L., Brandt, S., Schneider, T.: Getting the foot out of the pelvis: modeling problems affecting use of SNOMED CT hierarchies in practical applications. J. Am. Med. Inform. Assoc. 18, 432?440 (2011)CrossRefGoogle Scholar
  34. 34.
    Köhler, S., Doelken, S.C., Mungall, C.J., et al.: The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 42, 966?974 (2014)CrossRefGoogle Scholar
  35. 35.
    Diehn, M., Sherlock, G., Binkley, G., Jin, H., Matese, J.C., Hernandez-Boussard, T., Rees, C.A., Cherry, J.M., Botstein, D., Brown, P.O., Alizadeh, A.A.: SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res. 31, 219?223 (2003)CrossRefGoogle Scholar
  36. 36.
    Xie, H., Wasserman, A., Levine, Z., Novik, A., Grebinskiy, V., Shoshan, A., Mintz, L.: Large-scale protein annotation through gene ontology. Genome Res. 12, 785?794 (2002)CrossRefGoogle Scholar
  37. 37.
    Jayapandian, C., Wei, A., Ramesh, P., Zonjy, B., Lhatoo, S.D., Loparo, K., Zhang, GQ, Sahoo, S.S.: A scalable neuroinformatics data flow for electrophysiological signals using MapReduce. Front. Neuroinformatics 9 (2015)Google Scholar
  38. 38.
    Yildirim, P., Majnaric, L., Ekmekci, I.O., Holzinger, A.: Knowledge discovery of drug data on the example of adverse reaction prediction. BMC Bioinform. 15, S7 (2014)CrossRefGoogle Scholar
  39. 39.
    Holzinger, A.: Trends in interactive knowledge discovery for personalized medicine: cognitive science meets machine learning. IEEE Intell. Inf. Bull. 15, 6?14 (2014)Google Scholar
  40. 40.
    Preuß, M., Dehmer, M., Pickl, S., Holzinger, A.: On terrain coverage optimization by using a network approach for universal graph-based data mining and knowledge discovery. In: Ślȩzak, D., Tan, A.-H., Peters, James, F., Schwabe, L. (eds.) BIH 2014. LNCS (LNAI), vol. 8609, pp. 564?573. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-09891-3_51 Google Scholar
  41. 41.
    Holdren, J.P., Lander, E.: Realizing the full potential of health information technology to improve healthcare for americans: the path forward. PCAST Report, Washington, D.C. (2010)Google Scholar
  42. 42.
    Dean, D.A., Goldberger, A.L., Mueller, R., Kim, M., Rueschman, M., Mobley, D., Sahoo, S.S., Jayapandian, C.P., Cui, L., Morrical, M.G., Surovec, S., Zhang, G.Q., Redline, S.: Scaling up scientific discovery in sleep medicine: the National Sleep Research Resource. Sleep 39, 1151?1164 (2016)CrossRefGoogle Scholar
  43. 43.
    Lebo, T., Sahoo, S.S., McGuinness, D.: PROV-O: The PROV Ontology. World Wide Web Consortium W3C2013Google Scholar
  44. 44.
    Goble, C.: Position statement: musings on provenance, workflow and (semantic web) annotations for bioinformatics. In: Workshop on Data Derivation and Provenance, Chicago (2002)Google Scholar
  45. 45.
    Missier, P., Sahoo, S.S., Zhao, J., Goble, C., Sheth, A.: Janus: from Workflows to semantic provenance and linked open data. In: Presented at the IPAW 2010, Troy, NY (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Satya S. Sahoo
    • 1
    • 2
    Email author
  • Annan Wei
    • 2
  • Curtis Tatsuoka
    • 3
  • Kaushik Ghosh
    • 4
  • Samden D. Lhatoo
    • 3
  1. 1.Division of Medical Informatics, Department of Epidemiology and Biostatistics, School of MedicineCase Western Reserve UniversityClevelandUSA
  2. 2.Department of Electrical Engineering and Computer Science, School of EngineeringCase Western Reserve UniversityClevelandUSA
  3. 3.Department of Neurology, Epilepsy CenterUniversity Hospitals Case Medical CenterClevelandUSA
  4. 4.Department of Mathematical SciencesUniversity of NevadaLas VegasUSA

Personalised recommendations