Abstract
The rapidly increasing capabilities of neurotechnologies are generating massive volumes of complex multi-modal data at a rapid pace. This neurological big data can be leveraged to provide new insights into complex neurological disorders using data mining and knowledge discovery techniques. For example, electrophysiological signal data consisting of electroencephalogram (EEG) and electrocardiogram (ECG) can be analyzed for brain connectivity research, physiological associations to neural activity, diagnosis, and care of patients with epilepsy. However, existing approaches to store and model electrophysiological signal data has several limitations, which make it difficult for signal data to be used directly in data analysis, signal visualization tools, and knowledge discovery applications. Therefore, use of neurological big data for secondary analysis and potential development of personalized treatment strategies requires scalable data processing platforms. In this chapter, we describe the development of a high performance data flow system called Signal Data Cloud (SDC) to pre-process large-scale electrophysiological signal data using open source Apache Pig. The features of this neurological big data processing system are: (a) efficient partitioningof signal data into fixed size segments for easier storage in high performance distributed file system, (b) integration and semantic annotation of clinical metadata using an epilepsy domain ontology, and (c) transformation of raw signal data into an appropriate format for use in signal analysis platforms. In this chapter, we also discuss the various challenges being faced by the biomedical informatics community in the context of Big Data, especially the increasing need to ensure data quality and scientific reproducibility.
Keywords
- Electrophysiological signal data
- Epileptic seizure networks
- Neurology
- Clinical research
- Apache pig
- Distributed computing
This is a preview of subscription content, access via your institution.
Buying options


References
Brain Research through Advancing Innovative Neurotechnologies (BRAIN). The White House, Washington, D.C. (2013)
Bargmann, C., Newsome, W., Anderson, D., et al.: BRAIN 2025: a scientific vision. US National Institutes of Health 2014
Marcus, D.S., Harwell, J., Olsen, T., Hodge, M., Glasser, M.F., Prior, F., Jenkinson, M., Laumann, T., Curtiss, S.W., Van Essen, D.C.: Informatics and data mining tools and strategies for the human connectome project. Front. Neuroinformatics 5 2011
Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, S., Franklin, M., Gehrke, J., Haas, L., Halevy, A., Han, J., Jagadish, H.V., Labrinidis, A., Madden, S., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Ross, K., Shahabi, C., Suciu, D., Vaithyanathan, S., Widom, J.: Challenges and Opportunities with Big Data. Purdue University 2011
Sejnowski, T.J., Churchland, P.S., Movshon, J.A.: Putting big data to good use in neuroscience. Nature Neurosci. 17, 1440?1441 (2014)
Hagmann, P., Jonasson, L., Maeder, P., Thiran, J.P., Wedeen, V.J., Meuli, R.: Understanding diffusion MR imaging techniques: from scalar diffusion-weighted imaging to diffusion tensor imaging and beyond. RadioGraphics 26, 205?223 (2006)
Wendling, F., Ansari-Asl, K., Bartolomei, F., Senhadji, L.: From EEG signals to brain connectivity: a model-based evaluation of interdependence measures. J. Neurosci. Methods 183, 9?18 (2009)
Epilepsy Foundation. http://www.epilepsyfoundation.org/aboutepilepsy/whatisepilepsy/statistics.cfm. Accessed May 3, 2016
Wendling, F., Bartolomei, F., Senhadji, L.: Spatial analysis of intracerebral electroencephalographic signals in the time and frequency domain: identification of epileptogenic networks in partial epilepsy. Philos. Tansa. Maths Phys. Eng. Sci. 367, 297?316 (2009)
Fisher, R.S.: Emerging antiepileptic drugs. Neurology 43, 12?20 (1993)
Wagenaar, J.B., Brinkmann, B.H., Ives, Z., Worrell, G.A., Litt, B.: A multimodal platform for cloud-based collaborative research. In: Presented at the 6th International IEEE/EMBS Conference on Neural Engineering (NER), San Diego, CA (2013)
Kemp, B., Olivan, J.: European data format ?plus? (EDF+), an EDF alike standard format for the exchange of physiological data. Clin. Neurophysiol. 114, 1755?1761 (2003)
Sahoo, S.S., Wei, A., Valdez, J., Wang, L., Zonjy, B., Tatsuoka, C., Loparo, K.A., Lhatoo, S.D.: NeuroPigPen: a data management toolkit using hadoop pig for processing electrophysiological signals in neuroscience applications. Front. Neuroinformatics (2016)
Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of Map-Reduce: the Pig experience. In: 35th International Conference on Very Large Data Bases, Lyon, France, pp. 1414?1425 (2009)
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53, 72?77 (2010)
Friston, K.J.: Functional and effective connectivity: a review. Brain Connectivity 1, 13?36 (2011)
Kramer, M.A., Cash, S.S.: Epilepsy as a disorder of cortical network organization. Neuroscientist 18, 360?372 (2012)
Rogers, B.P., Morgan, V.L., Newton, A.T., Gore, J.C.: Assessing functional connectivity in the human brain by fMRI. Magn. Reson. Imaging 25, 1347?1357 (2007)
Bodenreider, O., Stevens, R.: Bio-ontologies: Current trends and future directions. Briefings Bioinform. 7, 256?274 (2006)
Fisher, R.S., Boas, W.E., Blume, W., Elger, C., Genton, P., Lee, P.Engel, Jr., J.: Epileptic Seizures and epilepsy: definitions proposed by the international league against epilepsy (ILAE) and the international bureau for epilepsy (IBE). Epilepsia 46, 470?472 (2005)
Dean, J.: Challenges in building large-scale information retrieval systems. In: Invited Talk, ed. ACM International Conference on Web Search and Data Mining (WSDM) (2009)
Freeman, J., Vladimirov, N., Kawashima, T., Mu, Y., Sofroniew, N.J., Bennett, D.V., Rosen, J., Yang, C.T., Looger, L.L., Ahrens, M.B.: Mapping brain activity at scale with cluster computing. Nat. Methods 11, 941?950 (2014)
Chen, D., Wang, L., Ouyang, G., Li, X.: Massively parallel neural signal processing on a many-core platform. Comput. Sci. Engg. 13, 42?51 (2011)
Wang, L., Chen, D., Ranjan, R., Khan, S.U., KolOdziej, J., Wang, J.: Parallel processing of massive EEG data with MapReduce. presented at the ICPADS (2012)
Wu, Z., Huang, N.E.: Ensemble empirical mode decomposition: a noise-assisted data analysis method. Adv. Adapt. Data Anal. 1, 1?41 (2009)
Boubela, R.N., Kalcher, K., Huf, W., Na?el, C., Moser, E.: Big data approaches for the analysis of large-scale fMRI data using apache spark and GPU processing: a demonstration on resting-state fMRI data from the human connectome project. Front. Neurosci. 9 (2016)
Guye, M., Bettus, G., Bartolomei, F., Cozzone, P.J.: Graph theoretical analysis of structural and functional connectivity MRI in normal and pathological brain networks. Magn. Reson. Mater. Phys., Biol. Med. 23, 409?421 (2010)
Yang, S., Tatsuoka, C., Ghosh, K., Lacuey-Lecumberri, N., Lhatoo, S.D., Sahoo, S.S.: Comparative Evaluation for Brain Structural Connectivity Approaches: Towards Integrative Neuroinformatics Tool for Epilepsy Clinical Research. In: Presented at the AMIA 2016 Joint Summits on Translational Science, San Francisco, CA (2016)
Sahoo, S.S., Lhatoo, S.D., Gupta, D.K., Cui, L., Zhao, M., Jayapandian, C., Bozorgi, A., Zhang, G.Q.: Epilepsy and seizure ontology: towards an epilepsy informatics infrastructure for clinical research and patient care. J. Am. Med. Inform. Assoc. 21, 82?89 (2014)
Hitzler, P., Krötzsch, M., Parsia, B., Patel-Schneider, P.F., Rudolph, S.: OWL 2 web ontology language primer. In: World Wide Web Consortium W3C2009
Lacuey, N., Zonjy, B., Kahriman, E.S., Marashly, A., Miller, J., Lhatoo, S.D., Lüders, H.O.: Homotopic reciprocal functional connectivity between anterior human insulae. Brain Struct. Funct. 221, 1?7 (2015)
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet. 25, 25?29 (2000)
Rector, A.L., Brandt, S., Schneider, T.: Getting the foot out of the pelvis: modeling problems affecting use of SNOMED CT hierarchies in practical applications. J. Am. Med. Inform. Assoc. 18, 432?440 (2011)
Köhler, S., Doelken, S.C., Mungall, C.J., et al.: The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 42, 966?974 (2014)
Diehn, M., Sherlock, G., Binkley, G., Jin, H., Matese, J.C., Hernandez-Boussard, T., Rees, C.A., Cherry, J.M., Botstein, D., Brown, P.O., Alizadeh, A.A.: SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res. 31, 219?223 (2003)
Xie, H., Wasserman, A., Levine, Z., Novik, A., Grebinskiy, V., Shoshan, A., Mintz, L.: Large-scale protein annotation through gene ontology. Genome Res. 12, 785?794 (2002)
Jayapandian, C., Wei, A., Ramesh, P., Zonjy, B., Lhatoo, S.D., Loparo, K., Zhang, GQ, Sahoo, S.S.: A scalable neuroinformatics data flow for electrophysiological signals using MapReduce. Front. Neuroinformatics 9 (2015)
Yildirim, P., Majnaric, L., Ekmekci, I.O., Holzinger, A.: Knowledge discovery of drug data on the example of adverse reaction prediction. BMC Bioinform. 15, S7 (2014)
Holzinger, A.: Trends in interactive knowledge discovery for personalized medicine: cognitive science meets machine learning. IEEE Intell. Inf. Bull. 15, 6?14 (2014)
Preuß, M., Dehmer, M., Pickl, S., Holzinger, A.: On terrain coverage optimization by using a network approach for universal graph-based data mining and knowledge discovery. In: Ślȩzak, D., Tan, A.-H., Peters, James, F., Schwabe, L. (eds.) BIH 2014. LNCS (LNAI), vol. 8609, pp. 564?573. Springer, Heidelberg (2014). doi:10.1007/978-3-319-09891-3_51
Holdren, J.P., Lander, E.: Realizing the full potential of health information technology to improve healthcare for americans: the path forward. PCAST Report, Washington, D.C. (2010)
Dean, D.A., Goldberger, A.L., Mueller, R., Kim, M., Rueschman, M., Mobley, D., Sahoo, S.S., Jayapandian, C.P., Cui, L., Morrical, M.G., Surovec, S., Zhang, G.Q., Redline, S.: Scaling up scientific discovery in sleep medicine: the National Sleep Research Resource. Sleep 39, 1151?1164 (2016)
Lebo, T., Sahoo, S.S., McGuinness, D.: PROV-O: The PROV Ontology. World Wide Web Consortium W3C2013
Goble, C.: Position statement: musings on provenance, workflow and (semantic web) annotations for bioinformatics. In: Workshop on Data Derivation and Provenance, Chicago (2002)
Missier, P., Sahoo, S.S., Zhao, J., Goble, C., Sheth, A.: Janus: from Workflows to semantic provenance and linked open data. In: Presented at the IPAW 2010, Troy, NY (2010)
Acknowledgements
This work is supported in part by the National Institutes of Biomedical Imaging and Bioengineering (NIBIB) Big Data to Knowledge (BD2 K) grant (1U01EB020955) and the National Institutes of Neurological Disorders and Stroke (NINDS) Center for SUDEP Research grant (1U01NS090407-01).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this chapter
Cite this chapter
Sahoo, S.S., Wei, A., Tatsuoka, C., Ghosh, K., Lhatoo, S.D. (2016). Processing Neurology Clinical Data for Knowledge Discovery: Scalable Data Flows Using Distributed Computing. In: Holzinger, A. (eds) Machine Learning for Health Informatics. Lecture Notes in Computer Science(), vol 9605. Springer, Cham. https://doi.org/10.1007/978-3-319-50478-0_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-50478-0_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50477-3
Online ISBN: 978-3-319-50478-0
eBook Packages: Computer ScienceComputer Science (R0)