Abstract
To obtain information from quantitative data, we need to develop various analysis methods, which can be drawn from diverse fields, such as computer science, information theory and statistics. This chapter discusses methods for analysing datasets generated in asthma study for personalized medicine. Personalized medicine is the future of medicine, aiming at providing tailor-made medical decisions, practices and products to individual patients. Medical decisions and treatments are being tailored to individual patient based on the context of patient’s various profiles such as Genomics, Proteomics, Lipidomics and Metabolomics content. High throughput instruments are used to generate large scale datasets. To succeed in personalized medicine, analysis methods, including those dedicated to specific data types and those shared among various data, should be well developed. In this chapter, we first discuss the need of using data from molecular level to pathway level. Then we introduce analysis methods in typical analysis steps, which are batch effect detection and removal, statistical analysis, feature selection and classification, and unsupervised way of pattern recognition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abbreviations
- ANOVA:
-
Analysis of variance
- CS:
-
Corticosteroids
- CV:
-
Cross-validation
- DNA:
-
Deoxyribonucleic acid
- DRAMI:
-
Drift, Retention time, Accurate Mass, Intensity
- DWD:
-
Distance weighted discrimination
- GR:
-
Glucocorticoid receptor
- GWAS:
-
Genome-wide association study
- LASSO:
-
least absolute shrinkage and selection operator
- LC-IMS/MSE :
-
Ion mobility supported lipid chromatography and mass spectrometry instrument
- LOOCV:
-
Leave-one-out cross-validation
- LPS:
-
Lipopolysaccharide
- MAPK:
-
Mitogen Activated Protein Kinase
- MKP-1:
-
MAPK phosphatase-1
- MS:
-
Mass spectrometry
- PCA:
-
Principle component analysis
- PLGS:
-
ProteinLynx Global Server
- ROC:
-
Receiver operating characteristic
- SNP:
-
Single-nucleotide polymorphisms
- SVA:
-
Surrogate variable analysis
- SVD:
-
Singular value decomposition
- SVM:
-
Support vector machine
- TAK1:
-
GFβ kinase-1
- TDA:
-
Topological data analysis
- UBIOPRED:
-
Unbiased BIOmarkers in PREDiction of respiratory disease outcomes
References
Coveney P, Díaz-Zuccarini V, Hunter P, Viceconti M. Computational biomedicine. In: Computational biomedicine; 2014. p. 296.
Wimmer GE, Shohamy D. Preference by association: how memory mechanisms in the hippocampus bias decisions. Science (80- ). 2012;338(6104):270–3. https://doi.org/10.1126/science.1223252.
Smith R. Stratified, personalised, or precision medicine 2012.
Dudley JT, Karczewski KJ. Exploring personal genomics; 2013. https://doi.org/10.1093/acprof:oso/9780199644483.001.0001.
Lu Y, Goldstein D, Angrist M, Cavalleri G. Personalized medicine and human genetic diversity. Cold Spring Harb Perspect Med. 2014;4(9):a008581.
Pearson TA, Manolio TA. How to interpret a genome-wide association study. JAMA. 2008;299(11):1335–44. https://doi.org/10.1001/jama.299.11.1335.
Manolio TA. Genomewide association studies and assessment of the risk of disease. N Engl J Med. 2010;363(2):166–76. https://doi.org/10.1056/NEJMra0905980.
Clarke GM, Anderson CA, Pettersson FH, Cardon LR, Morris AP, Zondervan KT. Basic statistical analysis in genetic case-control studies. Nat Protoc. 6(2):121–33.
Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. https://doi.org/10.1086/519795.
Gomez-Cabrero D, Abugessaisa I, Maier D, et al. Data integration in the era of omics: current and future challenges. BMC Syst Biol. 2014;8 Suppl 2(Suppl 2):I1. https://doi.org/10.1186/1752-0509-8-S2-I1.
Joyce AR, Palsson BØ. The model organism as a system: integrating’omics’ data sets. Nat Rev Mol Cell Biol. 2006;7(3):198–210. https://doi.org/10.1038/nrm1857.
Winslow RL, Trayanova N, Geman D, Miller MI. Computational medicine: translating models to clinical care. Sci Transl Med. 2012;4(158):158rv11. https://doi.org/10.1126/scitranslmed.3003528.
Shaw DE, Sousa AR, Fowler SJ, et al. Clinical and inflammatory characteristics of the European U-BIOPRED adult severe asthma cohort. Eur Respir J. 2015;46:1308–21. https://doi.org/10.1183/13993003.00779-2015.
Chen R, Mias GI, Li-Pook-Than J, et al. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell. 2012;148(6):1293–307. https://doi.org/10.1016/j.cell.2012.02.009.
Schneider MV, Orchard S. Omics technologies, data and bioinformatics principles. Methods Mol Biol. 2011;719:3–30. https://doi.org/10.1007/978-1-61779-027-0_1.
Zhang G, Annan RS, Carr SA, Neubert TA. Overview of peptide and protein analysis by mass spectrometry. Curr Protoc Protein Sci. 2010; Chapter 16(November):Unit16.1. https://doi.org/10.1002/0471140864.ps1601s62.
Silva JC, Denny R, Dorschel CA, et al. Quantitative proteomic analysis by accurate mass retention time pairs. Anal Chem. 2005;77(7):2187–200. https://doi.org/10.1021/ac048455k.
Olson CF. Parallel algorithms for hierarchical clustering. 1995;21:1313–25.
Zomorodian A. Topological data analysis. Inverse Probl. 2011;27(12):120201. https://doi.org/10.1088/0266-5611/27/12/120201.
Nikolsky Y, Kirillov E, Zuev R, Rakhmatulin E, Nikolskaya T. Functional analysis of OMICs data and small molecule compounds in an integrated “knowledge-based” platform. Methods Mol Biol. 2009;563:177–96. https://doi.org/10.1007/978-1-60761-175-2_10.
Wolkenhauer O. Why model? Front Physiol. 2014;5(JAN(January)):1–5. https://doi.org/103389/fphys2014.00021
Kholodenko BN. Cell-signalling dynamics in time and space. Nat Cell Biol. 2006;7(March):165–76. https://doi.org/10.1038/nrm1838.
Holehouse A, Yang X, Adcock I, Guo Y. Developing a novel integrated model of p38 MAPK and glucocorticoid signalling pathways. 2012 IEEE Symposium on Computational Intelligence Computational Biology CIBCB 2012. 2012:69–76. https://doi.org/10.1109/CIBCB.2012.6217213.
Ito K, Chung KF, Adcock IM. Update on glucocorticoid action and resistance. J Allergy Clin Immunol. 2006;117(3):522–43. https://doi.org/10.1016/j.jaci.2006.01.032.
Bhavsar P, Khorasani N, Hew M, Johnson M, Chung KF. Effect of p38 MAPK inhibition on corticosteroid suppression of cytokine release in severe asthma. Eur Respir J. 2010;35(4):750–6. https://doi.org/10.1183/09031936.00071309.
Hew M, Bhavsar P, Torrego A, et al. Relative corticosteroid insensitivity of peripheral blood mononuclear cells in severe asthma. Am J Respir Crit Care Med. 2006;174(2):134–41. https://doi.org/10.1164/rccm.200512-1930OC.
Hendriks BS, Hua F, Chabot JR. Analysis of mechanistic pathway models in drug discovery: P38 pathway. Biotechnol Prog. 2008;24(1):96–109. https://doi.org/10.1021/bp070084g.
Petricoin E, Ardekani A, Hitt B, Levine P. Use of proteomic patterns in serum to identify ovarian cancer. Lancet. 2002;359(9306):572–7.
Spielman R, Bastone L, Burdick J, Morley M. Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet. 2007;39:226–31.
Spielman R, Cheung V. Reply to “On the design and analysis of gene expression studies in human populations”. Nat Genet. 2007;39:808–9.
Baggerly KA, Edmonson SR, Morris JS, Coombes KR. High-resolution serum proteomic patterns for ovarian cancer detection. Endocr Relat Cancer. 2004;11:585–7.
Yang H, Harrington CA, Vartanian K, Coldren CD, Hall R, Churchill GA. Randomization in laboratory procedure is key to obtaining reproducible microarray results. PLoS One. 2008;3(11). https://doi.org/10.1371/journal.pone.0003724.
Holmes S, Alekseyenko A, Timme A, Nelson T, Pasricha PJ, Spormann A. Visualization and statistical comparisons of microbial communities using R packages on phylochip data. Pac Symp Biocomput. 2010:142–53. https://doi.org/10.1142/9789814335058_0016.
Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans R Soc A Math, Phys Eng Sci. 2016;374. https://doi.org/10.1098/rsta.2015.0202.
Desdouits N, Nilges M, Blondel A. Principal component analysis reveals correlation of cavities evolution and functional motions in proteins. J Mol Graph Model. 2015;55:13–24. https://doi.org/10.1016/j.jmgm.2014.10.011.
Alonso-Gutierrez J, Kim EM, Batth TS, et al. Principal component analysis of proteomics (PCAP) as a tool to direct metabolic engineering. Metab Eng. 2015;28:123–33. https://doi.org/10.1016/j.ymben.2014.11.011.
Zhang JD, Küng E, Boess F, Certa U, Ebeling M. Pathway reporter genes define molecular phenotypes of human cells. BMC Genomics. 2015;16(1):342. https://doi.org/10.1186/s12864-015-1532-2.
Fahad A, Alshatri N, Tari Z, et al. A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput. 2014;2(3):267–79. https://doi.org/10.1109/TETC.2014.2330519.
Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci. 2000;97(18):10101–6. Available at: http://www.pnas.org/cgi/content/abstract/97/18/10101
Nielsen T, West R, Linn S, Alter O, Knowling M. Molecular characterisation of soft tissue tumours: a gene expression study. Lancet. 2002. Available at: http://www.sciencedirect.com/science/article/pii/S0140673602082703. Accessed 13 March 2017.
Benito M, Parker J, Du Q, et al. Adjustment of systematic microarray data biases. Bioinformatics. 2004;20(1):105–14. https://doi.org/10.1093/bioinformatics/btg385.
Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27. https://doi.org/10.1093/biostatistics/kxj037.
Scherer A. Batch effects and noise in microarray experiments: sources and solutions. Chichester: Wiley; 2009.
Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3(9):1724–35. https://doi.org/10.1371/journal.pgen.0030161.
Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process and purpose. Am Stat. 2016. https://doi.org/10.1080/00031305.2016.1154108.
Mastin L. The story of mathematics.; 2010. Available at: www.storyofmathematics.com.
Welch BL. The generalization of ‘Student’s’ problem when several different population variances are involved. Biometrika. 1947;34(1/2):28–35. https://doi.org/10.1093/biomet/34.1-2.28.
Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18(1):50–60. https://doi.org/10.1214/aoms/1177730491.
Arnold TB, Emerson JW. Nonparametric goodness-of-fit tests for discrete null distributions. R J. 2011:34–9. Available at: http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Arnold+Emerson.pdf
Yates F. Contingency table involving small numbers and the χ2 test. Suppl to J R Stat Soc. 1934;1:217–35.
GEP B. Non-normality and tests on variances. Biometrika. 1953;40(3/4):318. https://doi.org/10.2307/2333350.
Mehta CR, Patel NR. Exact inference for categorical data. Encycl Biostat. 1998:1411–22. https://doi.org/10.1002/0470011815.b2a10019.
Davis J, Maes M, Andreazza A, McGrath JJ, Tye SJ, Berk M. Towards a classification of biomarkers of neuropsychiatric disease: from encompass to compass. Mol Psychiatry. 2014;20(2):152–3. https://doi.org/10.1038/mp.2014.139.
Eckardt K-U, Alper SL, Antignac C, et al. Autosomal dominant tubulointerstitial kidney disease: diagnosis, classification, and management—a KDIGO consensus report. Kidney Int. 2015;1(4):1–8. https://doi.org/10.1038/ki.2015.28.
Wisittipanit N, Rangwala H, Sikaroodi M, Keshavarzian A, Mutlu EA, Gillevet P. Classification methods for the analysis of LH-PCR data associated with inflammatory bowel disease patients. Int J Bioinforma Res Appl. 2015;11(2):111–29. https://doi.org/10.1504/IJBRA.2015.068087
Möller C, Pijnenburg YAL, van der Flier WM, et al. Alzheimer disease and behavioral variant frontotemporal dementia: automatic classification based on cortical atrophy for single-subject diagnosis. Radiology. 2015:150220. https://doi.org/10.1148/radiol.2015150220.
Murphy KP. Machine learning: a probabilistic perspective. Cambridge, MA: MIT press; 1991. https://doi.org/10.1007/SpringerReference_35834.
Fisher R. The use of multiple measurements in taxonomic problems. Ann Eugenics. 1936;7(2):179–88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
Cox DR. The regression analysis of binary sequences (with discussion). J Roy Stat Soc B. 1958;20:215–42.
Rish I. An empirical study of the naive Bayes classifier. Proc of Th IJCAI 2001 workshop on empirical methods in artificial intelligence. 2001;1:1–6.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97. https://doi.org/10.1007/BF00994018.
Quinlan JR. Simplifying decision trees. Int J Man Mach Stud. 1987;27(3):221–34. https://doi.org/10.1016/S0020-7373(87)80053-6.
Bishop CM. Neural networks for pattern recognition. J Am Stat Assoc. 1995;92:482. https://doi.org/10.2307/2965437.
Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal Mach Learn Res. 2001;1:211–44. https://doi.org/10.1162/15324430152748236.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
Aho K, Derryberry D, Peterson T. Model selection for ecologists: the worldviews of AIC and BIC. Ecology. 2014;95(3):631–6. https://doi.org/10.1890/13-1452.1.
Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6(2):461–4. https://doi.org/10.1214/aos/1176344136.
Dutta R, Bogdan M, Ghosh JK. Model selection and multiple testing – a Bayesian and empirical Bayes overview and some new results. J Indian Stat …. 2000;2015:1–29.
Toni T, Stumpf MPH. Simulation-based model selection for dynamical systems in systems and population biology. Bioinformatics. 2010;26(1):104–10.
Hug S, Schmidl D, Li WB, Greiter MB, Theis FJ. Bayesian model selection methods and their application to biological ODE systems. In: Uncertainty in biology, a computational modeling approach. Cham: Springer; 2015.
Yang X, Guo Y, Skipp P, Rowe A. Automating mass spectrometry proteomics analysis. In: Fourth international conference on bioinformatics and computational biology; 2012.
Wikipedia. Sensitivity and specificity. Available at: http://en.wikipedia.org/wiki/Sensitivity_and_specificity. Accessed 3 July 2015.
Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74. https://doi.org/10.1016/j.patrec.2005.10.010.
Arnold T, Emerson J. Nonparametric goodness-of-fit tests for discrete null distributions. R J. 2011:34–9.
Tibshirani R. Regression selection and shrinkage via the Lasso. J R Stat Soc B. 1994;58:267–88. https://doi.org/10.2307/2346178.
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2009;26(3):392–8. https://doi.org/10.1093/bioinformatics/btp630.
Zucknick M, Richardson S, Stronach EA. Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Stat Appl Genet Mol Biol. 2008;7(1.):Article7). https://doi.org/10.2202/1544-6115.1307.
Ahmed I, Hartikainen A-L, Järvelin M-R, Richardson S. False discovery rate estimation for stability selection: application to genome-wide association studies. Stat Appl Genet Mol Biol. 2011;10(1):1–20. https://doi.org/10.2202/1544-6115.1663.
Alexander DH, Lange K. Stability selection for genome-wide association. Genet Epidemiol. 2011;35(7):722–8. https://doi.org/10.1002/gepi.20623.
Kirk P, Witkover A, Bangham CRM, Richardson S, Lewin AM, Stumpf MPH. Balancing the robustness and predictive performance of biomarkers. J Comput Biol. 2013;20(12):979–89. https://doi.org/10.1089/cmb.2013.0018.
Saria S, Goldenberg A. Subtyping: what it is and its role in precision medicine. IEEE Intell Syst. 2015;30(4):70–5. https://doi.org/10.1109/MIS.2015.60.
Bishop CM. Pattern recognition and machine learning. New York: Springer; 2006. https://doi.org/10.1117/1.2819119.
Madeira SC, Oliveira AL. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform. 2004;1(1):24–45. https://doi.org/10.1109/TCBB.2004.2.
Cheng Y, Church GM. Biclustering of expression data. Int Conf Intell Syst Mol Biol. 2000;8:93–103.
Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci U S A. 2000;97(22):12079–84. https://doi.org/10.1073/pnas.210134797.
Bergmann S, Ihmels J, Barkai N. Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E Stat Nonlinear Soft Matter Phys. 2003;67(3 Pt 1):31902. https://doi.org/10.1103/PhysRevE.67.031902.
Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci U S A. 2004;101(9):2981–6. https://doi.org/10.1073/pnas.0308661100.
Tanay A. Biclustering algorithms: a survey. Handb Comput Mol Biol. 2005;9(May):122–4. https://doi.org/10.1.1.133.9434
Oghabian A, Kilpinen S, Hautaniemi S, Czeizler E. Biclustering methods: biological relevance and application in gene expression analysis. PLoS One. 2014;9(3). https://doi.org/10.1371/journal.pone.0090801.
Cha K, Hwang T, Oh K, Yi G-S. Discovering transnosological molecular basis of human brain diseases using biclustering analysis of integrated gene expression data. BMC Med Inform Decis Mak. 2015;15(Suppl 1):S7. https://doi.org/10.1186/1472-6947-15-S1-S7.
Hussain SF, Ramazan M. Biclustering of human cancer microarray data using co-similarity based co-clustering. Expert Syst Appl. 2016;55:520–31. https://doi.org/10.1016/j.eswa.2016.02.029
Williams A, Halappanavar S. Application of bi-clustering of gene expression data and gene set enrichment analysis methods to identify potentially disease causing nanomaterials. Beilstein J Nanotechnol. 2015;6(1.) under review
Nicolau M, Levine AJ, Carlsson G. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc Natl Acad Sci U S A. 2011;108(17):7265–70. https://doi.org/10.1073/pnas.1102826108.
Hinks TSC, Zhou X, Staples KJ, et al. Innate and adaptive T cells in asthmatic patients: relationship to severity and disease mechanisms. J Allergy Clin Immunol. 2015:1–11. https://doi.org/10.1016/j.jaci.2015.01.014.
Lum PY, Singh G, Lehman A, et al. Extracting insights from the shape of complex data using topology. Sci Rep. 2013;3:1236. https://doi.org/10.1038/srep01236.
Rucco M, Falsetti L, Herman D, et al. Using topological data analysis for diagnosis pulmonary embolism. ArXiv e-prints. 2014.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Yang, X., Guo, Y. (2018). Data Science for Asthma Study. In: Wang, X., Chen, Z. (eds) Genomic Approach to Asthma. Translational Bioinformatics, vol 12. Springer, Singapore. https://doi.org/10.1007/978-981-10-8764-6_13
Download citation
DOI: https://doi.org/10.1007/978-981-10-8764-6_13
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8763-9
Online ISBN: 978-981-10-8764-6
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)