Skip to main content

Introduction to Genomic and Proteomic Data Analysis

  • Chapter

Abstract

Genomics can be broadly defined as the systematic study of genes, their functions, and their interactions. Analogously, proteomics is the study of proteins, protein complexes, their localization, their interactions, and posttranslational modifications. Some years ago, genomics and proteomics studies focused on one gene or one protein at a time. With the advent of high-throughput technologies in biology and biotechnology, this has changed dramatically. We are currently witnessing a paradigm shift from a traditionally hypothesis-driven to a data-driven research. The activity and interaction of thousands of genes and proteins can now be measured simultaneously. Technologies for genome-and proteome-wide investigations have led to new insights into mechanisms of living systems. There is a broad consensus that these technologies will revolutionize the study of complex human diseases such as Alzheimer syndrome, HIV, and particularly cancer. With its ability to describe the clinical and histopathological phenotypes of cancer at the molecular level, gene expression profiling based on microarrays holds the promise of a patient-tailored therapy. Recent advances in high-throughput mass spectrometry allow the profiling of proteomic patterns in biofluids such as blood and urine, and complement the genomic portray of diseases.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Alter, O., Brown, P.O., and Botstein, D. (2000). Singular-value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA, 97(18):10101–10106.

    Article  PubMed  CAS  Google Scholar 

  • Ambroise, C. and McLachlan, G.J. (2002). Selection bias in gene extraction on th basis of microarray gene expression data. Proc. Natl. Acad. Sci. USA, 98:6562–6566.

    Article  CAS  Google Scholar 

  • Baggerly, K.A., Morris, J.S., and Coombes, K.R. (2004). Reproducibility of SELDI-TOF protein patterns in serum: Comparing datasets from different experiments. Bioinformatics, 20(5):777–785.

    Article  PubMed  CAS  Google Scholar 

  • Bartlett, M.S. (1937). Properties of sufficiency and statistical tests. Proc. R. Stat. Soc. Series A, 160:268–282.

    Article  Google Scholar 

  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc, B57:289–300.

    Google Scholar 

  • Berrar, D., Bradbury, L, and Dubitzky, W. (2006). Avoiding model selection bias in small-sample genomic data sets. Bioinformatics, 22(10):1245–1250.

    Article  PubMed  CAS  Google Scholar 

  • Berry, M.J.A. and Linoff, G. (1997). Data Mining Techniques: For Marketing, Sales, and Customer Support. Wiley, USA.

    Google Scholar 

  • Bouckaert, R.R. and Frank, E. (2004). Evaluating the replicability of significance tests for comparing learning algorithms. Proc. 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 3056:3–12.

    Google Scholar 

  • Braga-Neto, U.M. and Dougherty, E. (2004). Is cross-validation valid for small-sample microarray classification? Bioinformatics, 20(3):374–380.

    Article  PubMed  CAS  Google Scholar 

  • Brown, M.B. and Forsythe, A.B. (1974). Robust tests for the equality of variances. J. Am. Stat. Ass., 69:264–267.

    Article  Google Scholar 

  • Burnette, N.W. (1981). “Western Blotting”: Electrophoretic transfer of protein sodium dodecyl sulfate-polyacrylamid gels to unmodified nitrocellulose and radiographic detection with antibody and readiojodinated protein. Anal. Biochem., 112:195–203.

    Article  PubMed  CAS  Google Scholar 

  • Bustin, S.A. (2000). Absolute quantification of mrna using real-time reverse transcription polymerase chain reaction assays. J. Mol. Endocrinol, 25:169–193.

    Article  PubMed  CAS  Google Scholar 

  • Chen, D., Liu, Z., Ma, X., and Hua, D. (2005). Selecting genes by test statistics. J. Biomed. Biotech., 2:132–138.

    Article  CAS  Google Scholar 

  • Cochran, W.G. (1937). Problems arising in the analysis of a series of similar experiments. J. Roy. Stat. Soc. Ser. C. Appl. Stat., 4:102–118.

    Google Scholar 

  • Diatchenko, L., Lau, Y.F., and Campbell A.P., et al. (1996). Suppression subtractive hybridization: a method for generating differentially regulated or tissue-specific cDNA probes and libraries. Proc. Natl. Acad. Sci. USA, 93(12):6025–6030.

    Article  PubMed  CAS  Google Scholar 

  • Dietterich, T. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comp., 10(7): 1895–1924.

    Article  Google Scholar 

  • Dudoit, S., Fridlyand, J., and Speed, T.P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc, 97:77–87.

    Article  CAS  Google Scholar 

  • Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman & Hall.

    Google Scholar 

  • Fields, S. and Song, O. (1989). A novel genetic system to detect protein-protein interactions. Nature, 340:245–246.

    Article  PubMed  CAS  Google Scholar 

  • Glish, G.L. and Vachet, R.W. (2003). The basics of mass spectrometry in the twenty-first century. Nat. Rev. Drug Discov., 2(2):140–150.

    Article  PubMed  CAS  Google Scholar 

  • Golub, T.R., Slonim, D.K., and Tamayo P., et al. (1999). Molecular classification of cancer class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537.

    Article  PubMed  CAS  Google Scholar 

  • Hastie, T., Tibshirani, R., and Friedman, J. (2002). The Elements of Statistical Learning. Springer Series in Statistics, New York/Berlin/Heidelberg.

    Google Scholar 

  • Hedenfalk, I., Ringnér, M., Ben-Dor, A., Yakhini, Z., Chen, Y., Chebil, G., Ach, R., Loman, N., Olsson, H., Meltzer, P., Borg, A., and Trent, J. (2003). Molecular classification of familial non-BRCA1/BRCA2 breast cancer. Proc. Natl. Acad. Sci. USA, 100(5):2532–2537.

    Article  PubMed  CAS  Google Scholar 

  • Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75:800–802.

    Article  Google Scholar 

  • Hod, Y. (1992). A simplified ribonuclease protection assay. Biotechniques, 13:852–854.

    PubMed  CAS  Google Scholar 

  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scand. J. Stat, 6:65–70.

    Google Scholar 

  • Honoré, B., Ostergaard, M., and Vorum, H. (2004). Functional genomics studied by proteomics. Bioessays, 26(8):901–915.

    Article  PubMed  CAS  Google Scholar 

  • Hoogenboom, H.R., de Bruine, A.P., Hufton, S.E., Hoet, R.M., Arends, J.W., and Roovers, R.C. (1998). Antibody phage display technology and its applications. Immunotechnology, 4(1):1–20.

    Article  PubMed  CAS  Google Scholar 

  • Issaq, H.J., Veenstra, T.D., Conrads, T.P., and Felschow, D. (2002). The SELDI-TOF MS approach to proteomics: Protein profiling and biomarker identification. Biochem. Biophys. Res. Commun., 292(3):587–592.

    Article  PubMed  CAS  Google Scholar 

  • Johansson, P. and Hakkinen, J. (2006). Improving missing value imputation of microarray data by using spot quality weights. BMC Bioinformatics, 7(1):306.

    Article  PubMed  CAS  Google Scholar 

  • Karas, M., Bachmann, D., Bahr, U., and Hillenkamp, F. (1987). Matrix-assisted ultraviolet laser desorption of non-volatile compounds. Int. J. Mass Spectrom. Ion Processes, 78:53–68.

    Article  CAS  Google Scholar 

  • Klipp, E., Herwig, R., Kowald, A., Wierling, C, and Lehrach, H. (2005). Systems Biology in Practice. Wiley-VCH, Weinheim, Germany.

    Book  Google Scholar 

  • Klose, J. and Kobalz, U. (1995). Two-dimensional electrophoresis of proteins: An updated protocol and implications for a functional analysis of the genome. Electrophoresis, 16(6):1034–1059.

    Article  PubMed  CAS  Google Scholar 

  • Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proc. 14th Intl. Joint Conf. Art. Int., pages 1137–1143.

    Google Scholar 

  • Kruskal, W.H. and Wallis, W.A. (1952). Use of ranks in one-criterion variance analysis. J. Am. Stat. Ass., 47:583–621.

    Article  Google Scholar 

  • Levene, H. (1960). Robust tests for equality of variances. Contributions to Probability and Statistics, pages 278–292.

    Google Scholar 

  • Li, T., Zhang, C, and Ogihara, M. (2004). A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics, 20(15):2429–2437.

    Article  PubMed  CAS  Google Scholar 

  • Liang, P. and Pardee, A.B. (1992). Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction. Science, 257(5072):967–971.

    Article  PubMed  CAS  Google Scholar 

  • Lorkowski, S. and Cullen, P. (2003). Analysing Gene Expression: A Handbook of Methods Possibilities and Pitfalls. Wiley-VCH, Weinheim, Germany.

    Google Scholar 

  • MacBeath, G. (2002). Protein microarrays and proteomics. The Chipping Forecast II, Nat. Gen., 32:526–532.

    CAS  Google Scholar 

  • Manly, K.F., Nettleton, D., and Hwang, J.T.G (2004). Genomics, prior probability, and statistical tests of multiple hypotheses. Genome Res., 14:997–1001.

    Article  PubMed  CAS  Google Scholar 

  • Martin, J.K. and Hirschberg, D.S. (1996). Small sample statistics for classification error rates II: Confidence intervals and significance tests. Technical Report 96-22, University of California, Irvine, CA.

    Google Scholar 

  • Mitchell, T.M. (1997). Machine Learning. McGraw-Hill Book Co., Singapore.

    Google Scholar 

  • Moody, D.E. (2001). Genomics techniques: An overview of methods for the study of gene expression. J. Anim. Sci., 79(E.Suppl.):E128–135.

    Google Scholar 

  • Morris, J.S., Yin, G., Baggerly, K., Wu, C, and Zhang, L. (2003). Identification of prognostic genes, combining information across different institutions and oligonucleotide arrays. Oral and Poster Presenters’ Abstracts, 4th Int. Conf. Critical Assessment of Methods for Microarray Data Analysis, pages 1–5.

    Google Scholar 

  • Morrison, N. and Hoyle, D.C. (2002). Normalization — Concepts and methods for normalizing microarray data. In Berrar, D., Dubitzky, W., and Granzow, M., editors, A Practical Approach to Microarray Analysis, pages 76–90. Kluwer Academic Publisher, Boston.

    Google Scholar 

  • Murphy, D. (2002). Gene expression studies using microarrays: Principles, problems, and prospects. Adv. Physiol. Educ., 26(4):256–270.

    PubMed  Google Scholar 

  • Nadeau, C. and Bengio, Y. (2003). Inference for generalization error. Machine Learning, 52:239–281.

    Article  Google Scholar 

  • O’Farrell, P.H. (1975). High-resolution two-dimensional gel electrophoresis of proteins. J. Biol. Chew., 250(10):4007–4021.

    CAS  Google Scholar 

  • O’Neill, G.M., Catchpoole, D.R., and Golemis, E.A. (2003). From correlation to causality: Microarrays, cancer, and cancer treatment. BioTechniques, 34:S64–S71.

    Google Scholar 

  • Radmacher, M.D., McShane, L.M., and Simon, R. (2002). A paradigm for class prediction using gene expression profiles. J. Comp. Bio., 9(3):505–511.

    Article  CAS  Google Scholar 

  • Ramaswamy, S., Tamayo, P., and Rifkin, R., et al. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA, 98(26): 15149–15154.

    Article  PubMed  CAS  Google Scholar 

  • Raychaudhuri, S., Stuart, J.M, and Altman, R.B. (2000). Principal components analysis to summarize microarray experiments: Application to sporulation time series. Proc. 5th Pac. Symp. Biocomp., pages 455–566.

    Google Scholar 

  • Ripley, B.D. (1996). Pattern Recognition and Neural Networks. University Press, Cambridge.

    Google Scholar 

  • Saiki, R.K., Gelfand, D.H., Stoffel, S., Scharf, S.J., Higuchi, R., Horn, G.T., Mullis, K.B., and Erlich, H.A. (1988). Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science, 239(4839):487–491.

    Article  PubMed  CAS  Google Scholar 

  • Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1:317–327.

    Article  Google Scholar 

  • Sargent, T.D. and Dawid, I.B. (1983). Differential gene expression in the gastrula of xenopus laevis. Science, 222(4620):135–139.

    Article  PubMed  CAS  Google Scholar 

  • Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270(5235):467–470.

    Article  PubMed  CAS  Google Scholar 

  • Simon, R. (2002). Classifying breast cancer models. The Scientist, 16(17).

    Google Scholar 

  • Simon, R. (2003). Supervised analysis when the number of candidate features (p) greatly exceeds the number of cases (n). SIGKDD Explorations, 5(2):31–36.

    Article  Google Scholar 

  • Simon, R. (2005). Roadmap for developing and validation therapeutically relevant genomic classifiers. J. Clin. Onc., 23(29):7332–7341.

    Article  CAS  Google Scholar 

  • Somogyi, R., Fuhrman, S., and Wen, X. (2002). Genetic network inference in computational models and applications to large-scale gene expression data. In Bower, J.M. and Bolouri, H., editors, Computational Modeling of Genetic and Biochemical Networks, pages 119–157.

    Google Scholar 

  • Somorjai, R.L., Dolenko, B., and Baumgartner, R. (2003). Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: Curses, caveats, cautions. Bioinformatics, 19(12):1484–1491.

    Article  PubMed  CAS  Google Scholar 

  • Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., and Levy, S. (2005). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5):631–643.

    Article  PubMed  CAS  Google Scholar 

  • Storey, J.D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA, 100(16):9440–9445.

    Article  PubMed  CAS  Google Scholar 

  • Tang, N., Tornatore, P., and Weinberger, S.R. (2004). Current developments in SELDI affinity technology. Mass. Spectrom. Rev., 23(1):34–44.

    Article  PubMed  CAS  Google Scholar 

  • Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R.B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520–525.

    Article  PubMed  CAS  Google Scholar 

  • Unlu, M., Morgan, M.E., and Minden, J.S. (1997). Difference gel electrophoresis: A single gel method for detecting changes in protein extracts. Electrophoresis, 18(11):2071–2077.

    Article  PubMed  CAS  Google Scholar 

  • Velculescu, V.E., Zhang, L., Vogelstein, B., and Kinzler, K.W. (1995). Serial analysis of gene expression. Science, 270(5235):484–487.

    Article  PubMed  CAS  Google Scholar 

  • Welch, B.L. (1951). On the comparison of several mean values: An alternative approach. Biometrika, 38:330–336.

    Google Scholar 

  • Wolpert, D. and Macready, W. (1997). No free lunch theorems for optimization. IEEE Trans. Evolut. Comp., 1(1):67–82.

    Article  Google Scholar 

  • Yamashita, M. and Fenn, J.B. (1984). Electrospray ion source, another variation of the free-jet theme. J. Phys. Chem., 88:4451–4459.

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Berrar, D., Granzow, M., Dubitzky, W. (2007). Introduction to Genomic and Proteomic Data Analysis. In: Dubitzky, W., Granzow, M., Berrar, D. (eds) Fundamentals of Data Mining in Genomics and Proteomics. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-47509-7_1

Download citation

Publish with us

Policies and ethics