Advertisement

Microarray Data Analysis

  • Jane Fridlyand
Chapter
Part of the Selected Works in Probability and Statistics book series (SWPS)

Abstract

I met Terry when I was a beginning graduate student in the Department of Statistics at UC Berkeley. The first year is the time when bright-eyed and idealistic graduate students start thinking about what they want to do for the next 30 years of their lives, or at least until they are handed their PhD diploma and a job offer from Wall Street. I was in awe of Terry, but gathered my courage to approach him with that crucial question: “Would you work with me?” Now that I write this, I find that it sounds rather like a marriage proposal. And indeed, it becomes one: a covenant of commitment between a student and the mentor, with all the ups and downs, for better or for worse, lasting a lifetime. I wanted to work with Terry because he inspired me, as a scientist and as a person, and his interests in biological and medical applications were close to my heart. I also had reason to hope for a positive response – I was told in confidence by several people that in his 20 years of working with students Terry had never turned anyone down. So, here I was asking “Would you work with me?”. His reply was immediate and crushing: “Why?”. I did not know what to say – with all the mental rehearsals I had done, I was not prepared for this comeback. I must have blushed, mumbled something and run away. I guess there is always a first one to be turned down, unfortunately it just happened to be me!

Keywords

Linear Discriminant Analysis Microarray Data Analysis Schizophrenia Research Mental Rehearsal Cancer Microarray Dataset 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

I met Terry when I was a beginning graduate student in the Department of Statistics at UC Berkeley. The first year is the time when bright-eyed and idealistic graduate students start thinking about what they want to do for the next 30 years of their lives, or at least until they are handed their PhD diploma and a job offer from Wall Street. I was in awe of Terry, but gathered my courage to approach him with that crucial question: “Would you work with me?” Now that I write this, I find that it sounds rather like a marriage proposal. And indeed, it becomes one: a covenant of commitment between a student and the mentor, with all the ups and downs, for better or for worse, lasting a lifetime. I wanted to work with Terry because he inspired me, as a scientist and as a person, and his interests in biological and medical applications were close to my heart. I also had reason to hope for a positive response – I was told in confidence by several people that in his 20 years of working with students Terry had never turned anyone down. So, here I was asking “Would you work with me?”. His reply was immediate and crushing: “Why?”. I did not know what to say – with all the mental rehearsals I had done, I was not prepared for this comeback. I must have blushed, mumbled something and run away. I guess there is always a first one to be turned down, unfortunately it just happened to be me!

My despair did not last long. The next day I found a thick stack of papers on statistical genetics and schizophrenia research in my mailbox with a note asking me to read them and meet Terry the next day at a specific time to discuss. And this is how our work together began. Although I transitioned from schizophrenia research, our working relationship had been established, and Terry has remained a very important part of my life since (15 years and counting!).

Microarrays and high-dimensional data

Starting in the late nineties, the field of applied statistics in biomedical research has transformed from the traditional paradigm of many samples and few variables to a situation that had not been greatly considered before by statisticians outside of the machine learning community – one of few samples and an enormous number of variables, also known as the “small n, large p” problem. Unlike the past when existing theory and methods foreshadowed (or even dictated) data types that would occur in practice, this time the technology came first along with excitement and great promise. cDNA microarrays and high-density oligonucleotide chips allowed measurement of many thousands, and eventually, millions, of gene products simultaneously. High-density SNP (Single Nucleotide Polymorphism) arrays enabled high-throughput genetic profiling of living organisms. Taken together and occurring in parallel with the ongoing human and other genome projects, these breakthroughs in technology generated exciting possibilities in biomedical research: human disease prognosis and classification, new drug targets, mammalian models, basic research, and, finally, the ability to conduct discovery experiments on a scale previously unimaginable. As new technologies were quickly adopted by researchers and clinicians, questions encompassing a broad spectrum of statistical issues arose, including:
  • “How reliable and reproducible are the measurements?” (quality assessment and control)

  • “Can I really find a needle in a haystack?” (experimental design, estimation, testing)

  • “What can one do with so many variables at a time?” (modeling, prediction techniques)

  • “How can I minimize the false leads?” (testing)

Re-formulating and addressing such questions falls in the purview of statisticians, who are able to draw on their knowledge of experimental design, prediction techniques, modeling, estimation and testing, and adapt and expand existing concepts to work with these new and unprecedentedly large datasets.

When I think of Terry’s approach to statistics and mentorship, a few quotes from Albert Einstein come to mind – In theory, practice and theory are the same. In reality, they are not and Everything should be made as simple as possible, but not simpler. These points could not have been more appropriate or timely than when excited statisticians, physicists, and computer scientists started working on high-dimensional biomedical problems.

It is difficult to overemphasize Terry’s contributions to the field of high-dimensional data analysis in biomedical research. He stepped in at the very start and, with vigor, rigor and great enthusiasm, began to transform the analytical methods used in the field. Generically, we can consider two levels of microarray data analysis: low-level analysis, which is concerned with preprocessing the raw data into meaningful and analyzable measures; and high-level analysis, which is the statistical analysis of the resulting data matrix. Most methodological researchers tend to specialize in one or the other of these. Terry has made major, fundamental, and very widely used contributions across the analysis spectrum.

The May 8th 2011 PubMed search for “TP Speed” reveals that Terry has co-authored in excess of 150 peer-reviewed publications, a large number of which focus on the analysis of high-dimensional biological data. Here, I provide a historical commentary to only a few of the most ground breaking of those.

Your results will only be as good as the information you put in (more commonly known as “garbage in, garbage out”)

Perhaps the most widely cited microarray contributions from Terry have been from his work on low-level preprocessing of the measurements. Early on, it was recognized that there are many sources of systematic variation in both cDNA and oligonucleotide microarrays. Although understanding the underlying physical reasons for the observed variation is useful, it is not always feasible. Terry recognized that simple empirical normalization approaches may be competitive with more complex biophysical models. Terry also proposed a number of what are now among the most commonly used quality control visualization tools assuring that appropriate preprocessing has been done (e.g. MA-plots, chip pseudo-image plots). Finally, for a formal evaluation of preprocessing methods, relevant biological calibration experiments had to be designed and conducted.

Yang et al. [10] and Irizarry et al. [4] represent some of the papers describing revolutionary microarray normalization (preprocessing) techniques for cDNA arrays (lowess) and short oligonucleotide chips (RMA), respectively. RMA, or some subsequent variant of it, is the most frequently used and cited preprocessing technique for short oligonucleotide chips. Rabbee and Speed [7] describe a multi-chip, multi-SNP approach to genotype calling for Affymetrix SNP chips, providing the first such alternative to the standard (at the time) genotype calling procedures, which processed all the features associated with one chip and one SNP at a time.

Microarray data analysis ≠ clustering

In the very early days of microarray data analysis, probably due to the high dimensionality of the data, virtually all analyses included a cluster analysis – regardless of the scientific question under study (for which clustering may or may not be appropriate).

For the problem of identifying genes that are differentially expressed in two conditions, a more natural, statistically based approach would be to use the mean difference or standardized mean difference of the expression levels, separately for each gene. However, these statistics are problematic. A large mean difference may be due to large variability or an outlier. But taking account of the variability by using the standardized difference is also problematic because when the number of replicates is small the estimate of variance is less reliable, and in particular may be artificially small. In this case, a small average difference can be highly statistically significant, yet biologically meaningless.

Lönnstedt and Speed [6] address this issue using an empirical Bayes approach that avoids these problems. They use a Bayes log posterior odds for differential versus equal expression to select differentially expressed genes. Tai and Speed [9] extend the model to allow for analysis of time-course microarray data.

Do complex datasets require complex methods?

A new laboratory technology without an established methodology for analysis of the resulting data may be attractive to aspiring quantitative analysts eager to apply new sophisticated analytical methods “brewing” in their labs yet lacking an exciting application. This situation violates a firm rule that Terry had for his students: it is the real life problems that motivate methodological research, not the reverse. Thus, when human cancer microarray datasets were first publicly released for re-analysis by other groups, Terry questioned whether the complex, state-of-the-art prediction methods that were being published with the aim of addressing biomedical research problems (e.g. prediction of a patient’s tumor subtype or treatment outcome) do indeed outperform more simplistic approaches that place tight restrictions on the parameter space, such as a linear discriminant analysis with diagonal covariance matrix. Another question that came up was how to measure the relative performance of multiple candidate predictors in the absence of true, independent datasets, particularly when many parameters are estimated.

These two issues are discussed in-depth in Dudoit et al. [2]. Somewhat surprisingly, the main conclusion of this work (later supported theoretically by Levina and Bickel [5]) was that for small n, large p problems the simplest methods, with the most restrictive assumptions, perform as well as or better than the latest machine learning approaches. Moreover, unbiased assessment of performance can be challenging and must be done through a careful and valid cross-validation – an important caveat ignored by several groups in early publications. In view of these results, it is not surprising that in high-dimensional genomics, the rate of independently validated predictions remains low. Nevertheless, much of the progress that has been made is due to Terry’s work on formulating and disseminating the appropriate message.

If you torture data enough, it will confess

Testing many thousands of genes for association with the phenotype of interest invariably presents an issue. From the statistical point of view, testing must be performed at an exceedingly stringent alpha level to control the overall number of false positive findings. At the time of this writing, this idea seems obvious; however, even 5 years ago, it was not – a change of mindset was required as a great majority of papers reported the significance of individual tests without regard to the number of the comparisons performed. The initial discussion of permutation-based adjusted (rather than nominal) p-values took place in Callow et al. [1]; an extensive review of approaches to multiple testing was presented in Ge et al. [3].

And finally...

On a very personal note, I would like to conclude with the story of a paper that Terry and I have never written but work on which manifests in my mind many of the wonderful qualities that Terry possesses: inspiration, mentorship and willingness to always give one more chance. In 1998, I hit a creativity wall, a not uncommon occurrence in the life of a PhD student. Terry had many PhD students, but each of us was important to him as an individual. Terry brought me to Australia with him and there I was able to stumble on a topic that excited and reinvigorated me – the search for interactions in high-dimensional SNP studies. There we were able to utilize binary tree partitioning techniques to discover epistatic genes without prominent independent (marginal) effect, while at the same time illuminating an underlying interaction structure. The results summarizing the application of our approach are described in Symons et al. [8]; however, the methodological paper was never written. Nevertheless, this is our joint work for which I am most grateful to Terry, and that ultimately motivated me to start and finish my PhD dissertation. A lesson in this to all the mentors out there: do not give up on your students, and eventually you will be thanked in print!

References

  1. 1.
    M. J. Callow, S. Dudoit, E. J. Gong, T. P. Speed, and E. M. Rubin. Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Res., 10(12):2022–2029, 2000.CrossRefGoogle Scholar
  2. 2.
    S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc., 97(457):77–87, 2002.MathSciNetMATHCrossRefGoogle Scholar
  3. 3.
    Y. Ge, S. Dudoit, and T. P. Speed. Resampling-based multiple testing for microarray data analysis. TEST, 12(1):1–44, 2003.MathSciNetCrossRefGoogle Scholar
  4. 4.
    R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, and T. P. Speed. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4(2):249–264, 2003.MATHCrossRefGoogle Scholar
  5. 5.
    E. Levina and P. J. Bickel. Some theory for Fisher’s linear discriminant function, ”naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli, 10(6):989–1010, 2004.MathSciNetMATHCrossRefGoogle Scholar
  6. 6.
    I. Lönnstedt and T. P. Speed. Replicated microarray data. Stat. Sinica, 12:31–46, 2001.Google Scholar
  7. 7.
    N. Rabbee and T. P. Speed. A genotype calling algorithm for Affymetrix SNP arrays. Bioinformatics, 22(1):7–12, 2006.CrossRefGoogle Scholar
  8. 8.
    R. C. A. Symons, M. J. Daly, J. Fridlyand, T. P. Speed, W. D. Cook, S. Gerondakis, A. W. Harris, and S. J. Foote. Multiple genetic loci modify susceptibility to plasmacytoma-related morbidity in Eμ–v–abl transgenic mice. Proc. Natl. Acad. Sci. USA, 99(17):11299–11304, 2002.CrossRefGoogle Scholar
  9. 9.
    Y. C. Tai and T. P. Speed. A multivariate empirical Bayes statistic for replicated microarray time course data. Ann. Stat., 34(5):2387–2412, 2006.MathSciNetMATHCrossRefGoogle Scholar
  10. 10.
    Y. H. Yang, S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai, and T. P. Speed. Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res., 30(4):e15, 2002.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Department of BiostatisticsGenentechSouth San FranciscoUSA

Personalised recommendations