American Journal of Pharmacogenomics

, Volume 4, Issue 2, pp 129–139 | Cite as

At What Scale Should Microarray Data Be Analyzed?

  • Shuguang HuangEmail author
  • Adeline A. Yeo
  • Lawrence Gelbert
  • Xi Lin
  • Laura Nisenbaum
  • Kerry G. Bemis
Original Research Article


Introduction: The hybridization intensities derived from microarray experiments, for example Affymetrix’s MAS5 signals, are very often transformed in one way or another before statistical models are fitted. The motivation for performing transformation is usually to satisfy the model assumptions such as normality and homogeneity in variance. Generally speaking, two types of strategies are often applied to microarray data depending on the analysis need: correlation analysis where all the gene intensities on the array are considered simultaneously, and gene-by-gene ANOVA where each gene is analyzed individually.

Aim: We investigate the distributional properties of the Affymetrix GeneChip® signal data under the two scenarios, focusing on the impact of analyzing the data at an inappropriate scale.

Methods: The Box-Cox type of transformation is first investigated for the strategy of pooling genes. The commonly used log-transformation is particularly applied for comparison purposes. For the scenario where analysis is on a gene-by-gene basis, the model assumptions such as normality are explored. The impact of using a wrong scale is illustrated by log-transformation and quartic-root transformation.

Results: When all the genes on the array are considered together, the dependent relationship between the expression and its variation level can be satisfactorily removed by Box-Cox transformation. When genes are analyzed individually, the distributional properties of the intensities are shown to be gene dependent. Derivation and simulation show that some loss of power is incurred when a wrong scale is used, but due to the robustness of the t-test, the loss is acceptable when the fold-change is not very large.


Microarray Data Normality Assumption Hybridization Intensity Wrong Scale Nest Random Effect 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



We would like to express our thanks to Brian Eastwood and Phillip Iversen for various helpful consultations and discussions. We would like to thank Ray Carroll for giving very creative suggestions on several parts. We would also like to thank Faming Zhang and Jude Onyia for valuable comments and suggestions. The authors have provided no information on sources of funding or on conflicts of interest directly relevant to the content of this study.


  1. 1.
    Huang S, Qian H, Geringer C, et al. Assessing the variability in GeneChip data. Am J Pharmacogenomics 2003; 3(4): 279–90PubMedCrossRefGoogle Scholar
  2. 2.
    Jain N, Thatte J, Braciale T, et al. Local pooled error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics 2003; 19: 1945–51PubMedCrossRefGoogle Scholar
  3. 3.
    Affymetrix Inc. [data sheet]. GeneChip murine genome U74v2 set [online]. Available from URL: [Accessed 2003 Oct 21]
  4. 4.
    Rat UniGene database [online]. Available from URL:;=search&term;=rat [Accessed 2004 Mar 9]
  5. 5.
    Affymetrix Inc. Microarray suite user guide, version 5 [online]. Affymetrix, Santa Clara (CA). Available from URL: [Accessed 2003 Oct 21]
  6. 6.
    Box GEP, Cox DR. An analysis of transformations. J R Stat Soc Ser B Methodological 1964; 26: 211–52Google Scholar
  7. 7.
    Rocke DM, Durbin B. Approximate variance-stabilizing transformations for gene-expression microarray data. Bioinformatics 2003; 19: 966–72PubMedCrossRefGoogle Scholar
  8. 8.
    Holder D, Raubertas RF, Pikounis VB, et al. Statistical analysis of high density oligonucleotide arrays: a SAFER approach. GeneLogic Workshop on Low Level Analysis of Affymetrix Genechip Data. Santa Clara (CA): Affymetrix, 2001Google Scholar
  9. 9.
    Box GEP. Non-normality and tests on variance. Biometrika 1953; 40: 318–35Google Scholar
  10. 10.
    Zimmerman DW, Williams RH. Power comparisons of the student t-test and two approximations when variances and sample sizes are unequal. J Ind Soc Ag Statistics 1989; 41(2): 206–17Google Scholar
  11. 11.
    Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 1995; 57(1): 289–300Google Scholar
  12. 12.
    Posten HO. The robustness of the two-sample t-test over the Pearson system. J Statist Comput Simul 1978; 6: 295–311CrossRefGoogle Scholar
  13. 13.
    Posten HO. Robustness of the two-sample t-test under violations of the homogeneity of variance assumptions. Commun Statist Theor Meth 1982; 11(2): 109–26CrossRefGoogle Scholar

Copyright information

© Adis Data Information BV 2004

Authors and Affiliations

  • Shuguang Huang
    • 1
    Email author
  • Adeline A. Yeo
    • 1
  • Lawrence Gelbert
    • 2
  • Xi Lin
    • 2
  • Laura Nisenbaum
    • 3
  • Kerry G. Bemis
    • 1
  1. 1.Genomic InformaticsEli Lilly & Company, Lilly Corporate CenterIndianapolisUSA
  2. 2.Functional GenomicEli Lilly & CompanyIndianapolisUSA
  3. 3.Neuroscience ResearchEli Lilly & CompanyIndianapolisUSA

Personalised recommendations