Missing Value Imputation Framework for Microarray Significant Gene Selection and Class Prediction

  • Muhammad Shoaib B. Sehgal
  • Iqbal Gondal
  • Laurence Dooley
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3916)


Microarray data is used in a large number of applications ranging from diagnosis through to drug discovery. Such data however, often contains multiple missing genetic expressions which are generally ignored thus degrading the reliability of inferred results. This paper presents an innovative and robust imputation framework that more accurately estimates missing values leading subsequently to better gene selection and class prediction. To prove this premise, several missing value techniques including the Collateral Missing Values Estimation (CMVE), Bayesian Principal Component Analysis (BPCA), Least Square Impute (LSImpute), k-Nearest Neighbour (KNN) and ZeroImpute are analysed. A combination of univariate and multiple gene selection methods, namely, Between Group to within Group Sum of Squares and Weighted Partial Least Squares is then performed before applying class prediction using the Ridge Partial Least Square method. Overall, CMVE imputation consistently provided superior missing values estimation accuracy compared with the other algorithms examined, by virtue of exploiting local and global as well as positive and negative correlations between genes, with all empirical results being corroborated by the two-sided Wilcoxon Rank sum statistical significance test.


Partial Little Square Gene Selection Generalize Regression Neural Network Class Prediction Gene Selection Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sehgal, M.S.B., Gondal, I., Dooley, L.: Collateral Missing Value Imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics 21(10), 2417–2423 (2005)CrossRefMATHGoogle Scholar
  2. 2.
    Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasen-beek, M., Mesirov, J.P., Coller, H., Loh, M.L., Down-ing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lan-der, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)CrossRefGoogle Scholar
  3. 3.
    Bhattacharjee, A., Richards, W.G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E.F., Lander, E.S., Wong, W., Johnson, B.E., Golub, T.R., Sugarbaker, D.J., Meyerson, M.: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci., 13790–13795 (2001)Google Scholar
  4. 4.
    Sehgal, M.S.B., Gondal, I., Dooley, L.: A Collateral Missing Value Estimation Algorithm for DNA Microarrays. In: 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), USA, pp. 377–380 (2005)Google Scholar
  5. 5.
    Oba, S., Sato, M.A., Takemasa, I., Monden, M., Matsubara, K., Ishii, S.: A Bayesian Missing Value Estimation Method for Gene Expression Profile Data. Bioinformatics 19, 2088–2096 (2003)CrossRefGoogle Scholar
  6. 6.
    Sehgal, M.S.B., Gondal, I., Dooley, L.: Support Vector Machine and Generalized Regression Neural Network Based Classification Fusion Models for Cancer Diagnosis. In: IEEE Hybrid Intelligent Systems (HIS) 2004, Japan, pp. 49–54 (2004)Google Scholar
  7. 7.
    Fort, G., Lambert-Lacroix, S.: Classification using partial least squares with penalized logistic regression. Bioinformatics 21, 1104–1111 (2005)CrossRefGoogle Scholar
  8. 8.
    Liu, X., Krishnan, A., Mondry, A.: An Entropy-based gene selection method for cancer classification using microarray data. BMC Bioinformatics 6, 76 (2005)CrossRefGoogle Scholar
  9. 9.
    Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O.P., Wilfond, B., Borg, A., Trent, J.: Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344(8), 539–548 (2001)CrossRefGoogle Scholar
  10. 10.
    Sehgal, M.S.B., Gondal, I., Dooley, L.: Statistical Neural Networks and Support Vector Machine for the Classification of Genetic Mutations in Ovarian Cancer. In: IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 2004, USA, pp. 140–146 (2004)Google Scholar
  11. 11.
    Bø, T.H., Dysvik, B., Jonassen, I.: LSimpute: Accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res. 32(3), e34 (2004)CrossRefGoogle Scholar
  12. 12.
    Troyanskaya, M., Cantor, G., Sherlock, P., Brown, T., Hastie, R., Tibshirani, D.: Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 17, 520–525 (2001)CrossRefGoogle Scholar
  13. 13.
    Sehgal, M.S.B., Gondal, I., Dooley, L.: Collateral Missing Value Estimation: Robust missing value estimation for consequent microarray data processing. Lecture Notes in Artificial Intelligence (LNAI), pp. 274–283. Springer, Heidelberg (2005)MATHGoogle Scholar
  14. 14.
    Chen, P.Y., Popovich, P.M.: Correlation: Parametric and Nonparametric Measures, 1st edn. SAGE Publications, Thousand Oaks (2002)CrossRefGoogle Scholar
  15. 15.
    Boulesteix, A.-L.: PLS Dimension Reduction for Classification with Microarray Data. In: Statistical Applications in Genetics and Molecular Biology, vol. 3 (2003)Google Scholar
  16. 16.
    Yeung, K.Y., Bumgarner, R.E., Raftery, A.E.: Bayesian Model Averaging: development of an improved multiclass, gene selection and classification tool for microarray data. Bioinformatics 21(10), 2394–2402 (2005)CrossRefGoogle Scholar
  17. 17.
    Zhou, X., Wang, X., Dougherty, E.R.: Gene Selection Using Logistic Regressions Based on AIC, BIC and MDL Criteria. New Mathematics and Natural Computation 1, 129–145 (2005)MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Sehgal, M.S.B., Gondal, I., Dooley, L.: Missing Values Imputation for DNA Microarray Data using Ranked Covariance Vectors. The International Journal of Hybrid Intelligent Systems (IJHIS) (2005) ISSN 1448-5869Google Scholar
  19. 19.
    Sidak, Z., Sen, P.K., Hajek, J.: Theory of Rank Tests (Probability and Mathematical Statistics). Academic Press, London (1999)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Muhammad Shoaib B. Sehgal
    • 1
  • Iqbal Gondal
    • 1
  • Laurence Dooley
    • 1
  1. 1.Faculty of ITMonash UniversityChurchillAustralia

Personalised recommendations