Heuristic Non Parametric Collateral Missing Value Imputation: A Step Towards Robust Post-genomic Knowledge Discovery

  • Muhammad Shoaib B. Sehgal
  • Iqbal Gondal
  • Laurence S. Dooley
  • Ross Coppel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5265)


Microarrays are able to measure the patterns of expression of thousands of genes in a genome to give profiles that facilitate much faster analysis of biological processes for diagnosis, prognosis and tailored drug discovery. Microarrays, however, commonly have missing values which can result in erroneous downstream analysis. To impute these missing values, various algorithms have been proposed including Collateral Missing Value Estimation (CMVE), Bayesian Principal Component Analysis (BPCA), Least Square Impute (LSImpute), Local Least Square Impute (LLSImpute) and K-Nearest Neighbour (KNN). Most of these imputation algorithms exploit either the global or local correlation structure of the data, which normally leads to larger estimation errors. This paper presents an enhanced Heuristic Non Parametric Collateral Missing Value Imputation (HCMVI) algorithm which uses CMVE as its core estimator and Heuristic Non Parametric strategy to compute optimal number of estimator genes to exploit optimally both local and global correlations.


Gene Selection Imputation Method Cancer Data Normalize Root Mean Square Error Breast Cancer Data 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10), 906–914 (2004)CrossRefGoogle Scholar
  2. 2.
    Gustavo, B., Monard, C.M.: An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence 17(5-6), 519–533 (2003)CrossRefGoogle Scholar
  3. 3.
    Ramaswamy, S., Tamayo, P., Rifkin, R., et al.: Multiclass cancer diagnosis using tumour gene expression signatures. Proc. Natl. Acad. Sci. 98(26), 15149–15154 (2001)CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Shipp, M.A., Ross, K.N., Tamayo, P., et al.: Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nat. Med. 8(1), 68–74 (2002)CrossRefPubMedGoogle Scholar
  5. 5.
    Golub, T.R., Slonim, D.K., Tamayo, P., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)CrossRefPubMedGoogle Scholar
  6. 6.
    Munagala, K., Tibshiran, R., Brown, P.O.: Cancer characterization and feature set extraction by discriminative margin clustering. BMC Bioinformatics 5, 21 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  7. 7.
    Tuikkala, J., Elo, L., Nevalainen, O.S., Aittokallio, T.: Improving missing value estimation in microarray data with gene ontology. Bioinformatics, 566–572 (2005)Google Scholar
  8. 8.
    Oba, S., Sato, M.A., Takemasa, I., Monden, M., Matsubara, K., Ishii, S.: A Bayesian Missing Value Estimation Method for Gene Expression Profile Data. Bioinformatics 19, 2088–2096 (2003)CrossRefPubMedGoogle Scholar
  9. 9.
    Acuna, E., Rodriguez, C.: The treatment of missing values and its effect in the classifier accuracy. Classification, Clustering and Data Mining Applications, 639–648 (2004)Google Scholar
  10. 10.
    Kim, H., Golub, G.H., Park, H.: Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21, 187–198 (2005)CrossRefPubMedGoogle Scholar
  11. 11.
    Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.: Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 17, 520–525 (2001)CrossRefPubMedGoogle Scholar
  12. 12.
    Bø, T.H., Dysvik, B., Jonassen, I.: LSimpute: Accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res. 32(3), 34 (2004)CrossRefGoogle Scholar
  13. 13.
    Sehgal, M.S.B., Gondal, I., Dooley, L.: Collateral Missing Value Imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics 21(10), 2417–2423 (2005)CrossRefPubMedGoogle Scholar
  14. 14.
    Sehgal, M.S.B., Gondal, I., Dooley, L.: Missing Value Imputation Framework for Microarray Significant Gene Selection and Class Prediction. In: Li, J., Yang, Q., Tan, A.-H. (eds.) BioDM 2006. LNCS (LNBI), vol. 3916, pp. 131–142. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Stevens, J.P.: Applied Multivariate Statistics for the Social Sciences. LEA, Inc (2001)Google Scholar
  16. 16.
    Voelker, D.H., Orton, P.Z., Adams, S.: Statistics. Cliffs Notes (2001)Google Scholar
  17. 17.
    Amir, A.J., Yee, C.J., Sotiriou, C., et al.: Gene Expression Profiles of Brca1-Linked, Brca2-Linked, and Sporadic Ovarian Cancers. Journal of the National Cancer Institute 94(13) (2002)Google Scholar
  18. 18.
    Hedenfalk, I., Duggan, D., Chen, Y., Borg, A., Trent, J., et al.: Gene-expression profiles in hereditary breast cance. N. Engl. J. Med. 22;344(8), 539–548 (2001)CrossRefGoogle Scholar
  19. 19.
    Harvell, D.M.E., Richer, J.K., Allred, D.C., Sartorius, C.A., Horwitz, K.B.: Estradiol Regulates Different Genes in Human Breast Tumor Xenografts Compared with the Identical Cells in Culture. Endocrinology 147, 700–713 (2006)CrossRefPubMedGoogle Scholar
  20. 20.
    Ouyang, M., Welsh, W.J., Georgopoulos, P.: Gaussian Mixture Clustering and Imputation of Microarray Data. Bioinformatics 20(6), 917–923 (2004)CrossRefPubMedGoogle Scholar
  21. 21.
    Sehgal, M.S.B., Gondal, I., Dooley, L.: A Collateral Missing Value Estimation Algorithm for DNA Microarrays. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), USA, pp. 377–380 (2005)Google Scholar
  22. 22.
    Abelson, R.P.: Statistics as Principled Argument. Lawrence Erlbaum Associates, Mahwah (1995)Google Scholar
  23. 23.
    Yona, G., Dirks, W., Rahman, S., Lin, D.M.: Effective similarity measures for expression profiles. Bioinformatics 22, 1616–1622 (2006)CrossRefPubMedGoogle Scholar
  24. 24.
    Jornsten, R., Wang, H.-Y., Welsh, W.J., Ouyang, M.: DNA microarray data imputation and significance analysis of differential expression. Bioinformatics 21, 4155–4161 (2005)CrossRefPubMedGoogle Scholar
  25. 25.
    Basso, K., Margolin, A.A., Stolovitzky, G., Klein, U., Dalla-Favera, R., Califano, A.: Reverse engineering of regulatory networks in human B cells. Nature Genetics 37, 382–390 (2005)CrossRefPubMedGoogle Scholar
  26. 26.
    Jensen, F.V.: Bayesian Networks and Decision Graphs, 2nd edn. Springer, Heidelberg (2002)Google Scholar
  27. 27.
    Ihmels, J., Levy, R., Barkai, N.: Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae. Nature Biotechnology 22, 86–92 (2003)CrossRefPubMedGoogle Scholar
  28. 28.
    Margolin, A.A., Nemenman, I., Basso, K., et al.: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context. BMC Bioinformatics 7 (2006)Google Scholar
  29. 29.
    Jeffery, I.B., Higgins, D.G., Culhane2, A.C.: Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinformatics 7 (2006)Google Scholar
  30. 30.
    Eschrich, S., Yeatman, T.J.: DNA Microarrays and Data Analysis: An Overview. Surgery, ELSEVIER 136, 500–503 (2004)Google Scholar
  31. 31.
    Jornsten, R., Wang, H.-Y., Welsh, W.J., Ouyang, M.: DNA microarray data imputation and significance analysis of differential expression. Bioinformatics 21, 4155–4161 (2005)CrossRefPubMedGoogle Scholar
  32. 32.
    Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 77–78 (2002)Google Scholar
  33. 33.
    Sidak, Z., Sen, P.K., Hajek, J.: Theory of Rank Tests (Probability and Mathematical Statistics). Academic Press, London (1999)Google Scholar
  34. 34.
    Salceda, S., Drumright, C., DiEgidio, A., et al.: Identification of differentially expressed genes in breast cancer. Nature Genetics 27, 83–84 (2001)CrossRefGoogle Scholar
  35. 35.
    Bø, T.H., Jonassen, I.: New feature subset selection procedures for classification of expression profiles. Genome Biology 3(4), research0017.1–research0017.11 (2002)Google Scholar
  36. 36.
    Mertens, C., Kuhn, C., Franke, W.: Plakophilins 2a and 2b: constitutive proteins of dual location in the karyoplasm and the desmosomal plaque. J. Cell Biol. 135, 1009–1025 (1996)CrossRefPubMedGoogle Scholar
  37. 37.
    Mertens, C., Kuhn, C., Moll, R., Schwetlick, I., Franke, W.W.: Desmosomal plakophilin 2 as a differentiation marker in normal and malignant tissues. Differentiation 64, 277–290 (1999)CrossRefPubMedGoogle Scholar
  38. 38.
    Jansen, E., Laven, J.S.E., Dommerholt, H.B.R., et al.: Abnormal Gene Expression Profiles in Human Ovaries from Polycystic Ovary Syndrome Patients. Mol. Endocrinol 18, 3050–3063 (2004)CrossRefPubMedGoogle Scholar
  39. 39.
    Lu, M., Thompson, W.A., Lawlor, D.A., Reveille, J.D., Lee, J.E.: Rapid direct determination of HLA-DQB1 * 0301 in the whole blood of normal individuals and cancer patients by specific polymerase chain reaction amplification. Journal of Immunological Methods 199, 61–68 (1996)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Muhammad Shoaib B. Sehgal
    • 1
  • Iqbal Gondal
    • 2
  • Laurence S. Dooley
    • 3
  • Ross Coppel
    • 4
    • 5
  1. 1.ARC Centre of Excellence in Bioinformatics at IMBUniversity of QueenslandSt LuciaAustralia
  2. 2.Faculty of Information TechnologyMonash UniversityChurchillAustralia
  3. 3.Department of Communications and SystemsThe Open UniversityMilton KeynesUnited Kingdom
  4. 4.Department of MicrobiologyAustralia
  5. 5.Victorian Bioinformatics ConsortiumClaytonAustralia

Personalised recommendations