Missing value estimation for microarray data through cluster analysis

Pati, Soumen Kumar; Das, Asit Kumar

doi:10.1007/s10115-017-1025-5

Missing value estimation for microarray data through cluster analysis

Regular Paper
Published: 13 February 2017

Volume 52, pages 709–750, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Soumen Kumar Pati¹ &
Asit Kumar Das²

564 Accesses
23 Citations
Explore all metrics

Abstract

Microarray datasets with missing values need to impute accurately before analyzing diseases. The proposed method first discretizes the samples and temporarily assigns a value in missing position of a gene by the mean value of all samples in the same class. The frequencies of each gene value in both types of samples for all genes are calculated separately and if the maximum frequency occurs for same expression value in both types, then the whole gene is entered into a subset; otherwise, each portion of the gene of respective sample type (i.e., normal or disease) is entered into two separate subsets. Thus, for each gene expression value, maximum three different clusters of genes are formed. Each gene subset is further partitioned into a stable number of clusters using proposed splitting and merging clustering algorithm that overcomes the weakness of Euclidian distance metric used in high-dimensional space. Finally, similarity between a gene with missing values and centroids of the clusters are measured and the missing values are estimated by corresponding expression values of a centroid having maximum similarity. The method is compared with various statistical, cluster-based and regression-based methods with respect to statistical and biological metrics using microarray datasets to measure its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effectiveness of Different Partition Based Clustering Algorithms for Estimation of Missing Values in Microarray Gene Expression Data

Missing Value Imputation Using Correlation Coefficient

Clustering column-mean quantile median: a new methodology for imputing missing data

Article Open access 14 December 2022

Nourhan Yehia, Manal Abdel Wahed & Mai Said Mabrouk

References

Alizadeh AA (2000) Distinct types of diffuse large B-cell Lymphoma identified by gene expression profiling. Nature 403:503–511
Article Google Scholar
Bezdek JC, Pal NR (1998) Some new indexes of cluster validity. IEEE Trans Syst Man Cybern 28(3):301–315
Article Google Scholar
Bra’s LP, Menezes JC (2007) Improving cluster-based missing value estimation of DNA microarray data. Biomol Eng Elsevier 24:273–282
Article Google Scholar
Brevern AG, Hazout S, Malpertuy A (2004) Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinform. doi:10.1186/1471-2105-5-114
Butte AJ, Ye J (2001) Determining significant fold differences in gene expression analysis. Pac Symp Biocomput 6:6–17
Google Scholar
Cai Z, Heydari M, Lin G (2006) Iterated local least squares microarray missing value imputation. Bioinform Comput Biol 4:935–957
Article Google Scholar
Causton HC, Quackenbush J, Brazma A (2004) Microarray gene expression data analysis: a Beginner’s guide, vol 21. Blackwell, Oxford, pp 973–974
Cheng KO, Law NF, Siu WC (2012) Iterative bicluster-based least square framework for estimation of missing values in micro array gene expression data. Pattern Recognit 45(4):1281–1289
Article Google Scholar
Das AK, Sil J (2010) Cluster validation method for stable cluster formation. Can J Artif Intell Mach Learn Pattern Recognit 1(3):26–41
Google Scholar
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227
Article Google Scholar
de Brevern AG, Hazout S, Malpertuy A (2004) Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinform. doi:10.1186/1471-2105-5-114
DeRisi J (1996) Use of a cDNA microarray to analyze gene expression patterns in human cancer. Nat Genet 14(4):457–460
Article Google Scholar
Fu L, Medico E (2007) FLAME: a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinform. doi:10.1186/1471-2105-8-3
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2–3):107–145
Article MATH Google Scholar
Hand DJ, Heard NA (2005) Finding groups in gene expression data. J Biomed Biotechnol 2:215–225
Article Google Scholar
He C, Li HH, Zhao C et al (2015) Triple imputation for microarray missing value estimation. IEEE international conference on bioinformatics and biomedicine (BIBM), pp 208–213
Huynen M, Snel B, Lathe W et al (2000) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 10:1204–1210
Article Google Scholar
Ji R, Liu D, Zhou Z (2011) A bicluster-based missing value imputation method for gene expression data. J Comput Inf Syst 7(13):4810–4818
Google Scholar
Kaur A, Singh SS, Kaur SS (2010) Fuzzy clustering based missing value estimation of gene expression data. Computer engineering technology RIMT, pp 122–126
Kent Ridge Bio-medical Dataset. http://datam.i2r.a-star.edu.sg/datasets/krbd
Kim KY, Kim BJ, Yi GS (2004) Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinform. doi:10.1186/1471-2105-5-160
Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198
Article Google Scholar
Koopmans R, Schaeffer M (2015) Relational diversity and neighborhood cohesion unpacking variety balance and in-group size. Soc Sci Res Elsevier 53:162–176
Article Google Scholar
Luengo J, García S, Herrera F (2011) On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 32:77–108
Article Google Scholar
Luo J, Yang T, Wang Y (2005) Missing value estimation for microarray data based on fuzzy C-means clustering. In: Proceedings of the 8th international conference on high-performance computing in Asia-Pacific region (HPCASIA’05), pp 611–616
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654
Article Google Scholar
Meng F, Cai C, Yan H (2014) A bicluster-based Bayesian principal component analysis method for microarray missing value estimation. IEEE J Biomed Health Inform 18(3):863–871
Article Google Scholar
Oba S, Sato MA, Takemasa I et al (2003) A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096
Article Google Scholar
Oh S, Kang DD, Brock GN et al (2011) Biological impact of missing-value imputation on downstream analyses of gene expression profiles. Bioinformatics 27(1):78–86
Article Google Scholar
Pan L, Li J (2010) K-nearest neighbor based missing data estimation algorithm in wireless sensor networks. Wirel Sens Netw Sci Res 2:115–122
Google Scholar
Paul A, Sil J (2011) Estimating missing value in microarray gene expression data using fuzzy similarity measure. IEEE international conference on fuzzy systems- Taiwan, pp 27–30
Paul A, Sil J (2011) Missing value estimation in microarray data using Co regulation and similarity of genes. World congress on information and communication technologies, pp 705–710
P’erez MJ, Romero-Campero FJ (2006) A new computational modeling tool for systems biology. Trans Comput Syst Biol 6:176–197
MathSciNet Google Scholar
Pourhashem MM, Kelarestaghi M, Pedram MM (2010) Missing value estimation in microarray data using fuzzy clustering and semantic similarity. Global J Comput Sci Technol 10(12):18–22
Google Scholar
Qu Y, Xu S (2004) Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics 20:1905–1913
Article Google Scholar
Rahman MG, Islam MZ, Bossomaier T, Gao J (2012) Cairad: a co-appearance based analysis for incorrect records and attribute-values detection. IEEE international joint conference on neural networks (IJCNN), pp 1–10. doi:10.1109/IJCNN.2012.6252669
Rahman MG, Islam MZ (2016) Missing value imputation using a fuzzy clustering-based EM approach. Knowl Inf Syst 46:389–422
Article Google Scholar
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177
Article Google Scholar
Shi F, Zhang D, Chen J et al (2013) Missing value estimation for microarray data by Bayesian principal component analysis and iterative local least squares. Math Probl Eng. doi:10.1155/2013/162938
Suresh RM, Dinakaran K, Valarmathie P (2009) Model based modified k-means clustering for microarray data. ICIME 53:271–273
Google Scholar
Troyanskaya O, Cantor M, Sherlock G et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Article Google Scholar
Tusher VG (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci 98:5116–5121
Article MATH Google Scholar
Velarde CC, Escudero R, Zaliz RR (2008) Boolean networks: a study on microarray data discretization. ESTYLF08, Cuencas Mineras (Mieres-Langreo), pp 17–19
Wang H, Wang S (2010) Mining incomplete survey data through classification. Knowl Inf Syst 24(2):221–233
Article Google Scholar
Zahid N, Limouri M, Essaid A (1999) A new cluster-validity for fuzzy clustering. Pattern Recogn 32:1089–1097
Article Google Scholar
Zhang S, Zhang J, Zhu X, Qin Y, Zhang C (2008) Missing value imputation based on data clustering. Trans Comput Sci 1:128–138
Google Scholar
Zhang X, Song X, Wang H et al (2008) Sequential local least squares imputation estimating missing value of microarray data. Comput Biol Med 38:1112–1120
Article Google Scholar
Zhang S (2011) Shell-neighbor method and its application in missing data imputation. Appl Intell 35(1):123–133
Article Google Scholar
Zhang S, Jin Z, Zhu X (2011) Missing data imputation by utilizing information within incomplete instances. Syst Softw 84(3):452–459
Article Google Scholar
Zhao O, Fränti P (2014) WB-index: a sum-of-squares based index for cluster validity. Data Knowl Eng Elsevier 92:77–89
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

Department of Information Technology, St. Thomas’ College of Engineering and Technology, Kolkata, India
Soumen Kumar Pati
Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, Howrah, India
Asit Kumar Das

Authors

Soumen Kumar Pati
View author publications
You can also search for this author in PubMed Google Scholar
Asit Kumar Das
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soumen Kumar Pati.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interest in this paper.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pati, S.K., Das, A.K. Missing value estimation for microarray data through cluster analysis. Knowl Inf Syst 52, 709–750 (2017). https://doi.org/10.1007/s10115-017-1025-5

Download citation

Received: 01 April 2016
Revised: 05 January 2017
Accepted: 28 January 2017
Published: 13 February 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s10115-017-1025-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Missing value estimation for microarray data through cluster analysis

Abstract

Access this article

Similar content being viewed by others

Effectiveness of Different Partition Based Clustering Algorithms for Estimation of Missing Values in Microarray Gene Expression Data

Missing Value Imputation Using Correlation Coefficient

Clustering column-mean quantile median: a new methodology for imputing missing data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Missing value estimation for microarray data through cluster analysis

Abstract

Access this article

Similar content being viewed by others

Effectiveness of Different Partition Based Clustering Algorithms for Estimation of Missing Values in Microarray Gene Expression Data

Missing Value Imputation Using Correlation Coefficient

Clustering column-mean quantile median: a new methodology for imputing missing data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation