Improvements in the Large p, Small n Classification Issue

Huynh, Phuoc-Hai; Nguyen, Van Hoa; Do, Thanh-Nghi

doi:10.1007/s42979-020-00210-2

Improvements in the Large p, Small n Classification Issue

Original Research
Published: 21 June 2020

Volume 1, article number 207, (2020)
Cite this article

SN Computer Science Aims and scope Submit manuscript

1256 Accesses
6 Citations
Explore all metrics

Abstract

Classifying gene expression data is known to contain keys for solving the fundamental problems in cancer studies. However, this issue is a complex task because of the large p, small n issue on gene expression data analysis. In this paper, we propose the improvements in the large p, small n classification issue for the study of human cancer. First, a new enhancing sample size method with generative adversarial network is proposed to improve classification algorithms. Second, we suggest a classification approach with over-sampling technique using features extracted by deep convolutional neural network. Numerical test results on fifty very-high-dimensional and low-sample-size gene expression data datasets from the Kent Ridge Biomedical and Array Expression repositories illustrate that the proposed models are more accurate than state-of-the-art classifying models. In addition, we also have explored the performance of support vector machines, k nearest neighbors and random forests, which have improved when apply our approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

A Combined Enhancing and Feature Extraction Algorithm to Improve Learning Accuracy for Gene Expression Classification

Using Deep Learning to Classify Class Imbalanced Gene-Expression Microarrays Datasets

Advanced Machine Learning Models for Large Scale Gene Expression Analysis in Cancer Classification: Deep Learning Versus Classical Models

References

Aarthi P, Gothai E (2014) Enhancing sample classification for microarray datasets using genetic algorithm. In: International conference on information communication and embedded systems (ICICES2014). IEEE, pp 1–3.
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al. Tensorflow: large-scale machine learning on heterogeneous systems. 2015. Software available from tensorflow.org. https://www.tensorflow.org; 2019.
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Nat Acad Sci. 1999;96(12):6745–50.
Google Scholar
Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. 30(1):41–47. https://doi.org/10.1038/ng765. http://www.nature.com/articles/ng765z.
Bellman R. Dynamic programming treatment of the travelling salesman problem. J ACM. 1962;9(1):61–3.
MathSciNet MATH Google Scholar
Bernardo J, Bayarri M, Berger J, Dawid A, Heckerman D, Smith A, West M. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Stat. 2003;7:733–42.
MathSciNet Google Scholar
Bhattacharjee A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci. 2001;98(24):13790–5.
Google Scholar
Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG. ArrayExpress a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003;31(1):68–71.
Google Scholar
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
MATH Google Scholar
Brown MP, et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Nat Acad Sci. 2000;97(1):262–7.
Google Scholar
Burges CJ. A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc. 1998;2(2):121–67.
Google Scholar
Cao L, Chua KS, Chong W, Lee H, Gu Q. A comparison of PCA, KPCA and ICA for dimensionality reduction in support vector machine. Neurocomputing. 2003;55(1–2):321–36.
Google Scholar
Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):27.
Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
MATH Google Scholar
Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics. 2012;99(6):323–9.
Google Scholar
Chiaretti S, Li X, Gentleman R, Vitale A, Vignetti M, Mandelli F, Ritz J, Foa R. Gene expression profile of adult t-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood. 2004;103(7):2771–8.
Google Scholar
Chowdary D, Lathrop J, Skelton J, Curtin K, Briggs T, Zhang Y, Yu J, Wang Y, Mazumder A. Prognostic gene expression signatures can be measured in tissues collected in RNAlater preservative. J Mol Diagn. 2006;8(1):31–9.
Google Scholar
Costa P, Galdran A, Meyer MI, Niemeijer M, Abràmoff M, Mendonça AM, Campilho A. End-to-end adversarial retinal image synthesis. IEEE Trans Med Imaging. 2017;37(3):781–91.
Google Scholar
Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA. Generative adversarial networks: an overview. IEEE Signal Process Mag. 2018;35(1):53–65.
Google Scholar
Cristianini N, Shawe-Taylor J. An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press; 2000.
MATH Google Scholar
Dosovitskiy A, Springenberg JT, Tatarchenko M, Brox T. Learning to generate chairs, tables and cars with convolutional networks. IEEE Trans Pattern Anal Mach Intell. 2016;39(4):692–705.
Google Scholar
Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Asso. 2002;97(457):77–87.
MathSciNet MATH Google Scholar
Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinform. 2006;7(1):3.
Google Scholar
Engreitz JM, Daigle BJ Jr, Marshall JJ, Altman RB. Independent component analysis: mining microarray data for fundamental human gene expression modules. J Biomed Inform. 2010;43(6):932–44.
Google Scholar
Fix E, Hodges J. Discriminatory analysis-nonparametric discrimination: Small sample performance. Tech. rep., California Univ. Berkeley; 1952.
Golub TR, Slonim KD, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–7.
Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. Adv Neural Info Process Syst. 2014;2014:2672–80.
Google Scholar
Gordon GJ, Jensen RV, Hsiao LL, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 2002;62(17):4963–7.
Google Scholar
Gravier E, Pierron G, Vincent-Salomon A, Gruel N, Raynal V, Savignoni A, De Rycke Y, Pierga JY, Lucchesi C, Reyal F. A prognostic DNA signature for t1t2 node-negative breast cancer patients. Genes. 2010;49(12):1125.
Google Scholar
Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinform. 2015;20:15.
Google Scholar
Hsu CW, Chang CC, Lin CJ. A practical guide to support vector classification; 2003.
Hubel DH, Wiesel T. Shape and arrangement of columns in cat’s striate cortex. J Physiol. 1963;165(3):559–68.
Google Scholar
Huynh PH, Nguyen VH, Do TN. A coupling support vector machines with the feature learning of deep convolutional neural networks for classifying microarray gene expression data. Modern approaches for intelligent information and database systems. Berlin: Springer; 2018. p. 233–43.
Google Scholar
Huynh PH, Nguyen VH, Do TN. A combined enhancing and feature extraction algorithm to improve learning accuracy for gene expression classification; 2019. pp. 255–273.
Huynh PH, Nguyen VH, Do TN. Enhancing gene expression classification of support vector machines with generative adversarial networks. J Inf Commun Convergence Eng. 2019;17:14–20.
Google Scholar
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning; 2015. pp. 448–56.
Jinyan L, Huiqing L. Kent ridge bio-medical data set repository. Technical report; 2002.
Jonnalagadda S, Srinivasan R. Principal components analysis based methodology to identify differentially expressed genes in time-course microarray data. BMC Bioinform. 2008;9(1):267.
Google Scholar
Kalantari A, Kamsin A, Shamshirband S, Gani A, Alinejad-Rokny H, Chronopoulos AT. Computational intelligence approaches for classification of medical data: State-of-the-art, future challenges and research directions. Neurocomputing. 2018;276:2–22.
Google Scholar
Kim SY. Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC Bioinform. 2009;10(1):147.
Google Scholar
Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014. pp. 1746–51.
Kingma DP, Ba JA. A method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations (ICLR); 2014.
Krizhevsky et al. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems; 2012. pp. 1097–05.
Breiman L, Friedman J, C.J.S.R.A.O. Classification and regression trees. L. Breiman J. Friedman, C.J.S.R.A.O. Wadsworth International Group. 1984;8:452–6.
Landgrebe J, Wurst W, Welzl G. Permutation-validated principal components analysis of microarray data. Genome Biol. 2002;3(4):research0019-1.
Google Scholar
Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, et al. Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. pp. 4681–90.
Lee SI, Batzoglou S. Application of independent component analysis to microarrays. Genome Biol. 2003;4(11):R76.
Google Scholar
Liu Z, Chen D, Bensmail H. Gene expression data classification with kernel principal component analysis. BioMed Res Int. 2005;2005(2):155–9.
Google Scholar
Lusa L, et al. Class prediction for high-dimensional class-imbalanced data. BMC Bioinform. 2010;11(1):523.
Google Scholar
Maas AL, Hannun AY, Ng AY. Rectifier nonlinearities improve neural network acoustic models. Proc ICML. 2013;30:3.
Google Scholar
Min S, Lee B, Yoon S. Deep learning in bioinformatics. Brief Bioinform. 2016;1:bbw068.
Google Scholar
Moeskops P, Veta M, Lafarge MW, Eppenhof KA, Pluim JP. Adversarial training and dilated convolutions for brain mri segmentation. Deep learning in medical image analysis and multimodal learning for clinical decision support. Berlin: Springer; 2017. p. 56–64.
Google Scholar
Nikulin V, McLachlan GJ. Penalized principal component analysis of microarray data. In: International meeting on computational intelligence methods for bioinformatics and biostatistics, pp. 82–96. Springer; 2009.
Novianti PW, Jong VL, Roes KC, Eijkemans MJ. Factors affecting the accuracy of a class prediction model in gene expression data. BMC Bioinform. 2015;16(1):199.
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
MathSciNet MATH Google Scholar
Perez-Diez A, Morgun A, Shulzhenko N. Microarrays for cancer diagnosis and classification. In: Sag D, editor. Microarray technology and cancer gene profiling. Berlin: Springer; 2007. p. 74–85.
Google Scholar
Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo W, Chen C, Zhai Y. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet. 1998;20:2.
Google Scholar
Pirooznia M, Yang JY, Yang MQ, Deng Y. A comparative study of different machine learning methods on microarray gene expression data. BMC Genom. 2008;9(S1):S13.
Google Scholar
Quinlan JR. C4.5: programs for machine learning. San Francisco: Morgan Kaufmann Publishers Inc.; 1993.
Google Scholar
Reverter F, Vegas E, Oller JM. Kernel-pca data integration with enhanced interpretability. BMC Syst Biol. 2014;8(S2):S6.
Google Scholar
Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270(5235):467–70.
Google Scholar
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS. others: Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002;8(1):68.
Google Scholar
Tan CS, Ting WS, Mohamad MS, Chan WH, Deris S, Ali Shah Z. A review of feature extraction software for microarray gene expression data. BioMed Res Int. 2014;20:14.
Google Scholar
Van Hulse J, Khoshgoftaar TM, Napolitano A. Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on Machine learning, pp. 935–942. ACM 2007.
Van’t Veer LJ, Dai H, Van De Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, Van Der Kooy K, Marton MJ, Witteveen AT. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530.
Google Scholar
Vapnik. The nature of statistical learning theory. Berlin: Springer; 1995.
MATH Google Scholar
Vapnik V. An overview of statistical learning theory. IEEE Trans Neural Netw. 1998;10(5):988–99.
Google Scholar
Wong TT. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 2015;48(9):2839–46.
MATH Google Scholar
Wu X, Kumar V. The top ten algorithms in data mining. Boca Raton: CRC Press; 2009.
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, An Giang University, Angiang, Vietnam
Phuoc-Hai Huynh & Van Hoa Nguyen
Vietnam National University Ho Chi Minh City, Ho Chi Minh City, Vietnam
Phuoc-Hai Huynh & Van Hoa Nguyen
College of Information Technology, Can Tho University, Cantho, Vietnam
Thanh-Nghi Do
UMI UMMISCO 209 (IRD/UPMC), Sorbonne University, Pierre and Marie Curie University, Paris 6, France
Thanh-Nghi Do

Authors

Phuoc-Hai Huynh
View author publications
You can also search for this author in PubMed Google Scholar
Van Hoa Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Thanh-Nghi Do
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Phuoc-Hai Huynh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Future Data and Security Engineering 2019” guest edited by Tran Khanh Dang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huynh, PH., Nguyen, V.H. & Do, TN. Improvements in the Large p, Small n Classification Issue. SN COMPUT. SCI. 1, 207 (2020). https://doi.org/10.1007/s42979-020-00210-2

Download citation

Received: 06 April 2020
Accepted: 28 May 2020
Published: 21 June 2020
DOI: https://doi.org/10.1007/s42979-020-00210-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improvements in the Large p, Small n Classification Issue

Abstract

Access this article

Similar content being viewed by others

A Combined Enhancing and Feature Extraction Algorithm to Improve Learning Accuracy for Gene Expression Classification

Using Deep Learning to Classify Class Imbalanced Gene-Expression Microarrays Datasets

Advanced Machine Learning Models for Large Scale Gene Expression Analysis in Cancer Classification: Deep Learning Versus Classical Models

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improvements in the Large p, Small n Classification Issue

Abstract

Access this article

Similar content being viewed by others

A Combined Enhancing and Feature Extraction Algorithm to Improve Learning Accuracy for Gene Expression Classification

Using Deep Learning to Classify Class Imbalanced Gene-Expression Microarrays Datasets

Advanced Machine Learning Models for Large Scale Gene Expression Analysis in Cancer Classification: Deep Learning Versus Classical Models

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation