Skip to main content

A Review of Microarray Datasets: Where to Find Them and Specific Characteristics

  • Protocol
  • First Online:
Microarray Bioinformatics

Abstract

The advent of DNA microarray datasets has stimulated a new line of research both in bioinformatics and in machine learning. This type of data is used to collect information from tissue and cell samples regarding gene expression differences that could be useful for disease diagnosis or for distinguishing specific types of tumor. Microarray data classification is a difficult challenge for machine learning researchers due to its high number of features and the small sample sizes. This chapter is devoted to reviewing the microarray databases most frequently used in the literature. We also make the interested reader aware of the problematic of data characteristics in this domain, such as the imbalance of the data, their complexity, and the so-called dataset shift.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A type I error is the incorrect rejection of a true null hypothesis that usually has the effect of concluding that a given relationship exists when in fact it does not. That is, a type I error is a false positive.

References

  1. Piatetsky-Shapiro G, Tamayo P (2003) Microarray data mining: facing the challenges. ACM SIGKDD Explor Newsl 5(2):1–5

    Article  Google Scholar 

  2. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517

    Article  CAS  PubMed  Google Scholar 

  3. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537

    Article  CAS  PubMed  Google Scholar 

  4. Jain A, Zongker D (1997) Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Intell 19(2):153–158

    Article  Google Scholar 

  5. Guyon I, Gunn S, Nikravesh M, Zadeh LA (2006) Feature extraction: foundations and applications, vol 207. Springer, Berlin

    Google Scholar 

  6. Arrayexpress - Functional Genomics Data (2018). http://www.ebi.ac.uk/arrayexpress/. [Online; accessed Jan 2018]

  7. Gene Expression Omnibus (2018). http://www.ncbi.nlm.nih.gov/geo/. [Online; accessed Jan 2018]

  8. The Cancer Genome Atlas (TCGA) (2018). https://cancergenome.nih.gov/. [Online; accessed Jan 2018]

  9. Broad Institute (2018) Cancer Program Data Sets. http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi. [Online; accessed Jan 2018]

  10. Dataset Repository, Bioinformatics Research Group (2018). http://www.upo.es/eps/bigs/datasets.html. [Online; accessed Jan 2018]

  11. Statnikov A, Aliferis CF, Tsamardinos I (2018) Gems: gene expression model selector. http://www.gems-system.org. [Online; accessed Jan 2018]

  12. Gene Expression Project (2014) Princeton University. http://genomics-pubs.princeton.edu/oncology/. [Online; accessed Jan 2014]

  13. The Arabidopsis Information Resource, Gene Expression Resources (2018) https://www.arabidopsis.org/portals/expression/microarray/. [Online; accessed Jan 2018]

  14. Hruz T, Laule O, Szabo G, Wessendorp F, Bleuler S, Oertle L, Widmayer P, Gruissem W, Zimmermann P (2008) Genevestigator v3: a reference expression database for the meta-analysis of transcriptomes. Adv Bioinforma 2008, 5pp.

    Google Scholar 

  15. An open-source r framework for your microarray analysis (2018). http://www.aroma-project.org/. [Online; accessed Jan 2018]

  16. ELVIRA Biomedical Data Set Repository (2018). http://leo.ugr.es/elvira/DBCRepository/. [Online; accessed Jan 2018]

  17. Machine Learning Dataset Repository (2018). http://mldata.org/repository/data/. [Online; accessed Jan 2018]

  18. The home of data science & machine learning (2018). https://www.kaggle.com/datasets. [Online; accessed Jan 2018]

  19. Frank A, Asuncion A (2018). UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010. [Online; accessed Jan 2018]

  20. Feature Selection Datasets at Arizona State University (2018). http://featureselection.asu.edu/datasets.php. [Online; accessed Jan 2018]

  21. Bioconductor, open source software for bioinformatics (2018). http://www.bioconductor.org. [Online; accessed Jan 2018]

  22. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870):436–442

    Article  CAS  PubMed  Google Scholar 

  23. Shah M, Marchand M, Corbeil J (2012) Feature selection with conjunctions of decision stumps and learning from microarray data. IEEE Trans Pattern Anal Mach Intell 34(1):174–186

    Article  CAS  PubMed  Google Scholar 

  24. Tian E, Zhan F, Walker R, Rasmussen E, Ma Y, Barlogie B, Shaughnessy JD Jr (2003) The role of the wnt-signaling antagonist dkk1 in the development of osteolytic lesions in multiple myeloma. N Engl J Med 349(26):2483–2494

    Article  CAS  PubMed  Google Scholar 

  25. Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, Pohl U, Hartmann C, McLaughlin ME, Batchelor TT et al (2003) Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 63(7):1602–1607

    CAS  PubMed  Google Scholar 

  26. Bolón-Canedo V, Seth S, Sánchez-Maroño N, Alonso-Betanzos A, Principe JC (2011) Statistical dependence measure for feature selection in microarray datasets. In: 19th European symposium on artificial neural networks-ESANN, pp 23–28

    Google Scholar 

  27. Bolón-Canedo V, Sánchez-Marono N, Alonso-Betanzos A, Benítez JM, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci 282:111–135

    Google Scholar 

  28. Bolón-Canedo V, Sechidis K, Sánchez-Marono N, Alonso-Betanzos A, Brown G (2017) Exploring the consequences of distributed feature selection in dna microarray data. In: International joint conference on neural networks

    Google Scholar 

  29. Ebrahimpour MK, Zare M, Eftekhari M, Aghamolaei G (2017) Occam’s razor in dimension reduction: using reduced row echelon form for finding linear independent features in high dimensional microarray datasets. Eng Appl Artif Intell 62:214–221

    Article  Google Scholar 

  30. Wanderley MF, Gardeux V, Natowicz R, Braga AP (2013) Ga-kde-bayes: an evolutionary wrapper method based on non-parametric density estimation applied to bioinformatics problems. In: 21st European symposium on artificial neural networks-ESANN, pp 155–160

    Google Scholar 

  31. Meyer PE, Schretter C, Bontempi G (2008) Information-theoretic feature selection in microarray data using variable complementarity. IEEE J Sel Top Signal Process 2(3):261–274

    Article  Google Scholar 

  32. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Raffeld M et al (2001) Gene-expression profiles in hereditary breast cancer. N Engl J Med 344(8):539–548

    Article  CAS  PubMed  Google Scholar 

  33. Lee C, Leu Y (2011) A novel hybrid feature selection method for microarray data analysis. Appl Soft Comput 11(1):208–213

    Article  Google Scholar 

  34. van’t Veer LJ, Dai H, Van De Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871):530–536

    Article  Google Scholar 

  35. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2012) An ensemble of filters and classifiers for microarray data classification. Pattern Recogn 45(1):531–539

    Article  Google Scholar 

  36. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2010) On the effectiveness of discretization on gene selection of microarray data. In: The 2010 international joint conference on neural networks (IJCNN). IEEE, Piscataway, pp 18–23

    Google Scholar 

  37. Kumar M, Rath SK (2015) Classification of microarray using mapreduce based proximal support vector machine classifier. Knowl-Based Syst 89:584–602

    Article  Google Scholar 

  38. Mohapatra P, Chakravarty S, Dash PK (2016) Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system. Swarm Evol Comput 28:144–160

    Article  Google Scholar 

  39. Navarro FFG, Muñoz LAB (2009) Gene subset selection in microarray data using entropic filtering for cancer classification. Expert Syst 26(1):113–124

    Article  Google Scholar 

  40. West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 98(20):11462–11467

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Leung Y, Hung Y (2010) A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification. IEEE/ACM Trans Comput Biol Bioinform 7(1):108–117

    Article  CAS  PubMed  Google Scholar 

  42. Heap G, Trynka G, Jansen R, Bruinenberg M, Swertz M, Dinesen L, Hunt K, Wijmenga C et al (2009) Complex nature of snp genotype effects on gene expression in primary human leucocytes. BMC Med Genomics 2(1):1

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  43. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2014) Data classification using an ensemble of filters. Neurocomputing 135:13–20

    Article  Google Scholar 

  44. Dessì N, Pes B (2015) Similarity of feature selection methods: an empirical study across data intensive classification tasks. Expert Syst Appl 42(10):4632–4642

    Article  Google Scholar 

  45. Shreem SS, Abdullah S, Nazri MZA, Alzaqebah M (2012) Hybridizing ReliefF, MRMR filters and GA wrapper approaches for gene selection. J Theor Appl Inf Technol 46(2):1034–1039

    Google Scholar 

  46. Yang F, Mao KZ (2011) Robust feature selection for microarray data based on multicriterion fusion. IEEE/ACM Trans Comput Biol Bioinform 8(4):1080–1092

    Article  PubMed  Google Scholar 

  47. Ye Y, Wu Q, Huang JZ, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recogn 46(3):769–787

    Article  Google Scholar 

  48. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12):6745–6750

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Ferreira AJ, Figueiredo MAT (2012) An unsupervised approach to feature discretization and selection. Pattern Recogn 45(9):3048–3060

    Article  Google Scholar 

  50. Lovato P, Bicego M, Cristani M, Jojic N, Perina A (2012) Feature selection using counting grids: application to microarray data. In: Structural, syntactic, and statistical pattern recognition. Springer, Berlin, pp 629–637

    Chapter  Google Scholar 

  51. Song L, Smola A, Gretton A, Bedo J, Borgwardt K (2012) Feature selection via dependence maximization. J Mach Learn Res 98888:1393–1434

    Google Scholar 

  52. Maldonado S, Weber R, Basak J (2011) Simultaneous feature selection and classification using kernel-penalized support vector machines. Inf Sci 181(1):115–128

    Article  Google Scholar 

  53. Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3):392–398

    Article  CAS  PubMed  Google Scholar 

  54. Mundra PA, Rajapakse JC (2010) SVM-RFE with mRMR filter for gene selection. IEEE Trans NanoBiosci 9(1):31–37

    Article  Google Scholar 

  55. Nguyen T, Khosravi A, Creighton D, Nahavandi S (2015) Hidden Markov models for cancer classification using gene expression profiles. Inf Sci 316:293–307

    Article  Google Scholar 

  56. Wang J, Wu L, Kong J, Li Y, Zhang B (2013) Maximum weight and minimum redundancy: a novel framework for feature subset selection. Pattern Recogn 46(6):1616–1627

    Article  Google Scholar 

  57. Song Q, Ni J, Wang G (2013) A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans Knowl Data Eng 25(1):1–14

    Article  CAS  Google Scholar 

  58. Canul-Reich J, Hall LO, Goldgof DB, Korecki JN, Eschrich S (2012) Iterative feature perturbation as a gene selector for microarray data. Int J Pattern Recogn Artif Intell 26(05):1260003

    Article  Google Scholar 

  59. Moradkhani M, Amiri A, Javaherian M, Safari H (2015) A hybrid algorithm for feature subset selection in high-dimensional datasets using FICA and IWSSr algorithm. Appl Soft Comput 35:123–135

    Article  Google Scholar 

  60. Noble CL, Abbas AR, Cornelius J, Lees CW, Ho G, Toy K, Modrusan Z, Pal N, Zhong F, Chalasani S et al (2008) Regional variation in gene expression in the healthy colon is dysregulated in ulcerative colitis. Gut 57(10):1398–1405

    Article  CAS  PubMed  Google Scholar 

  61. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RCT, Gaasenbeek M, Angelo M, Reich M, Pinkus GS et al (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68–74

    Article  CAS  PubMed  Google Scholar 

  62. Chuang L, Yang C, Wu K, Yang C (2011) A hybrid feature selection method for dna microarray data. Comput Biol Med 41(4):228–237

    Article  CAS  PubMed  Google Scholar 

  63. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511

    Article  CAS  PubMed  Google Scholar 

  64. Freije WA, Castro-Vargas FE, Fang Z, Horvath S, Cloughesy T, Liau LM, Mischel PS, Nelson SF (2004) Gene expression profiling of gliomas strongly predicts survival. Cancer Res 64(18):6503–6510

    Article  CAS  PubMed  Google Scholar 

  65. Nie F, Huang H, Cai X, Ding C (2010) Efficient and robust feature selection via joint l2, 1-norms minimization. Adv Neural Inf Process Syst 23:1813–1821

    Google Scholar 

  66. Guangtao W, Qinbao S, Baowen X, Yuming Z (2013) Selecting feature subset for high dimensional data via the propositional foil rules. Pattern Recogn 46(1):199–214

    Article  Google Scholar 

  67. Kang S, Song J (2017) Robust gene selection methods using weighting schemes for microarray data analysis. BMC Bioinformatics 18(1):389

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  68. Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna-Gengelbach M, Van De Rijn M, Rosen GD, Perou CM, Whyte RI et al (2001) Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci 98(24):13784–13789

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Gordon GJ, Jensen RV, Hsiao L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62(17):4963–4967

    CAS  PubMed  Google Scholar 

  70. Zhou P, Hu X, Li P, Wu X (2017) Online feature selection for high-dimensional class-imbalanced data. Knowl-Based Syst 136:187–199

    Article  Google Scholar 

  71. Shedden K, Taylor JMG, Enkemann SA, Tsao M, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE et al (2008) Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 14(8):822–827

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Eschrich S, Yang I, Bloom G, Kwong KY, Boulware D, Cantor A, Coppola D, Kruhøffer M, Aaltonen L, Orntoft TF et al (2005) Molecular staging for survival prediction of colorectal cancer patients. J Clin Oncol 23(15):3526–3535

    Article  CAS  PubMed  Google Scholar 

  73. Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC et al (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359(9306):572–577

    Article  CAS  PubMed  Google Scholar 

  74. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209

    Article  CAS  PubMed  Google Scholar 

  75. Sharma A, Imoto S, Miyano S (2012) A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinform 9(3):754–764

    Article  PubMed  Google Scholar 

  76. Spira A, Beane JE, Shah V, Steiling K, Liu G, Schembri F, Gilman S, Dumas Y, Calner P, Sebastiani P et al (2007) Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nat Med 13(3):361–366

    Article  CAS  PubMed  Google Scholar 

  77. Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN et al (2001) Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci 98(19):10787–10792

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Liu Z, Tang D, Cai Y, Wang R, Chen F (2017) A hybrid method based on ensemble welm for handling multi class imbalance in cancer microarray data. Neurocomputing 266:641–650

    Article  Google Scholar 

  79. Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, Lapp H, Schultz PG, Powell SM, Moskaluk CA, Frierson HF Jr et al (2001) Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res 61(20):7388–7393

    CAS  PubMed  Google Scholar 

  80. Liu K-H, Zeng Z-H, Ng VTY (2016) A hierarchical ensemble of ECOC for cancer classification based on multi-class microarray data. Inf Sci 349:102–118

    Article  Google Scholar 

  81. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP et al (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 98(26):15149–15154

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Lan L, Vucetic S (2011) Improving accuracy of microarray classification by a simple multi-task feature selection filter. Int J Data Min Bioinform 5(2):189–208

    Article  PubMed  Google Scholar 

  83. Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) On the use of different base classifiers in multiclass problems. Prog Artif Intell 1–9. https://doi.org/10.1007/s13748-017-0126-4

    Article  Google Scholar 

  84. Haslinger C, Schweifer N, Stilgenbauer S, Döhner H, Lichter P, Kraut N, Stratowa C, Abseher R (2004) Microarray gene expression profiling of B-cell chronic lymphocytic leukemia subgroups defined by genomic aberrations and VH mutation status. J Clin Oncol 22(19):3937–3949

    Article  CAS  PubMed  Google Scholar 

  85. Sun L, Hui A, Su Q, Vortmeyer A, Kotliarov Y, Pastorino S, Passaniti A, Menon J, Walling J, Bailey R et al (2006) Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell 9(4):287–300

    Article  CAS  PubMed  Google Scholar 

  86. Anaissi A, Kennedy PJ, Goyal M (2011) Feature selection of imbalanced gene expression microarray data. In: 2011 12th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing (SNPD). IEEE, Piscataway, pp 73–78

    Google Scholar 

  87. Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ et al (2002) Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 30(1):41–47

    Article  CAS  PubMed  Google Scholar 

  88. Student S, Fujarewicz K (2012) Stable feature selection and classification algorithms for multiclass microarray data. Biol Direct 7(1):33

    Article  PubMed  PubMed Central  Google Scholar 

  89. Liu K-H, Tong M, Xie S-T, Ng VTY (2015) Genetic programming based ensemble system for microarray data classification. Comput Math Methods Med 2015, 11pp.

    Google Scholar 

  90. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M et al (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci 98(24):13790–13795

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Stienstra R, Saudale F, Duval C, Keshtkar S, Groener JEM, van Rooijen N, Staels B, Kersten S, Müller M (2010) Kupffer cells promote hepatic steatosis via interleukin-1beta-dependent suppression of peroxisome proliferator-activated receptor alpha activity. Hepatology 51(2):511–522

    Article  CAS  PubMed  Google Scholar 

  92. Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C et al (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7(6):673–679

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci 97(1):262–267

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Dougherty ER (2001) Small sample issues for microarray-based classification. Comp Funct Genomics 2(1):28–34

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  95. Yang H, Churchill G (2007) Estimating p-values in small microarray experiments. Bioinformatics 23(1):38–43

    Article  CAS  PubMed  Google Scholar 

  96. Storey JD, Tibshirani R, Garret ES, Irizarry RA, Zeger SL (2003) SAM thresholding and false discovery rates for detecting differential gene expression in DNA microarrays. Springer, New York

    Book  Google Scholar 

  97. Xie Y, Pan W, Khodursky AB (2005) A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics 21(23):4280–4288

    Article  CAS  PubMed  Google Scholar 

  98. Murie C, Woody O, Lee AY (2009) Comparison of small n statistical tests of differential expression applied to microarrays. BMC Bioinformatics 10:45

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  99. Paul J, Chiu D, Golovan S, Husain M, Hakimov H (2008) Analysis of extremely small sample microarrays using multi-source data 1

    Google Scholar 

  100. Nikulin V (2014) On a solution for the high-dimensionality-small-sample-size regression problem with several different microarrays. Int J Data Min Bioinform 9(3):221–234

    Article  PubMed  Google Scholar 

  101. Allison DB, Gadbury GL, Heo M, Fernández JR, Lee C-K, Prolla TA, Weindruch R (2002) A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data Anal 39(1):1–20

    Article  Google Scholar 

  102. Phan JH, Moffitt RA, Barrett AB, Wang MD (2008) Improving microarray sample size using bootstrap data combination. In: Proceedings conf. IEEE engineering in medicine and biology society. IEEE, Piscataway, pp 5660–5663

    Google Scholar 

  103. Braga-Neto U (2007) Fads and fallacies in the name of small-sample microarray classification-a highlight of misunderstanding and erroneous usage in the applications of genomic signal processing. IEEE Signal Process Mag 24(1):91–99

    Article  Google Scholar 

  104. Michiels S, Koscielny S, Hill C (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365(9458):488–492

    Article  CAS  PubMed  Google Scholar 

  105. Braga-Neto UM, Dogherty ER (2004) Is cross-validation valid for small-sample microarray classification? Bioinformatics 20(3):374–380

    Article  CAS  PubMed  Google Scholar 

  106. Hanczar B, Jianping H, Sima C, Weinstein J, Bittner M, Dougherty ER (2010) Small-sample precision of ROC-related estimates. Bioinformatics 26(6):822–830

    Article  CAS  PubMed  Google Scholar 

  107. Laber EB, Murphy SA (2008) Small sample inference for generalization error in classification using the cud bound. In: Proc. of the conference on uncertainty in artificial intelligence, pp 357–365

    Google Scholar 

  108. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  109. Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recogn Artif Intell 23(04):687–719

    Article  Google Scholar 

  110. López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250(0):113–141

    Article  Google Scholar 

  111. Lusa L et al (2010) Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics 11(1):523

    Article  PubMed  PubMed Central  Google Scholar 

  112. Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484

    Article  Google Scholar 

  113. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  114. Blagus R, Lusa L (2012) Evaluation of smote for high-dimensional class-imbalanced microarray data. In: 2012 11th international conference on machine learning and applications (ICMLA), vol 2. IEEE, Piscataway, pp 89–94

    Chapter  Google Scholar 

  115. Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2016) Data complexity measures for analyzing the effect of smote over microarrays. In: European symposium on artificial neural networks, computational intelligence and machine learning

    Google Scholar 

  116. Galar M, Fernández A, Barrenechea E, Herrera F (2013) Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn 46(12):3460–3471

    Article  Google Scholar 

  117. Tax DMJ, Duin RPW (2004) Support vector data description. Mach Learn 54(1):45–66

    Article  Google Scholar 

  118. Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246

    Article  Google Scholar 

  119. Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300

    Article  Google Scholar 

  120. Lorena AC, Costa IG, Spolaôr N, de Souto MCP (2012) Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75(1):33–42

    Article  Google Scholar 

  121. Okun O, Priisalu H (2009) Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artif Intell Med 45(2):151–162

    Article  PubMed  Google Scholar 

  122. Bolón-Canedo V, Moran-Fernandez L, Alonso-Betanzos A (2015) An insight on complexity measures and classification in microarray data. In: 2015 International joint conference on neural networks (IJCNN). IEEE, Piscataway, pp 42–49

    Google Scholar 

  123. Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Can classification performance be predicted by complexity measures? A study using microarray data. Knowl Inf Syst 51(3):1067–1090

    Article  Google Scholar 

  124. Moreno-Torres JG, Raeder T, Alaiz-Rodríguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45(1):521–530

    Article  Google Scholar 

  125. Moreno-Torres JG, Sáez JA, Herrera F (2012) Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Trans Neural Netw Learn Syst 23(8):1304–1312

    Article  PubMed  Google Scholar 

  126. Barnett V, Lewis T (1994) Outliers in statistical data, vol 3. Wiley, New York

    Google Scholar 

  127. Kadota K, Tominaga D, Akiyama Y, Takahashi K (2003) Detecting outlying samples in microarray data: a critical assessment of the effect of outliers on sample classification. Chem-Bio Inf 3(1):30–45

    CAS  Google Scholar 

  128. Gonzalez-Navarro FF (2011) Feature selection in cancer research: microarray gene expression and in vivo 1H-MRS domains. PhD thesis, Technical University of Catalonia

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Verónica Bolón-Canedo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Alonso-Betanzos, A., Bolón-Canedo, V., Morán-Fernández, L., Sánchez-Maroño, N. (2019). A Review of Microarray Datasets: Where to Find Them and Specific Characteristics. In: Bolón-Canedo, V., Alonso-Betanzos, A. (eds) Microarray Bioinformatics. Methods in Molecular Biology, vol 1986. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9442-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-9442-7_4

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-4939-9441-0

  • Online ISBN: 978-1-4939-9442-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics