A Review of Microarray Datasets: Where to Find Them and Specific Characteristics

Alonso-Betanzos, Amparo; Bolón-Canedo, Verónica; Morán-Fernández, Laura; Sánchez-Maroño, Noelia

doi:10.1007/978-1-4939-9442-7_4

Amparo Alonso-Betanzos⁴,
Verónica Bolón-Canedo⁴,
Laura Morán-Fernández⁴ &
…
Noelia Sánchez-Maroño⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1986))

Abstract

The advent of DNA microarray datasets has stimulated a new line of research both in bioinformatics and in machine learning. This type of data is used to collect information from tissue and cell samples regarding gene expression differences that could be useful for disease diagnosis or for distinguishing specific types of tumor. Microarray data classification is a difficult challenge for machine learning researchers due to its high number of features and the small sample sizes. This chapter is devoted to reviewing the microarray databases most frequently used in the literature. We also make the interested reader aware of the problematic of data characteristics in this domain, such as the imbalance of the data, their complexity, and the so-called dataset shift.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A type I error is the incorrect rejection of a true null hypothesis that usually has the effect of concluding that a given relationship exists when in fact it does not. That is, a type I error is a false positive.

References

Piatetsky-Shapiro G, Tamayo P (2003) Microarray data mining: facing the challenges. ACM SIGKDD Explor Newsl 5(2):1–5
Article Google Scholar
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Article CAS PubMed Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Article CAS PubMed Google Scholar
Jain A, Zongker D (1997) Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Intell 19(2):153–158
Article Google Scholar
Guyon I, Gunn S, Nikravesh M, Zadeh LA (2006) Feature extraction: foundations and applications, vol 207. Springer, Berlin
Google Scholar
Arrayexpress - Functional Genomics Data (2018). http://www.ebi.ac.uk/arrayexpress/. [Online; accessed Jan 2018]
Gene Expression Omnibus (2018). http://www.ncbi.nlm.nih.gov/geo/. [Online; accessed Jan 2018]
The Cancer Genome Atlas (TCGA) (2018). https://cancergenome.nih.gov/. [Online; accessed Jan 2018]
Broad Institute (2018) Cancer Program Data Sets. http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi. [Online; accessed Jan 2018]
Dataset Repository, Bioinformatics Research Group (2018). http://www.upo.es/eps/bigs/datasets.html. [Online; accessed Jan 2018]
Statnikov A, Aliferis CF, Tsamardinos I (2018) Gems: gene expression model selector. http://www.gems-system.org. [Online; accessed Jan 2018]
Gene Expression Project (2014) Princeton University. http://genomics-pubs.princeton.edu/oncology/. [Online; accessed Jan 2014]
The Arabidopsis Information Resource, Gene Expression Resources (2018) https://www.arabidopsis.org/portals/expression/microarray/. [Online; accessed Jan 2018]
Hruz T, Laule O, Szabo G, Wessendorp F, Bleuler S, Oertle L, Widmayer P, Gruissem W, Zimmermann P (2008) Genevestigator v3: a reference expression database for the meta-analysis of transcriptomes. Adv Bioinforma 2008, 5pp.
Google Scholar
An open-source r framework for your microarray analysis (2018). http://www.aroma-project.org/. [Online; accessed Jan 2018]
ELVIRA Biomedical Data Set Repository (2018). http://leo.ugr.es/elvira/DBCRepository/. [Online; accessed Jan 2018]
Machine Learning Dataset Repository (2018). http://mldata.org/repository/data/. [Online; accessed Jan 2018]
The home of data science & machine learning (2018). https://www.kaggle.com/datasets. [Online; accessed Jan 2018]
Frank A, Asuncion A (2018). UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010. [Online; accessed Jan 2018]
Feature Selection Datasets at Arizona State University (2018). http://featureselection.asu.edu/datasets.php. [Online; accessed Jan 2018]
Bioconductor, open source software for bioinformatics (2018). http://www.bioconductor.org. [Online; accessed Jan 2018]
Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870):436–442
Article CAS PubMed Google Scholar
Shah M, Marchand M, Corbeil J (2012) Feature selection with conjunctions of decision stumps and learning from microarray data. IEEE Trans Pattern Anal Mach Intell 34(1):174–186
Article CAS PubMed Google Scholar
Tian E, Zhan F, Walker R, Rasmussen E, Ma Y, Barlogie B, Shaughnessy JD Jr (2003) The role of the wnt-signaling antagonist dkk1 in the development of osteolytic lesions in multiple myeloma. N Engl J Med 349(26):2483–2494
Article CAS PubMed Google Scholar
Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, Pohl U, Hartmann C, McLaughlin ME, Batchelor TT et al (2003) Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 63(7):1602–1607
CAS PubMed Google Scholar
Bolón-Canedo V, Seth S, Sánchez-Maroño N, Alonso-Betanzos A, Principe JC (2011) Statistical dependence measure for feature selection in microarray datasets. In: 19th European symposium on artificial neural networks-ESANN, pp 23–28
Google Scholar
Bolón-Canedo V, Sánchez-Marono N, Alonso-Betanzos A, Benítez JM, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci 282:111–135
Google Scholar
Bolón-Canedo V, Sechidis K, Sánchez-Marono N, Alonso-Betanzos A, Brown G (2017) Exploring the consequences of distributed feature selection in dna microarray data. In: International joint conference on neural networks
Google Scholar
Ebrahimpour MK, Zare M, Eftekhari M, Aghamolaei G (2017) Occam’s razor in dimension reduction: using reduced row echelon form for finding linear independent features in high dimensional microarray datasets. Eng Appl Artif Intell 62:214–221
Article Google Scholar
Wanderley MF, Gardeux V, Natowicz R, Braga AP (2013) Ga-kde-bayes: an evolutionary wrapper method based on non-parametric density estimation applied to bioinformatics problems. In: 21st European symposium on artificial neural networks-ESANN, pp 155–160
Google Scholar
Meyer PE, Schretter C, Bontempi G (2008) Information-theoretic feature selection in microarray data using variable complementarity. IEEE J Sel Top Signal Process 2(3):261–274
Article Google Scholar
Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Raffeld M et al (2001) Gene-expression profiles in hereditary breast cancer. N Engl J Med 344(8):539–548
Article CAS PubMed Google Scholar
Lee C, Leu Y (2011) A novel hybrid feature selection method for microarray data analysis. Appl Soft Comput 11(1):208–213
Article Google Scholar
van’t Veer LJ, Dai H, Van De Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871):530–536
Article Google Scholar
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2012) An ensemble of filters and classifiers for microarray data classification. Pattern Recogn 45(1):531–539
Article Google Scholar
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2010) On the effectiveness of discretization on gene selection of microarray data. In: The 2010 international joint conference on neural networks (IJCNN). IEEE, Piscataway, pp 18–23
Google Scholar
Kumar M, Rath SK (2015) Classification of microarray using mapreduce based proximal support vector machine classifier. Knowl-Based Syst 89:584–602
Article Google Scholar
Mohapatra P, Chakravarty S, Dash PK (2016) Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system. Swarm Evol Comput 28:144–160
Article Google Scholar
Navarro FFG, Muñoz LAB (2009) Gene subset selection in microarray data using entropic filtering for cancer classification. Expert Syst 26(1):113–124
Article Google Scholar
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 98(20):11462–11467
Article CAS PubMed PubMed Central Google Scholar
Leung Y, Hung Y (2010) A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification. IEEE/ACM Trans Comput Biol Bioinform 7(1):108–117
Article CAS PubMed Google Scholar
Heap G, Trynka G, Jansen R, Bruinenberg M, Swertz M, Dinesen L, Hunt K, Wijmenga C et al (2009) Complex nature of snp genotype effects on gene expression in primary human leucocytes. BMC Med Genomics 2(1):1
Article PubMed PubMed Central CAS Google Scholar
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2014) Data classification using an ensemble of filters. Neurocomputing 135:13–20
Article Google Scholar
Dessì N, Pes B (2015) Similarity of feature selection methods: an empirical study across data intensive classification tasks. Expert Syst Appl 42(10):4632–4642
Article Google Scholar
Shreem SS, Abdullah S, Nazri MZA, Alzaqebah M (2012) Hybridizing ReliefF, MRMR filters and GA wrapper approaches for gene selection. J Theor Appl Inf Technol 46(2):1034–1039
Google Scholar
Yang F, Mao KZ (2011) Robust feature selection for microarray data based on multicriterion fusion. IEEE/ACM Trans Comput Biol Bioinform 8(4):1080–1092
Article PubMed Google Scholar
Ye Y, Wu Q, Huang JZ, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recogn 46(3):769–787
Article Google Scholar
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12):6745–6750
Article CAS PubMed PubMed Central Google Scholar
Ferreira AJ, Figueiredo MAT (2012) An unsupervised approach to feature discretization and selection. Pattern Recogn 45(9):3048–3060
Article Google Scholar
Lovato P, Bicego M, Cristani M, Jojic N, Perina A (2012) Feature selection using counting grids: application to microarray data. In: Structural, syntactic, and statistical pattern recognition. Springer, Berlin, pp 629–637
Chapter Google Scholar
Song L, Smola A, Gretton A, Bedo J, Borgwardt K (2012) Feature selection via dependence maximization. J Mach Learn Res 98888:1393–1434
Google Scholar
Maldonado S, Weber R, Basak J (2011) Simultaneous feature selection and classification using kernel-penalized support vector machines. Inf Sci 181(1):115–128
Article Google Scholar
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3):392–398
Article CAS PubMed Google Scholar
Mundra PA, Rajapakse JC (2010) SVM-RFE with mRMR filter for gene selection. IEEE Trans NanoBiosci 9(1):31–37
Article Google Scholar
Nguyen T, Khosravi A, Creighton D, Nahavandi S (2015) Hidden Markov models for cancer classification using gene expression profiles. Inf Sci 316:293–307
Article Google Scholar
Wang J, Wu L, Kong J, Li Y, Zhang B (2013) Maximum weight and minimum redundancy: a novel framework for feature subset selection. Pattern Recogn 46(6):1616–1627
Article Google Scholar
Song Q, Ni J, Wang G (2013) A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans Knowl Data Eng 25(1):1–14
Article CAS Google Scholar
Canul-Reich J, Hall LO, Goldgof DB, Korecki JN, Eschrich S (2012) Iterative feature perturbation as a gene selector for microarray data. Int J Pattern Recogn Artif Intell 26(05):1260003
Article Google Scholar
Moradkhani M, Amiri A, Javaherian M, Safari H (2015) A hybrid algorithm for feature subset selection in high-dimensional datasets using FICA and IWSSr algorithm. Appl Soft Comput 35:123–135
Article Google Scholar
Noble CL, Abbas AR, Cornelius J, Lees CW, Ho G, Toy K, Modrusan Z, Pal N, Zhong F, Chalasani S et al (2008) Regional variation in gene expression in the healthy colon is dysregulated in ulcerative colitis. Gut 57(10):1398–1405
Article CAS PubMed Google Scholar
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RCT, Gaasenbeek M, Angelo M, Reich M, Pinkus GS et al (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68–74
Article CAS PubMed Google Scholar
Chuang L, Yang C, Wu K, Yang C (2011) A hybrid feature selection method for dna microarray data. Comput Biol Med 41(4):228–237
Article CAS PubMed Google Scholar
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511
Article CAS PubMed Google Scholar
Freije WA, Castro-Vargas FE, Fang Z, Horvath S, Cloughesy T, Liau LM, Mischel PS, Nelson SF (2004) Gene expression profiling of gliomas strongly predicts survival. Cancer Res 64(18):6503–6510
Article CAS PubMed Google Scholar
Nie F, Huang H, Cai X, Ding C (2010) Efficient and robust feature selection via joint l2, 1-norms minimization. Adv Neural Inf Process Syst 23:1813–1821
Google Scholar
Guangtao W, Qinbao S, Baowen X, Yuming Z (2013) Selecting feature subset for high dimensional data via the propositional foil rules. Pattern Recogn 46(1):199–214
Article Google Scholar
Kang S, Song J (2017) Robust gene selection methods using weighting schemes for microarray data analysis. BMC Bioinformatics 18(1):389
Article PubMed PubMed Central CAS Google Scholar
Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna-Gengelbach M, Van De Rijn M, Rosen GD, Perou CM, Whyte RI et al (2001) Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci 98(24):13784–13789
Article CAS PubMed PubMed Central Google Scholar
Gordon GJ, Jensen RV, Hsiao L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62(17):4963–4967
CAS PubMed Google Scholar
Zhou P, Hu X, Li P, Wu X (2017) Online feature selection for high-dimensional class-imbalanced data. Knowl-Based Syst 136:187–199
Article Google Scholar
Shedden K, Taylor JMG, Enkemann SA, Tsao M, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE et al (2008) Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 14(8):822–827
Article CAS PubMed PubMed Central Google Scholar
Eschrich S, Yang I, Bloom G, Kwong KY, Boulware D, Cantor A, Coppola D, Kruhøffer M, Aaltonen L, Orntoft TF et al (2005) Molecular staging for survival prediction of colorectal cancer patients. J Clin Oncol 23(15):3526–3535
Article CAS PubMed Google Scholar
Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC et al (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359(9306):572–577
Article CAS PubMed Google Scholar
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209
Article CAS PubMed Google Scholar
Sharma A, Imoto S, Miyano S (2012) A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinform 9(3):754–764
Article PubMed Google Scholar
Spira A, Beane JE, Shah V, Steiling K, Liu G, Schembri F, Gilman S, Dumas Y, Calner P, Sebastiani P et al (2007) Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nat Med 13(3):361–366
Article CAS PubMed Google Scholar
Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN et al (2001) Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci 98(19):10787–10792
Article CAS PubMed PubMed Central Google Scholar
Liu Z, Tang D, Cai Y, Wang R, Chen F (2017) A hybrid method based on ensemble welm for handling multi class imbalance in cancer microarray data. Neurocomputing 266:641–650
Article Google Scholar
Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, Lapp H, Schultz PG, Powell SM, Moskaluk CA, Frierson HF Jr et al (2001) Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res 61(20):7388–7393
CAS PubMed Google Scholar
Liu K-H, Zeng Z-H, Ng VTY (2016) A hierarchical ensemble of ECOC for cancer classification based on multi-class microarray data. Inf Sci 349:102–118
Article Google Scholar
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP et al (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 98(26):15149–15154
Article CAS PubMed PubMed Central Google Scholar
Lan L, Vucetic S (2011) Improving accuracy of microarray classification by a simple multi-task feature selection filter. Int J Data Min Bioinform 5(2):189–208
Article PubMed Google Scholar
Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) On the use of different base classifiers in multiclass problems. Prog Artif Intell 1–9. https://doi.org/10.1007/s13748-017-0126-4
Article Google Scholar
Haslinger C, Schweifer N, Stilgenbauer S, Döhner H, Lichter P, Kraut N, Stratowa C, Abseher R (2004) Microarray gene expression profiling of B-cell chronic lymphocytic leukemia subgroups defined by genomic aberrations and VH mutation status. J Clin Oncol 22(19):3937–3949
Article CAS PubMed Google Scholar
Sun L, Hui A, Su Q, Vortmeyer A, Kotliarov Y, Pastorino S, Passaniti A, Menon J, Walling J, Bailey R et al (2006) Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell 9(4):287–300
Article CAS PubMed Google Scholar
Anaissi A, Kennedy PJ, Goyal M (2011) Feature selection of imbalanced gene expression microarray data. In: 2011 12th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing (SNPD). IEEE, Piscataway, pp 73–78
Google Scholar
Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ et al (2002) Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 30(1):41–47
Article CAS PubMed Google Scholar
Student S, Fujarewicz K (2012) Stable feature selection and classification algorithms for multiclass microarray data. Biol Direct 7(1):33
Article PubMed PubMed Central Google Scholar
Liu K-H, Tong M, Xie S-T, Ng VTY (2015) Genetic programming based ensemble system for microarray data classification. Comput Math Methods Med 2015, 11pp.
Google Scholar
Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M et al (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci 98(24):13790–13795
Article CAS PubMed PubMed Central Google Scholar
Stienstra R, Saudale F, Duval C, Keshtkar S, Groener JEM, van Rooijen N, Staels B, Kersten S, Müller M (2010) Kupffer cells promote hepatic steatosis via interleukin-1beta-dependent suppression of peroxisome proliferator-activated receptor alpha activity. Hepatology 51(2):511–522
Article CAS PubMed Google Scholar
Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C et al (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7(6):673–679
Article CAS PubMed PubMed Central Google Scholar
Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci 97(1):262–267
Article CAS PubMed PubMed Central Google Scholar
Dougherty ER (2001) Small sample issues for microarray-based classification. Comp Funct Genomics 2(1):28–34
Article CAS PubMed PubMed Central Google Scholar
Yang H, Churchill G (2007) Estimating p-values in small microarray experiments. Bioinformatics 23(1):38–43
Article CAS PubMed Google Scholar
Storey JD, Tibshirani R, Garret ES, Irizarry RA, Zeger SL (2003) SAM thresholding and false discovery rates for detecting differential gene expression in DNA microarrays. Springer, New York
Book Google Scholar
Xie Y, Pan W, Khodursky AB (2005) A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics 21(23):4280–4288
Article CAS PubMed Google Scholar
Murie C, Woody O, Lee AY (2009) Comparison of small n statistical tests of differential expression applied to microarrays. BMC Bioinformatics 10:45
Article PubMed PubMed Central CAS Google Scholar
Paul J, Chiu D, Golovan S, Husain M, Hakimov H (2008) Analysis of extremely small sample microarrays using multi-source data 1
Google Scholar
Nikulin V (2014) On a solution for the high-dimensionality-small-sample-size regression problem with several different microarrays. Int J Data Min Bioinform 9(3):221–234
Article PubMed Google Scholar
Allison DB, Gadbury GL, Heo M, Fernández JR, Lee C-K, Prolla TA, Weindruch R (2002) A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data Anal 39(1):1–20
Article Google Scholar
Phan JH, Moffitt RA, Barrett AB, Wang MD (2008) Improving microarray sample size using bootstrap data combination. In: Proceedings conf. IEEE engineering in medicine and biology society. IEEE, Piscataway, pp 5660–5663
Google Scholar
Braga-Neto U (2007) Fads and fallacies in the name of small-sample microarray classification-a highlight of misunderstanding and erroneous usage in the applications of genomic signal processing. IEEE Signal Process Mag 24(1):91–99
Article Google Scholar
Michiels S, Koscielny S, Hill C (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365(9458):488–492
Article CAS PubMed Google Scholar
Braga-Neto UM, Dogherty ER (2004) Is cross-validation valid for small-sample microarray classification? Bioinformatics 20(3):374–380
Article CAS PubMed Google Scholar
Hanczar B, Jianping H, Sima C, Weinstein J, Bittner M, Dougherty ER (2010) Small-sample precision of ROC-related estimates. Bioinformatics 26(6):822–830
Article CAS PubMed Google Scholar
Laber EB, Murphy SA (2008) Small sample inference for generalization error in classification using the cud bound. In: Proc. of the conference on uncertainty in artificial intelligence, pp 357–365
Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recogn Artif Intell 23(04):687–719
Article Google Scholar
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250(0):113–141
Article Google Scholar
Lusa L et al (2010) Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics 11(1):523
Article PubMed PubMed Central Google Scholar
Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Blagus R, Lusa L (2012) Evaluation of smote for high-dimensional class-imbalanced microarray data. In: 2012 11th international conference on machine learning and applications (ICMLA), vol 2. IEEE, Piscataway, pp 89–94
Chapter Google Scholar
Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2016) Data complexity measures for analyzing the effect of smote over microarrays. In: European symposium on artificial neural networks, computational intelligence and machine learning
Google Scholar
Galar M, Fernández A, Barrenechea E, Herrera F (2013) Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn 46(12):3460–3471
Article Google Scholar
Tax DMJ, Duin RPW (2004) Support vector data description. Mach Learn 54(1):45–66
Article Google Scholar
Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246
Article Google Scholar
Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300
Article Google Scholar
Lorena AC, Costa IG, Spolaôr N, de Souto MCP (2012) Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75(1):33–42
Article Google Scholar
Okun O, Priisalu H (2009) Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artif Intell Med 45(2):151–162
Article PubMed Google Scholar
Bolón-Canedo V, Moran-Fernandez L, Alonso-Betanzos A (2015) An insight on complexity measures and classification in microarray data. In: 2015 International joint conference on neural networks (IJCNN). IEEE, Piscataway, pp 42–49
Google Scholar
Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Can classification performance be predicted by complexity measures? A study using microarray data. Knowl Inf Syst 51(3):1067–1090
Article Google Scholar
Moreno-Torres JG, Raeder T, Alaiz-Rodríguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45(1):521–530
Article Google Scholar
Moreno-Torres JG, Sáez JA, Herrera F (2012) Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Trans Neural Netw Learn Syst 23(8):1304–1312
Article PubMed Google Scholar
Barnett V, Lewis T (1994) Outliers in statistical data, vol 3. Wiley, New York
Google Scholar
Kadota K, Tominaga D, Akiyama Y, Takahashi K (2003) Detecting outlying samples in microarray data: a critical assessment of the effect of outliers on sample classification. Chem-Bio Inf 3(1):30–45
CAS Google Scholar
Gonzalez-Navarro FF (2011) Feature selection in cancer research: microarray gene expression and in vivo 1H-MRS domains. PhD thesis, Technical University of Catalonia
Google Scholar

Download references

Author information

Authors and Affiliations

CITIC, Universidade da Coruña, A Coruña, Spain
Amparo Alonso-Betanzos, Verónica Bolón-Canedo, Laura Morán-Fernández & Noelia Sánchez-Maroño

Authors

Amparo Alonso-Betanzos
View author publications
You can also search for this author in PubMed Google Scholar
Verónica Bolón-Canedo
View author publications
You can also search for this author in PubMed Google Scholar
Laura Morán-Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Noelia Sánchez-Maroño
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Verónica Bolón-Canedo .

Editor information

Editors and Affiliations

CITIC, Universidade da Coruña, A Coruña, Spain
Verónica Bolón-Canedo
CITIC, Universidade da Coruña, A Coruña, Spain
Amparo Alonso-Betanzos

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Alonso-Betanzos, A., Bolón-Canedo, V., Morán-Fernández, L., Sánchez-Maroño, N. (2019). A Review of Microarray Datasets: Where to Find Them and Specific Characteristics. In: Bolón-Canedo, V., Alonso-Betanzos, A. (eds) Microarray Bioinformatics. Methods in Molecular Biology, vol 1986. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9442-7_4

Download citation

DOI: https://doi.org/10.1007/978-1-4939-9442-7_4
Published: 22 May 2019
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-9441-0
Online ISBN: 978-1-4939-9442-7
eBook Packages: Springer Protocols

Publish with us

Policies and ethics