Abstract
We discuss high-dimensional data analysis in the framework of pattern recognition and machine learning, including single-component analysis and clustering analysis. Several boosting methods for tackling imbalances in sample sizes are investigated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bamber D (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12:387–415
Breiman L (2004) Population theory for boosting ensembles. Ann Stat 32:1–11
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Springer, Heidelberg, pp 107–119
Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19:1061–1069
Do JH, Choi D (2008) Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol Cells 25:279–288
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87
Eguchi S, Copas J (2002) A class of logistic-type discriminant functions. Biometrika 89:1–22
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95:14863–14868
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139
Freund Y, Schapire RE (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14:771–780
Friedman J (2002) Stochastic gradient boosting. Comput Stat Data Anal 38:367–378
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–407
Fushiki T, Fujisawa H, Eguchi S (2006) Identification of biomarkers from mass spectrometry data using a “common” peak approach. BMC Bioinform 7:358
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Applications and Reviews) 42:463–484
Golub TT, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: IBM research report, pp 1–20
Kawakita M, Minami M, Eguchi S, Lennert-Cody CE (2005) An introduction to the predictive technique AdaBoost with a comparison to generalized additive models. Fish Res 76:328–343
Komori O (2011) A boosting method for maximization of the area under the ROC curve. Ann Inst Stat Math 63:961–979
Komori O, Eguchi S (2010) A boosting method for maximizing the partial area under the ROC curve. BMC Bioinform 11:314
Lugosi BG, Vayatis N (2004) On the Bayes-risk consistency of regularized boosting methods. Ann Stat 32:30–55
Ma S, Huang J (2005) Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 21:4356–4362
Pepe MS (2003) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, New York
Pepe MS, Cai T, Longton G (2006) Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 62:221–229
Pepe MS, Thompson ML (2000) Combining diagnostic test results to increase accuracy. Biostatistics 1:123–140
Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227
Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26:1651–1686
Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning PE, Børresen-Dale A (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 98:10869–10874
Takenouchi T, Ushijima M, Eguchi S (2007) GroupAdaBoost: accurate prediction and selection of important genes. IPSJ Digit Cour 3:145–152
van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536
Wang Z, Chang YI, Ying Z, Zhu L, Yang Y (2007) A parsimonious threshold-independent protein feature selection method through the area under receiver operating characteristic curve. Bioinformatics 23:1794–2788
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2019 The Author(s), under exclusive licence to Springer Japan KK
About this chapter
Cite this chapter
Komori, O., Eguchi, S. (2019). Machine Learning Methods for Imbalanced Data. In: Statistical Methods for Imbalanced Data in Ecological and Biological Studies. SpringerBriefs in Statistics(). Springer, Tokyo. https://doi.org/10.1007/978-4-431-55570-4_5
Download citation
DOI: https://doi.org/10.1007/978-4-431-55570-4_5
Published:
Publisher Name: Springer, Tokyo
Print ISBN: 978-4-431-55569-8
Online ISBN: 978-4-431-55570-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)