Machine Learning Methods for Imbalanced Data

Komori, Osamu; Eguchi, Shinto

doi:10.1007/978-4-431-55570-4_5

Osamu Komori³ &
Shinto Eguchi⁴

Part of the book series: SpringerBriefs in Statistics ((JSSRES))

904 Accesses

Abstract

We discuss high-dimensional data analysis in the framework of pattern recognition and machine learning, including single-component analysis and clustering analysis. Several boosting methods for tackling imbalances in sample sizes are investigated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bamber D (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12:387–415
Article MathSciNet Google Scholar
Breiman L (2004) Population theory for boosting ensembles. Ann Stat 32:1–11
Article MathSciNet Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Springer, Heidelberg, pp 107–119
Chapter Google Scholar
Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19:1061–1069
Article Google Scholar
Do JH, Choi D (2008) Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol Cells 25:279–288
Google Scholar
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87
Article MathSciNet Google Scholar
Eguchi S, Copas J (2002) A class of logistic-type discriminant functions. Biometrika 89:1–22
Article MathSciNet Google Scholar
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95:14863–14868
Article Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139
Article MathSciNet Google Scholar
Freund Y, Schapire RE (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14:771–780
Google Scholar
Friedman J (2002) Stochastic gradient boosting. Comput Stat Data Anal 38:367–378
Article MathSciNet Google Scholar
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–407
Article MathSciNet Google Scholar
Fushiki T, Fujisawa H, Eguchi S (2006) Identification of biomarkers from mass spectrometry data using a “common” peak approach. BMC Bioinform 7:358
Article Google Scholar
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Applications and Reviews) 42:463–484
Article Google Scholar
Golub TT, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Article Google Scholar
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36
Article Google Scholar
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
Book Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
Book Google Scholar
Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: IBM research report, pp 1–20
Google Scholar
Kawakita M, Minami M, Eguchi S, Lennert-Cody CE (2005) An introduction to the predictive technique AdaBoost with a comparison to generalized additive models. Fish Res 76:328–343
Article Google Scholar
Komori O (2011) A boosting method for maximization of the area under the ROC curve. Ann Inst Stat Math 63:961–979
Article MathSciNet Google Scholar
Komori O, Eguchi S (2010) A boosting method for maximizing the partial area under the ROC curve. BMC Bioinform 11:314
Article Google Scholar
Lugosi BG, Vayatis N (2004) On the Bayes-risk consistency of regularized boosting methods. Ann Stat 32:30–55
MathSciNet MATH Google Scholar
Ma S, Huang J (2005) Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 21:4356–4362
Article Google Scholar
Pepe MS (2003) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, New York
MATH Google Scholar
Pepe MS, Cai T, Longton G (2006) Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 62:221–229
Article MathSciNet Google Scholar
Pepe MS, Thompson ML (2000) Combining diagnostic test results to increase accuracy. Biostatistics 1:123–140
Article Google Scholar
Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227
Google Scholar
Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26:1651–1686
Article MathSciNet Google Scholar
Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning PE, Børresen-Dale A (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 98:10869–10874
Article Google Scholar
Takenouchi T, Ushijima M, Eguchi S (2007) GroupAdaBoost: accurate prediction and selection of important genes. IPSJ Digit Cour 3:145–152
Article Google Scholar
van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536
Article Google Scholar
Wang Z, Chang YI, Ying Z, Zhu L, Yang Y (2007) A parsimonious threshold-independent protein feature selection method through the area under receiver operating characteristic curve. Bioinformatics 23:1794–2788
Google Scholar

Download references

Author information

Authors and Affiliations

Seikei University, Musashino, Tokyo, Japan
Osamu Komori
The Institute of Statistical Mathematics, Tachikawa, Tokyo, Japan
Shinto Eguchi

Authors

Osamu Komori
View author publications
You can also search for this author in PubMed Google Scholar
Shinto Eguchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Osamu Komori .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Komori, O., Eguchi, S. (2019). Machine Learning Methods for Imbalanced Data. In: Statistical Methods for Imbalanced Data in Ecological and Biological Studies. SpringerBriefs in Statistics(). Springer, Tokyo. https://doi.org/10.1007/978-4-431-55570-4_5

Download citation

DOI: https://doi.org/10.1007/978-4-431-55570-4_5
Published: 03 July 2019
Publisher Name: Springer, Tokyo
Print ISBN: 978-4-431-55569-8
Online ISBN: 978-4-431-55570-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics