An approach for classification of highly imbalanced data using weighting and undersampling

Anand, Ashish; Pugalenthi, Ganesan; Fogel, Gary B.; Suganthan, P. N.

doi:10.1007/s00726-010-0595-2

An approach for classification of highly imbalanced data using weighting and undersampling

Original Article
Published: 22 April 2010

Volume 39, pages 1385–1391, (2010)
Cite this article

Amino Acids Aims and scope Submit manuscript

Ashish Anand¹,
Ganesan Pugalenthi¹,
Gary B. Fogel² &
…
P. N. Suganthan¹

3491 Accesses
113 Citations
Explore all metrics

Abstract

Real-world datasets commonly have issues with data imbalance. There are several approaches such as weighting, sub-sampling, and data modeling for handling these data. Learning in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we propose a new undersampling technique for selecting instances from the majority class. The performance of this approach was evaluated in the context of several real biological imbalanced data. The ratios of negative to positive samples vary from ~9:1 to ~100:1. Useful classifiers have high sensitivity and specificity. Our results demonstrate that the proposed selection technique improves the sensitivity compared to weighted support-vector machine and available results in the literature for the same datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on ensemble learning

Article 30 August 2019

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

References

Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. Lect Notes Comput Sci 3201:39–50
Article Google Scholar
Batuwita R, Palade V (2009a) microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25:989–995
Article CAS PubMed Google Scholar
Batuwita R, Palade V (2009b) AGm: a new performance measure for class imbalance learning. Application to bioinformatics problems. In: Proceedings of 8th international conference on machine learning and applications, ICMLA 2009, 13–15 December 2009, Miami Beach, USA
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucl Acids Res 28:235–242
Article CAS PubMed Google Scholar
Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines, 2001, Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm
Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6:1–6
Article Google Scholar
Chen X, Jeong JC (2009) Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 25:585–591
Article PubMed Google Scholar
Chen J, Liu H, Yang J, Chou KC (2007) Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids 33(3):423–428
Article CAS PubMed Google Scholar
Cortes C (1995) Prediction of generalization ability in learning machines. University of Rochester, Rochester
Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Article Google Scholar
Joachims T, Nedellec C, Rouveirol C (1998) Text categorization with support vector machines: learning with many relevant features. In: Machine learning: ECML-98. Springer, Berlin
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36:D202–D205
Article CAS PubMed Google Scholar
Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: Proceedings of the 9th European conference on Machine Learning. LNCS, vol 1224. Springer, London, pp 146–153
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659
Article CAS PubMed Google Scholar
Liu XY, Wu J, Zhou ZH (2009) Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans Syst Man Cybern B 39:539–550
Article Google Scholar
Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21:427–436
Article PubMed Google Scholar
McGuffin LJ, Bryson K, Jones DT (2000) The PSIPRED protein structure prediction server. Bioinformatics 16:404–405
Article CAS PubMed Google Scholar
Mizuguchi K, Deane CM, Blundell TL, Johnson MS, Overington JP (1998) JOY: protein sequence-structure representation and analysis. Bioinformatics 14:617–623
Article CAS PubMed Google Scholar
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, pp 258–267
Nishikawa K, Ooi T (1986) Radial locations of amino acid residues in a globular protein: correlation with the sequence. J Biochem 100:1043–1047
CAS PubMed Google Scholar
Osuna E, Freund R, Girosit F (1997) Training support vector machines: an application to face detection. In: 1997 IEEE computer society conference on computer vision and pattern recognition, 1997, pp 130–136
Porter CT, Bartlett GJ, Thornton JM (2004) The catalytic site atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 32:D129
Article CAS PubMed Google Scholar
Pugalenthi G, Kumar KK, Suganthan PN, Gangal R (2008) Identification of catalytic residues from protein structure using support vector machine with sequence and structural features. Biochem Biophys Res Commun 367:630–634
Article CAS PubMed Google Scholar
Robinson M, Sharabi O, Sun Y, Adams R, Boekhorst R, Rust AG, Davey N (2007) Using real-valued meta classifiers to integrate and contextualize binding site predictions. Lect Notes Comput Sci 4431:822–829
Article Google Scholar
Sales AP, Tomaras GD, Kepler TB (2008) Improving peptide-MHC class I binding prediction for unbalanced datasets. BMC Bioinform 9:385
Article Google Scholar
Shi MG, Xia JF, Li XL, Huang DS (2009) Predicting protein–protein interactions from sequence using correlation coefficient and high-quality interaction dataset. Amino Acids
Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21:631–643
Article CAS PubMed Google Scholar
Sun XD, Huang RB (2006) Prediction of protein structural classes using support vector machines. Amino Acids 30:469–475
Article CAS PubMed Google Scholar
Tang Y, Zhang YQ, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B 39:281–288
Article Google Scholar
Vapnik V (1998) Statistical learning theory. Wiley, New York
Google Scholar
Verma R, Varshney GC, Raghava GP (2009) Prediction of mitochondrial proteins of malaria parasite using split amino acid composition and PSSM profile. Amino Acids
Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the sixteenth international joint conference on artificial intelligence (IJCAI99)
Wang M, Yang J, Chou KC (2005) Using string kernel to predict signal peptide cleavage site based on subsite coupling model. Amino Acids 28(4):395–402
Article CAS PubMed Google Scholar
Wang Y, Xue Z, Shen G, Xu J (2008) PRINTR: prediction of RNA binding sites in proteins using SVM and profiles. Amino Acids 35(2):295–302
Article PubMed Google Scholar
Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 workshop on learning from imbalanced data sets II. Washington, DC
Wu J, Liu H, Duan X, Ding Y, Wu H, Bai Y, Sun X (2009) Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics 25:30–35
Article CAS PubMed Google Scholar
Yang ZR (2004) Biological applications of support vector machines. Briefings Bioinform 5:328–338
Article CAS Google Scholar
Yousef M, Nebozhyn M, Shatkay H, Kanterakis S, Showe LC, Showe MK (2006) Combining multi-species genomic data for microRNA identification using a Naive Bayes classifier. Bioinformatics 22:1325–1334
Article CAS PubMed Google Scholar
Zhang J, Bloedorn E, Rosen L, Venese D, Inc AOL, Dulles VA (2004) Learning rules from highly unbalanced data sets. In: Fourth IEEE international conference on data mining, 2004. ICDM’04, pp 571–574

Download references

Acknowledgments

The authors acknowledge financial support offered by the Agency for Science, Technology, and Research, Singapore (A*Star) under grant #052 101 0020.

Author information

Authors and Affiliations

School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
Ashish Anand, Ganesan Pugalenthi & P. N. Suganthan
Natural Selection, Inc, 9330 Scranton Road, Suite 150, San Diego, CA, 92121, USA
Gary B. Fogel

Authors

Ashish Anand
View author publications
You can also search for this author in PubMed Google Scholar
Ganesan Pugalenthi
View author publications
You can also search for this author in PubMed Google Scholar
Gary B. Fogel
View author publications
You can also search for this author in PubMed Google Scholar
P. N. Suganthan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. N. Suganthan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anand, A., Pugalenthi, G., Fogel, G.B. et al. An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39, 1385–1391 (2010). https://doi.org/10.1007/s00726-010-0595-2

Download citation

Received: 09 December 2009
Accepted: 07 April 2010
Published: 22 April 2010
Issue Date: November 2010
DOI: https://doi.org/10.1007/s00726-010-0595-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An approach for classification of highly imbalanced data using weighting and undersampling

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on ensemble learning

Learning from imbalanced data: open challenges and future directions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An approach for classification of highly imbalanced data using weighting and undersampling

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on ensemble learning

Learning from imbalanced data: open challenges and future directions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation