An Evaluation Measure for Learning from Imbalanced Data Based on Asymmetric Beta Distribution

Thai-Nghe, Nguyen; Gantner, Zeno; Schmidt-Thieme, Lars

doi:10.1007/978-3-642-28894-4_15

Nguyen Thai-Nghe⁴,
Zeno Gantner⁴ &
Lars Schmidt-Thieme⁴

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

3431 Accesses
1 Citations

Abstract

Hand (Mach Learn 77:103–123, 2009) has shown that the AUC has a serious deficiency since it implicitly uses different misclassification cost distributions for different classifiers. Thus, using the AUC can be compared to using different metrics to evaluate different classifiers. To overcome this incoherence, the H measure was proposed, which uses a symmetric Beta distribution to replace the implicit cost weight distributions in the AUC. When learning from imbalanced data, misclassifying a minority class example is much more serious than misclassifying a majority class example. To take different misclassification costs into account, we propose using an asymmetric distribution (B42) instead of a symmetric one. Experimental results on 36 imbalanced datasets using SVMs and logistic regression show that the asymmetric B42 could be a good choice for evaluating in imbalanced data environments since it puts more weight on the minority class.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 7(30), 1145–1159.
Article Google Scholar
Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1), 1–6.
Article Google Scholar
Degroot, M. H., & Schervish, M. J. (2002). Probability and Statistics (3rd ed.). Boston: Addison-Wesley.
Google Scholar
Fan, R. E., Chang K. W., Hsieh C. J., Wang X. R., & Lin C. J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.
MATH Google Scholar
Hand, D. J. (2006). Classifier technology and the illusion of progress. Statistical Science, 21(1), 1–14.
Article MathSciNet MATH Google Scholar
Hand, D. J. (2009). Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103–123.
Article Google Scholar
Hanley, J., & McNeil, B. (1982). The meaning and use of the area under receiver operating characteristics (ROC) curve. Radiology, 143(1), 29–36.
Google Scholar
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Article Google Scholar
Jamain, A., & Hand, D.J. (2009). Where are the large and difficult datasets? Advances in Data Analysis and Classification, 3(1), 25–38.
Article MathSciNet MATH Google Scholar
Thai-Nghe, N., Gantner, Z., & Schmidt-Thieme, L. (2010). Cost-sensitive learning methods for imbalanced data. In Proceeding of IEEE IJCNN10, IEEE CS, Barcelona (pp. 1–8).
Google Scholar
Yang, Q., & Wu, X. (2006). 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making, 5(4), 597–604.
Article Google Scholar

Download references

Acknowledgements

The first author was funded by the “Teaching and Research Innovation Grant” Project of Can Tho University-Vietnam.

Author information

Authors and Affiliations

University of Hildesheim, Hildesheim, Germany
Nguyen Thai-Nghe, Zeno Gantner & Lars Schmidt-Thieme

Authors

Nguyen Thai-Nghe
View author publications
You can also search for this author in PubMed Google Scholar
Zeno Gantner
View author publications
You can also search for this author in PubMed Google Scholar
Lars Schmidt-Thieme
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nguyen Thai-Nghe .

Editor information

Editors and Affiliations

Department of Statistics, Università degli Studi di Firenze, Viale G.B. Morgagni 59, Firenze, 50134, Italy
Antonio Giusti
Fakultät für Informatik, und Mathematik, Universität Passau, Innstr. 33, Passau, 94030, Germany
Gunter Ritter
Sapienza", Department of Statistics, University of Rome "La, Piazzale Aldo Moro 5, Rome, 00185, Italy
Maurizio Vichi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Thai-Nghe, N., Gantner, Z., Schmidt-Thieme, L. (2013). An Evaluation Measure for Learning from Imbalanced Data Based on Asymmetric Beta Distribution. In: Giusti, A., Ritter, G., Vichi, M. (eds) Classification and Data Mining. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28894-4_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-28894-4_15
Published: 06 September 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28893-7
Online ISBN: 978-3-642-28894-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics