Abstract
Hand (Mach Learn 77:103–123, 2009) has shown that the AUC has a serious deficiency since it implicitly uses different misclassification cost distributions for different classifiers. Thus, using the AUC can be compared to using different metrics to evaluate different classifiers. To overcome this incoherence, the H measure was proposed, which uses a symmetric Beta distribution to replace the implicit cost weight distributions in the AUC. When learning from imbalanced data, misclassifying a minority class example is much more serious than misclassifying a majority class example. To take different misclassification costs into account, we propose using an asymmetric distribution (B42) instead of a symmetric one. Experimental results on 36 imbalanced datasets using SVMs and logistic regression show that the asymmetric B42 could be a good choice for evaluating in imbalanced data environments since it puts more weight on the minority class.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 7(30), 1145–1159.
Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1), 1–6.
Degroot, M. H., & Schervish, M. J. (2002). Probability and Statistics (3rd ed.). Boston: Addison-Wesley.
Fan, R. E., Chang K. W., Hsieh C. J., Wang X. R., & Lin C. J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.
Hand, D. J. (2006). Classifier technology and the illusion of progress. Statistical Science, 21(1), 1–14.
Hand, D. J. (2009). Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103–123.
Hanley, J., & McNeil, B. (1982). The meaning and use of the area under receiver operating characteristics (ROC) curve. Radiology, 143(1), 29–36.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Jamain, A., & Hand, D.J. (2009). Where are the large and difficult datasets? Advances in Data Analysis and Classification, 3(1), 25–38.
Thai-Nghe, N., Gantner, Z., & Schmidt-Thieme, L. (2010). Cost-sensitive learning methods for imbalanced data. In Proceeding of IEEE IJCNN10, IEEE CS, Barcelona (pp. 1–8).
Yang, Q., & Wu, X. (2006). 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making, 5(4), 597–604.
Acknowledgements
The first author was funded by the “Teaching and Research Innovation Grant” Project of Can Tho University-Vietnam.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Thai-Nghe, N., Gantner, Z., Schmidt-Thieme, L. (2013). An Evaluation Measure for Learning from Imbalanced Data Based on Asymmetric Beta Distribution. In: Giusti, A., Ritter, G., Vichi, M. (eds) Classification and Data Mining. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28894-4_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-28894-4_15
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28893-7
Online ISBN: 978-3-642-28894-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)