Advertisement

Instance selection improves geometric mean accuracy: a study on imbalanced data classification

  • Ludmila I. Kuncheva
  • Álvar Arnaiz-González
  • José-Francisco Díez-PastorEmail author
  • Iain A. D. Gunn
Regular Paper
  • 3 Downloads

Abstract

A natural way of handling imbalanced data is to attempt to equalise the class frequencies and train the classifier of choice on balanced data. For two-class imbalanced problems, the classification success is typically measured by the geometric mean (GM) of the true positive and true negative rates. Here we prove that GM can be improved upon by instance selection, and give the theoretical conditions for such an improvement. We demonstrate that GM is non-monotonic with respect to the number of retained instances, which discourages systematic instance selection. We also show that balancing the distribution frequencies is inferior to a direct maximisation of GM. To verify our theoretical findings, we carried out an experimental study of 12 instance selection methods for imbalanced data, using 66 standard benchmark data sets. The results reveal possible room for new instance selection methods for imbalanced data.

Keywords

Imbalanced data Geometric mean (GM) Instance/prototype selection Nearest neighbour Ensemble methods Theoretical perspective 

Notes

Acknowledgements

This work was done under project RPG-2015-188 funded by the Leverhulme Trust, UK; the project TIN2015-67534-P funded by the Ministerio de Economía y Competitividad of the Spanish Government; and the BU085P17 funded by the Junta de Castilla y León. The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

References

  1. 1.
    Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, 20-24 September, 2004. Proceedings, pp. 39–50. Springer Berlin Heidelberg, Berlin, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-30115-8_7
  2. 2.
    Alcala-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL data-mining software tool: data set repository and integration of algorithms and experimental analysis framework. J. Mult. Valued Logic Soft Comput. 17(2–3), 255–287 (2011)Google Scholar
  3. 3.
    Barandela, R., Sánchez, J., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36(3), 849–851 (2003)CrossRefGoogle Scholar
  4. 4.
    Barandela, R., Valdovinos, R., Sánchez, J.: New applications of ensembles of classifiers. Pattern Anal. Appl. 6(3), 245–256 (2003)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004).  https://doi.org/10.1145/1007730.1007735 CrossRefGoogle Scholar
  6. 6.
    Batuwita, R., Palade, V.: FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans. Fuzzy Syst. 18(3), 558–571 (2010).  https://doi.org/10.1109/TFUZZ.2010.2042721 CrossRefGoogle Scholar
  7. 7.
    Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)CrossRefGoogle Scholar
  8. 8.
    Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: GrC, pp. 732–737 (2006)Google Scholar
  9. 9.
    Cleofas-Sánchez, L., Sánchez, J.S., García, V.: Gene selection and disease prediction from gene expression data using a two-stage hetero-associative memory. Prog. Artif. Intell. (2018).  https://doi.org/10.1007/s13748-018-0148-6
  10. 10.
    Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)CrossRefzbMATHGoogle Scholar
  11. 11.
    Dal Pozzolo, A., Caelen, O., Le Borgne, Y.A., Waterschoot, S., Bontempi, G.: Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst. Appl. 41(10), 4915–4928 (2014)CrossRefGoogle Scholar
  12. 12.
    Dasarathy, B.V.: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, California (1990)Google Scholar
  13. 13.
    Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)CrossRefGoogle Scholar
  14. 14.
    Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C., Kuncheva, L.I.: Random balance: ensembles of variable priors classifiers for imbalanced data. Knowl. Based Syst. 85, 96–111 (2015)CrossRefGoogle Scholar
  15. 15.
    Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C.I., Kuncheva, L.I.: Diversity techniques improve the performance of the best imbalance learning ensembles. Inf. Sci. 325, 98–117 (2015)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Drown, D.J., Khoshgoftaar, T.M., Seliya, N.: Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 39(5), 1097–1107 (2009)CrossRefGoogle Scholar
  17. 17.
    Eskildsen, S.F., Coupé, P., Fonov, V., Collins, D.L.: Detecting alzheimer’s disease by morphological MRI using hippocampal grading and cortical thickness. In: Proceedings of the MICCAI Workshop Challenge on Computer-Aided Diagnosis of Dementia Based on Structural MRI Data, pp. 38–47 (2014)Google Scholar
  18. 18.
    Fernández, A., García, S., Galar, M., Prati, R., Krawczyk, B., Herrera, F.: Learning from imbalanced data sets. Springer International PU (2018). https://books.google.es/books?id=8Fp0DwAAQBAJ. Accessed 4 Feb 2019
  19. 19.
    Fix, E., Hodges, J.L.: Discriminatory analysis: non parametric discrimination: small sample performance. Technical report project 21-49-004 (11), USAF School of Aviation Medicine, Randolph Field, Texas (1952)Google Scholar
  20. 20.
    Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997).  https://doi.org/10.1006/jcss.1997.1504 MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)CrossRefGoogle Scholar
  22. 22.
    Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit. 46(12), 3460–3471 (2013)CrossRefGoogle Scholar
  23. 23.
    Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012).  https://doi.org/10.1109/TPAMI.2011.142 CrossRefGoogle Scholar
  24. 24.
    García-Pedrajas, N., Pérez-Rodríguez, J., García-Pedrajas, M.D., Ortiz-Boyer, D., Fyfe, C.: Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl. Based Syst. 25(1), 22–34 (2012)CrossRefGoogle Scholar
  25. 25.
    García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evolut. Comput. 17(3), 275–306 (2009).  https://doi.org/10.1162/evco.2009.17.3.275 MathSciNetCrossRefGoogle Scholar
  26. 26.
    Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515–516 (1968)CrossRefGoogle Scholar
  27. 27.
    Jain, A.K., Zongker, D.: Feature selection: evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 19(2), 153–158 (1997).  https://doi.org/10.1109/34.574797 CrossRefGoogle Scholar
  28. 28.
    Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016).  https://doi.org/10.1007/s13748-016-0094-0 CrossRefGoogle Scholar
  29. 29.
    Krawczyk, B., Galar, M., Jeleń, Ł., Herrera, F.: Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38, 714–726 (2016)CrossRefGoogle Scholar
  30. 30.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann (1997)Google Scholar
  31. 31.
    Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York (2014). https://books.google.co.uk/books?id=RtRLBAAAQBAJ. Accessed 4 Feb 2019
  32. 32.
    Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) Artificial Intelligence in Medicine: 8th Conference on Artificial Intelligence in Medicine in Europe, AIME 2001 Cascais, Portugal, 1–4 July, 2001, Proceedings, pp. 63–66. Springer Berlin Heidelberg, Berlin, Heidelberg (2001).  https://doi.org/10.1007/3-540-48229-6_9
  33. 33.
    López, V., Triguero, I., Carmona, C.J., García, S., Herrera, F.: Addressing imbalanced classification with instance generation techniques: IPADE-ID. Neurocomputing 126, 15–28 (2014)CrossRefGoogle Scholar
  34. 34.
    Phua, C., Alahakoon, D., Lee, V.: Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor. Newsl. 6(1), 50–59 (2004)CrossRefGoogle Scholar
  35. 35.
    Pudil, P., Novovičová, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognit. Lett. 15(11), 1119–1125 (1994).  https://doi.org/10.1016/0167-8655(94)90127-9 CrossRefGoogle Scholar
  36. 36.
    Saeys, Y., Inza, I., naga, P.L.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)CrossRefGoogle Scholar
  37. 37.
    Sáez, J.A., Krawczyk, B., Woźniak, M.: Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognit. (2016).  https://doi.org/10.1016/j.patcog.2016.03.012
  38. 38.
    Sanz, J.A., Bernardo, D., Herrera, F., Bustince, H., Hagras, H.: A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data. IEEE Trans. Fuzzy Syst. 23(4), 973–990 (2015)CrossRefGoogle Scholar
  39. 39.
    Seiffert, C., Khoshgoftaar, T., Van Hulse, J., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 40(1), 185–197 (2010)CrossRefGoogle Scholar
  40. 40.
    Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J.: Improving software-quality predictions with data sampling and boosting. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 39(6), 1283–1294 (2009)CrossRefGoogle Scholar
  41. 41.
    Sun, Z., Song, Q., Zhu, X.: Using coding-based ensemble learning to improve software defect prediction. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(6), 1806–1817 (2012)CrossRefGoogle Scholar
  42. 42.
    Tao, T.: An Introduction to Measure Theory. Graduate Studies in Mathematics. American Mathematical Society, Providence (2013). https://books.google.es/books?id=SPGJjwEACAAJ. Accessed 4 Feb 2019
  43. 43.
    Tesfahun, A., Bhaskari, D.L.: Intrusion detection using random forests classifier with smote and feature reduction. In: 2013 International Conference on Cloud & Ubiquitous Computing & Emerging Technologies (CUBE), pp. 127–132. IEEE (2013)Google Scholar
  44. 44.
    Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)MathSciNetzbMATHGoogle Scholar
  45. 45.
    Triguero, I., Derrac, J., García, S., Herrera, F.: A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(1), 86–100 (2012)CrossRefGoogle Scholar
  46. 46.
    Visa, S., Ralescu, A.: Issues in mining imbalanced data sets-a review paper. In: Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, vol. 2005, pp. 67–73 (2005)Google Scholar
  47. 47.
    Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC–2(3), 408–421 (1972).  https://doi.org/10.1109/TSMC.1972.4309137 MathSciNetCrossRefzbMATHGoogle Scholar
  48. 48.
    Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Mach. Learn. 38(3), 257–286 (2000).  https://doi.org/10.1023/A:1007626913721 CrossRefzbMATHGoogle Scholar
  49. 49.
    Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2011)Google Scholar
  50. 50.
    Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)CrossRefGoogle Scholar
  51. 51.
    Yang, P., Xu, L., Zhou, B.B., Zhang, Z., Zomaya, A.Y.: A particle swarm based hybrid system for imbalanced medical data sampling. BMC Genom. 10(3), 1–14 (2009).  https://doi.org/10.1186/1471-2164-10-S3-S34 CrossRefGoogle Scholar
  52. 52.
    Zhang, J., Mani, I.: kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of The Twentieth International Conference on Machine Learning (ICML-2003), Workshop on Learning from Imbalanced Data Sets (2003)Google Scholar
  53. 53.
    Zheng, B., Myint, S.W., Thenkabail, P.S., Aggarwal, R.M.: A support vector machine to identify irrigated crop types using time-series landsat NDVI data. Int. J. Appl. Earth Obs. Geoinf. 34, 103–112 (2015)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Computer ScienceBangor UniversityBangorUK
  2. 2.Escuela Politécnica SuperiorUniversidad de BurgosBurgosSpain
  3. 3.Department of Computer ScienceMiddlesex UniversityLondonUK

Personalised recommendations