Instance selection improves geometric mean accuracy: a study on imbalanced data classification

Kuncheva, Ludmila I.; Arnaiz-González, Álvar; Díez-Pastor, José-Francisco; Gunn, Iain A. D.

doi:10.1007/s13748-019-00172-4

Instance selection improves geometric mean accuracy: a study on imbalanced data classification

Regular Paper
Published: 06 February 2019

Volume 8, pages 215–228, (2019)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

Ludmila I. Kuncheva¹,
Álvar Arnaiz-González²,
José-Francisco Díez-Pastor ORCID: orcid.org/0000-0001-5013-7505² &
…
Iain A. D. Gunn³

854 Accesses
45 Citations
Explore all metrics

Abstract

A natural way of handling imbalanced data is to attempt to equalise the class frequencies and train the classifier of choice on balanced data. For two-class imbalanced problems, the classification success is typically measured by the geometric mean (GM) of the true positive and true negative rates. Here we prove that GM can be improved upon by instance selection, and give the theoretical conditions for such an improvement. We demonstrate that GM is non-monotonic with respect to the number of retained instances, which discourages systematic instance selection. We also show that balancing the distribution frequencies is inferior to a direct maximisation of GM. To verify our theoretical findings, we carried out an experimental study of 12 instance selection methods for imbalanced data, using 66 standard benchmark data sets. The results reveal possible room for new instance selection methods for imbalanced data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Article Open access 15 November 2019

A survey of transfer learning

Article Open access 28 May 2016

Notes

We will use the terms “example”, “instance”, “object” and “prototype” interchangeably, meaning a data point in the feature space of interest, e.g. \(\mathbf {x}\in \mathbb {R}^n\).
We find it curious that no such methods, on this category, have yet been developed to maximise GM.
Available at http://sci2s.ugr.es/keel/imbalanced.php.
We noticed that, while the original OSS is defined by Kubat in [30] as CNN followed by TL, later on, Batista [5] defined it in reverse order and also independently proposed an equivalent to Kubat’s OSS. This misunderstanding has spread in subsequent works. However, we have maintained the original name OSS for CNN+TL, as used [30], and we use TL+CNN for Batista et al.’s method [5].
The random selection was performed by using the SpreadSubsample instance supervised filter.
Available in the KEEL GitHub repository: https://github.com/SCI2SUGR/KEEL.
Available in Google code: https://code.google.com/archive/p/imbalanced-data-sampling/.

References

Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, 20-24 September, 2004. Proceedings, pp. 39–50. Springer Berlin Heidelberg, Berlin, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30115-8_7
Alcala-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL data-mining software tool: data set repository and integration of algorithms and experimental analysis framework. J. Mult. Valued Logic Soft Comput. 17(2–3), 255–287 (2011)
Google Scholar
Barandela, R., Sánchez, J., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36(3), 849–851 (2003)
Article Google Scholar
Barandela, R., Valdovinos, R., Sánchez, J.: New applications of ensembles of classifiers. Pattern Anal. Appl. 6(3), 245–256 (2003)
Article MathSciNet Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004). https://doi.org/10.1145/1007730.1007735
Article Google Scholar
Batuwita, R., Palade, V.: FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans. Fuzzy Syst. 18(3), 558–571 (2010). https://doi.org/10.1109/TFUZZ.2010.2042721
Article Google Scholar
Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)
Article Google Scholar
Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: GrC, pp. 732–737 (2006)
Cleofas-Sánchez, L., Sánchez, J.S., García, V.: Gene selection and disease prediction from gene expression data using a two-stage hetero-associative memory. Prog. Artif. Intell. (2018). https://doi.org/10.1007/s13748-018-0148-6
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Article MATH Google Scholar
Dal Pozzolo, A., Caelen, O., Le Borgne, Y.A., Waterschoot, S., Bontempi, G.: Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst. Appl. 41(10), 4915–4928 (2014)
Article Google Scholar
Dasarathy, B.V.: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, California (1990)
Google Scholar
Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)
Article Google Scholar
Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C., Kuncheva, L.I.: Random balance: ensembles of variable priors classifiers for imbalanced data. Knowl. Based Syst. 85, 96–111 (2015)
Article Google Scholar
Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C.I., Kuncheva, L.I.: Diversity techniques improve the performance of the best imbalance learning ensembles. Inf. Sci. 325, 98–117 (2015)
Article MathSciNet Google Scholar
Drown, D.J., Khoshgoftaar, T.M., Seliya, N.: Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 39(5), 1097–1107 (2009)
Article Google Scholar
Eskildsen, S.F., Coupé, P., Fonov, V., Collins, D.L.: Detecting alzheimer’s disease by morphological MRI using hippocampal grading and cortical thickness. In: Proceedings of the MICCAI Workshop Challenge on Computer-Aided Diagnosis of Dementia Based on Structural MRI Data, pp. 38–47 (2014)
Fernández, A., García, S., Galar, M., Prati, R., Krawczyk, B., Herrera, F.: Learning from imbalanced data sets. Springer International PU (2018). https://books.google.es/books?id=8Fp0DwAAQBAJ. Accessed 4 Feb 2019
Fix, E., Hodges, J.L.: Discriminatory analysis: non parametric discrimination: small sample performance. Technical report project 21-49-004 (11), USAF School of Aviation Medicine, Randolph Field, Texas (1952)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997). https://doi.org/10.1006/jcss.1997.1504
Article MathSciNet MATH Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)
Article Google Scholar
Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit. 46(12), 3460–3471 (2013)
Article Google Scholar
Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012). https://doi.org/10.1109/TPAMI.2011.142
Article Google Scholar
García-Pedrajas, N., Pérez-Rodríguez, J., García-Pedrajas, M.D., Ortiz-Boyer, D., Fyfe, C.: Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl. Based Syst. 25(1), 22–34 (2012)
Article Google Scholar
García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evolut. Comput. 17(3), 275–306 (2009). https://doi.org/10.1162/evco.2009.17.3.275
Article MathSciNet Google Scholar
Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515–516 (1968)
Article Google Scholar
Jain, A.K., Zongker, D.: Feature selection: evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 19(2), 153–158 (1997). https://doi.org/10.1109/34.574797
Article Google Scholar
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0
Article Google Scholar
Krawczyk, B., Galar, M., Jeleń, Ł., Herrera, F.: Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38, 714–726 (2016)
Article Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann (1997)
Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York (2014). https://books.google.co.uk/books?id=RtRLBAAAQBAJ. Accessed 4 Feb 2019
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) Artificial Intelligence in Medicine: 8th Conference on Artificial Intelligence in Medicine in Europe, AIME 2001 Cascais, Portugal, 1–4 July, 2001, Proceedings, pp. 63–66. Springer Berlin Heidelberg, Berlin, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_9
López, V., Triguero, I., Carmona, C.J., García, S., Herrera, F.: Addressing imbalanced classification with instance generation techniques: IPADE-ID. Neurocomputing 126, 15–28 (2014)
Article Google Scholar
Phua, C., Alahakoon, D., Lee, V.: Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor. Newsl. 6(1), 50–59 (2004)
Article Google Scholar
Pudil, P., Novovičová, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognit. Lett. 15(11), 1119–1125 (1994). https://doi.org/10.1016/0167-8655(94)90127-9
Article Google Scholar
Saeys, Y., Inza, I., naga, P.L.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Article Google Scholar
Sáez, J.A., Krawczyk, B., Woźniak, M.: Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognit. (2016). https://doi.org/10.1016/j.patcog.2016.03.012
Sanz, J.A., Bernardo, D., Herrera, F., Bustince, H., Hagras, H.: A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data. IEEE Trans. Fuzzy Syst. 23(4), 973–990 (2015)
Article Google Scholar
Seiffert, C., Khoshgoftaar, T., Van Hulse, J., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 40(1), 185–197 (2010)
Article Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J.: Improving software-quality predictions with data sampling and boosting. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 39(6), 1283–1294 (2009)
Article Google Scholar
Sun, Z., Song, Q., Zhu, X.: Using coding-based ensemble learning to improve software defect prediction. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(6), 1806–1817 (2012)
Article Google Scholar
Tao, T.: An Introduction to Measure Theory. Graduate Studies in Mathematics. American Mathematical Society, Providence (2013). https://books.google.es/books?id=SPGJjwEACAAJ. Accessed 4 Feb 2019
Tesfahun, A., Bhaskari, D.L.: Intrusion detection using random forests classifier with smote and feature reduction. In: 2013 International Conference on Cloud & Ubiquitous Computing & Emerging Technologies (CUBE), pp. 127–132. IEEE (2013)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
MathSciNet MATH Google Scholar
Triguero, I., Derrac, J., García, S., Herrera, F.: A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(1), 86–100 (2012)
Article Google Scholar
Visa, S., Ralescu, A.: Issues in mining imbalanced data sets-a review paper. In: Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, vol. 2005, pp. 67–73 (2005)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC–2(3), 408–421 (1972). https://doi.org/10.1109/TSMC.1972.4309137
Article MathSciNet MATH Google Scholar
Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Mach. Learn. 38(3), 257–286 (2000). https://doi.org/10.1023/A:1007626913721
Article MATH Google Scholar
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2011)
Google Scholar
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Article Google Scholar
Yang, P., Xu, L., Zhou, B.B., Zhang, Z., Zomaya, A.Y.: A particle swarm based hybrid system for imbalanced medical data sampling. BMC Genom. 10(3), 1–14 (2009). https://doi.org/10.1186/1471-2164-10-S3-S34
Article Google Scholar
Zhang, J., Mani, I.: kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of The Twentieth International Conference on Machine Learning (ICML-2003), Workshop on Learning from Imbalanced Data Sets (2003)
Zheng, B., Myint, S.W., Thenkabail, P.S., Aggarwal, R.M.: A support vector machine to identify irrigated crop types using time-series landsat NDVI data. Int. J. Appl. Earth Obs. Geoinf. 34, 103–112 (2015)
Article Google Scholar

Download references

Acknowledgements

This work was done under project RPG-2015-188 funded by the Leverhulme Trust, UK; the project TIN2015-67534-P funded by the Ministerio de Economía y Competitividad of the Spanish Government; and the BU085P17 funded by the Junta de Castilla y León. The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Author information

Authors and Affiliations

School of Computer Science, Bangor University, Dean Street, Bangor, Gwynedd, Wales, LL57 2NJ, UK
Ludmila I. Kuncheva
Escuela Politécnica Superior, Universidad de Burgos, Avda. de Cantabria s/n, 09006, Burgos, Spain
Álvar Arnaiz-González & José-Francisco Díez-Pastor
Department of Computer Science, Middlesex University, London, NW4 4BT, UK
Iain A. D. Gunn

Authors

Ludmila I. Kuncheva
View author publications
You can also search for this author in PubMed Google Scholar
Álvar Arnaiz-González
View author publications
You can also search for this author in PubMed Google Scholar
José-Francisco Díez-Pastor
View author publications
You can also search for this author in PubMed Google Scholar
Iain A. D. Gunn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José-Francisco Díez-Pastor.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kuncheva, L.I., Arnaiz-González, Á., Díez-Pastor, JF. et al. Instance selection improves geometric mean accuracy: a study on imbalanced data classification. Prog Artif Intell 8, 215–228 (2019). https://doi.org/10.1007/s13748-019-00172-4

Download citation

Received: 21 November 2018
Accepted: 19 January 2019
Published: 06 February 2019
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s13748-019-00172-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Instance selection improves geometric mean accuracy: a study on imbalanced data classification

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

A survey of transfer learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Instance selection improves geometric mean accuracy: a study on imbalanced data classification

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

A survey of transfer learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation