1 Introduction

Among imbalanced data classification methods, one of the most promising directions is using models based on classifier ensembles. In the case of ensemble learning, great emphasis is placed, on the one hand, on good prediction quality and, on the other hand, on appropriate diversification of base classifiers. Additionally, for the task in which there are unequal proportions between classes, and thus the cost of misclassification of one class (usually minority class) is higher than an incorrect object from the other class, a significant problem is the selection of a correct classification quality metric. This metric should serve as an optimization criterion when learning the model. The canonical classification model assumes knowledge of the mentioned misclassification costs, which should be provided by the user in the form of a loss function. Then a conceptually simple criterion would be the expectation value of the mentioned function, i.e., the overall risk [20].

Unfortunately, such information is lacking in real-world problems, and determining the cost of an erroneous decision is very difficult. One may suggest that a misclassification cost should be inversely proportional to the frequency of objects in a given class. However, it is easy to show that this is not necessarily true for many practical problems, and especially in the case of highly imbalanced data sets, this would lead to ignoring errors made on the majority class. Hence, approaches are constantly being developed to acquire the misclassification costs mentioned above, such as those given in the work of Branco et al. [9] utility-based learning. Most of the work on classifying imbalanced data analyzes simple metrics such as Recall and Precision (or Specificity). However, aggregate metrics such as Gmean, AUC, or F-measure are adopted due to the desire to express the quality assessment of a method by a single value. Their undoubted disadvantage is that they assume a fixed relationship between simple criteria, e.g., in the case of Gmean, it is the geometric mean of Precision and Recall. It is also worth noting that these criteria ignore the imbalance ratio or the misclassification costs on objects of different classes. Brzezinski et al. [10] showed that such metrics are biased towards the majority class and suggested using parametric metrics as Fβ, where β should be proportional to the imbalance ratio. Nevertheless, such a recommendation still does not guarantee that the best-evaluated classifier using this criterion will properly optimize the cost of misclassification because it assumes that the misclassification cost of minority class is proportional to the imbalance ratio, but it is not always true.

Additionally, it should also be noted that the values of these aggregated criteria are ambiguous. Based on their values, we de facto do not know how the given model behaves, i.e., what values the simple criteria take since a given value of the aggregated criterion can be achieved for many pairs of precision and recall values.

Considering the above, it seems interesting to consider the problem of classifier learning for imbalanced data as a multi-criteria optimization task. As a result of such algorithms, we should obtain not a single solution but a set of solutions (Pareto front), from which the selection of a single solution can either be automated or left to the user to decide.

In this paper, we intend to answer whether it is possible to propose classifier learning algorithms using multi-criteria optimization to train a set of Pareto-optimal svm classifiers to obtain the ensemble as good as the state-of-the-art methods. The classical svm classifier has difficulties classifying imbalanced data because determining the optimal hyperplane is not as simple as for more balanced data. The higher the imbalance ratio, the weaker the classification ability is. Therefore data sampling mechanisms (undersampling or oversampling) or changes at the classifier architecture level are used [12].

The main contributions of this paper can be summarized as follows:

  • Proposing the semoos ensemble method using multi-criteria optimization in the model learning phase.

  • Designing a strategy for building the ensemble consisting of Pareto optimal svm classifiers.

  • Developing a new bootstrapping variant for sample subspace selection from imbalanced data.

  • Analysing the impact of hyperparameters setting on the semoos behavior.

  • Estimating the computational complexity of the semoos method.

  • Experimental evaluation of the proposed approach based on diverse benchmark datasets and a detailed comparison with the state-of-the-art methods.

The paper is organized as follows. The following section discusses the related works. Section 3 provides the details of the semoos algorithm and its run-time complexity. Then we discuss the experimental setup, report the results along with our analysis, and finally present the concluding remarks and possible future research directions.

2 Related works

This section presents the main works related to our research. We shortly introduce the main topics related to multi-objective optimization, imbalanced data analysis, and classifier ensemble learning.

2.1 Multi-objective optimization

Multi-criteria (multi-objective) optimization algorithms optimize more than one objective function. Hence unlike single- optimization methods they return many solutions instead of a single one. The returned solutions may be divided into dominated and non-dominated ones [21, 60]. A non-dominated (Pareto-optimal) solution cannot be improved due to any criterion without compromising the value of the others. Most multi-objective optimization algorithms are global methods [17], where approximate solutions are returned. They may be grouped into: (i) Weighted Objectives Methods, (ii) Hierarchical Optimization Methods, (iii) Trade-Off Methods, (iv) Methods of Distance Functions, Min-Max Methods, (v) Goal Programming Methods, (vi) Genetic Methods.

Among the latter methods, nsga, nsga-ii, npga, and ffga [3] should be mentioned, as well as a multi-objective evolutionary algorithm based on decomposition (moea/d) that decomposes an optimization problem into a number of scalar optimization subproblems and optimizes them simultaneously [79]. Nguyen et al. [54] modified the moea/d algorithm and proposed two decomposition methods based on multiple reference points, static and dynamic. The developed method gave satisfactory results in the feature selection in the classification. Currently, multi-objective evolutionary algorithms (moea s) are eagerly developed and bring significant benefits, especially in large-scale multi-objective optimization problems. In this case, the decision space is grouped into several subspaces, is reduced, or novel search strategies are used [69].

Pareto solutions make it possible to reconcile several criteria but simultaneously make it difficult to choose one, the best solution, which is subjective and should depend on the user’s preference. The more critical criterion is for a user, the more likely the solution with a better score for that criterion will be selected. The Pareto MTL method was developed to facilitate selecting a Pareto front solution [45]. Thanks to it, solutions in the objective space are well-distributed, which ensures a compromise between various, often opposite, criteria. Another approach that provides well-distributed solutions on the Pareto front is the cosmos method [61] developed for deep learning.

There are many propositions on how to solve optimization. One of them is the use of the surrogate-assisted Particle Swarm Optimization with the Pareto active learning algorithm [48], which has a relatively low computational cost, fast convergence, and good diversity. Convergence is good when the solutions are close to the Pareto front, and good diversity in the objective space means that the solutions are evenly distributed in space and are not just at one point. The Ma et al. [49] proposed the lsmoea/d method using reference vectors in the control variable analysis. The experiments confirmed high quality for the huge test problems with 2–10 objectives and 200–1000 variables.

Multi-criteria optimization is a rapidly growing domain, primarily due to the increasing complexity of modeled processes. Increasingly, such optimization is used in real problems such as crude oil price forecasting [35], efficient use of energy in agriculture [36], renewable energy of the building sector [46] and many more. There are also some applications of multi-criteria optimization in classification, e.g., Jin et al. [77] solved the problem of multi-criteria optimization of the structure and parameters of the neural network using evolutionary methods (nsga-ii). The mo-selm method [75] has been tested for classification and regression. It relies on the optimization of parameters and structure learning of the Extreme Learning Machine network to cope with the overfitting problem. gemonn [76] employed a gradient-guided evolutionary approach containing the advantages of gradient descent and evolutionary algorithms to train deep neural networks. Optimization is used to determine weights for the network, while multi-criteria optimization is used simultaneously for the network sparsity and the training loss.

Many researchers combine svm classifiers with optimization algorithms that solve the feature selection problem. The article [78] used nsga-iii for the feature subset selection and cnn-svm (Convolutional Neural Network - Support Vector Machine) for software defect prediction with an imbalance problem. Mierswa [51] showed the possibility of using multi-criteria optimization techniques in svm learning, pointing out that thanks to this approach, it is possible to turn away from aggregated optimization criteria as a combination of opposing criteria. Additionally, Pareto-optimal solutions allow complexity analysis of the solution so the user can easily see which solutions are overfitted. The combination of svm classifier to detect malicious traffic and a Genetic Algorithm to optimize hyperparameters is used in the article [62].

2.2 Imbalanced data classification

In imbalanced data classification, the disproportion among the different classes is not the sole issue of learning difficulties. One may easily come up with an example where the instance distributions from different classes are well-separated. Napierała and Stefanowski observed that the minority class samples often may form scattered clusters of an unknown structure [53]. An additional complication arises from the possibility that there may be an insufficient number of minority class samples for a classifier learning algorithm to achieve an adequate level of generalization, resulting in overfitting [16].

One may divide imbalanced data classification algorithms into three groups [47].

Data preprocessing methods

concentrate on decreasing the number of examples from the majority class (undersampling) or generating new minority class samples (oversampling). These mechanisms aim to balance the number of objects from considered classes. Oversampling randomly replicates existing samples or generates new samples in a guided manner. smote is the most recognized algorithm [14] that generates new minority samples between two randomly selected objects. Unfortunately, methods such as smote may lead to changes in the characteristic of the minority class and, as a result, overfit the classifier. Therefore several modifications of smote have been proposed that can identify the samples to be copied more intelligently, such as Borderlinesmote [29] that generates new minority class samples close to the decision border. Safe-Level smote [11] and ln-smote [50] reduce the probability of generating synthetic minority instances in areas where the predominant objects are that of the majority class. Koziarski et al. proposed rbo [39] and ccr that enforce instances from the majority-class to be relocated from the areas where the minority-class instances are present [38]. The alternative preprocessing approach is undersampling. Such methods employ randomly removing the instances from the majority class or removing them from the areas so that the classifier’s quality is not disrupted using neighborhood analysis.

Inbuilt mechanisms

modify existing classification algorithms for imbalanced tasks ensuring balanced accuracy for both classes. One approach is one-class classification [34], which aims to learn the decision areas associated with one class. Initially, an approach based on building models for the majority class was proposed due to the sufficiently large number of objects representing it, and the minority class was treated as so-called outliers. A different approach was proposed in [41], where a one-class classifier was trained on minority class. In turn cost-sensitive classification considers the asymmetrical loss function that assigns a higher misclassification cost of minority class [30, 40, 47, 80]. Unfortunately, such methods may cause a reverse bias towards the minority class.

Hybrid methods

combine the advantages of methods using data preprocessing with the different classification methods. One of the most popular approaches is the hybridization of under- and oversampling with ensemble classifiers [27]. This approach allows the data to be independently processed for each base model. It is also worth noting methods based on ensemble classification [74], such as smoteBoost [15] and AdaBoost.NC [71].

2.3 Classifier ensembles

The purpose of the classifier ensemble is to make a joint decision by the pool of base classifiers to obtain more significant predictive power [42]. The main factors affecting the predictive quality of the ensemble are the individual predictive quality of base classifiers and their diversity [52]. These factors are the opposite. In the case of an ideal classifier, making a correct prediction in each case, increasing ensemble diversification must lead to the addition of worse quality classifiers. Therefore the ensemble forming process may be treated as a multi-objective optimization problem.

Chandra and Yao proposed divace (DIVerse and Accurate Ensemble Learning Algorithm) [13], which employs multi-objective optimization to the ensemble learning task to find a trade-off between diversity and accuracy. Abbas developed Memetic Pareto Artificial Neural Network (mpann), which optimizes similar criteria [1]. In [73], the authors optimize the weights of the models in the ensemble, and select the solution from the Pareto front using the promethee method. Other works focus on ensembles of decision trees [37], recursive networks [63], or fuzzy rules [33] which prove the possibility of using this approach. However, the number of works devoted to employing multi-objective optimization to classifier ensemble design is relatively small, especially compared to the common problem of designing the ensemble using a single criterion. Answers to key questions require further research, i.e., which models are best suited to the multi-objective task, and whether it is possible to develop algorithms that effectively combine different decision models. It is also necessary to develop methods for forming the base classifier and creating combination rules that connect them in the multi-objective optimization task.

Fletcher et al. [26] proposed a non-specific ensemble classification algorithm that uses multi-objective optimization to set the required. Ribero and Reynoso-Meza [59] developed a two-stage ensemble learning framework, where firstly, a set of diverse classifiers is generated, and secondly, a pool of classifiers is pruned. The same authors also [4] analyze different mood approaches for ensemble learning and propose a taxonomy of multi-objective ensemble learning. Olivera et al. [56] employed multi-objective optimization to select the base classifier’s valuable features and then choose the best ensemble line-op. Onan et al. [57] developed an ensemble method that employs a static classifier selection involving majority voting error and forward search, as well as a multi-objective differential evolution algorithm. Liang et al. [44] described an ensemble learning model based on multimodal multi-objective optimization. Asadi and Roshan [5] formulated an interesting proposition that focuses on the bagging procedure, considering the number of bags and simultaneously the diversity of the trained classifiers. Gu et al. [28] focused on classifier ensemble generation, where they proposed a solution that is a trade-off between accuracy and diversity ensemble. They also showed that the proposed solution could outperform single-objective methods.

2.4 Multi-objective optimization for imbalanced data classification

One could find only a few works on imbalanced data classification employing multi-objective optimization methods. It is worth mentioning the work of Bhowan et al. [6], who proposed to build a classifier ensemble based on Pareto-optimal classifiers. In turn, Soda [64] suggested training the classification system named Reliability-based Balancing, using multi-objective optimization methods and maximizing two criteria related to the frequency of correct decisions and G-mean. They used the data preprocessing technique based simultaneously on feature selection and prototype selection obtained from multi-objective optimization. Li et al. [43] proposed a data preprocessing method Adaptive Multi-objective Swarm Crossover Optimization, which used both over- and under-sampling at the same time. This approach selected the best proportion between majority and minority samples by multi-objective optimization.

Also, several ensemble techniques have been employed for imbalanced data classification tasks. Bhowan et al. [7] proposed a two-step approach to evolving ensembles using genetic programming for imbalanced data classification. The Pareto front optimal classifiers form an initial pool of classifiers, then ensemble pruning methods based on genetic programming were employed. Fernandez et al. [25] employed a decision tree ensemble and multi-objective optimization to find the best combination between feature and instance selections for the multi-class imbalanced task. Felicioni et al. [22] developed an algorithm that took fourth place in the ACM RecSys Challenge 2020, organized by Twitter. The challenge aimed to predict the probability of user engagement based on past interactions on the Twitter platform. Authors employ feature extraction and gradient boosting for decision trees and neural networks and build the ensemble using multi-objective optimization.

3 Proposition of the method

Let us propose the S VM E nsemble with M ulti-O bjective O ptimization S election (semoos) method dedicated to training an svm classifier ensemble for imbalanced data. Its pseudocode is presented in Algorithm 1 and in Fig. 1. Footnote 1 The main idea is to find such a pool of svm s that gives a diversified ensemble. To achieve that, we look for the setting of two parameters C and γ of the svm Radial Basis Function (rbf) kernel. Additionally, the feature selection for each base classifier is performed to ensure the high diversity of the ensemble. The multi-objective optimization is used to select the best svm parameter setting, including feature selection [67]. This method depends on several parameters, but we distinguish two main versions of semoos - semoos b and semoos bp. semoos b employs an original bootstrapping method to increase the diversity of the ensemble (Bootstrapping is true). semoos bp employs also pruning to remove similar models from the ensemble (Pruning is true).

Fig. 1
figure 1

Diagram of proposed methods: semoos, semoos b (when bootstrapping is used), semoos bp (when bootstrapping and pruning are applied)

Bootstrapping generates data subspaces. Optimization based on this data results in differentiated svm parameters from a given range and determines which features will be selected. The best non-dominated solutions were used to create a model, and then such a model was added to the ensemble of classifiers. This iteration of center-based Bootstrapping was repeated several times to ensure stability, higher quality, as well as avoid overfitting.

Algorithm 1
figure c

semoos, semoos b, semoos bp.

3.1 Algorithm

Let us quickly analyze Algorithm 1, which starts with the learning set \({\mathscr{L}}\mathcal {S}\) as an input. Suppose the center-based Bootstrapping mechanism is enabled (Bootstrapping = True). In that case, it is repeated n-times, and it divides each fold of the dataset into subsets S. A root r is selected randomly from the set of examples only for the first iteration (r1), and this point becomes a center \(\overline {x}\). The distribution list Di is built based on the sum of distances D and every distance from the root r to each samples in the \({\mathscr{L}}\mathcal {S}\). Then, Di is used as a probability in the sampling with replacement. It creates the subset Si composed of examples from \({\mathscr{L}}\mathcal {S}\). The following root ri+ 1 is the furthest point from the center of mass. If xc is the center of the time k −examples and d is the dimension of the x, then the center of k + 1 examples is given by the following formula:

$$ center(x_{c},k,x)=\frac{1}{k+1} \begin{bmatrix} k x_{c}^{(1)}+x^{(1)}\\ k x_{c}^{(2)}+x^{(2)}\\ \vdots\\ k x_{c}^{(d)}+x^{(d)} \end{bmatrix} $$
(1)

For example, in two-dimensional space, after the second iteration, there are two roots, and the center of them is in the middle of the line that connects these roots. After the third iteration, three roots are selected creating a triangle, so the new root is a centroid of this figure. This process continues until the iterator reaches the value of the n-repetition parameter.

In each iteration, the subset Si is used as the input to the multi-objective optimization fulfilled by the nsga-ii algorithm [18]. Equation 2 presents two fitness functions, F1 and F2 that the optimization algorithm uses to maximize Precision and Recall

$$ \begin{cases} \text{maximize } F_{1}(C, \gamma, \hat{x}) = Precision \\ \text{maximize } F_{2}(C, \gamma, \hat{x}) = Recall \end{cases} $$
(2)

The metrics are calculated during the validation process inside the optimization with the base estimator svm and different values of its hyperparameters C and γ that have various lower and upper limits in the set of real numbers: C ∈ [1e 6,1e 9] and γ ∈ [1e-7,1e-4]. \(\hat {x}\) is a binary vector containing selected features. These three parameters (\(C, \gamma , \hat {x}\)) form an initial population. The optimization is repeated until the maximum number of evaluations (m) is reached. It returns the Pareto optimal set containing results from the Pareto front, such as C,γ, and selected features.

Then, the estimator is trained and added to the ensemble for each result. The version semoos bp with the Pruning does not add all models to the ensemble. It finds unique values of fitness functions (Precision and Recall) and, based on these indexes of solutions, trains the estimator on \((C, \gamma , \hat {x})\). Finally, the algorithm returns the ensemble of classifiers. The prediction is based on Average Support Vectors.

The basic version of semoos is much simpler. It starts from the multi-objective optimization described above but the input is \({\mathscr{L}}\mathcal {S}\). Subsequently solutions from the Pareto Set PS such as \((C, \gamma , \hat {x})\) are used to train the model, and all models are added to the ensemble.

3.2 Computational complexity analysis

The computational complexity depends on a few aspects of the proposed method. Firstly, the time complexity of the base classifier svm is O(N3) [2], where N denotes the size of the dataset, so it is the number of examples in the learning set. Let us assume that M is the number of objectives, and K is the population size. The computational complexity of nsga-ii is O(MK2) [18]. In our case, M = 2, the complexity is O(2K2). The last part of the method is Bootstrapping, and its complexity is given by O(N). The complexity of each possible step of semoos is O(N3) + O(MK2) + O(N) = O(N3) + O(MK2), and the highest of them is the overall computational complexity of all versions of the proposed methods.

4 Experimental evaluation

The experiments described in this section are used to test the proposed methods and answer the research questions posed below.

  • RQ1: What is the impact of the semoos’s parameters (especially Bootstrapping and Pruning) on its quality?

  • RQ2: How do variants of the semoos affect classification quality?

  • RQ3: Can semoos methods outperform state-of-the-art algorithms?

  • RQ4: What is diversity of semoos ensemble compared to the reference methods?

4.1 Setup

Experiments were prepared using the Python programming language and a few libraries: Pymoo [8], scikit-learn [58], Numpy [55], Matplotlib [32], Pandas [72]. The implementation of the experiments is available in the GitHub repositoryFootnote 2.

As the proposed method aims to train an ensemble of svm s, the Random Subspace method is also based on svm, which randomly performs feature selection for base classifiers. By choosing the right features for each base classifier, diversity is assured. The other reference algorithms are not using the ensemble classifier paradigm, i.e., simple svm classifiers without feature selection and two svm models based on feature selection. A description of the semoos algorithm’s variants and the reference methods used in the experiments are presented below.

  • semoos - svm Ensemble with Multi Objective Optimization Selection

  • semoos b - svm with Multi Objective Optimization Selection with Bootstrapping

  • semoos bp - svm Ensemble with Multi Objective Optimization Selection with Bootstrapping and Pruning

  • rs - Random Subspace svm Ensemble [31]

  • svm - Support Vector Machines [65]

  • fs - Feature Selection svm

  • fsirsvm - Feature Selection Imbalance Ratio svm

Three versions of the proposed method (semoos, semoos b, semoos bp) will be compared with the following benchmark solutions: svm, an ensemble rs, and two classifiers with feature selection (fs and fsirsvm). fs is a classical approach to Feature Selection. It is based on the Chi-square statistic [68] and the K-best function, which chooses K-best features. fsirsvm is almost the same as fs, but it has an additional parameter, ir – Imbalance Ratio of each fold, which is applied to the svm as the class weight parameter. fs and fsirsvm select 75% of features from each dataset. rs creates random subspaces, and it has 100 models in the ensemble. The Support Vector Machines (svm) classifier was used as a base classifier for all methods except all variants of semoos. svm has default parameters such as the regularization parameter C = 1, and the kernel is rbf, kernel coefficient γ scaled, probability estimates are set to True, and the rest parameters had default values from the scikit-learn.

As mentioned in Section 3, all semoos variants use svm and optimize its parameters C and γ. The rest parameters of svm are default. Our methods use the optimization algorithm nsga-ii (Non-dominated Sorting Genetic Algorithm) [18] by 100 populations with the diverse representation, because C and γ are real values, and \(\hat {x}\) is a binary vector of selected features. After pre-experiments genetic operators were used, such as Random Sampling, Polynomial Mutation (for the real representation) and Bitflip Mutation (for the binary representation), Simulated Binary Crossover (for the real representation), and Two Point Crossover (for the binary representation). An eta parameter of Simulated Binary Crossover and Polynomial Mutation was set to 5. A constraint’s representation \(\hat {x}\) is 75% meaning that 75% of features are selected and their value in the vector is 1. The process of optimizing parameters is done by 1000 evaluations. The size of the ensemble depends on the variant of our methods. The parameter determines the ensemble size in the semoos methods, which is ten models. In semoos b, this value is multiplied by a number of iterations (five iterations) of the Bootstrapping, and the final ensemble consists of 50 classifiers. Due to the Pruning in semoos bp, we cannot specify its size in advance.

As the experimental protocol, the Repeated Stratified K-Fold (5 repeats x 2 splits) cross-validation was chosen [66]. Such cross-validation was also applied inside the optimization to avoid overfitting. Results were saved as csv files for each dataset fold and metric. Then, we performed Wilcoxon statistical rank-sum tests to see if one method was statistically significantly better than the other in pairwise rankings. Five metrics were used to measure the quality of methods: bac – Balanced Accuracy, Gmean, Recall, and Precision.

All imbalanced datasets used in the experiments are listed in Table 1, where ID — is the dataset identifier, Dataset — is the name of the dataset, IR – is the Imbalance Ratio, Ex. – is the number of examples, Attr. – is the number of attributes. They were loaded from the keel dataset repository [70], where these datasets were divided according to ir. The first 19 datasets are separated with the line and have ir lower than 9. The rest of the datasets has ir higher than 9. We keep up this division in our results to check the effectiveness of our methods on data with low and high imbalance. Most of the datasets are multi-class problems but they have already been prepared as binary classification problems. For example, the glass-0-1-2-3_vs_4-5-6 dataset combines 0, 1, 2, 3 classes of the original dataset as a negative class (majority) and 4, 5, 6 as a positive class (minority). The extended description with class names is characterized by Fernandez et al. [23, 24].

Table 1 Description of datasets

4.2 Experiments

We carried out four groups of experiments to answer the research questions:

  • Selection of the best hyperparameters of the semoos algorithm (Experiment 1).

  • Comparison of the main variants of the proposed method, i.e., investigation of the effect of bootstrapping and ensemble pruning on the quality of semoos (Experiment 2).

  • Comparison of the variants of the proposed method with selected reference algorithms (Experiment 3).

  • Evaluation of classifier ensemble diversity of the proposed variants of semoos (Experiment 4).

4.2.1 Experiment 1: Setting hyperparameters

Our methods have a few parameters, so we decided to conduct pre-experiments to choose values of these parameters that get the best quality. We tested semoos b on four datasets (ecoli-0-3-4-7_vs_5-6, glass5, vehicle3, yeast-2_vs_4) and averaged results to present figures.

Firstly, we check the eta parameter of crossover and mutation for each metric for values [2, 5, 10, 20]. Deb et al. in [19] point out that small eta values for crossover provide a diverse search among solutions. Figure 2 shows exemplary results for one dataset yeast-2_vs_4, where eta_m and eta_c are eta parameters for mutation and crossover accordingly. Figures presenting the rest datasets are available in the GitHub repository.Footnote 3, Footnote 4 The best results is a black square with bold, white font inside. After analyzing all datasets, the results shown in Fig. 2 proved that the parameters achieving the highest bac, Gmean, and Recall are eta_c = 5 and eta_m = 5. Precision is not the highest for these values, but the metric equal 0.916 are not much worse.

Fig. 2
figure 2

First pre-experiment: setting eta parameter for mutation and crossover (the best result – the black square)

Next, the number of iterations of bootstraps and the percent of selected features were tested, and ecoli-0-3-4-7_vs_5-6 dataset is shown in Fig. 3. This situation is similar to the previous one, bac, Gmean, and Recall indicate the highest metrics for bootstrap = 5 and features = 75%.

Fig. 3
figure 3

Second pre-experiment: setting iteration of Bootstrapping and the percent of selected features (the best result – the black square)

4.2.2 Experiment 2: Comparison of three variants of semoos

Based on the parameters selected in the previous experiment, tests are conducted comparing the proposed semoos, semoos b, and semoos bp methods. The results from the folds and all datasets are averaged, and Wilcoxon rank-sum statistical tests are performed. In the figure of the Wilcoxon test, rows are different metrics (bac, Gmean, Recall, Precision), and columns are labeled with methods. The green color means that the method wins, yellow – ties, and red – loses to the method located at the bar on the chart. The black dashed line indicates the statistical significance of the method as winning. Figure 4 shows a test for datasets with the imbalance ratio of less than 9. The results do not indicate any of the methods as statistically significantly better than the others. However, it can be noticed that semoos and semoos bp win with the semoos b method, especially in the first three metrics, i.e. bac, Gmean, and Recall. In Fig. 5, there are 59 datasets where IR > 9. Similar conclusions can be drawn from it that semoos is better than semoos b, but the other methods are at a similar level.

Fig. 4
figure 4

Wilcoxon rank-sum test of proposed methods for datasets with IR < 9 (green – win, yellow – tie, red – loss)

Fig. 5
figure 5

Wilcoxon rank-sum test of proposed methods for datasets with IR > 9 (green – win, yellow – tie, red – loss)

4.2.3 Experiment 3: Comparison with reference methods

A vital element of each method is to compare it with state-of-the-art methods to check whether the proposed new method is statistically significantly better than the others. This experiment will compare three proposed variants of the semoos method and the reference methods rs, svm, fs, fsirsvm. In Appendix ?? is a link to tables with the exact results of the metrics averaged over the folds together with the standard deviation for each dataset. The best result for a given dataset is marked in bold. Each table contains a different metric. The results of the proposed methods are compared with the reference ones. Sometimes the quality is slightly higher or lower, but it is difficult to pinpoint a winning method.

Therefore, Wilcoxon statistical tests showed how many datasets each proposed method won. As in the previous section, the figures are grouped by imbalance ratio above or below 9. Each semoos variant is compared to all reference methods for the five metrics. All semoos variants work similarly in Fig. 6, and they are statistically significantly better than fsirsvm for bac, Gmean, and Recall metrics. They also achieved many wins compared to the rs method. However, only semoos for the Recall metric shows a statistically significant win.

Fig. 6
figure 6

Wilcoxon rank-sum test for datasets with IR < 9 (green – win, yellow – tie, red – loss)

The results in Fig. 7 are slightly different for data with high imbalance. There are many more datasets in this case. All variants of the semoos method win with static significance with the rs method for bac, Gmean, and Recall metrics. There are also fewer draws between the methods for these three metrics. An important goal for us was to correct identify the minority class that is represented by Recall.

Fig. 7
figure 7

Wilcoxon rank-sum test for datasets with IR > 9 (green – win, yellow – tie, red – loss)

Analysis of the Pareto front in the objective functions space is presented in Fig. 8. Triangles represent the reference methods. Our method proposals fall into two categories, the method name with PF (Pareto Front), and the method name alone is the quality of the final constructed ensemble. These scatter plots show how the Pareto front solutions locate relative to the final built ensembles and the reference methods. Figure 8a shows an exemplary broad Pareto front, which provides more diverse models. The reference methods are behind the Pareto front; when comparing them to ensembles, they obtain a lower Precision value. Figure 8b shows more focused PF solutions than Fig. 8a. semoos variants obtain a similar value for both metrics and win over the reference methods. In these examples, the methods score is both high Precision and Recall, but this is not a rule for all datasets.

Fig. 8
figure 8

Scatter plots with Pareto front and reference methods

4.2.4 Experiment 4: Evaluating semoos ensemble diversity

The last experiment focuses on the evaluation of the diversity of models in the ensemble. We conducted tests for four diversity metrics proposed in [42] for three semoos variants and one reference ensemble rs. Figure 9 shows the Q-statistic metric, while all other metrics are available on the remote GitHub repository.Footnote 5, Footnote 6

Q-statistic is a pairwise diversity measure. It can assess two classifier outputs and return the decision on their similarity. Q is in the range from − 1 to 1, where Q = 0 means that classifiers are statistically independent; Q < 0 – classifiers make mistakes on different objects; Q > 0 – classifiers correctly recognize the same objects.

From Fig. 9a, it may be concluded that for data with a small imbalance, the method diversity hardly differs between each other. However, for a greater imbalance (Fig. 9b), these differences are significant, and the result of the semoos method is closest to the value 0, which means that the models in this method are the most diverse, compared to rs the difference is 0.3.

4.3 Lessons learned

Let us try to answer the asked research questions considering the obtained results of the experiments.

  • RQ1: What is the impact of the semoos’s parameters (especially Bootstrapping and Pruning) on its quality?

    During pre-experiments, four parameters were examined: eta_c for crossover and eta_m for mutation, the number of Bootstrap iterations, and the number of features. The quality of the eta parameter for both genetic operators depends on the selected metric. The most significant difference between the best and worst results is 0.3, but it is usually only a few hundredths. The combination of different values of the eta_c and eta_m parameters does not give unequivocal results. The analysis of these cases showed that good results were obtained for the values of eta_c = 5 and eta_m = 5. It is easier to draw the following conclusions for the other two parameters. The greater the number of features selected, the higher the value of most metrics. The results are the worst for the 1 and 10 iterations of Bootstrapping. Therefore, we selected 75% of the features and five iterations of Bootstrapping for further experiments to obtain the highest classification quality. Bootstrapping and Pruning as parameters of the semoos method improve the quality of the tested metric.

  • RQ2: How do variants of the semoos method affect classification quality?

    None of the proposed methods shows a statistically significant difference compared to the others. Therefore, we included all three variants in further experiments. However, based on the presented statistical test results, semoos and semoos bp are better than the semoos b for all datasets.

  • RQ3: Can semoos methods outperform state-of-the-art algorithms?

    Each variant of the semoos method has been compared with four state-of-the-art methods, and the statistical test results for each semoos variant are similar if we consider the number of wins. However, depending on the level of dataset imbalance, the results vary. For datasets with the imbalance ratio below 9, the proposed methods exceed the fsirsvm method with statistical significance and have a high number of wins with the rs method. However, for the imbalance ratio above 9, statistical significance is obtained for the rs algorithm.

  • RQ4: What is diversity of semoos ensemble compared to the reference methods?

    The last element of the research shows that all proposed methods are more diversified than the Random Subspace (rs) method. When analyzing the figure with a greater imbalance ratio with a larger number of datasets, it is noticeable that the semoos method works best.

5 Conclusion

The main goal of this work was to use multi-objective optimization (nsga-ii) in the form of two independent fitness functions, Precision and Recall, to classify imbalanced data. Three ensemble classifiers using svm as a base model were proposed: semoos, semoos b, and semoos bp. semoos is a basic version that adds all svm models obtained from optimization. semoos b has the additional option of Bootstrapping, i.e., resampling a dataset with replacement is performed to get more samples from one dataset. The last version is semoos bp, and the method includes both the Bootstrapping and Pruning needed to remove redundant models from the ensemble.

The experiments were performed according to the experimental protocol on 78 imbalanced datasets. First, semoos hyperparameters were set to select the parameter values that produce the best results for this kind of data. Then, the three versions of the semoos method were compared with each other, but none of the methods showed statistical significance. Therefore all versions were selected for further research. The main experiment was to compare the proposed methods with various state-of-the-art methods, such as single svm classifier, ensemble classifier (rs), and two methods using feature selection (fs and fsirsvm). Statistical tests showed that the semoos variants outperform the rs and fsirsvm methods with statistical significance. The last element of the work was to compare the diversity of models in ensemble methods. The presented results show that the proposed methods are more diversified than the SOTA solutions.

Our methods could also be tested for multi-class problems because data binarization is only a simplification of this problem. Due to feature selection in the proposed methods and potentially good results for datasets with more features, it would be good to check more such datasets. In case of too little real data, it is worth considering artificially generated data, which will also ensure controlled conditions for testing methods. The comparison could also be made with other reference methods that would have more elements in common with the semoos, semoos b or semoos bp methods. The last of future works is to consider the optimization itself, that is to use other multi-objective optimization methods, e.g. moea/d, other metrics in fitness functions, or to increase their number.