1 Introduction

Naïve Bayes (NB) is a classification technique that has played a prominent role in the literature. Hand and Yu (2001), Hastie et al. (2001) and Mehra and Gupta (2013) highlight its tractability, simplicity and efficiency. The implicit hypothesis of independent attributes conditioned to the class eases its implementation significantly because it allows to express the sample likelihood to be maximized as the product of univariate marginals. Moreover, this classifier is less prone to overfitting since it estimates fewer parameters than other current classification techniques (Domingos and Pazzani 1997; Hand and Yu 2001). As a consequence, NB has been applied in a number of real contexts, for example, genetics (Chandra and Gupta 2011; Minnier et al. 2015), medicine [see Wei et al. 2011; Rosen et al. 2010; Parthiban et al. 2011; Wolfson et al. 2015], risk (Minnier et al. 2015), reliability (Turhan and Bener 2009; Menzies et al. 2007), document analysis (Bermejo et al. 2011; Guan et al. 2014) and a number of variants have been proposed in the literature [see Jiang et al. 2016; Boullé 2007; Wu et al. 2015; Yager 2006].

Although classifiers are built so that an overall performance measure is optimized, misclassification rates for different classes may be different, and they may not be in accordance with misclassification costs, since the classes of least interest may be much better classified than the critical ones. This is of particular concern in some real contexts, such as early detection of diseases (since fewer observations of diseased population are often available), risk management and credit card fraud detection, see Carrizosa et al. (2008), He and Yunqian (2013), Prati et al. (2015), Sun et al. (2009) for more details and applications. Consider, as an example, the well-referenced Breast Cancer Wisconsin (Diagnostic) data set from the UCI repository (Lichman 2013). It is a slightly unbalanced dataset composed by 30 continuous variables and two classes: Benign (\(63\%\) of the total samples) and Malignant (\(37\%\)). It is relevant to remark that, for this dataset, it is more important to classify correctly the Malignant class (the critical one) than the Benign class. If the classic NB is performed, setting equal both misclassification costs, then the estimated performance rate for the control group is about 0.96, higher than the rate for the sick group (0.89). One can easily modify the misclassification costs structure, but this way only an indirect control on misclassification rates is obtained.

In this paper we propose a novel way of controlling misclassification rates, that do not call for using misclassification costs which may be hard to choose and are not usually given (Sun et al. 2007, 2009). In particular, a new version of the NB is obtained by modeling performance contraints where the Recall (proportion of instances of a given class correctly classified) for the classes of interest is forced to be lower-bounded by certain thresholds. In this way, the user is allowed to assign different importance to the different classes according to her preferences. For example, in the previously considered Breast Cancer dataset, it may be desiderable to increase the Recall for the Malignant class, which was equal to 0.89. As it will be shown in Sect. 3, for this case such rate can be increased up to 0.91. Other example where performance constraints are useful is when fair classification is a requirement as a social criterion, and then the sensitive groups should be protected to avoid the discrimination against race, or other sensitive data (Romei and Ruggieri 2014). Acceptable values for the Recall of groups at risk could be fixed via the proposed method in this work. A direct application of our proposal is to handle highly unbalanced datasets, with two or more classes, where the inclusion of performance constrains allows us to improve the results associated with the most damaged classes while controlling the Recall related to the rest of the classes.

The problem of cost imbalance has been addressed in the literature from two different perspectives: Data-Level techniques and Algorithm-Level approches, see Leevy et al. (2018). Whereas the former include data sampling methods and feature selection, the latter encompass cost-sensitive and hybrid/ensemble methods which adapt the base classifier to overcome the imbalance. Particularly, our approach can be seen as a cost-sensitive method. Cost-sensitive approaches have been already considered in the literature for well-known classifiers. For example, Datta and Das (2015), Carrizosa et al. (2008) and Lee et al. (2017) focus on the support vector machine (SVM) classifier. In Datta and Das (2015) the decision boundary shift is combined with unequal misclassification penalties. On the other hand, in Carrizosa et al. (2008) a biobjective problem, which simultaneous minimizes the misclassification rates, is performed. In Lee et al. (2017), the authors propose a new weight adjustment factor that is applied to a weighted SVM. In the context of decision trees, Freitas et al. (2007), Ling et al. (2004) introduce tree-building strategies which choose the splitting criterion by minimizing the misclassification costs, whereas Bradford et al. (1998) performs the pruning of a subtree following the cost information. Cost-sensitive versions of neural networks for unbalanced data classification have also been studied in the literature (Cao et al. 2013; Zhou and Liu 2006). Other approaches can be found, for example in Peng et al. (2014), where a new version of the so-called data gravitation-based classification model is proposed.

However, there is a lack of methodologies allowing the user to control the different performance measures of interest at the same time. The application of mathematical optimization tools, the approach that we undertake in this paper, seems to be a promising (Carrizosa and Romero Morales 2013) and not fully explored option: one overall criterion is to be optimized, while constraints are introduced in the model to demand admissible values for the efficiency measures under consideration. Recently, this approach has been considered either in classification (Benítez-Peña et al. 2019; Blanquero et al. 2021) or in regression (Blanquero et al. 2021). In this paper, this technique is explored for improving the NB performance in the classes of most interest to the user. It will be seen that unlike the traditional NB, which is a two-step classifier (estimation first and classification next), the novel approach integrates both stages. In particular, maximum likelihood estimation is formulated as an optimization problem in which thresholds on classification rates are imposed. In other words, maximum likelihood estimates are replaced here by constrained maximum likelihood estimates, where the constraints control the Recall values of the classes of interest.

This paper is organized as follows. In Sect. 2 the NB is briefly reviewed and the proposed version of constrained NB (CNB from now on) is described. Section 3 illustrates the usefulness of our novel approach. Eight real databases with different sampling properties are thoroughly analyzed, and a detailed discussion concerning the Recall values of the proposed approach compared with the classic NB is given. Some conclusions and further related research are considered in Sect. 4.

2 The constrained Naïve Bayes

In our approach, the estimation is performed by solving a constrained maximum likelihood estimation problem, constraints being related with thresholds on the Recall values for different classes. The aim of this section is to describe the associated optimization problem. As a result, a computationally tractable classifier that allows the user to control its performance is obtained.

2.1 Preliminaries on NB classification

Consider a random vector \(({\mathbf {X}},Y)\), where \({\mathbf {X}}=\left( X_1,\ldots ,X_p\right) \) contains p features and Y identifies the class label. Assume that we have a single-label (one class label per observation) classification problem with K classes. Then, for each class \(k\in \{1,\ldots ,K\}\), let \(\pi _k\) denote the prior probability of the class, \(\pi _k=P(Y=k)\), and assume that \(X_j|(Y=k)\) has a probability density function \(f_{\theta _{jk}}(x)\), where \(\theta _{jk}\in \Theta _{jk}\). For \(k=1,\ldots ,K,\) define \(\varvec{\theta }_k=(\theta _{1k},\ldots ,\theta _{pk}).\)

Let \({\mathbf {x}}=(x_1,\ldots ,x_p)\) be a new observation. Then the aim is to label it on one of the K classes. Then, under the 0–1 loss function, Bayesian Decision Theory establishes that \({\mathbf {x}}\) is classified in the most probable class according to the conditional distribution. The estimation of the associated parameters may be cumbersome if the number of features p is large. However, the use of the Bayes theorem, in addition to the assumption of independence (conditioned to the class) ease the previous estimation process. As it is well known, the latter assumption implies that the joint density function can be expressed as

$$\begin{aligned} f(x_1, \ldots ,x_p, k)= & {} P(Y=k)f(x_1,\ldots ,x_p \mid k)\\= & {} \pi _k f_{\varvec{\theta }_{k}}({\mathbf {x}})\\= & {} \pi _k \prod _{j=1}^{p}f_{\theta _{jk}}(x_j), \end{aligned}$$

and thus the estimation process is reduced to estimate the parameters of each marginal distribution. Then, the NB classifier performs by assigning \({\mathbf {x}}\) to class k satisfying

$$\begin{aligned} \pi _k \prod _{j=1}^{p}f_{\theta _{jk}}(x_j) \ge \pi _i \prod _{j=1}^{p}f_{\theta _{ji}}(x_j) \quad \forall i =1,\ldots ,K. \end{aligned}$$
(1)

Given a training sample of size \(N_1\), \(({\mathbf {x}}_1,k_1),\ldots ,({\mathbf {x}}_{N_1},k_{N_1})\), then \(\varvec{\theta }= (\varvec{\theta }_1,\ldots ,\varvec{\theta }_K)\) is estimated in NB via maximum likelihood (Hogg et al. 2005), and therefore computed as the solution of the optimization problem:

$$\begin{aligned} \max _{\varvec{\theta }} \sum _{n=1}^{N_1} \log f_{\varvec{\theta }_{k_n}}({\mathbf {x}}_{n}) \end{aligned}$$
(2)

Therefore, the classic NB can be seen as a two-step classifier, where the model parameter is first estimated as \(\hat{\varvec{\theta }}\) from a training sample, and then (1) is applied under \(\varvec{\theta }=\hat{\varvec{\theta }}\).

2.2 A novel formulation with performance constraints

In order to calibrate the performance of a classifier, many measures have been defined in the literature, see Sokolova and Lapalme (2009). In particular, the so-called Recall\(_{k}\), for \(k=1,\ldots ,K,\) is defined as the sample fraction of individuals in class k which are correctly classified.

Given a validation sample of size \(N_2\), where \(N_2=\sum _{k}N_{2,k}\) and \(N_{2,k}\) is the size of class k in such a validation sample, \(({\mathbf {x}}^{(k)}_1,k),\ldots ,({\mathbf {x}}^{(k)}_{N_{2,k}},k)\), then the Recall for class k can be expressed as functions of \(\hat{\varvec{\theta }}\),

$$\begin{aligned} Recall _{k}({\hat{\varvec{\theta }}})=\frac{1}{N_{2,k}} \sum _{n=1}^{N_{2,k}} C_k({\hat{\varvec{\theta }}},\ {\mathbf {x}}^{(k)}_{n}),\ k=1,\ldots ,K, \end{aligned}$$
(3)

where

$$\begin{aligned} C_k({\hat{\varvec{\theta }}},\ {\mathbf {x}}^{(k)}_{n})= \left\{ \begin{array}{l@{\quad }l} 1 &{} \text {if the individual} {\mathbf {x}}^{(k)}_{n} \text {is classified in class}\ k, \\ \\ 0 &{} \text {otherwise.} \\ \end{array} \right. \end{aligned}$$
(4)

Unlike the classic NB, based on a two-step approach, the CNB proposed in this paper integrates the performance of the classifier [according to expression (3)] within the estimation step. In particular, the pursued aim is to estimate \(\varvec{\theta }\) as the solution of an optimization problem where the objective function is given using a training sample of size \(N_1\) as in (2) and, to prevent overfitting, constraints on (3) are imposed on an independent sample (validation set) of size \(N_2=\sum _{k=1}^{K}N_{2,k}\),

$$\begin{aligned} \begin{aligned} \underset{\varvec{\theta }}{\max }&\sum _{n=1}^{N_1} \log f_{\varvec{\theta }_{k_n}}({\mathbf {x}}_{n})\\ \text{ s.t. }\qquad&\dfrac{1}{N_{2,k}} \sum _{n=1}^{N_{2,k}} C_k(\varvec{\theta },\ {\mathbf {x}}^{(k)}_{n}) \ge \alpha _{k},\quad k=1,\ldots ,K. \end{aligned} \end{aligned}$$
(CNB)

In the previous CNB optimization problem, \(\alpha _k\in (0,1)\) is a threshold, a lower-bound value close to 1, for \(k=1,\ldots ,K\), which is fixed by the user according to her requirements about the classification in the different classes. From the point of view of optimization, we assume that the function \(f_{\varvec{\theta }_{k_n}}\) is smooth with respect to the parameter \(\varvec{\theta }_{k_n}\). Regarding the constraints, they are not smooth and therefore, gradient methods cannot be applied in order to solve Problem (CNB). This fact makes the resolution of (CNB) to be slow, especially for large datasets. However, a proxy version of (CNB) can be written in a more tractable way if the constraints are reformulated in terms of smooth functions as

$$\begin{aligned} {\widetilde{C}}_k(\varvec{\theta }, {\mathbf {x}}^{(k)}; \lambda )=\prod _{i=1, i\ne k}^{K} F(y_{ki}(\varvec{\theta }, {\mathbf {x}}^{(k)});\lambda ), \end{aligned}$$
(5)

where \(F(y;\lambda )=\frac{1}{1+e^{-\lambda y}}\) is the sigmoid function and

$$\begin{aligned} y_{ki}(\varvec{\theta }, {\mathbf {x}})=\pi _k \prod _{j=1}^{p}f_{\theta _{jk}}(x_j) - \pi _i \prod _{j=1}^{p}f_{\theta _{ji}}(x_j). \end{aligned}$$
(6)

On the one hand, from the definition of the sigmoid function, it can be seen that \(\lim _{\lambda \rightarrow \infty } {\widetilde{C}}_k(\varvec{\theta }, {\mathbf {x}}^{(k)}; \lambda ) = C_k(\varvec{\theta }, {\mathbf {x}}^{(k)})\), since for large values of \(\lambda \), \(F(y_{ki}(\varvec{\theta }, {\mathbf {x}}^{(k)});\lambda )\) will only take the values 0 or 1 depending on the sign of \(y_{ki}(\varvec{\theta }, {\mathbf {x}}^{(k)})\). Then, \(\lambda \) is a hyperparameter big enough so that C and \({\widetilde{C}}\) are as close as possible. On the other hand, the reason why we use the product function to define \({\widetilde{C}}\) is explained below. Note that if any class i has associated a density much greater than class k, then \(y_{ki}\) will take a large negative value which makes \(F(y_{ki}(\varvec{\theta }, {\mathbf {x}}^{(k)});\lambda )\) close to 0 and therefore \({\widetilde{C}}_k(\varvec{\theta },\ {\mathbf {x}}^{(k)};\lambda )\) will also be close to 0. From the previous discussion, a differentiable version of the CNB problem is obtained as

$$\begin{aligned} \begin{aligned} \underset{\varvec{\theta }}{\max }\qquad&\sum _{n=1}^{N_1} \log f_{\varvec{\theta }_{k_n}} ({\mathbf {x}}_{n})\\ \text {s.t.}\qquad&\dfrac{1}{N_{2,k}} \sum _{n=1}^{N_{2,k}} {\tilde{C}}_k(\varvec{\theta },\ {\mathbf {x}}^{(k)}_{n}) \ge \alpha _{k},\quad k=1,\ldots ,K. \end{aligned} \end{aligned}$$
(SCNB)

The smooth formulation (SCNB) can be solved using efficient solvers for nonlinear constrained programming [see, e.g. Birgin and Martínez (2008)]. From now on, we refer to (SCNB) as our optimization problem.

Some important remarks need to be made at this point. The first one regards the feasibility of the (SCNB). In a real application, threshold values \(\alpha _1,\ldots ,\alpha _K\) have to be fixed. As a first option, they could be fixed by the user according to her demand, but it might be the case that (SCNB) is unfeasible. For that reason, we propose a procedure for determining the thresholds in such a way that (SCNB) is always feasible. If we consider a dataset with K different classes, let \(\varvec{\theta ^*}\) be the model parameter associated with (2) and \(k_0\) be the critical class or the class where the method performs the worst. Suppose that the aim is to improve the Recall for such class \(k_0\), say

$$\begin{aligned} \alpha _{k_0}=\frac{1}{N_{2,k_0}} \sum _{n=1}^{N_{2,k_0}} {\tilde{C}}_{k_0}(\varvec{\theta }^*,\ {\mathbf {x}}^{(k_0)}_{n})+\Delta , \end{aligned}$$

with \(\Delta >0\). Then, in order to know the maximum threshold \(\tau \) for the other classes \(k \ne k_0, k\in \{1,\ldots ,K\}\), the next optimization problem can be solved:

$$\begin{aligned} \underset{\varvec{\theta }, \tau }{\max }&\tau \\ \text{ s.t. } \qquad&\frac{1}{N_{2,k_0}} \sum _{n=1}^{N_{2,k_0}} {\tilde{C}}_{k_0}(\varvec{\theta },\ {\mathbf {x}}^{(k_0)}_{n}) \ge \frac{1}{N_{2,k_0}} \sum _{n=1}^{N_{2,k_0}} {\tilde{C}}_{k_0}(\varvec{\theta }^*,\ {\mathbf {x}}^{(k_0)}_{n})+ \Delta \\&\frac{1}{N_{2,k}} \sum _{n=1}^{N_{2,k}} {\tilde{C}}_{k}(\varvec{\theta },\ {\mathbf {x}}^{(k)}_{n}) \ge \tau , \,\,\,\,\, \forall k \ne k_0. \end{aligned}$$

This way we search the estimates \(\varvec{\theta }\) such that in the relevant class \(k_0\) the Recall is improved in at least \(\Delta \) with respect to the Recall in the traditional Bayes estimate and maximize the minimum Recall in the remaining classes.

Secondly, it should be highlighted that the parameters \(\alpha _1,\ldots , \alpha _K\) involved in the model have a clear interpretation (the desired Recall for each of the classes), while allowing us to have full control over all of them. The third comment is related to the size of the considered dataset in terms of the number of predictor variables. Problem (SCNB) can be addressed when the number of features p is large. However, to alleviate the computational cost and thus to improve the running times, we propose to perform a pre-processing to select relevant predictors for large datasets as a part of the procedure. This step will be explained in more detail in Sect. 3.2. Finally, the fourth remark concerns the solutions of (SCNB), which are not maximum likelihood estimates any more, but maximum constrained likelihood estimates instead. On the contrary, the problem yields a solution with the highest sample likelihood fulfilling the constraints on performance on the independent sample. Up to our knowledge, this is a breaking approach that has never been considered in NB models.

3 Numerical results

In this section, eight datasets from the UCI Machine Learning Repository and KEEL open source (Alcalá-Fdez et al. 2011, 2009) diverse, in both in the number of classes, sizes and imbalance ratio shall be analyzed. The description of the datasets can be found in Sect. 3.1 and the numerical experiments and obtained results will be considered in Sects. 3.2 and 3.3, respectively.

3.1 Datasets

The datasets breast cancer, SPECTF, page-blocks, abalone, yeast, Satimage, RCV1 and letter will be considered. From all the available versions of the datasets, we have chosen those described in Table 1. The colums report the dataset name, the number of instances and features and finally, the class split of the eight considered datasets (page-blocks, abalone, yeast, Satimage and RCV1 can be considered unbalanced datasets).

Table 1 Datasets description

3.2 Design of experiments

3.2.1 Probability distributions setting and resolution of the optimization problem

As comented in Sect. 2.1, a probability model needs to be selected for the features conditioned to the class. If the feature is continuous, in this paper we will assume the normal distribution. For discrete features, we consider the categorical distribution, and the Poisson distribution for non-negative integers. From the point of view of the optimization, (SCNB) will be solved using solvers for smooth optimization. In particular, auglag and mma functions from R package nloptr will be used in this work to obtain all numerical results.

3.2.2 Estimation of the performance rates

The performance of the proposed classifier will be estimated using a stratified 25 Monte-Carlo cross-validation (Xu and Liang 2001). The dataset will be split into three sets, the so-called training, validation and testing sets. One-third of the dataset is used as testing set, and the remaining two-thirds for training set and validation set. Specifically, the training set is formed by two-thirds of those two-thirds of the dataset, whereas the remaining one-third is used for the validation set. As explained in Sect. 2, the objective function will be optimized on the training set while the constraints will be evaluated on the validation set. Once the SCNB problem is solved, Recall values are estimated on the testing set. It must be highlighted that at each run, the training sample is built in a stratified way so that the proportion of samples per class is similar to the proportions depicted by Table 1. Finally, regarding the hyperparameter \(\lambda \), after an extensive simulation study considering a wide grid of values, the choice \(\lambda = 2^3\) is set in the experiments since it provides a good match between C and \({\tilde{C}}\) as in (4) and (5).

3.2.3 Pre-processing for large datasets

As commented at the end of Sect. 2.2, Problem (SCNB) turns out computationally costly for large datasets as the considered RCV1 dataset. As it is common in the literature [see Leevy et al. (2018) and references therein], we suggest to pre-processing such datasets in a way that irrelevant variables are removed in a first step previous to the resolution of (SCNB). That is, at each fold of the stratified 25 Monte-Carlo cross-validation previously commented, the importance of the predictor variables are measured using the training set so that the predictor variables with low importance are not considered when solving Problem (SCNB). Specifically, in this work the importance of the predictor variables composing RCV1 were measured using the R function information.gain from FSelector. In this case, most of the variables have an associated importance close to 0 and, then, only 392 of the total are going to be kept when solving (SCNB) for the RCV1 dataset.

3.2.4 The choice of thresholds

In order to select the threshold values \(\alpha _k\) in Problem (SCNB), the classic NB classifier (2) was first run. Table 2 shows the Recall estimates for each class. For letter dataset, the average Recall values of classic NB are in the first row of Table 4.

Table 2 Average Recall of classic NB (25 Monte-Carlo cross-validation)

Throughout this work we consider the classes where the classic NB performs the worst as the classes of interest or at risk and thus the aim is to improve the rates for such classes. From results in Table 2 and the first row in Table 4, the set of thresholds to be tested in the numerical experiments shall be given by Table 3 and the second row in Table 4. Specifically, the better rates for the classes with the worst associated Recall are selected by increasing in steps of two points those results obtained by the classic classifier, whereas admissible values for the rest of classes are also fixed.

Additionally, to highlight the versatility of our proposal, for three of the datasets (page-blocks, yeast and letter) we aim to improve the Recall of more than one class at the same time. Thus, for instance, for yeast dataset, we will improve the Recall of classes CYT and NUC, which are the two classes in the dataset with the lowest Recall values. Then, we first run Problem (SCNB) with thresholds 0.060 for CYT and 0.340 for NUC and, then, we run it again by imposing 0.080 for CYT and 0.360 for NUC.

Table 3 Tested thresholds
Table 4 Average Recall of classic NB (25 Monte-Carlo cross-validation) and tested thresholds for letter dataset

3.3 Results

The estimated rates are reported in Tables 5, 6, 7, 8, 9, 10, 11, and 12. The first row shows the results for the classic NB, when no thresholds are imposed. The first column shows the imposed thresholds for the Recall of each class, whereas the column and thresholds in bold correspond to the classes at risk (where the classic NB presents the poorest performance). For example, in Table 6, it is required that the Recall of Normal class is at least 0.900, while over the Abnormal class the threshold varies from 0.660 to 0.700. The remaining columns, except for the last one, provide the average Recall values measured on the test set. Finally, the last column contains the value of the micro-averaged \(F_1\) (Yang and Liu 1999), an aggregate performance measure of the classifier. From the \(F_1\) values, the sign-test was used to test if both approaches are statistically significantly different. In particular, the significance codes follow the following nomenclature: ‘**’ , ‘*’ and ‘.’ mean respectively that the p-value is smaller than 0.01, 0.05 and 0.1.

Table 5 Average Recall values of SCNB (25 Monte-Carlo cross-validation) for breast cancer
Table 6 Average Recall values of SCNB (25 Monte-Carlo cross-validation) for SPECTF

As expected, the results under the constrained NB version differ from the results provided by the classic NB. For example, for the page-blocks dataset, the Recall values under the classic NB are 0.915, 0.673, 0.644, 0.942 and 0.400, for the text, horiz. line, graphic, vert. line and picture classes, respectively (Table 7). As commented before, we are interested in increasing the Recall of the classes worst classified. According to Table 7, if the minima 0.710, 0.680, 0.440 are imposed for the horiz. line, graphic and picture classes, the final rates change from 0.673 to 0.697, from 0.644 to 0.694 and from 0.400 to 0.457, respectively. It is important to highlight two different facts concerning the previous results. First, note that better rates for the horiz. line, graphic and picture classes have been obtained, but at the expense of slightly decreasing the rates of the rest of the classes. Second, note that even though a rate equal to 0.710 was imposed for the horiz. line class, such value was not finally obtained, but a slightly smaller one (0.697) instead. This is not surprising, since the constraints are imposed for one sample, and tested on an independent set.

Table 7 Average Recall values of SCNB (25 Monte-Carlo cross-validation) for page-blocks
Table 8 Average Recall values of SCNB (25 Monte-Carlo cross-validation) for abalone
Table 9 Average Recall values of SCNB (25 Monte-Carlo cross-validation) for yeast
Table 10 Average Recall values of SCNB (25 Monte-Carlo cross-validation) for Satimage
Table 11 Average Recall values of SCNB (25 Monte-Carlo cross-validation) for RCV1 using 392 variables of the total
Table 12 Average Recall values of SCNB (25 Monte-Carlo cross-validation) for letter

From the results shown in Tables 5, 6, 7, 8, 9, 10, 11, and 12, it can be concluded that the proposed approach allows the user to control the Recall values in such a way that the classes of interest, where the classic method performs the worst in this case, can be improved. Additionally, our approach reaches comparable or even better overall results than the classic NB [see micro \(F_1\) scores throughout Tables 5, 6, 7, 8, 9, 10, 11, and 12]. Note that among the possible non-dominated solutions shown for each dataset, the user could choose according to her interest and to what she is willing to lose in the less critical classes.

Fig. 1
figure 1

Scalability: X-axis represents the number of instances (with range from 500 to 20,000) whereas each line the number of features (with range from 10 to 1000)

Finally, to illustrate the computational cost of the optimization algorithm depending on the number of instances and features, we simulated data following (Witten et al. 2014) with \(\{500, 1000, 3000, 5000\), 10,000, 15,000, 20,000\(\}\) instances and \(p\in \{10, 50, 100, 300, 500,\) \(700, 900, 1000\}\). Figures 1 and 2 report the logarithm of the user times (in seconds) when the SCNB is run on an Intel(R) Core(TM) i7-7500U CPU at 2.70 GHz 2.90 GHz with 8.0 GB of RAM, and the number of evaluations for the algorithm auglag is 100. The X-axis of Fig. 1 shows the number of instances whereas each line represents the number of variables of the dataset (p). Figure 2 is the opposite. Overall, running time grows linearly respect to the number of instances, but not so smooth when p increases.

Fig. 2
figure 2

Scalability: X-axis represents the number of features (with range from 10 to 1000) whereas each line the number of instances (with range from 500 to 20,000)

4 Conclusions and extensions

In this paper a new version of the NB classifier is proposed with the aim of controlling misclassification rates in the different classes, avoiding the use of precise values of misclassification costs, which may be hard to choose. In order to achieve this goal, performance constraints are included into the optimization problem which estimates the involved parameters. The approach results in a novel method (SCNB) not reported in the literature previously, up to our knowledge. Unlike the classic NB, which is based on a two-step approach, the (SCNB) integrates the performance rates in the parameters’ estimation step. In fact, this novel approach allows the user to impose thresholds to assure the achievement in the measures of efficiency (in this case, the Recall values). The proposed methodology has been tested on eight real datasets with different sampling properties. The numerical results show that not only the classification rates of interest can be controlled and improved, but also similar or even better overall results, comparing with those of the classic NB, are obtained. The former is of great interest in some medical, credit scoring or social contexts where some classes are more critical than others.

A possible extension to this work is to consider non parametric estimation for the density function for continuous attributes via kernel density estimation. Also, one anonymous referee suggested to measure the efficiency of the approach via statistical tests in the same spirit as in Demšar (2006). Work of these issues is underway.