1 Introduction

Machine learning algorithms have been successfully applied to many complex problems in recent years, including intelligent analysis in medical diagnosis (Lavrač 1999; Kononenko 2001). There are different medical fields which have especially benefited from the machine learning methods, e.g., oncology diagnosis (Michalski et al. 1986), the diagnosis of breast cancer recurrence (Štrumbelj et al. 2010), lung cancer diagnosis (Zięba et al. 2014), cDNA microarray data analysis (Pearson et al. 2003), toxicology analysis (Blinova et al. 2003), supporting diabetes treatment (Tomczak and Gonczarek 2013). Among others, there are two crucial issues in extracting diagnostic models from medical data. First, there is a need of learning comprehensible models which provide interpretable knowledge to human doctors (Lavrač 1999). Second, medical data are recognized to be imbalanced (Mac Namee et al. 2002) which means that the number of examples from one class (e.g., healthy patients) is significantly higher than the number of examples from the other class (e.g., ill patients). In the imbalanced data phenomenon we can distinguish two subproblems (Japkowicz 2001; He and Garcia 2009): (i) the between-class imbalanced problem—data set exhibits an unequal distribution between its classes, (ii) the within-class imbalanced data problem—an unequal distribution of examples among subconcepts within a class. Here we discuss the imbalanced data phenomenon in medical domain only, however, this problem is widely encountered in other applications like credit scoring (Brown and Mues 2012) or fraud detection in telecommunication (Fawcett and Provost 1997).

In order to obtain comprehensible models classification rules or classification trees are typically used. There are different approaches to learning classification rules from data. One of the most traditional rules induction approaches is based on a separate-and-conquer strategy (see Fürnkranz 1999 for more details), e.g., AQ (Michalski et al. 1986), RIPPER (Cohen 1995) and OneR (Holte 1993). Another traditional approach to rules extraction uses searching in version (hypotheses) space which was applied in Candidate Elimination Algorithm (CEA) (Mitchell 1997) or in JSM method (Blinova et al. 2003).

A different approach to learning rules is based on taking advantage of association rules. The core of this approach exploits the idea of mining a special subset of associations rules whose right-hand-side are restricted to the class label (Liu et al. 1998). This line of research has been extended by applying fuzzy set theory (Chen and Chen 2008) or generalizing to multiple class problem and imbalanced data (Cerf et al. 2013).

The problem of the rules induction can be gently cast in the framework of rough sets (Pawlak et al. 1995). The general idea is to utilize rough set theory in order to obtain certain and approximate decision rules on the basis of approximations of decision classes, e.g., LEM1 and LEM2 (Stefanowski 1998) and their extension VC-DomLEM (Błaszczyński et al. 2011), or a dedicated method for imbalanced data (Stefanowski and Wilk 2006).

The accuracy of learning the classification rules can be increased by analyzing their statistical properties. The research in this direction has lead to many interesting methods, e.g., finding rules minimizing the difference between the rules margin and variance (Rückert and Kramer 2008), obtaining data dependent generalization bounds (Vorontsov and Ivahnenko 2011) or by applying Minimum Description Length paradigm (Vreeken et al. 2011).

In the machine learning community, it has been shown that the classification rules (trees) provide rather mediocre predictive performance in comparison to the strong classifiers such as Support Vector Machines (SVMs), Neural Networks or deep learning models (Kotsiantis 2007). One possible fashion of improving their accuracy and maintaining their comprehensibility at the same time is application of ensemble learning techniques. First approaches tried to combine classification rules with various decision making procedures, e.g., majority voting (Kononenko 1992) and bagging (Breiman 1996). A different technique aimed at probabilistic combination of decision trees by using tree averaging (Buntine 1992). More recently, Bayesian Model Averaging (BMA) of the classification rules was applied (Domingos 1997, 2000). The main idea of this approach was to combine several sets of classification rules which were induced using well-known rules induction algorithms (e.g., extracting rules from a decision tree C4.5). Nevertheless, it was noticed that the BMA of the classification rules suffered from overfitting (Domingos 2000). This result was further explained that the BMA cannot be applied as a model combination (Minka 2000) and thus a different approach should be used. We will propose a new fashion of a probabilistic combination for the classification rules.

Most of standard learning methods assume balanced datasets and/or equal misclassification costs. However, in the case of imbalanced datasets they fail in learning regularities within data which results in biased predictions across classes. Hitherto, a number of attempts to deal with the imbalanced data problem has been proposed (He and Garcia 2009; Japkowicz and Stephen 2002), i.e., sampling methods for imbalanced data (Kubat and Matwin 1997), cost-sensitive solutions (Elkan 2001), such as, cost-sensitive Naïve Bayes (Gama 2000), cost-sensitive SVMs (Masnadi-Shirazi and Vasconcelos 2010). Recently, a number of ensemble learning methods designed for imabalanced datasets has been proposed (Wang and Japkowicz 2010; Galar et al. 2012; Zięba et al. 2014).

In this paper, we present the probabilistic combination of classification rules for dealing with the both stated issues. The problem of interpretability of the model is solved by applying the classification rules. In order to increase its predictive performance we use probabilistic reasoning for combination of if-then rules. Next, we reduce the imbalanced data phenomenon by modifying the Bayesian estimator for categorical features, also known as m-estimate (Cestnik 1990; Džeroski et al. 1993; Fürnkranz and Flach 2003; Lavrač 1999; Zadrozny and Elkan 2001), with different misclassification costs.

Another issue we would like to consider in this paper is a method for aggregating the data for further reasoning. Typically, observations are stored with their number of occurrences. However, such approach implemented naively may be very cumbersome and requires a lot of computational resources. In order to increase the efficiency of data aggregation process, we propose a modification of a graph-based data aggregation (Tomczak and Gonczarek 2013).

The contribution of the paper is as follows:

  • Conjunctive features as latent variables representing hidden relationships among features are presented.

  • Soft rules, a probabilistic version of the classification rules, are proposed.

  • The manner of the combination of the soft rules is outlined.

  • The modification of the m-estimate for imbalanced data problem is introduced.

  • The modification of graph-based data aggregation (later referred to as graph-based memorization) for reducing memory complexity of storing observations is outlined.

  • The proposed approach is applied to the medical diagnosis.

The paper is organized as follows. In Sect. 2.1 a combination of the classification rules is outlined. In Sect. 2.2 a new approach to the probabilistic combination of the classification rules is described, including introduction of conjunctive features and soft rules. Next, Sect. 2.3 explains estimation of probabilities including the proposition of m-estimate for imbalanced data, object probability, and prior probability for conjunctive features. In Sect. 2.4 the graph-based memorization process is described and the fashion of using it in the estimation process is outlined. In Sect. 3 we study our model’s performance in a simulation study (Sect. 3.1) and on benchmark datasets (Sect. 3.2), medical datasets (Sect. 3.3), and discuss the results (Sect. 3.4). Finally, conclusions are drawn in Sect. 4.

2 Methodology

2.1 Combination of crisp rules

Let \({\mathbf {x}} \in {\mathcal {X}}\) be an object described by D attributes, where each \(x_{d}\), \(d=1, \ldots , D\), can take only one of \(K_d\) possible values. We will write \(x_{d}^{k}\) if \(x_{d} = k\). We denote the total number of all possible values of attributes by K, i.e., \(\sum _{d} K_{d} = K\). Let \(y \in \{ - 1, 1 \}\) be a class label of \({\mathbf {x}}\). We refer to the examples with class label \(y=-1\) as negative, and the ones with the label \(y=1\) are called positive. Additionally, we assume that the minority class (less frequent in the training data) is labeled with \(y=1\).

A classification rule r can be defined in two equivalent manners (Mitchell 1997). First fashion represents the classification rule as a binary-valued function \(r: {\mathcal {X}} \rightarrow \{-1,1\}\). On the other hand, the classification rule can be defined as an if-then rule in which the antecedent, \(a_{r}\), is a conjunction of features’ values, and the consequent, \(c_{r}\), is a specific value of the class label. For example, for \({\mathbf {x}} \in \{1,2,3\}\times \{1,2\}\), an exemplary if-then classification rule is \(\text {IF } x_{1}^{1} \wedge x_{2}^{1} \text { THEN } y = 1\), where \(a_{r} =\)\(x_{1}^{1} {\wedge } x_{2}^{1}\)” and \(c_{r} = 1\). The key advantage of applying the classification rules to a prediction problem is that they are easily interpretable, i.e., they form a comprehensible model, and the final decision can be straightforwardly explained. Further, we assume a finite set of if-then rules or a space of all possible if-then rules for given feature space \({\mathcal {X}}\). In both cases we denote this set as \({\mathcal {R}}\).

In theory, the set of classification rules should cover the whole feature space and the rules should not contradict themselves, i.e., for the same features’ values two (or more) different rules must return the same class label. However, in practice this condition may be false because of, e.g., several sets of rules are combined together (Kononenko 1992). Therefore, it is convenient to treat the set of classification rules as an ensemble classifier (sometimes called a model combination) (Dembczyński et al. 2008). Before introducing the combination of rules let us re-define the classification rule which we refer to as a crisp rule.Footnote 1 The crisp rule is a function \(f:{\mathcal {X}}\times \{-1,1\} \times {\mathcal {R}} \rightarrow \{0, 1\}\) in the following form:

$$\begin{aligned} f({\mathbf {x}},y,r) = \left\{ \begin{array}{rl} 1, &{}\quad \text {if } {\mathbf {x}} \text { is covered by } a_{r} \text { and } y = c_{r} \\ 0, &{}\quad \text {otherwise} \\ \end{array} \right. \end{aligned}$$
(1)

where \(r \in {\mathcal {R}}\) is a rule in the set of all possible if-then rules. For further simplicity we will denote \(f({\mathbf {x}},y,r) \mathop {=}\limits ^{\varDelta } f_{r}({\mathbf {x}},y)\).

In fact, the crisp rule is a standard boolean-valued function but it takes also the class label as an argument. This definition could be useful in multi-class problem, i.e., when there are more than two possible values of the class label. Moreover, the crisp rule represents a transformation of the if-then rule to a form which will be useful in the combined classifier.

An ensemble classifier is a weighted combination of base models. If we consider the crisp rules as the base models, then we can combine all crisp rules in \({\mathcal {R}}\) which yields the following ensemble classifier:

$$\begin{aligned} g({\mathbf {x}},y) = \sum _{r\in {\mathcal {R}}} w_{r} f_{r}({\mathbf {x}},y), \end{aligned}$$
(2)

where \(w_{r} \in {\mathbb {R}}_{+}\) is a tunable parameter which denotes a weight of the \(r{\text {th}}\) rule. The final decision is the class label with the highest value of the combination \(g({\mathbf {x}},y)\):

$$\begin{aligned} y^{*} = \arg \max _{y} g({\mathbf {x}},y). \end{aligned}$$
(3)

Application of combination of crisp rules remains interpretability of the classification rules where the weights can be seen as confidence levels of rules. However, there are three problems associated with learning combination of the crisp rules. First, the crucial issue is how to determine the set of rules \({\mathcal {R}}\). The simplest approach is to apply a rule induction algorithm or several such algorithms in order to obtain \({\mathcal {R}}\). However, it has been shown that such a technique may give unsatisfactory results (Domingos 1997). On the other hand, summing over all possible crisp rules for all possible class labels is practically intractable. Second issue concerns learning the tunable parameters. There are different ensemble learning methods, e.g., bagging (Breiman 1996), boosting (Freund and Schapire 1997). Nonetheless, since the base learners corresponds to fixed, i.e., non-learnable, crisp rules it is more appropriate to apply other techniques for learning such as stacking (Wolpert 1992) or combinations based on statistical properties of the rules (Kononenko 1992). However, these learning procedures do not result in satisfactory predictive performance because in some cases they can decrease classification accuracy (Kononenko 1992) or even lead to overfitting (Domingos 2000). Third problem is that crisp rules assign only one class label to all objects they cover. This may be problematic in decision support systems in which human expert may would like to know the class label with its certainty level. These three issues are addressed in our probabilistic approach to the classification rules combination, which we refer to as soft rules combination.

2.2 Combination of soft rules

Let us consider the object \({\mathbf {x}}\) and the class label y as random variables. Additionally, we introduce a new random variable, which we refer to as conjunctive feature, \(\varphi \), that corresponds to a conjunction of at least one feature with fixed value. Moreover, we assume that each of the features can take only one value within the conjunctive feature. The set of all possible conjunctive features is denoted by \({\mathcal {F}}\). For example, for \({\mathbf {x}} \in \{1,2,3\}\times \{1,2\}\), a conjunctive feature can be \(\varphi = x_{1}^{1} \wedge x_{2}^{1}\), but the conjunction \((x_{1}^{1} \vee x_{1}^{2}) \wedge x_{2}^{1}\) is not the conjunctive feature according to our definition. The conjunctive feature \(\varphi \) defines a set, that is:

$$\begin{aligned} {\mathcal {X}}_{\varphi } = \{ {\mathbf {x}} \in {\mathcal {X}} : {\mathbf {x}} \text { is covered by }\varphi \}. \end{aligned}$$
(4)

For example, for \({\mathbf {x}} \in \{1,2,3\}\times \{1,2\}\) and the conjunctive feature \(\varphi = x_{1}^{1}\) one gets \({\mathcal {X}}_{\varphi } = \{ (x_{1}^{1},x_{1}^{2}), (x_{1}^{1}, x_{2}^{2}) \}\).

We assume that an object is described by the features and the class label which are observable random variables, while conjunctive features are treated as latent variables. The goal of our model is to use conjunctive features as a common structure that relates attributes and the class label, i.e., the conjunctive features consolidate two separate but related concepts. Since a conjunctive feature is a latent variable shared by the attributes and the class value, it generates both object \({\mathbf {x}}\) and its label y. As a consequence, for given \(\varphi \) random variables \({\mathbf {x}}\) and y become stochastically independent [see the model represented as a probabilistic graphical model (Cooper and Herskovits 1992) in Fig. 1], i.e., \(p(y,{\mathbf {x}}|\varphi ) = p(y|\varphi )\ p({\mathbf {x}}|\varphi )\). The assumption about the independence allows to easier estimate the probabilities \(p(y|\varphi )\) and \(p({\mathbf {x}}|\varphi )\) instead of the joint probability \(p(y,{\mathbf {x}}|\varphi )\). Moreover, it is an unorthodox approach comparing to the classification rules in which rules are deterministically associated with the class label. Here, we make a soft assumption about conjunctive feature’s label. The proposed model is a specific kind of a shared model (Damianou et al. 2012) because a single hidden variable shares the information about both observable variables.

The idea of shared model is to introduce latent variables which capture common structure between two or more concepts. In the literature, there are different kinds of shared models. In one approach single shared latent structure is proposed to capture mutual information of observable variables (Shon et al. 2005). A different shared model introduces additional hidden variables which are typical for one of the concepts (Ek et al. 2008). Recently, fully Bayesian treatment of the shared model was proposed where the latent representation is marginalized out (Damianou et al. 2012). In our case, we aim at finding common hidden representation which allows reasoning about both the attributes and the class label. Therefore, the idea of our approach is very similar to the one presented in Shon et al. (2005) where there is one single common latent structure.

Fig. 1
figure 1

Probabilistic graphical model for the considered model with latent conjunctive features. We assume that the conjunctive feature \(\varphi \) generates both object \({\mathbf {x}}\) and its label y. Latent variable is represented by white node and observable variables by gray nodes

Hitherto, we have introduced new random variable, the conjunctive feature, that in fact corresponds to the antecedent of the if-then rule. However, we take advantage of probabilistic approach, and thus, the classification rules should be reformulated. We define a soft rule as a function which returns probability for the class label y and the features \({\mathbf {x}}\) conditioned on the conjunctive feature \(\varphi \), \(f: {\mathcal {X}} \times \{-1,1\} \times {\mathcal {F}} \rightarrow [0,1]\), that is:

$$\begin{aligned} f({\mathbf {x}},y,\varphi ) = \left\{ \begin{array}{ll} p(y|\varphi )\ p({\mathbf {x}}|\varphi ), &{}\quad \text {if } {\mathbf {x}} \text { is covered by } \varphi \\ 0, &{}\quad \text {otherwise} \\ \end{array} \right. \end{aligned}$$
(5)

where \(p(y | \varphi )\) is the probability of label y for all objects generated by \(\varphi \), \(p({\mathbf {x}} | \varphi )\) is the probability of the object \({\mathbf {x}}\) given \(\varphi \). In fact, in (5) we should write the joint distribution for \({\mathbf {x}}\) and y, \(p(y,{\mathbf {x}})\), instead of \(p(y|\varphi )\ p({\mathbf {x}}|\varphi )\), but we have already applied the assumption about the conditional independence of \({\mathbf {x}}\) and y given \(\varphi \). For further simplicity we will write \(f({\mathbf {x}},y,\varphi ) \mathop {=}\limits ^{\varDelta } f_{\varphi }({\mathbf {x}},y)\).

In order to make decisions we need to obtain a predictive distribution \(p(y|{\mathbf {x}})\) which enforces application of the summation rule, i.e., summation over all conjunctive features \(\varphi \) for the distribution \(p(y, \varphi | {\mathbf {x}})\). However, we observe that there is no need to sum over all possible conjunctive features because only the ones covering \({\mathbf {x}}\) are important, and for all others we get \(p({\mathbf {x}}|\varphi ) = 0\). Therefore, the predictive distribution takes the following form:

$$\begin{aligned} p(y | {\mathbf {x}})&= \sum _{\mathcal {F}} p(y, \varphi |{\mathbf {x}}) \nonumber \\&= \sum _{\varphi : {\mathbf {x}} \in {\mathcal {X}}_{\varphi } } \frac{1}{p({\mathbf {x}})}\ p(y,{\mathbf {x}}|\varphi )\ p(\varphi ) \nonumber \\&\propto \sum _{\varphi : {\mathbf {x}} \in {\mathcal {X}}_{\varphi } } p(y|\varphi )\ p({\mathbf {x}}|\varphi )\ p(\varphi ), \end{aligned}$$
(6)

where \(p(\varphi )\) is a priori probability of a conjunctive feature. Notice that \(p({\mathbf {x}})\) is the same for all conjunctive features and hence can be omitted in further prediction. The final decision is the most probable class label. An application of exemplary conjunctive features is presented in Fig. 2.

Fig. 2
figure 2

An exemplary application of soft rules. For \({\mathbf {x}} \in \{1,2,3\}\times \{1,2\}\), new object \((x_{1}^{1}, x_{2}^{2})\) (black cross) is covered by three conjunctive features: \(\varphi _1 = x_{1}^{1}\) (light gray rectangle), \(\varphi _2 = x_{2}^{2}\) (gray rectangle), and \(\varphi _3 = x_{1}^{1} \wedge x_{2}^{2}\) (dark gray rectangle). Other conjunctive features, e.g., \(\varphi _4 = x_{2}^{1}\) or \(\varphi _5 = x_{1}^{2}\), do not cover the object and thus are irrelevant in the prediction

Taking a closer look at the predictive distribution one can notice that it formulates the combination of soft rules (CSR) given by (5) with weights \(w_{\varphi }\) equal \(p(\varphi )\), and the set of possible rules determined by the conjunctive features which cover \({\mathbf {x}}\), \({\mathcal {R}} = \{ \varphi : {\mathbf {x}} \in {\mathcal {X}}_{\varphi } \}\). Below, we indicate which parts of the Eq. (6) correspond to elements in the ensemble classifier:

$$\begin{aligned} g({\mathbf {x}}, y) = \sum _{ \varphi : {\mathbf {x}} \in {\mathcal {X}}_{\varphi } } \underbrace{p(\varphi )}_{w_{\varphi }}\ \underbrace{p(y|\varphi )\ p({\mathbf {x}}|\varphi )}_{f_{\varphi }({\mathbf {x}},y)}. \end{aligned}$$
(7)

The final decision is the class label with the highest value of the combination, i.e., \(y^{*} = \arg \max _{y} g(y|{\mathbf {x}})\), which takes exactly the same form as for the crisp rules (3) but with the interpretation of unnormalized probabilities.

Note that the soft rules remain interpretable because it is a soft version of the crisp rule which in turn is the if-then rule. Hence, the soft rule can be represented as the if-then rule but with the soft consequent, i.e., the consequent consists the information about the class label and its probability p. For example, for \({\mathbf {x}} \in \{1,2,3\}\times \{1,2\}\), the soft rule for \(\varphi = x_{1}^{1}\) and the considered object \({\mathbf {x}}\) is as follows:

$$\begin{aligned} \text {IF } {\mathbf {x}}_{1}^{1} \text { THEN }&y=1 \text { with } p = p(y=1|x_{1}^{1})\ p({\mathbf {x}}|x_{1}^{1}) \\&\text {OR} \\&y=-1 \text { with } p = p(y=-1|x_{1}^{1})\ p({\mathbf {x}}|x_{1}^{1}) \end{aligned}$$

The application of the soft rules and their combination has the following advantages:

  1. 1.

    The set of soft rules used in the summation is determined automatically for given \({\mathbf {x}}\).

  2. 2.

    The weights of the soft rules in the combination are related to the prior for the conjunctive features. Hence, there is no need to propose an additional procedure for their determination.

  3. 3.

    The application of the probabilistic reasoning allows to utilize all information provided by the conjunctive features covering new object in contrary to the combination of the crisp rules which uses only a subset of all possible rules. Similar argument was also used in previous studies about the classification rules (Viswanathan and Webb 1998).

  4. 3.

    The soft rule can be represented as the crisp rule but with different consequent which returns probabilities of class labels instead of crisp assignment. In other words, the soft rule remains interpretability of the crisp rule and additionally assigns probabilities to the class labels.

The disadvantage associated with application of the conjunctive features is that their total number grows exponentially with the number of features D. Assuming for a while that the number of values of all features are equal, we can give an exact relationship between the number of the conjunctive features and the number of features:

Lemma 1

Assuming \(K_{d}=\kappa \) for \(d=1, \ldots , D\), the number of all conjunctive features is equal \((\kappa +1)^{D}-1\).

The justification of the relationship is trivial. Let us consider an object \({\mathbf {x}}\) which has D distinct features with number of values equal \(\kappa \). Then, we need to consider all possible combinations of these distinct features except the empty conjunctive feature (i.e. \(\varphi = \emptyset \)) which results in:

$$\begin{aligned} \sum _{d=1}^{D} \left( {\begin{array}{c}D\\ d\end{array}}\right) \kappa ^{d} - 1 = (\kappa +1)^{D} - 1. \end{aligned}$$

As we can see, the application of soft rules combination may be problematic when one deals with high-dimensional problems. Let us consider specifically the case of classifying new object. The object has D features and each attribute takes only one value (according to the Lemma 1 \(\kappa \) is equal 1). Therefore, in order to classify new object one needs to calculate the sum of \(2^{D} - 1\) soft rules. For given N objects the general time complexity can be estimated by \(O(N\ 2^{D})\). Hence, application of (6) can be performed in an exact form for approximately up to 20 features. For higher-dimensional problems the time needed to perform the classification can be too long or the computational demands cannot meet computer requirements. In order to overcome these limitations a feature selection method can be applied or an approximate inference should be utilized. A different approach aims at faster computations via better coding schemes. In Sect. 2.4 we will show how to store sufficient statistics for calculating probabilities \(p(y|\varphi )\) efficiently.

2.3 Probabilities calculation

In the following, we present the plug-in estimators of probabilities used in the classification rule (6) for given training data \({\mathcal {D}}=\{({\mathbf {x}}_{1},y_{1}),\ldots , ({\mathbf {x}}_{N},y_{N})\}\). First, we propose the modification of the m-estimate for imbalanced data which is used in estimating \(p(y|\varphi )\). Second, we define the probability of the object given the conjunctive feature \(p({\mathbf {x}}|\varphi )\). Next, we give the function for evaluating complexity of a conjunctive feature which is later applied to formulating a priori probability of conjunctive features \(p(\varphi )\).

2.3.1 Modified m-estimate for class label

It has been shown that estimating probabilities with relative frequencies is troublesome and results in unreliable estimators (Cestnik 1990; Mitchell 1997). In Cestnik (1990) it has been proposed to take advantage of conjugate distributions, that is, Categorical and Dirichlet distributions, which gives the Bayesian estimator, called also m-estimate. In the considered case of the binary classification we deal with Bernoulli distribution and its conjugate prior Beta distribution. There are different possible choices of priors (Jaynes 1968), e.g., Jeffreys prior (Jeffreys 1946), however, we aim at reducing imbalanceness of data through prior. For this purpose Beta distribution suits well because it allows us to put more probabilistic mass on probability associated with minority class, which is not the case for any non-informative prior (e.g., Jeffreys prior). The application of m-estimate to \(p(y|\varphi )\) yields the following estimate:

$$\begin{aligned} p(y|\varphi ) \approx \frac{ N_{y,\varphi } + m\ \pi _{y} }{ \sum _{y} N_{y,\varphi } + m }, \end{aligned}$$
(8)

where \(N_{y,\varphi }\) is the number of occurrences of objects with the class label y covered by \(\varphi \), m is the non-negative tunable parameter of the estimator, and \(\pi _{y}\) is the initial probability of an object in the class y.

In the context of the between-class imbalance problem the estimation of probability \(p(y|\varphi )\) should not be burden by one class, i.e., the proper estimator should eliminate the influence of the majority class on the minority class. In the m-estimate we can modify \(\pi _{y}\) and m, which have the following interpretations. The former determines the initial probability of objects covered by \(\varphi \) in the class y and the latter is the number of objects that should be observed initially, i.e., before observing any data.

In order to eliminate the imbalanced data phenomenon we propose to weight each observation using the following proportion:

$$\begin{aligned} \tilde{\pi }_{y} = \frac{ N }{ 2N_{y} }, \end{aligned}$$
(9)

where \(N_{y}\) is the number of observations in the class y. The weight of an observation in the minority class is \(>\)1 while in the majority class—\(<\)1. Such proportion was used in learning SVMs in order to reduce imbalanced data phenomenon (Cawley 2006; Daemen and De Moor 2009). This proposition can be justified twofold:

  1. 1.

    It has been noted in (Cawley 2006) that the weighting (9) is asymptotically equivalent to re-sampling the data so that there is an equal number of positive and negative examples.

  2. 2.

    If we sum over all the observations weighted with (9) we get

    $$\begin{aligned} \sum _{n} \tilde{\pi }_{y_{n}}&= N_{1} \frac{ N }{ 2N_{1} } + N_{2} \frac{ N }{ 2N_{2} } \nonumber \\&= N. \end{aligned}$$
    (10)

In other words, such weighting preserves the number of training examples.

The initial probability can be obtained by normalizing the proportions:

$$\begin{aligned} \pi _{y}&= \frac{ \tilde{\pi }_{y} }{ \sum _{y'} \tilde{\pi }_{y'} } \nonumber \\&= \frac{N_{-y}}{N}, \end{aligned}$$
(11)

where \(N_{-y}\) is the number of occurrences of objects with the opposite class to y, i.e., \(\pi _{1} = \frac{N_{-1}}{N}\) and \(\pi _{-1} = \frac{N_{1}}{N}\). Hence, the application of (11) can eliminate the imbalanced data phenomenon by increasing initial probability of sampling initial objects in the minority class.

Typically, the tunable parameter m is determined experimentally (Džeroski et al. 1993; Zadrozny and Elkan 2001). Later in the work the m-estimate with the prior (11) is referred to as imbalanced m-estimate (or shorter im-estimate).

2.3.2 Object probability

In the proposed model we assume that the conjunctive feature can generate both the class label and the object. In the simplest approach the probability of the object given the conjunctive feature can be 1 if the object is covered by the conjunctive feature and 0—otherwise. However, such fashion of assigning probabilities do not distinguish conjunctive features and their possible generative capabilities. Instead, we would prefer to assume that the object is sampled from a uniform distribution over the domain determined by \(\varphi \):

$$\begin{aligned} p({\mathbf {x}}|\varphi ) = \left\{ \begin{array}{ll} \frac{1}{|{\mathcal {X}}_{\varphi }|}, &{}\quad \text {if } {\mathbf {x}} \in {\mathcal {X}}_{\varphi }\\ 0, &{}\quad \text {otherwise} \end{array} \right. \end{aligned}$$
(12)

where \(|{\mathcal {X}}_{\varphi }|\) is the cardinality of the set determined by the conjunctive feature \(\varphi \). Such approach is very similar to the strong sampling assumption (Tenenbaum and Griffiths 2001).

The object probability can be seen as a realization of a semantic principle of simplicity (semantic Occam’s razor) because the larger the domain of the conjunctive feature is the more it is penalized. In other words, the larger domain determined by \(\varphi \), the lower the probability of the object. The semantic comes from the interpretation of the probability, i.e., we consider the meaning of covering the object by the conjunctive feature. For example, for \({\mathbf {x}} \in \{1,2,3\}\times \{1,2\}\), for the conjunctive feature \(\varphi = {\mathbf {x}}_{1}^{1}\) one gets \({\mathcal {X}}_{\varphi } = \{ ({\mathbf {x}}_{1}^{1}, {\mathbf {x}}_{2}^{1}), ({\mathbf {x}}_{1}^{1}, {\mathbf {x}}_{2}^{2}) \}\) and thus for \({\mathbf {x}} = (x_{1}^{1}, x_{2}^{1})\) the object probability equals \(p({\mathbf {x}}|\varphi ) = \frac{1}{2}\).

The philosophical considerations can be cast in a more formal justification by observing that the probability \(p({\mathbf {x}}|\varphi )\) defined as in (12) is monotonic. In the counting inference-based literature a rule descriptive measure \(d:{\mathcal {R}}\rightarrow {\mathbb {R}}\) is said to be monotonic in rule r, if for any two rules \(r_{1}\) and \(r_{2}\), such that antecedent of \(r_{1}\) has more formulae than antecedent of \(r_{2}\), one obtains \(d(r_1) \ge d(r_2)\) (Brzezinska et al. 2007; Ceglar and Roddick 2006).Footnote 2 In the considered case it is indeed true because the more formulae there are in the conjunctive feature, the higher value takes the probability (12).

Notice that the proposed fashion the object probability is calculated eliminates the within-class imbalanced data problem. Usually, the unequal distribution of examples within a class leads to biased estimates. Here, our assumption about the dependencies in the graphical model results in vanishing dependency between the class and the object for given conjunctive feature. Therefore, we calculate the object probability (12) using the cardinality of the set of objects \({\mathcal {X}}_{\varphi }\) determined by the conjunctive feature independent on the class label.

2.3.3 Conjunctive feature prior

The probability of the conjunctive features represents prior beliefs. One possible proposition of prior beliefs is the following: The conjunctive features that contain less features are more probable. In other words, the prior of the conjunctive features can be seen as a realization of a syntactic principle of simplicity (syntactic Occam’s razor) which states that shorter (simpler) conjunctive features are a priori more probable than the longer ones. We say syntactic because we consider the meaning of the structure of the conjunctive feature. Hence, we propose the following function for measuring conjunctive feature’s complexity:

$$\begin{aligned} h(\varphi ) = \exp (-a D_{\varphi }), \end{aligned}$$
(13)

where a is a free parameter, \(D_{\varphi }\) denotes the number of features in \(\varphi \). Further in the paper we arbitrarily use \(a = \frac{1}{D}\) which turned to work well in practice.

We get the prior over conjunctive features by normalizing the complexity function which results in the Gibbs distribution:

$$\begin{aligned} p(\varphi ) = \frac{ h(\varphi ) }{ \sum _{\varphi '} h(\varphi ') }. \end{aligned}$$
(14)

On the contrary to the object probability, the conjunctive feature prior probability \(p(\varphi )\) defined as in (14) is anti-monotonic. Extending the conjunctive feature by adding an appropriate formula (i.e. preserving it is still the conjunctive feature according to our definition) results in decreasing the probability in (13).

2.3.4 Probability calculation: summary

In Table 1 we give a summary of our considerations about how the probabilities used in the combination of the soft rules are calculated and indicate the final forms we use in the experiments. In the conjunctive feature prior we can omit calculating denominator since it is constant for all conjunctive features and does not influence final prediction. Hence, we use \(w_{\varphi } = h(\varphi )\) instead of the probability (14) in the combination of the soft rules (7).

Table 1 The summary of probabilities calculation used in the experiments

2.4 Graph-based memorization

As we have pointed out earlier, for equal number of features’ values, there are \((\kappa +1)^{D}-1\) of all conjunctive features (see Lemma 1). In order to calculate probability estimators for \(p(y|\varphi )\), we need to store the number of sufficient statistics (parameters) proportional to the number of all possible conjunctive features. In practice, such approach can be troublesome or even impossible to keep in computer’s memory because of the exponential complexity. However, to speed-up calculations and limit the number of parameters, we can take advantage of the data aggregation method with the graph-based representation which we call graph-based memorization.

Let us define the graph \({\mathcal {G}}_{y}=({\mathcal {V}}_{y},{\mathcal {E}}_{y})\) for the given class label y in the following manner. The set of vertices \({\mathcal {V}}_{y}\) consists of the nodes that represent considered features with all values, i.e., a node is a pair \(v = (d,i)\) where d states for feature’s index and i denotes value of the feature. Nodes corresponding to one feature form a layer. We add a terminal node \(v_{T} = (D+1,1)\) to the set of vertices and it forms a separate layer. Moreover, the terminal node is added to every example as an additional feature which always takes value 1. The set of edges \({\mathcal {E}}_{y}\) consists of all connections between any two nodes (including the terminal node) from different layers, e.g., the edge in the class y connecting \(i{\text {th}}\) value in \(s{\text {th}}\) layer, \(u = (s,i)\), and \(j{\text {th}}\) value in \(t{\text {th}}\) layer, \(v=(t,j)\), is denoted by \(e_{u,v}^{y}\). Additionally, all nodes are connected with the terminal node. A weight of an edge represents the number of co-occurrences of two nodes in the training data, e.g., the weight of an edge in the class y connecting \(i{\text {th}}\) value in \(s{\text {th}}\) layer, \(u = (s,i)\), and \(j{\text {th}}\) value in \(t{\text {th}}\) layer, \(v=(t,j)\), is denoted by \(w_{u,v}^{y}\). An exemplary graph is presented in Fig. 3a. All complete subgraphs which consist of at most one node from each layer and the terminal layer constitute conjunctive features in the class y, \({\mathcal {G}}_{y,\varphi }=({\mathcal {V}}_{y},{\mathcal {E}}_{y,\varphi })\). Note that the set of vertices is the same as in the graph \({\mathcal {G}}\) but the set of edges consists of only those edges which connect nodes included in the conjunctive feature \(\varphi \). Additionally, we assume that the terminal node is included in each observation. An exemplary conjunctive feature as a graph is presented in Fig. 3b.

Fig. 3
figure 3

a Exemplary graph for \({\mathbf {x}} \in \{1,2\}^{2}\) in the class y. The gray vertex denotes the terminal node. The light gray rectangle represents first layer and the darker one—second layer. b Exemplary conjunctive feature \(\varphi = x_{1}^{1} \wedge x_{2}^{2}\) represented as a graph \({\mathcal {G}}_{y,\varphi }\) for \({\mathbf {x}} \in \{1,2\}^{2}\)

For given data \({\mathcal {D}}\) we can propose the following weights’ updating procedure. If a pair of nodes \(u=(s,i)\) and \(v=(t,j)\) co-occurs in \(n{\text {th}}\) observation, \(({\mathbf {x}}_{n}, y_{n})\), then

$$\begin{aligned} w_{u,v}^{y_{n}} {:=} w_{u,v}^{y_{n}} + 1. \end{aligned}$$
(15)

We need to perform updating for all co-occurrences of pairs of nodes in the observation \({\mathbf {x}}_{n}\) and proceed the procedure for all examples in \({\mathcal {D}}\). Remember that the terminal node is included in the observation, thus we always update \(w_{u,v_{T}}^{y_{n}}\) for all features. Initially, all weights are set to zero. It is worth noting that the updating procedure is independent on the order of upcoming observations which means that the updating process is performed in an incremental manner and can be applied to a datastream. The procedure of the graph-based memorization is presented in Algorithm 1.

Graph-based memorization allows us to aggregate data in graphs \({\mathcal {G}}_{y}\) for classes \(y \in \{-1, 1\}\), and thus we can approximate the count of objects with the class label y which are covered by \(\varphi \) as follows:

$$\begin{aligned} N_{y,\varphi } \le \min _{ (u, v) : e_{u,v}^{y} \in {\mathcal {E}}_{y,\varphi } } w_{u,v}^{y}. \end{aligned}$$
(16)

The im-estimator can be determined using graph-based memorization by inserting (16) into (8).

figure a

The application of the graph-based memorization indeed allows to decrease the number of stored sufficient statistics according to the following Lemma:

Lemma 2

Assuming \(K_{d}=\kappa \) for \(d=1, \ldots , D\), \(\kappa >2\) and \(D>3\) or \(\kappa = 2\) and \(D>4\), the number of sufficient statistics stored by the graph \({\mathcal {G}}\) is equal \((D\kappa +1)^{2}\).

Proof

Let us use the adjacency matrix to represent the considered graph \({\mathcal {G}}\). Because there are \(K+1\) nodes, i.e., there are K nodes corresponding to features with values plus the terminal node, we need less than \((K+1)^{2}\) weights in one class, because we do not allow edges within one layer, i.e., among nodes representing values of the same feature. Moreover, we notice that weights are symmetric, i.e., for any two nodes u and v, \(w_{u,v}^{y} = w_{v,u}^{y}\), because we calculate co-occurrence of two nodes. Hence, we need less than \((K+1)^{2}/2\) weights in one class. Next, once we assume equality of features’ domains and binary classification problem, we get the number of sufficient statistics to be equal \((D\kappa +1)^{2}\). \(\square \)

2.4.1 Example

Let us consider a toy example for graph-based memorization. The object is described by two variables which can take two values, i.e., \({\mathbf {x}} \in \{1, 2\} \times \{1, 2\}\), and \(y\in \{-1,1\}\). We assume there are three examples: i) \((x_{1}^{2}, x_{2}^{1})\) and \(y=-1\), ii) \((x_{1}^{2}, x_{2}^{2})\) and \(y=-1\), iii) \((x_{1}^{1}, x_{2}^{1})\) and \(y=1\).

According to the graph-based memorization, we begin with the first example and update each edge which is encountered in the example (line 9 in Algorithm 1). We have to remember that the terminal node is added to every example and that is why one needs to iterate to \(D+1\) instead of D (line 6 in Algorithm 1). The resulting graphs after including the first two and the last datum are presented in Fig. 4a, b, respectively.

For graphs as in Fig. 4b we are able to calculate probability of y for given counts \(N_{y,\varphi }\) determined by (16), the conjunctive feature \(\varphi \), initial probabilities and fixed value of m using (8). Let us assume that \(m=1\) and \(\pi _{1}=0.5\). Then, for instance, for \(\varphi = x_{1}^{1} \wedge x_{2}^{1}\) we have \(N_{1,\varphi }=\min \{1,1\}=1\) and \(N_{-1,\varphi }=\min \{0,1\}=0\), and consequently \(p(y=1|\varphi ) = \frac{1+0.5}{1+1} = 0.75\) and \(p(y=-1|\varphi ) = \frac{0+0.5}{1+1} = 0.25\).

Fig. 4
figure 4

An exemplary performance of the graph-based memorization for \({\mathbf {x}} \in \{1, 2\} \times \{1, 2\}\), \(y\in \{-1,1\}\) and three observations. Numbers above edges represent values of weights. More details can be found in text

2.5 Multi-class case

In our considerations we have assumed only two possible class labels. However, the presented approach can be straightforwardly generalized to the multi-class case. First of all, let us notice that in the presented equations for calculating the combination of soft rules, i.e., Eqs. (5) and (7), as well as in the equation for final prediction (3), there is no restriction for the number of classes. Similarly, in calculating the probability of class label [see Eq. (6)] the number of classes is not determined. We have only used the assumption of two classes in calculating weighting of observations in the Eq. (9). However, it is easy to generalize it to any number of classes by replacing 2 in the denominator with the number of classes. Finally, the graph-based memorization is also independent on the number of class labels because we build a graph for each class separately. Therefore, the whole process of storing sufficient statistics in the graph-based representation can be performed in the multi-class case.

3 Experiments

Data We carry out one simulation study on synthetic data and two experiments: Experiment 1 on synthetic benchmark datasetsFootnote 3 (see Table 2), Experiment 2 on medical datasetsFootnote 4 (see Table 3) including one real-life medical dataset provided by the Institute of Oncology, Ljubljana (Štrumbelj et al. 2010):

  • breast cancer: the goal is the prediction of a recurrence of a breast cancer,

  • breast cancer Wisconsin: the goal is the classification of a breast cancer as benign or malignant,

  • diabetes: the goal is to classify the patient as tested positive for diabetes or not,

  • hepatitis: the goal is to predict whether the patient suffering hepatitis will survive or die,

  • indian liver: the goal is to classify the patient as healthy or with a liver issue,

  • liver disorders: the goal is to classify the patient as healthy or with a liver disorder,

  • postoperative patient: the goal is to classify the patient to be sent to hospital or home,

  • oncology: the goal is to predict whether the patient will have a recurrence of a breast cancer or not.

Table 2 The number of examples, the number of features and the imbalance ratio for benchmark datasets
Table 3 The number of examples, the number of features and the imbalance ratio for medical datasets used in the experiments

The datasets are summarized in Tables 2 and 3 in which the number of features and the number of examples for each dataset are given. Additionally, we provide the imbalance ratio defined as the number of negative class examples divided by the number of positive class examples (Galar et al. 2012).

Evaluation methodology. The proposed method, Combination of Soft Rules (CSR), was evaluated on the synthetic data and further was compared in two experiments with the following non-rule-based methods:

  • AdaBoost (AB) (Freund and Schapire 1997),

  • Bagging (Bag) (Breiman 1996),

  • SMOTEBagging (SBag): Modified Bagging by learning base learners using SMOTE sampling technique (Chawla et al. 2002),

  • SMOTEBoost (SBoost): Modified AdaBoost by learning base learners using SMOTE sampling technique (Chawla et al. 2002),

  • Naïve Bayes classifier (NB),

  • Cost-sensitive SVM (CSVM) with linear kernel (Cortes and Vapnik 1995),

  • Neural Network (NN),

and rule-based methods:

  • C.45 tree learnerFootnote 5 (Quinlan 1993),

  • RIPPER classification rules learner (Cohen 1995),

  • OneR classification rules learner (Holte 1993),

  • CFAR classification rules learner based on fuzzy association rules (Chen and Chen 2008),

  • SGERD fuzzy classification rules learner based on steady-state genetic algorithm (Mansoori et al. 2008),

  • ART classification rules learner based on association rule tree (Berzal et al. 2004).

In order to evaluate the methods we applied the following assessment metrics:Footnote 6

  • Gmean (Geometric mean) which is defined as follows (Kubat and Matwin 1997; Kubat et al. 1997; He and Garcia 2009; Wang and Japkowicz 2010):

    $$\begin{aligned} \textit{Gmean} = \sqrt{ \frac{\textit{TP}}{\textit{TP}+\textit{FN}}\ \frac{\textit{TN}}{\textit{TN}+\textit{FP}} }, \end{aligned}$$
    (17)
  • AUC (Area Under Curve of the ROC curve) which is expressed in the following form (He and Garcia 2009):

    $$\begin{aligned} \textit{AUC} = \frac{1 + \frac{\textit{TP}}{\textit{TP}+\textit{FN}} - \frac{\textit{FP}}{\textit{TN}+\textit{FP}} }{2}, \end{aligned}$$
    (18)
  • Precision specifies how many examples from minority class were correctly classified comparing to the incorrectly labeled majority objects as minority ones (Fawcett 2006):

    $$\begin{aligned} {\textit{Precision}} = \frac{\textit{TP}}{\textit{TP}+\textit{FP}}, \end{aligned}$$
    (19)
  • Recall denotes the fraction of correctly classified objects from minority class to all examples labeled as minority class (Fawcett 2006):

    $$\begin{aligned} {\textit{Recall}} = \frac{\textit{TP}}{\textit{TP}+\textit{FN}}. \end{aligned}$$
    (20)

It is advocated to use Gmean for imbalanced datasets because this metric punishes low classification accuracy of the minority class (Kubat and Matwin 1997; Kubat et al. 1997; He and Garcia 2009; Wang and Japkowicz 2010). Comparing to AUC the Gmean enforces high predictive accuracy on majority and minority classes. For further comparison of methods we calculated average ranks over benchmark and medical datasets according to Gmean and AUC which is a simple fashion of evaluating classification algorithms (Demšar 2006; Brazdil et al. 2003).

In the experiment Precision and Recall were also examined because these two measures give a thorough insight into the classifier’s predictive performance exclusively for minority class. For better understanding of the obtained results, we present graphically the Pareto frontier (Brzezinska et al. 2007; Vamplew et al. 2011) with respect to Precision and Recall for the considered methods.

In order to verify our claims about the time complexity of the proposed approach, we measured the average execution time of five folds, expressed in miliseconds. We examined the dependency between time and the number of attributes and the number of examples separately.

The presented approach uses categorical variables only, therefore, we applied discretizer which utilizes entropy minimization heuristics (Fayyad and Irani 1993). Additionally, we performed feature selection using correlation-based feature extraction with exhaustive search (Hall 1999) to the selected datasets. In the Experiment 2, the feature selection on hepatitis dataset resulted in ten features, and in the Experiment 3 (oncology dataset)—five features were selected.

The experiments were conducted in KEEL softwareFootnote 7 (Alcalá et al. 2010). In order to obtain results we applied fivefold cross validation. For each dataset the value of m in CSF was determined using a validation set.

3.1 Simulation study: synthetic data

In order to get better insight of the proposed approach, in the simulation study we want to state and verify two issues: (i) the behavior of CSR for different distributions and given examples, (ii) the time complexity of CSR with varying number of attributes or examples. The questions are verified with the following simulation set-ups:

  1. (i)

    It is assumed that the considered phenomenon is described by 8 binary attributes, \({\mathbf {x}} \in \{0,1\}^{8}\). The attributes are generated independently with the Bernoulli distribution with \(p_{d}=0.5\), for \(d=1,\ldots , 8\). Further, we consider four possible rule-based descriptions of the phenomenon (in the brackets we give the imbalance ratio of all possible configurations of \({\mathbf {x}}\)), namely, a conjunction \(x_{1}^{1} \Rightarrow y=1\) (1:1), a more specific conjunction \(x_{1}^{1} \wedge x_{8}^{1} \Rightarrow y=1\) (3:1), a disjunction \(x_{1}^{1} \vee x_{8}^{1} \Rightarrow y=1\) (3:1), and a more specific disjunction \((x_{1}^{1} \wedge x_{5}^{1}) \vee x_{8}^{1} \Rightarrow y=1\) (5:3).Footnote 8 Additionally, during generating data we inject noise to the class label, i.e., we switch the class label with probability \(\varepsilon \in \{0, 0.01, 0.05, 0.1\}\). Eventually, we evaluate CSR with all possible configurations without noise, while learning is performed with 50 and 200 examples. In learning the im-estimate was used with \(m=1\).

  2. (ii)

    In order to check the time complexity we chose two datasets from the benchmark datasets, namely, vowel0 (13 attributes, 990 examples) and page-blocks0 (10 attributes, 5000 examples). The first of the mentioned datasets was used to evaluate CSR in the case of varying number of attributes, while the second one was applied to assess the time complexity for varying number of examples.

All simulations were repeated 5 times.

The averaged results for the first issue are presented in Table 4 and the averaged results for the second issue are given in Fig. 5.

Table 4 Results of the simulation study in terms of Gmean and AUC for CSR

We can draw the following conclusions from the simulation study:

  1. (i)

    In general, we conclude that CSR can be used to learn different descriptions of the phenomenon but for the sufficient number of observations, see Table 4. A quite unexpected conclusion of the simulation experiment is that CSR can handle both conjunctions and disjunctions. A main disadvantage of the CSR is its sensitivity of the value of the parameter m. In the simulation runs we used fixed \(m=1\) which is an inappropriate value for imbalanced data. Therefore, it is needed to apply a model selection of this parameter.

  2. (ii)

    As we presumed (see Lemma 1), CSR allows to handle up to 10 attributes in a reasonable time but for more than 10 attributes the computation time starts increasing drastically, see Fig. 5a. On the other hand, the time complexity in the case of the growing number of examples is linear, see Fig. 5b. This is an important fact which shows that CSR scales well for large datasets but with a moderate number of attributes.

Fig. 5
figure 5

Average execution time: a for varying number of attributes (vowel0 dataset), b for varying number of examples (page-blocks0 dataset)

3.2 Experiment 1: Benchmark data

In the first experiment we consider 37 benchmark datasets. These are generated from the benchmark datasets available in UCI ML Repository (Frank and Asuncion 2010) and are built-in the KEEL software. The classification problem for them is demanding because the datasets are highly imbalanced (see Table 2).

3.3 Experiment 2: Medical data

First, in the second experiment we evaluate our method on benchmark medical datasets (Frank and Asuncion 2010). It was noticed that medical problems associated with illnesses or post-operative prediction are typical examples of imbalanced data phenomenon (Mac Namee et al. 2002). In the considered medical datasets the issue of imbalanced data is observed, i.e., on average the imbalance ratio equals 2.6, varying from 1.52 to 5.43 (see Table 3).

Next, we took a closer look at the 949-case dataset provided by the Institute of Oncology Ljubljana (Štrumbelj et al. 2010). Each patient is described by 15 features:

  • menopausal status;

  • tumor stage;

  • tumor grade;

  • histological type of the tumor;

  • level of progesterone receptors in tumor (in fmol per mg of protein);

  • invasiveness of the tumor

  • number of involved lymph nodes

  • medical history

  • lymphatic or vascular invasion;

  • level of estrogen receptors in tumor (in fmol per mg of protein);

  • diameter of the largest removed lymph node;

  • ratio between involved and total lymph nodes removed;

  • patient age group;

  • application of a therapy (cTherapy);

  • application of a therapy (hTherapy).

All the features were discretized by oncologists, based on how they use the features in everyday medical practice, see (Štrumbelj et al. 2010). The goal in the considered problem is the prognosis of a breast cancer recurrence within 10 years after surgery.

The oncology dataset, similarly to the benchmark medical datasets, suffers from the imbalance data phenomenon. The imbalance ratio for this dataset equals 2.92. It is a value which forces the application of techniques for preventing overfitting to the majority class.

Table 5 Detailed test results for CSR versus considered methods using ranks of Gmean and AUC for benchmark and medical datasets

For the oncology dataset we used an additional assessment metric, that is, the classification accuracy (CA) which is defined as follows:

$$\begin{aligned} {\textit{CA}} = \frac{\textit{TP} + \textit{TN}}{\textit{TP} + \textit{FN} + \textit{FP} + \textit{TN}}. \end{aligned}$$
(21)

The CA was applied because the authors of (Štrumbelj et al. 2010) have obtained results for randomly chosen 100 cases which were analyzed by two human doctors (O1 and O2 in Fig. 6). The oncologists were asked to predict the class value for these cases and then the CA value was calculated (Štrumbelj et al. 2010). The obtained quantities do not lead to a conclusion that the classifiers have significantly higher accuracy. However, they can give an insight in the usefulness of the application of machine learning methods in the medical domain.

Fig. 6
figure 6

Summary performance comparison of all algorithms and human oncologists, O1 and O2. The performance is expressed by classification accuracy (21). Note both the human oncologists as well as the classifiers performed the classification for the same test set consisting of 100 examples

Fig. 7
figure 7

Pareto frontier for: a benchmark datasets, b medical datasets. Methods denoted by triangular mark constitutes the Pareto frontier. Rules-based methods are denoted by diamonds. Notice that in both cases the CSR (denoted by a star) is Pareto-optimal

Fig. 8
figure 8

Execution time comparison between CSR and non-rule-based methods with respect to the number of attributes

Fig. 9
figure 9

Execution time comparison between CSR and rule-based methods with respect to the number of attributes

Fig. 10
figure 10

Execution time comparison between CSR and non-rule-based methods with respect to the number of examples

Fig. 11
figure 11

Execution time comparison between CSR and rule-based methods with respect to the number of examples

3.4 Results of the experiments and discussion

The results of the Experiment 1 are presented in Table 5 (for ranks of Gmean and AUC), and Fig. 7a (for Pareto frontier with respect to Precision and Recall). The results of the Experiments 2 are presented in Table 5 (for ranks of Gmean and AUC), and Fig. 7b (for Pareto frontier with respect to Precision and Recall). Additionally, in Fig. 6 we provide a comparison of machine learning methods and human doctors in terms of CA. More detailed results of the experiments are given in Electronic Supplementary Material; Experiment 1: in Table A1 (for Gmean), Table A3 (for AUC)), Table A5 (for Precision and Recall), Experiment 2: Table A2 (for Gmean), Table A4 (for AUC), Table A6 (for Precision and Recall). The results of time complexity analysis are gathered in Table A7 (see Electronic Supplementary Material) and more detailed results are presented in Figs. 8, 9, 10, and 11.

The results obtained within the carried out experiments show that according to the Gmean and the AUC metrics the proposed approach performs comparably with the best ensemble-based classifiers, that is, SMOTEBagging (SBAG) and SMOTEBoosting (SBoost), and slightly better than the best non-ensemble predictor, i.e., Naïve Bayes classifier (NB), see Tables A1, A2, A3, A4 in Electronic Supplementary Material, and the ranks in Table 5. However, it outperforms all rule-based methods where, for instance, the best rule-based methods C45 and RIPPER achieved results worst by several ranks.

It is interesting that the combination of soft rules performed very good on datasets with small number of examples, e.g., ecoli-0-1-3-7vs2-6, glass-016vs5, new-thyroid2, hepatitis, postoperative patient. This effect can be explained twofold. First, in general the idea of combining models increases robustness to overfitting phenomenon. Second, it is possible that the application of the im-estimate allows to counteract the small size of the dataset. On the other hand, it can be noticed that the CSR achieved slightly worst results on datasets with higher number of attributes, e.g., page-blocks0, vowel0. The plausible explanation of this effect is due to the wrong assumption that the data are generated by one conjunctive feature. Perhaps, other kind of the shared model, e.g., with some kind of disjunction of the conjunctive features, would better represent hidden representation of data.

In terms of Precision and Recall it turned out that CSR is Pareto-optimal in case of both benchmark and medical datasets (see Fig. 7a, b), where methods forming Pareto frontier are denoted by triangular marks and our method is represented by a star). On medical datasets our method achieved balanced values of Precision and Recall. However, on benchmark datasets CSR has very high value of Recall but lower value of Precision which means that it is able to detect most of positive objects but is prone to label negatives as positives. In the case of medical domain such fact is less dangerous from the patient point of view where it is less harmful to classify a healthy patient as ill (Recall) than in the opposite manner (Precision). Nonetheless, comparing our approach to other rule-based methods it turns out that only C45 and RIPPER are comparable (on benchmark datasets they are close the Pareto frontier, and on medical datasets C45 is Pareto-optimal and RIPPER is close to the Pareto frontier). Additionally, the OneR is Pareto-optimal on medical datasets.

Quite surprising results were obtained by the Naïve Bayes which achieved second rank in terms of Gmean and first rank in terms of AUC on medical datasets (see Table 5). Nevertheless, it is a well-known fact that this Bayesian classifier with features independence assumption behaves extremely good in the medical domain (Kononenko 2001). The general properties of the Naïve Bayes classifier has been thoroughly studied in (Domingos and Pazzani 1997) and theoretical justification for its good performance has been given. However, we notice that our method performs better than NB classifier and obtains less varying values (see standard deviations Tables A1, A2, A3, A4 in Electronic Supplementary Material). It is an important result because both NB and our method try to approximate the optimal Bayes classifier.

In the paper, we have claimed that main disadvantage of our approach is the exponential growth of soft rules with increasing number of attributes (see Lemma 1). In the experiment we verified this claim empirically by calculating the average execution time of five cross-validation folds. In Figs. 8 and 9 the times with respect to the number of attributes for non-rules-based and rules-based methods, respectively, are presented. In Figs. 10 and 11 the times with respect to the growing number of examples for non-rules-based and rules-based methods, respectively, are given. We can notice that the execution time of CSR grows as the number of attributes increases, as expected. The same effect is observed in the case of the number of examples. However, we believe that this result can be a consequence of the testing procedure which is more time-consuming for any combination of models (see Fig. 10a–d) for ensemble classifiers). Comparing CSR to the non-rule-based methods we notice that our approach performs similarly or even takes less time than other combinations of models, i.e., AB, BAG, SBAG, and SBoost. It is especially evident for the ensemble classifiers which apply additional procedure of SMOTE (see Fig. 10c, d). However, CSR utilizes noticeably much more time than any of the rule-based methods. The only exception is CFAR which performs longer with respect to the number of attributes than our approach (see Fig. 9 c) and comparably with respect to the number of examples (see Fig. 11c).

Last but not least, we would like to address the results obtained by the classifiers in the oncology dataset to the ones received by oncologists. The human experts have been presented with 100 cases only and their relatively mediocre performance cannot be exaggerated. Nonetheless, it can be stated that the predictions of the machine learning methods are at least comparable with those of expert oncologists (see Fig. 6).

3.5 Exemplary knowledge for oncology data

At the end of the experimental section we would like to demonstrate exemplary knowledge in terms of soft rules. Let us consider the oncology data and one randomly chosen patient which is described as follows:

  1. 1.

    Menopausal status false.

  2. 2.

    Tumor stage less than 20 mm.

  3. 3.

    Tumor grade medium.

  4. 4.

    Histological type of the tumor ductal.

  5. 5.

    Level of progesterone receptors in tumor (in fmol per mg of protein) more than 10.

  6. 6.

    Invasiveness of the tumor no.

  7. 7.

    Number of involved lymph nodes 0.

  8. 8.

    Application of a therapy (cTherapy) false.

  9. 9.

    Application of a therapy (hTherapy) false.

  10. 10.

    Medical history 1st generation breast, ovarian or prostate cancer.

  11. 11.

    Lymphatic or vascular invasion false.

  12. 12.

    Level of estrogen receptors in tumor (in fmol per mg of protein) more than 30.

  13. 13.

    Diameter of the largest removed lymph node less than 15 mm.

  14. 14.

    Ratio between involved and total lymph nodes removed 0.

  15. 15.

    Patient age group under 40.

Each sentence corresponds to medical history of the patient.

Further, let us assume that we have performed the graph-based memorization basing on the training data. Then we are able to generate soft rules for the given patient’s description, for example:

  1. 1.

    IF application of a therapy (cTherapy) false AND lymphatic or vascular invasion false

    THEN \(y=1 \text { with } p = 0.030\) or \(y=-1 \text { with } p = 0.226\).

  2. 2.

    IF lymphatic or vascular invasion false

    THEN \(y=1 \text { with } p = 0.107\) or \(y=-1 \text { with } p = 0.353\).

  3. 3.

    IF menopausal status false AND tumor grade medium

    THEN \(y=1 \text { with } p = 0.036\) or \(y=-1 \text { with } p = 0.028\).

  4. 4.

    IF application of a therapy (hTherapy) false

    THEN \(y=1 \text { with } p = 0.125\) or \(y=-1 \text { with } p = 0.316\).

We can notice that the second and fourth rule would significantly contribute to the final prediction (see (7)). Nonetheless, it is instructive to examine how class probabilities vary for given antecedent. Including both information about cTheraphy and lymphatic or vascular invasions drastically decreases probability of \(y=1\) in comparison to information about lymphatic or vascular invasions only (see rules 1 and 2). Therefore, one can easily generate the most important rules, i.e., the rules which are the most contributory to the final decision, and present them in form of an interpretable report to the physician or the patient. Moreover, in comparison to the crisp rules, the soft rules enrich the report by additional information about the probability of the class label in the consequent of the rule.

4 Conclusion

We have proposed the combination of soft rules in the application to the medical domain. The approach relies on probabilistic decision making with latent relationships among features represented by the conjunctive features and a new manner of estimating probabilities in case of imbalanced data problem, which is the modified m-estimator (im-estimator). Moreover, we have presented the graph-based memorization, a technique for aggregating data in the form of a graph. This technique enables an efficient fashion of memorizing observations. We would like also to emphasize that the combination of soft rules is the comprehensible model and can be useful in supporting medical diagnosis.

In the ongoing research we develop the Bayesian approach to the presented problem. Moreover, we focus on establishing a sampling method basing on Markov Chain Monte Carlo technique for approximate inference in high-dimensional spaces. In this paper, we have assumed that data is generated from one conjunctive feature which in fact approximates the true concept representation. Therefore, we leave investigating the inference with a set of conjunctive features as future research.