Learning monotone nonlinear models using the Choquet integral
Authors
- First Online:
- Received:
- Revised:
- Accepted:
DOI: 10.1007/s10994-012-5318-3
- Cite this article as:
- Fallah Tehrani, A., Cheng, W., Dembczyński, K. et al. Mach Learn (2012) 89: 183. doi:10.1007/s10994-012-5318-3
Abstract
The learning of predictive models that guarantee monotonicity in the input variables has received increasing attention in machine learning in recent years. By trend, the difficulty of ensuring monotonicity increases with the flexibility or, say, nonlinearity of a model. In this paper, we advocate the so-called Choquet integral as a tool for learning monotone nonlinear models. While being widely used as a flexible aggregation operator in different fields, such as multiple criteria decision making, the Choquet integral is much less known in machine learning so far. Apart from combining monotonicity and flexibility in a mathematically sound and elegant manner, the Choquet integral has additional features making it attractive from a machine learning point of view. Notably, it offers measures for quantifying the importance of individual predictor variables and the interaction between groups of variables. Analyzing the Choquet integral from a classification perspective, we provide upper and lower bounds on its VC-dimension. Moreover, as a methodological contribution, we propose a generalization of logistic regression. The basic idea of our approach, referred to as choquistic regression, is to replace the linear function of predictor variables, which is commonly used in logistic regression to model the log odds of the positive class, by the Choquet integral. First experimental results are quite promising and suggest that the combination of monotonicity and flexibility offered by the Choquet integral facilitates strong performance in practical applications.
Keywords
Choquet integral Monotone learning Nonlinear models Choquistic regression Classification VC dimension1 Introduction
A proper specification of the type of dependency between a set of predictor (input) variables X _{1},…,X _{ m } and the target (output) variable Y is an important prerequisite for successful model induction. The specification of a corresponding hypothesis space imposes an inductive bias that, amongst others, allows for the incorporation of background knowledge in the learning process. An important type of background knowledge is monotonicity: Everything else being equal, the increase (decrease) of a certain input variable X _{ i } can only produce an increase in the output variable Y (e.g., a real number in regression, a class in ordered classification, or the probability of the positive class in binary classification). Adherence to this kind of background knowledge may not only be beneficial for model induction, but is often even considered as a hard constraint. For example, no medical doctor will accept a model in which the probability of cancer is not monotonically increasing in tobacco consumption.
Perhaps the sole disadvantage of a linear model is its inflexibility and, coming along with this, the supposed absence of any interaction between the variables: The effect of an increase of X _{ i } is always the same, namely ∂Y/∂X _{ i }=α _{ i }, regardless of the values of all other attributes. In many real applications, this assumption is not tenable. Instead, more complex, nonlinear models are needed to properly capture the dependencies between the inputs X _{ i } and the output Y.
In this paper, we advocate the use of the (discrete) Choquet integral as a tool that is interesting in this regard. As will be argued in more detail later on, the Choquet integral combines the aforementioned properties in a quite convenient and mathematically elegant way: By its very nature as an integral, it is a monotone operator, while at the same time allowing for interactions between attributes. Moreover, it disposes of natural measures for quantifying the importance of individual and the interaction within groups of features, which provide important insights into the model and thereby support interpretability.
The rest of this paper, parts of which have already been presented in Fallah Tehrani et al. (2011), Hüllermeier and Fallah Tehrani (2012b), is organized as follows. In the next section, we give a brief overview of related work. In Sect. 3, we recall the basic definition of the Choquet integral and some related notions. In Sect. 4, we analyze the flexibility of binary classifiers based on the Choquet integral in terms of the notion of VC dimension. In Sect. 5, we propose a generalization of logistic regression for binary classification, in which the Choquet integral is used to model the log odds of the positive class. In Sect. 6, we elaborate on complexity issues and propose a method for finding a suitable level of (non-)additivity for the Choquet integral in a concrete learning task. Experimental results are presented in Sect. 7, prior to concluding the paper with a few remarks in Sect. 8.
2 Related work
As already mentioned, the problem of monotone classification has received increasing attention in the machine learning community in recent years,^{1} despite having been introduced in the literature much earlier (Ben-David et al. 1989). Meanwhile, several machine learning algorithms have been modified so as to guarantee monotonicity in attributes, including nearest neighbor classification (Duivesteijn and Feelders 2008), neural networks (Sill 1998), decision tree learning (Ben-David 1995; Potharst and Feelders 2002), rule induction (Dembczyński et al. 2009), as well as methods based on isotonic separation (Chandrasekaran et al. 2005) and piecewise linear models (Dembczyński et al. 2006).
Instead of modifying learning algorithms so as to guarantee monotone models, another idea is to modify the training data. To this end, data pre-processing methods such as re-labeling techniques have been developed. Such methods seek to repair inconsistencies in the training data, so that (standard) classifiers learned on that data will tend to be monotone (although, in general, they still do not guarantee this property) (Feelders 2010; Kotłowski et al. 2008).
Although the Choquet integral has been widely applied as an aggregation operator in multiple criteria decision making (Grabisch et al. 2000; Grabisch 1995a; Torra 2011), it has been used much less in the field of machine learning so far. There are, however, a few notable exceptions. First, the problem of extracting a Choquet integral (or, more precisely, the non-additive measure on which it is defined) in a data-driven way has been addressed in the literature (Beliakov 2008). Essentially, this is a parameter identification problem, which is commonly formalized as a constraint optimization problem, for example using the sum of squared errors as an objective function (Torra and Narukawa 2007; Grabisch 2003). To this end, Mori and Murofushi (1989) proposed an approach based on the use of quadratic forms, while an alternative heuristic, gradient-based method called HLMS (Heuristic Least Mean Squares) was introduced in Grabisch (1995b). In Angilella et al. (2009), Beliakov and James (2011), the Choquet integral is used in the context of ordinal classification. Besides, the Choquet integral has been used as an aggregation operator in the context of ensemble learning, i.e., for combining the predictions of different classifiers (Grabisch and Nicolas 1994).
3 The discrete Choquet integral
In this section, we give a brief introduction to the (discrete) Choquet integral, which, to the best of our knowledge, is not widely known in the field of machine learning so far. Since the Choquet integral can be seen as a generalization of the standard (Lebesque) integral to the case of non-additive measures, we start with a reminder of this type of measure.
3.1 Non-additive measures
Let C={c _{1},…,c _{ m }} be a finite set and μ:2^{ C }→[0,1] a measure on this set. For each A⊆C, we interpret μ(A) as the weight or, say, the importance of the set of elements A. As an illustration, one may think of C as a set of criteria (binary features) relevant for a job, like “speaking French” and “programming Java”, and of μ(A) as the evaluation of a candidate satisfying criteria A (and not satisfying C∖A). The term “criterion” is indeed often used in the decision making literature, where it suggests a monotone “the higher the better” influence. In the context of machine learning, to which we shall turn later on, criteria are playing the role of features (input attributes).
A standard assumption on a measure μ(⋅), which is, for example, at the core of probability theory, is additivity: μ(A∪B)=μ(A)+μ(B) for all A,B⊆C such that A∩B=∅. Unfortunately, additive measures cannot model any kind of interaction between elements: Extending a set of elements A by a set of elements B always increases the weight μ(A) by the weight μ(B), regardless of A and B.
Suppose, for example, that the elements of two sets A and B are complementary in a certain sense. For instance, A={French,Spanish} and B={Java} could be seen as complementary, since both language skills and programming skills are important for the job. Formally, this can be expressed in terms of a positive interaction: μ(A∪B)>μ(A)+μ(B). In the extreme case, when language skills and programming skills are indeed essential, μ(A∪B) can be high although μ(A)=μ(B)=0 (suggesting that a candidate lacking either language or programming skills is completely unacceptable). Likewise, elements can interact in a negative way: If two sets A and B are partly redundant or competitive, then μ(A∪B)<μ(A)+μ(B). For example, A={C,C#} and B={Java} might be seen as redundant, since one programming language does in principle suffice.
A measure μ is said to be k-order additive, or simply k-additive, if k is the smallest integer such that m(A)=0 for all A⊆C with |A|>k. This property is interesting for several reasons. First, as can be seen from (4), it means that a measure μ can formally be specified by significantly fewer than 2^{ m } values, which are needed in the general case. Second, k-additivity is also interesting from a semantic point of view: As will become clear in the following, this property simply means that there are no interaction effects between subsets A,B⊆C whose cardinality exceeds k.
3.2 Importance of criteria and interaction
Measuring the importance of a criterion c _{ i } becomes obviously more involved when μ is non-additive. Besides, one may then also be interested in a measure of interaction between the criteria, either pairwise or even of a higher order. In the literature, measures of that kind have been proposed, both for the importance of single as well as the interaction between several criteria.
3.3 The Choquet integral
So far, the criteria c _{ i } were simply considered as binary features, which are either present or absent. Mathematically, μ(A) can thus also be seen as an integral of the indicator function of A, namely the function f _{ A } given by f _{ A }(c)=1 if c∈A and =0 otherwise. Now, suppose that f: C→ℝ_{+} is any non-negative function that assigns a value to each criterion c _{ i }; for example, f(c _{ i }) might be the degree to which a candidate satisfies criterion c _{ i }. An important question, then, is how to aggregate the evaluations of individual criteria, i.e., the values f(c _{ i }), into an overall evaluation, in which the criteria are properly weighted according to the measure μ. Mathematically, this overall evaluation can be considered as an integral \(\mathcal{C}_{\mu}(f)\) of the function f with respect to the measure μ.
4 The VC dimension of the Choquet integral
Advocating the Choquet integral as a novel tool for machine learning immediately begs an interesting theoretical question, namely the question regarding the capacity of the corresponding model class. In fact, since the Choquet integral in its general form (not restricted to k-additive measures) has a rather large number of parameters, one may expect it to be quite flexible and, therefore, to have a high capacity. On the other hand, the parameters cannot be chosen freely. Instead, they are highly constrained due to the properties of the underlying fuzzy measure.
In any case, knowledge about the VC dimension of the Choquet integral (or, more specifically, a binary classifier based on the Choquet integral as an underlying aggregation function) is not only of theoretical but also of practical relevance. In particular, it may help finding the right level of flexibility for the data at hand. As mentioned earlier, because of its highly nonlinear nature, one may expect the Choquet integral in its most general form comes with a danger of overfitting the data. On the other hand, a restriction to k-additive measures may provide a reasonable means for regularization. Both conjectures will be confirmed in this section.
Theorem 1
For the model class \(\mathcal{H}\) as defined above, \(\mathit{VC}(\mathcal{H}) = \varOmega(2^{m}/\sqrt{m})\). That is, the VC dimension of \(\mathcal{H}\) grows asymptotically at least as fast as \(2^{m}/\sqrt{m}\).
Proof
In order to prove this claim, we construct a sufficiently large data set \(\mathcal {D}\) and show that, despite its size, it can be shattered by \(\mathcal{H}\). In this construction, we restrict ourselves to binary attribute values, which means that x _{ i }∈{0,1} for all 1≤i≤m. Consequently, each instance x=(x _{1},…,x _{ m })∈{0,1}^{ m } can be identified with a subset of indices S _{ x }⊆X={1,2,…,m}, namely its indicator set S _{ x }={i∣x _{ i }=1}.
Now, we define the data set \(\mathcal {D}\) in terms of the collection of all instances x=(x _{1},…,x _{ m })∈{0,1}^{ m } whose indicator set S _{ x } is a q-subset of X. Recall that, from a decision making perspective, each attribute can be interpreted as a criterion. Thus, each instance in our data set satisfies exactly q of the m criteria, and there is not a single “dominance” relation in the sense that the set of criteria satisfied by one instance is a superset of those satisfied by another instance. Intuitively, the instances in \(\mathcal {D}\) are therefore maximally incomparable. This is precisely the property we are now going to exploit in order to show that \(\mathcal {D}\) can be shattered by \(\mathcal {H}\).
Noting that the special case where \(\mathcal {P} = \emptyset\) is handled correctly by the Möbius transform m such that m(C)=1 and m(T)=0 for all T⊆̷C (and any threshold β>0), we can conclude that the data set \(\mathcal {D}\) can be shattered by \(\mathcal {H}\). Consequently, the VC dimension of \(\mathcal {H}\) is at least the size of \(\mathcal {D}\), whence (10) is a lower bound of \(\mathit{VC}(\mathcal {H})\).
Remark 1
Remark 2
5 Choquistic regression
Logistic regression is a well-established statistical method for (probabilistic) classification (Hosmer and Lemeshow 2000). Its popularity is due to a number of appealing properties, including monotonicity and comprehensibility: Since the model is essentially linear in the input attributes, the strength of influence of each predictor is directly reflected by the corresponding regression coefficient. Moreover, the influence of each attribute is monotone in the sense that an increase of the value of the attribute can either only increase or only decrease the probability of the positive class (depending on whether the associated regression coefficient is positive or negative).
Needless to say, the linearity of the above model is a strong restriction from a learning point of view, and the possibility of interactions between predictor variables has of course also been noticed in the statistical literature (Jaccard 2001). A standard way to handle such interaction effects is to add interaction terms to the linear function of predictor variables, like in (2). As explained earlier, however, the aforementioned advantages of logistic regression will then be lost.
In the following, we therefore propose an extension of logistic regression that allows for modeling nonlinear relationships between input and output variables while preserving the advantages of comprehensibility and monotonicity. As mentioned earlier, the monotonicity constraint is important if the direction of influence of an input attribute is known beforehand and needs to be reflected by the model, an assumption that we shall make in the following. As an aside, we note that one may also envision the case where an attribute is known to have a monotone influence, although the direction of influence is unknown. The learning problem then becomes slightly more difficult, since the learner has to figure out whether the influence is positive (increasing) or negative (decreasing). We shall not consider this problem any further, however, and instead assume the direction of influence to be given as prior knowledge.
5.1 The Choquistic model
5.2 Normalization
5.3 Logistic regression as a special case
5.4 Parameter estimation
A solution to the above problem can be produced by standard solvers. Concretely, we used the fmincon function implemented in the optimization toolbox of Matlab. This method is based on a sequential quadratic programming approach.
Recall that, once the model has been identified, the importance of each attribute and the degree of interaction between groups of attributes can be derived from the Möbius transform m; these are given, respectively, by the Shapley value and the interaction indexes as introduced in Sect. 3.2.
6 Complexity reduction
Obviously, choquistic regression can be interpreted as fitting a (constrained) linear function in the feature space spanned by the set of features f _{ T } defined by (13), with one feature for each subset of criteria T⊆{1,2,…,m}. Since the dimensionality of this feature space is 2^{ m }−1, the method is clearly critical from a complexity point of view. It was already mentioned that an L _{1}-regularization in (22) may shrink some coefficients to 0 and, therefore, some of the features f _{ T } may disappear. Although this may help to simplify the choquistic model, that is, the result produced by the learning algorithm, it does not simplify the optimization problem itself.
Thus, one may wonder whether some of the features (13) could not even be eliminated prior to solving the actual optimization problem. Specifically interesting in this regard is a possible restriction of the choquistic model to k-additive measures, for a suitable value of k<m. Since this means that significantly less parameters (namely 2^{ k }−1) need to be identified, the computational complexity might be reduced drastically. Besides, a restriction to k-additive measures may also have advantages from a learning point of view, as it reduces the capacity of the underlying model class (cf. Sect. 4) and thus may prevent from over-fitting the data in cases where the full flexibility of the Choquet integral is actually not needed. Of course, the key problem to be addressed concerns the question of how to choose k in the most favorable way.
6.1 Exploiting equivalence of features for dimensionality reduction
In the following, we shall elaborate on the following question: Is it possible to find an upper bound on the required level of complexity of the model, namely the level of additivity k, prior to fitting the Choquet integral to the data? Or, more specifically, can we determine the value k in such a way that fitting a k-additive measure is definitely enough, in the sense that each labeling of the training data produced by the full Choquet integral (k=m) can also be produced by a Choquet integral based on a k-additive measure?
Now, imagine that a subset of features \(\mathcal{F} = \{ f_{T_{1}}, \ldots, f_{T_{r}} \}\) assumes the same value, not only for a single instance, but for all instances in the training data. Then, this set can be said to form an equivalence class. Thus, one of the features could in principle be selected as a representative, absorbing all the weights of the others; more specifically, the weight of this feature would be set to m(T _{1})+m(T _{2})+⋯+m(T _{ r }), while the weights of the other features in \(\mathcal{F}\) would be set to 0.
Note, however, that this “transfer of Möbius mass” will in general not be feasible, as it may cause a violation of the monotonicity constraint on the fuzzy measure μ. As a side remark, we also note that, from a learning point of view, the equivalence of features may obviously cause problems with regard to the identifiability of coefficients; due to the monotonicity constraints just mentioned, however, this is not necessarily the case.
Proposition 1
We like to emphasize that k ^{∗} is only an upper bound on the complexity needed to fit the training data. Thus, it is not necessarily the optimal k from the point of view of model induction (which might be figured out by the regularizer in (22)). In particular, note that the computation of k ^{∗} does not refer to the output values y ^{(i)}. Instead, it should be considered as a measure of the complexity of the training instances. As such, it is obviously connected to the notion of VC dimension.
Since the exact reproducibility (25) may appear overly stringent or, stated differently, a small loss may actually be acceptable, we finally propose a relaxation somewhat in line with idea of probably approximately correct (PAC) learning (Valiant 1984). First, noting that the Choquet integral may change by at most ϵ when combining features f _{ A } and f _{ B } such that |f _{ A }−f _{ B }|<ϵ, one may think of relaxing the definition of equivalence as follows: f _{ A } and f _{ B } are ϵ-equivalent (on a given training instance x) if |f _{ A }(x)−f _{ B }(x)|<ϵ. Second, we relax the condition of coverage. Denoting by v(A,B)∈[0,1] the fraction of training examples on which f _{ A } and f _{ B } are ϵ-equivalent, we say that f _{ A } ϵ-δ-covers f _{ B } if v(A,B)≥1−δ. Roughly speaking, for a small ϵ and δ close to 0, this means that, with only a few exceptions, the values of f _{ A } and f _{ B } are almost the same on the training data (we used ϵ=δ=0.1 is our experiments below). In order to find a proper upper bound k ^{∗}, the principle (24) can be used as before, just replacing coverage with ϵ-δ-coverage.
7 Experiments
In this section, we present the results of an experimental study that was conducted in order to validate the practical performance of our choquistic regression (CR) method. The goal of this study is twofold. First, we would like to show that CR is competitive in terms of predictive accuracy. To this end, we compare it with several alternative methods on a number of (monotone) benchmark data sets. Second, we would like to corroborate our claim that the CR model provides interesting information about attribute importance and interaction. To this end, we discuss some examples showing that the corresponding Shapley and interaction values produced by CR are indeed meaningful and plausible.
7.1 Data sets
Data sets and their properties
data set |
#instances |
#attributes |
source |
---|---|---|---|
Den Bosch (DBS) |
120 |
8 |
Daniels and Kamp (1999) |
CPU |
209 |
6 |
UCI |
Breast Cancer (BCC) |
286 |
7 |
UCI |
Auto MPG |
392 |
7 |
UCI |
Employee Selection (ESL) |
488 |
4 |
WEKA |
Mammographic (MMG) |
961 |
5 |
UCI |
Employee Rejection/Acceptance (ERA) |
1000 |
4 |
WEKA |
Lecturers Evaluation (LEV) |
1000 |
4 |
WEKA |
Car Evaluation (CEV) |
1728 |
6 |
UCI |
Den Bosch (DBS)
This data set contains 8 attributes describing houses in the city of Den Bosch: area, number of bedrooms, type of house, volume, storeys, type of garden, garage, and price. The output is a binary variable indicating whether the price of the house is low or high (depending on whether or not it exceeds a threshold).
CPU
This is a standard benchmark data set from the UCI repository. It contains eight input attributes, two of which were removed since they are obviously of no predictive value (vendor name, model name). The problem is to predict the (estimated) relative performance of a CPU (binarized by thresholding at the median) based on its machine cycle time in nanoseconds, minimum main memory in kilobytes, maximum main memory in kilobytes, cache memory in kilobytes, minimum channels in units, maximum channels in units.
Breast Cancer (BCC)
This data set was obtained from the University Medical Center, Institute of Oncology, Ljubljana, Yugoslavia. There are 7 attributes, namely menopause gain, tumor-size, inv-nodes, node-caps, deg-malig, breast cost, irradiat gain. The output is a binary variable, namely no-recurrence-events and recurrence-events.
Auto MPG
This data set was used in the 1983 American Statistical Association Exposition. The problem is to predict the city-cycle fuel consumption in miles per gallon (binarized by thresholding at the median) based on the following attributes of a car: cylinders, displacement, horsepower, weight, acceleration, model year, origin. We removed incomplete instances.
Employee Selection (ESL)
This data set contains profiles of applicants for certain industrial jobs. The values of the four input attributes were determined by expert psychologists based upon psychometric test results and interviews with the candidates. The output is an overall score on an ordinal scale between 1 and 9, corresponding to the degree of suitability of each candidate to this type of job. We binarized the output value by distinguishing between suitable (score 6–9) and unsuitable (score 1–5) candidates.
Mammographic (MMG)
This data set is about breast cancer screening by mammography. The goal is to predict the severity (benign or malignant) of a mammographic mass lesion from BI-RADS attributes (mass shape, mass margin, density) and the patient’s age.
Employee Rejection/Acceptance (ERA)
This data set originates from an academic decision-making experiment. The input attributes are features of a candidate such as past experience, verbal skills, etc., and the output is the subjective judgment of a decision-maker, measured on an ordinal scale from 1 to 9, to which degree he or she tends to accept the applicant for the job. We binarized the output value by distinguishing between acceptance (score 5–9) and rejection (score 1–4).
Lecturers Evaluation (LEV)
This data set contains examples of anonymous lecturer evaluations, taken at the end of MBA courses. Students were asked to score their lecturers according to four attributes such as oral skills and contribution to their professional/general knowledge. The output was a total evaluation of each lecturer’s performance, measured on an ordinal scale from 0 to 4. We binarized the output value by distinguishing between good (score 3–4) and bad evaluation (score 0–2).
Car Evaluation (CEV)
This data set contains 6 attributes describing a car, namely, buying price, price of the maintenance, number of doors, capacity in terms of persons to carry, the size of luggage boot, estimated safety of the car. The output is the overall evaluation of the car: unacceptable, acceptable, good, very good. We binarized this evaluation into unacceptable versus not unacceptable (acceptable, good or very good).
7.2 Methods
Since choquistic regression (CR) can be seen as an extension of standard logistic regression (LR), it is natural to compare these two methods. Essentially, this comparison should give an idea of the usefulness of an increased flexibility. On the other side, one may also ask for the usefulness of assuring monotonicity. Therefore, we additionally included two other extensions of LR, which are flexible but not necessarily monotone, namely kernel logistic regression (KLR) with polynomial and Gaussian kernels. The degree of the polynomial kernel was set to 2, so that it models low-level interactions of the features. The Gaussian kernel, on the other hand, is able to capture interactions of higher order. For each data set, the width parameter of the Gaussian kernel was selected from {10^{−4},10^{−3},10^{−2},10^{−1},10^{0}} in the most favorable way. Likewise, the regularization parameter η in choquistic regression was selected from {10^{−3},10^{−2},10^{−1},10^{0},10^{1},10^{2}}.
Finally, we also included two methods that are both monotone and flexible, namely the MORE algorithm for learning rule ensembles under monotonicity constraints (Dembczyński et al. 2009) and the LMT algorithm for logistic model tree induction (Landwehr et al. 2003). Following the idea of forward stagewise additive modeling (Tibshirani et al. 2001), the MORE algorithm treats a single rule as a subsidiary base classifier in the ensemble. The rules are added to the ensemble one by one. Each rule is fitted by concentrating on the examples that are most difficult to classify correctly by rules that have already been generated. The LMT algorithm builds tree-structured models that contain logistic regression functions at the leaves. It is based on a stagewise fitting process to construct the logistic regression models that can select relevant attributes from the data. This process is used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree structure.
7.3 Results
7.3.1 Performance in terms of predictive accuracy
As performance measures, we determined the standard misclassification rate (0/1 loss) as well as the AUC. Estimates of both measures were obtained by randomly splitting the data into two parts, one part for training and one part for testing. This procedure was repeated 100 times, and the results were averaged. In order to analyze the influence of the amount of training data, we varied the proportion between training and test data from 20:80 over 50:50 to 80:20. In these experiments, we used a variant of CR in which the underlying fuzzy measure is restricted to be k-additive, with k determined by means of an internal cross-validation. Compared with other variants (cf. Sect. 7.3.2), this one performed best in terms of accuracy.
A possible improvement of CR over its competitors, in terms of predictive accuracy, may be due to two reasons: First, in comparison to standard LR, it is more flexible and has the ability to capture nonlinear dependencies between input attributes. Second, in comparison to non-monotone learners, it takes background knowledge about the dependency between input and output variables into consideration.
Classification performance in terms of the mean and standard deviation of 0/1 loss. From top to bottom: 20 %, 50 %, and 80 % training data. (Average ranks comparing significantly worse with CR at the 90 % confidence level are put in bold font)
data set |
CR |
LR |
KLR-ply |
KLR-rbf |
MORE |
LMT |
---|---|---|---|---|---|---|
DBS |
.1713 ± .0424(2) |
.2124 ± .0650(6) |
.1695 ± .0437(1) |
.1883 ± .0536(4) |
.1932 ± .0511(5) |
.1779 ± .0420(3) |
CPU |
.0811 ± .0103(3) |
.0711 ± .0312(1) |
.0996 ± .0231(6) |
.0802 ± .0292(2) |
.0829 ± .0379(4) |
.0850 ± .0256(5) |
BCC |
.2775 ± .0335(2) |
.2893 ± .0240(6) |
.2760 ± .0243(1) |
.2787 ± .0237(3) |
.2827 ± .0255(4) |
.2884 ± .0306(5) |
MPG |
.0709 ± .0193(1) |
.0832 ± .0151(6) |
.0788 ± .0097(4) |
.0772 ± .0107(2) |
.0811 ± .0119(5) |
.0773 ± .0148(3) |
ESL |
.0682 ± .0129(1) |
.0733 ± .0107(2) |
.1488 ± .0278(6) |
.0756 ± .0167(3) |
.0838 ± .0241(5) |
.0771 ± .0148(4) |
MMG |
.1725 ± .0120(1) |
.1729 ± .0122(2) |
.1960 ± .0160(6) |
.1791 ± .0133(4) |
.1764 ± .0137(3) |
.1803 ± .0171(5) |
ERA |
.2889 ± .0273(1) |
.2902 ± .0317(2) |
.3001 ± .0130(5) |
.2934 ± .0112(3) |
.3155 ± .0150(6) |
.2963 ± .0126(4) |
LEV |
.1499 ± .0122(1) |
.1655 ± .0082(3) |
.1627 ± .0119(2) |
.1691 ± .0125(5) |
.1707 ± .0186(6) |
.1672 ± .0140(4) |
CEV |
.0448 ± .0089(3) |
.1410 ± .0079(6) |
.0663 ± .0130(5) |
.0618 ± .0151(4) |
.0339 ± .0076(1) |
.0432 ± .0116(2) |
avg. rank |
1.67 |
3.78 |
4 |
3.33 |
4.33 |
3.89 |
DBS |
.1572 ± .0416(4) |
.1708 ± .0380(6) |
.1333 ± .0333(1) |
.1692 ± .0382(5) |
.1457 ± .0413(3) |
.1473 ± .0406(2) |
CPU |
.0464 ± .0281(1) |
.0626 ± .0247(4) |
.0835 ± .0264(6) |
.0547 ± .0233(3) |
.0489 ± .0226(2) |
.0674 ± .0243(5) |
BCC |
.2687 ± .0282(4) |
.2799 ± .0245(6) |
.2591 ± .0287(1) |
.2599 ± .0301(2) |
.2640 ± .0288(3) |
.2717 ± .0295(5) |
MPG |
.0577 ± .0251(1) |
.0654 ± .0150(2) |
.0728 ± .0159(4) |
.0744 ± .0151(5) |
.0751 ± .0178(6) |
.0672 ± .0164(3) |
ESL |
.0601 ± .0126(1) |
.0704 ± .0113(4) |
.1023 ± .0225(6) |
.0682 ± .0121(2) |
.0695 ± .0139(3) |
.0709 ± .0135(5) |
MMG |
.1667 ± .0144(1) |
.1701 ± .0158(5) |
.1721 ± .0164(6) |
.1693 ± .0130(4) |
.1691 ± .0140(3) |
.1671 ± .0167(2) |
ERA |
.2844 ± .0306(1) |
.2851 ± .0303(2) |
.2926 ± .0151(4) |
.2882 ± .0142(3) |
.3037 ± .0180(6) |
.2956 ± .0148(5) |
LEV |
.1372 ± .0125(1) |
.1651 ± .0133(6) |
.1520 ± .0160(4) |
.1493 ± .0165(3) |
.1486 ± .0157(2) |
.1545 ± .0142(5) |
CEV |
.0376 ± .0059(4) |
.1360 ± .0101(6) |
.0328 ± .0057(3) |
.0463 ± .0086(5) |
.0215 ± .0053(2) |
.0174 ± .0069(1) |
avg. rank |
2 |
4.56 |
3.89 |
3.56 |
3.33 |
3.67 |
DBS |
.1416 ± .0681(4) |
.1616 ± .0743(6) |
.1265 ± .0663(2) |
.1343 ± .0672(3) |
.1242 ± .0609(1) |
.1433 ± .0667(5) |
CPU |
.0212 ± .0301(1) |
.0640 ± .0335(5) |
.0754 ± .0372(6) |
.0405 ± .0284(3) |
.0412 ± .0299(4) |
.0338 ± .0352(2) |
BCC |
.2496 ± .0485(1) |
.2773 ± .0548(6) |
.2569 ± .0506(2) |
.2598 ± .0529(4) |
.2570 ± .0463(3) |
.2707 ± .0554(5) |
MPG |
.0551 ± .0160(1) |
.0611 ± .0263(2) |
.0727 ± .0268(4) |
.0740 ± .0284(6) |
.0737 ± .0269(5) |
.0614 ± .0251(3) |
ESL |
.0542 ± .0218(1) |
.0660 ± .0203(3) |
.0922 ± .0279(6) |
.0657 ± .0229(2) |
.0661 ± .0219(4) |
.0691 ± .0228(5) |
MMG |
.1584 ± .0251(1) |
.1657 ± .0232(4) |
.1741 ± .0246(6) |
.1696 ± .0271(5) |
.1645 ± .0235(3) |
.1595 ± .0283(2) |
ERA |
.2813 ± .0280(1) |
.2843 ± .0302(2) |
.2918 ± .0290(5) |
.2905 ± .0312(3) |
.2988 ± .0276(6) |
.2910 ± .0290(4) |
LEV |
.1314 ± .0176(1) |
.1627 ± .0249(6) |
.1472 ± .0231(3) |
.1496 ± .0233(5) |
.1397 ± .0214(2) |
.1474 ± .0232(4) |
CEV |
.0273 ± .0089(4) |
.1328 ± .0173(6) |
.0286 ± .0075(5) |
.0239 ± .0066(3) |
.0190 ± .0070(2) |
.0089 ± .0047(1) |
avg. rank |
1.67 |
4.44 |
4.33 |
3.78 |
3.33 |
3.44 |
Performance in terms the average AUC ± standard deviation. From top to bottom: 20 %, 50 %, and 80 % training data. (Average ranks comparing significantly worse with CR at the 90 % confidence level are put in bold font)
data set |
CR |
LR |
KLR-ply |
KLR-rbf |
MORE |
LMT |
---|---|---|---|---|---|---|
DBS |
.9290 ± .0322(2) |
.8866 ± .0511(5) |
.9359 ± .0218(1) |
.9053 ± .0433(4) |
.8731 ± .0481(6) |
.9151 ± .0228(3) |
CPU |
.9822 ± .0121(2) |
.9806 ± .0124(4) |
.9716 ± .0072(6) |
.9843 ± .0116(1) |
.9749 ± .0235(5) |
.9816 ± .0113(3) |
BCC |
.6400 ± .0641(6) |
.6970 ± .0411(3) |
.6509 ± .0568(5) |
.7124 ± .0290(2) |
.6639 ± .0567(4) |
.7310 ± .0675(1) |
MPG |
.9788 ± .0160(1) |
.9675 ± .0068(5) |
.9704 ± .0075(4) |
.9741 ± .0055(3) |
.9501 ± .0263(6) |
.9753 ± .0092(2) |
ESL |
.9670 ± .0074(4) |
.9721 ± .0060(1) |
.9638 ± .0106(5) |
.9705 ± .0099(2) |
.9466 ± .0484(6) |
.9696 ± .0086(3) |
MMG |
.8867 ± .0123(4) |
.8962 ± .0080(1) |
.8552 ± .0203(6) |
.8938 ± .0121(2) |
.8754 ± .0274(5) |
.8890 ± .0259(3) |
ERA |
.7669 ± .0334(1) |
.7602 ± .0331(4) |
.7555 ± .0139(5) |
.7662 ± .0098(2) |
.7198 ± .0329(6) |
.7619 ± .0160(3) |
LEV |
.8971 ± .0098(1) |
.8905 ± .0081(2) |
.8870 ± .0094(3) |
.8860 ± .0128(4) |
.8137 ± .0621(6) |
.8797 ± .0182(5) |
CEV |
.9825 ± .0080(3) |
.9332 ± .0033(6) |
.9818 ± .0058(5) |
.9821 ± .0076(4) |
.9888 ± .0063(2) |
.9902 ± .0042(1) |
avg. rank |
2.67 |
3.44 |
4.44 |
2.67 |
5.11 |
2.67 |
DBS |
.9341 ± .0228(2) |
.9191 ± .0293(4) |
.9492 ± .0198(1) |
.9174 ± .0316(6) |
.9179 ± .0403(5) |
.9259 ± .0289(3) |
CPU |
.9920 ± .0073(2) |
.9914 ± .0056(3) |
.9771 ± .0109(6) |
.9925 ± .0056(1) |
.9873 ± .0149(5) |
.9883 ± .0077(4) |
BCC |
.6912 ± .0469(6) |
.7184 ± .0367(3) |
.7001 ± .0396(4) |
.7294 ± .0344(2) |
.6980 ± .0586(5) |
.7387 ± .0656(1) |
MPG |
.9818 ± .0075(1) |
.9803 ± .0084(3) |
.9776 ± .0083(4) |
.9752 ± .0068(5) |
.9563 ± .0313(6) |
.9814 ± .0074(2) |
ESL |
.9720 ± .0084(4) |
.9764 ± .0062(1) |
.9726 ± .0080(3) |
.9754 ± .0070(2) |
.9557 ± .0301(6) |
.9707 ± .0120(5) |
MMG |
.9003 ± .0132(1) |
.8972 ± .0125(4) |
.8962 ± .0140(5) |
.8995 ± .0091(2) |
.8839 ± .0305(6) |
.8976 ± .0153(3) |
ERA |
.7705 ± .0310(4) |
.7633 ± .0241(5) |
.7740 ± .0148(2) |
.7745 ± .0141(1) |
.7215 ± .0381(6) |
.7719 ± .0144(3) |
LEV |
.9098 ± .0103(1) |
.8935 ± .0113(4) |
.8999 ± .0120(3) |
.9012 ± .0128(2) |
.8185 ± .0580(6) |
.8920 ± .0164(5) |
CEV |
.9912 ± .0024(4) |
.9362 ± .0071(6) |
.9950 ± .0019(2) |
.9907 ± .0031(5) |
.9921 ± .0042(3) |
.9977 ± .0017(1) |
avg. rank |
2.78 |
3.67 |
3.33 |
2.89 |
5.33 |
3 |
DBS |
.9427 ± .0443(3) |
.9224 ± .0514(6) |
.9608 ± .0347(1) |
.9495 ± .0459(2) |
.9409 ± .0539(4) |
.9343 ± .0479(5) |
CPU |
.9971 ± .0063(2) |
.9907 ± .0085(5) |
.9827 ± .0167(6) |
.9984 ± .0052(1) |
.9909 ± .0167(4) |
.9959 ± .0078(3) |
BCC |
.7349 ± .0692(1) |
.7253 ± .0715(4) |
.7071 ± .0720(5) |
.7335 ± .0690(3) |
.7042 ± .0853(6) |
.7342 ± .0791(2) |
MPG |
.9855 ± .0108(1) |
.9843 ± .0138(2) |
.9797 ± .0121(4) |
.9771 ± .0142(5) |
.9551 ± .0372(6) |
.9841 ± .0106(3) |
ESL |
.9766 ± .0150(2) |
.9722 ± .0167(4) |
.9746 ± .0141(3) |
.9782 ± .0126(1) |
.9507 ± .0508(6) |
.9713 ± .0176(5) |
MMG |
.9135 ± .0233(1) |
.9048 ± .0237(3) |
.9011 ± .0199(4) |
.8991 ± .0255(5) |
.8889 ± .0363(6) |
.9063 ± .0215(2) |
ERA |
.7670 ± .0290(4) |
.7630 ± .0281(5) |
.7731 ± .0293(3) |
.7759 ± .0315(1) |
.7228 ± .0475(6) |
.7735 ± .0296(2) |
LEV |
.9122 ± .0202(1) |
.8928 ± .0234(5) |
.9048 ± .0201(2) |
.9031 ± .0172(3) |
.8078 ± .0661(6) |
.8996 ± .0222(4) |
CEV |
.9959 ± .0027(3) |
.9352 ± .0095(6) |
.9942 ± .0018(4) |
.9970 ± .0013(2) |
.9936 ± .0046(5) |
.9993 ± .0017(1) |
avg. rank |
2 |
4.44 |
3.56 |
2.56 |
5.44 |
3 |
Win statistics (number of data sets on which the first method was better than the second one) for 20 %, 50 %, and 80 % training data for 0/1 loss case
Needless to say, statistical significance is difficult to achieve due to the limited number of data sets. In terms of pairwise comparison, for example, a standard sign test will not report a significant difference (at the 10 % significance level) unless one of the method wins at least 7 of the 9 data sets. For the 0/1 loss, this is indeed accomplished by CR in all cases except two (comparison with KLR-ply and MORE on 50 % training data); see Table 4. For AUC, CR is superior, too, but significance is reached less often; see Table 5.
We also applied the two-step procedure recommended by Demsar (2006), consisting of a Friedman test and (provided this one rejects the null-hypothesis of overall equal performance of all methods) the subsequent use of a Nemenyi test in order to compare methods in a pairwise manner; both tests are based on average ranks. For both 0/1 loss and AUC, the Friedman test finds significant differences among the six classifiers (at the 10 % significance level) when all three different proportions of data are used for training. The critical distance of ranks in the Nemenyi test is 2.28 for both measures. In Tables 2 and 3, the average ranks for which this difference is exceeded are highlighted in bold font.
7.3.2 Variants of choquistic regression
Performance in terms the average 0/1 loss and AUC ± standard deviation for CR using the full fuzzy measure compared with using a k-additive measure with cross-validated k. From top to bottom: 20 %, 50 %, and 80 % training data
data set |
0/1 loss full |
0/1 loss k-additive |
AUC full |
AUC k-additive |
---|---|---|---|---|
DBS |
.2329 ± .0518 |
.1713 ± .0424 |
.8981 ± .0135 |
.9290 ± .0322 |
CPU |
.1341 ± .0802 |
.0811 ± .0103 |
.9505 ± .0377 |
.9822 ± .0121 |
BCC |
.3342 ± .0252 |
.2775 ± .0335 |
.6112 ± .0678 |
.6400 ± .0641 |
MPG |
.0709 ± .0193 |
.0709 ± .0193 |
.9788 ± .0182 |
.9788 ± .0182 |
ESL |
.0730 ± .0168 |
.0682 ± .0129 |
.9667 ± .0085 |
.9670 ± .0074 |
MMG |
.1776 ± .0101 |
.1725 ± .0120 |
.8899 ± .0145 |
.8867 ± .0123 |
ERA |
.2981 ± .0158 |
.2889 ± .0273 |
.7579 ± .0103 |
.7669 ± .0334 |
LEV |
.1526 ± .0146 |
.1499 ± .0122 |
.8984 ± .0103 |
.8971 ± .0098 |
CEV |
.0448 ± .0089 |
.0488 ± .0089 |
.9825 ± .0080 |
.9825 ± .0080 |
DBS |
.2261 ± .0685 |
.1572 ± .0416 |
.8995 ± .0486 |
.9341 ± .0228 |
CPU |
.0702 ± .0912 |
.0464 ± .0281 |
.9834 ± .0256 |
.9920 ± .0073 |
BCC |
.3122 ± .0324 |
.2687 ± .0282 |
.6596 ± .0309 |
.6912 ± .0469 |
MPG |
.0577 ± .0251 |
.0577 ± .0251 |
.9818 ± .0075 |
.9818 ± .0075 |
ESL |
.0711 ± .0133 |
.0601 ± .0126 |
.9695 ± .0102 |
.9720 ± .0084 |
MMG |
.1671 ± .0139 |
.1667 ± .0144 |
.8940 ± .0110 |
.9003 ± .0132 |
ERA |
.2930 ± .0162 |
.2844 ± .0306 |
.7641 ± .0146 |
.7705 ± .0310 |
LEV |
.1421 ± .0142 |
.1372 ± .0125 |
.9088 ± .0132 |
.9098 ± .0103 |
CEV |
.0376 ± .0059 |
.0376 ± .0059 |
.9912 ± .0024 |
.9912 ± .0024 |
DBS |
.2192 ± .0466 |
.1416 ± .0681 |
.9052 ± .0210 |
.9427 ± .0443 |
CPU |
.0241 ± .0413 |
.0212 ± .0301 |
.9866 ± .0187 |
.9971 ± .0063 |
BCC |
.2853 ± .0592 |
.2496 ± .0485 |
.6945 ± .0455 |
.7349 ± .0692 |
MPG |
.0551 ± .0160 |
.0551 ± .0160 |
. 9855 ± .0108 |
.9855 ± .0108 |
ESL |
.0658 ± .0221 |
.0542 ± .0218 |
.9755 ± .0160 |
.9766 ± .0150 |
MMG |
.1628 ± .0187 |
.1584 ± .0251 |
.8966 ± .0162 |
.9135 ± .0233 |
ERA |
.2899 ± .0191 |
.2813 ± .0280 |
.7687 ± .0261 |
.7670 ± .0290 |
LEV |
.1370 ± .0162 |
.1314 ± .0176 |
.9140 ± .0124 |
.9122 ± .0202 |
CEV |
.0273 ± .0089 |
.0273 ± .0089 |
.9959 ± .0027 |
.9959 ± .0027 |
Performance in terms the average 0/1 loss and AUC ± standard deviation for CR using complexity reduction (ϵ=δ=0.1). From top to bottom: 20 %, 50 %, and 80 % training data
data set |
k |
0/1 loss |
AUC |
---|---|---|---|
DBS |
4 |
.2286 ± .0549 |
.9235 ± .0489 |
CPU |
4 |
.0998 ± .0347 |
.9664 ± .0227 |
BCC |
3 |
.2888 ± .0578 |
.6193 ± .0406 |
MPG |
4 |
.0719 ± .0108 |
.9787 ± .0067 |
ESL |
3 |
.0737 ± .0103 |
.9663 ± .0049 |
MMG |
3 |
.1761 ± .0107 |
.8857 ± .0174 |
ERA |
4 |
.2981 ± .0158 |
.7579 ± .0103 |
LEV |
4 |
.1526 ± .0146 |
.8984 ± .0103 |
CEV |
6 |
.0448 ± .0089 |
.9825 ± .0080 |
DBS |
4 |
.1944 ± .0631 |
.9338 ± .0368 |
CPU |
4 |
.0361 ± .0432 |
.9902 ± .0139 |
BCC |
3 |
.2838 ± .0448 |
.6232 ± .0374 |
MPG |
4 |
.0570 ± .0080 |
.9812 ± .0044 |
ESL |
3 |
.0727 ± .0148 |
.9740 ± .0077 |
MMG |
3 |
.1667 ± .0130 |
.8976 ± .0087 |
ERA |
4 |
.2930 ± .0162 |
.7641 ± .0146 |
LEV |
4 |
.1421 ± .0142 |
.9088 ± .0132 |
CEV |
6 |
.0376 ± .0059 |
.9912 ± .0024 |
DBS |
4 |
.1939 ± .0615 |
.9381 ± .0471 |
CPU |
4 |
.0244 ± .0531 |
.9962 ± .0090 |
BCC |
3 |
.2755 ± .0404 |
.7142 ± .0507 |
MPG |
4 |
.0597 ± .0126 |
.9832 ± .0057 |
ESL |
3 |
.0603 ± .0236 |
.9769 ± .0146 |
MMG |
3 |
.1620 ± .0250 |
.9001 ± .0202 |
ERA |
4 |
.2899 ± .0191 |
.7687 ± .0261 |
LEV |
4 |
.1370 ± .0162 |
.9140 ± .0124 |
CEV |
6 |
.0273 ± .0089 |
.9959 ± .0027 |
7.3.3 Model interpretation
Average values of the scaling parameter γ in the choquistic regression model
DBS |
CPU |
BCC |
MPG |
ESL |
MMG |
ERA |
LEV |
CEV |
---|---|---|---|---|---|---|---|---|
36.69 |
691.81 |
15.30 |
23.87 |
45.12 |
19.05 |
8.07 |
15.13 |
69.23 |
As one of its key features, the Choquet integral offers interesting information about the importance of individual attributes as well as the interaction between them; this aspect was highlighted in Sect. 3.2. In fact, in many practical applications, this type of information is at least as important as the prediction accuracy of the model. A detailed analysis of this type of information is difficult and beyond the scope of this paper. Instead, we just give a few examples showing the plausibility of the results.
For the CPU data, the following Shapley values are obtained: machine cycle time in nanoseconds ≈0.07, minimum main memory in kilobytes ≈0.24, maximum main memory in kilobytes ≈0.30, cache memory in kilobytes ≈0.20, minimum channels in units ≈0.10, maximum channels in units ≈0.09. Thus, the most important properties are those concerning the memory (main and cache). The influence of the other properties (channels, cycle time) is not as strong, although they are not completely unimportant either.
8 Concluding remarks
In this paper, we have advocated the use of the discrete Choquet integral as an aggregation operator in machine learning, especially in the context of learning monotone models. Apart from combining monotonicity and flexibility in a mathematically sound and elegant manner, the Choquet integral offers measures for quantifying the importance of individual predictor variables and the interaction between groups of variables, thereby providing important information about the relationship between independent and dependent variables.
We have analyzed several properties of the Choquet integral that appear to be interesting from a machine learning point of view, notably its capacity in terms of the VC-dimension. Moreover, we have addressed the issue of complexity reduction or, more specifically, the restriction of the Choquet integral to k-additive measures. In this regard, we have proposed a method for finding a suitable value of k.
As a concrete machine learning application of the Choquet integral, we have proposed a generalization of logistic regression, in which the Choquet integral is used for modeling the log odds of the positive class. First experimental studies have shown that this method, called choquistic regression, compares quite favorably with other methods. We like to mention again, however, that an improvement in prediction accuracy should not be seen as the only goal of monotone learning. Instead, the adherence to monotonicity constraints is often an important prerequisite for the acceptance of a model by domain experts.
Compared to standard logistic regression, the benefits of choquistic regression are coming at the expense of an increased computational complexity of the underlying learning algorithm, which solves a maximum likelihood estimation problem. This is mainly caused by the large number of parameters of the fuzzy measure on which the Choquet integral is based, and the complicated dependency between these parameters. In Hüllermeier and Fallah Tehrani (2012a), first steps aiming at a reduction of this complexity are made. Nevertheless, speeding up choquistic regression and making it scalable toward data sets with many attributes is an important topic of ongoing and future work.
Needless to say, the Choquet integral can be combined with machine learning methods other than logistic regression. Moreover, its use is not restricted to (binary) classification. In fact, we are quite convinced of its high potential in machine learning in general, and we are looking forward to exploring this potential in greater detail.
For example, a workshop on “Learning Monotone Models from Data” was organized at ECMLPKDD 2009 in Bled, Slovenia.
Acknowledgements
This work was supported by the German Research Foundation (DFG). Ali Fallah Tehrani, Weiwei Cheng, and Eyke Hüllermeier were supported by the German Research Foundation. Krzysztof Dembczyński was supported by the German Research Foundation and the Polish Ministry of Science and Higher Education.