1 Introduction

A proper specification of the type of dependency between a set of predictor (input) variables X 1,…,X m and the target (output) variable Y is an important prerequisite for successful model induction. The specification of a corresponding hypothesis space imposes an inductive bias that, amongst others, allows for the incorporation of background knowledge in the learning process. An important type of background knowledge is monotonicity: Everything else being equal, the increase (decrease) of a certain input variable X i can only produce an increase in the output variable Y (e.g., a real number in regression, a class in ordered classification, or the probability of the positive class in binary classification). Adherence to this kind of background knowledge may not only be beneficial for model induction, but is often even considered as a hard constraint. For example, no medical doctor will accept a model in which the probability of cancer is not monotonically increasing in tobacco consumption.

The simplest type of dependency is a linear relationship:

$$ Y = \sum_{i=1}^m \alpha_i X_i + \epsilon, $$
(1)

where α 1,…,α m are real coefficients and ϵ is an error term. Monotonicity can be guaranteed quite easily for (1), since monotonicity in X i is equivalent to the constraint α i ≥0. Another important advantage of (1) is its comprehensibility. In particular, the direction and strength of influence of each predictor X i are directly reflected by the corresponding coefficient α i .

Perhaps the sole disadvantage of a linear model is its inflexibility and, coming along with this, the supposed absence of any interaction between the variables: The effect of an increase of X i is always the same, namely ∂Y/∂X i =α i , regardless of the values of all other attributes. In many real applications, this assumption is not tenable. Instead, more complex, nonlinear models are needed to properly capture the dependencies between the inputs X i and the output Y.

An increased flexibility, however, typically comes at the price of a loss in terms of the two previous criteria: comprehensibility is hampered, and monotonicity is more difficult to assure. In fact, as soon as an interaction between attributes is allowed, the influence of an increase in X i may depend on all other variables, too. As a simple example, consider the extension of (1) by the addition of interaction terms, a model which is often used in statistics:

$$ Y = \sum_{i=1}^m \alpha_i X_i + \sum_{1 \leq i < j \leq m} \alpha_{ij} X_i X_j + \epsilon. $$
(2)

For this model, ∂Y/∂X i is given by α i +∑ ji α ij X j and depends on the values of all other attributes, which means that, depending on the context as specified by these values, the monotonicity condition may change from one case to another. Consequently, it is difficult to find simple global constraints on the coefficients that assure monotonicity. For example, assuming that all attributes are non-negative, it is clear that α i ≥0 and α ij ≥0 for all 1≤ijm will imply monotonicity. While being sufficient, however, these constraints are non-necessary conditions, and may therefore impose restrictions on the model space that are more far-ranging than desired; besides, negative interactions cannot be modeled in this way. Quite similar problems occur for commonly used nonlinear methods in machine learning, such as neural networks and kernel machines.

In this paper, we advocate the use of the (discrete) Choquet integral as a tool that is interesting in this regard. As will be argued in more detail later on, the Choquet integral combines the aforementioned properties in a quite convenient and mathematically elegant way: By its very nature as an integral, it is a monotone operator, while at the same time allowing for interactions between attributes. Moreover, it disposes of natural measures for quantifying the importance of individual and the interaction within groups of features, which provide important insights into the model and thereby support interpretability.

The rest of this paper, parts of which have already been presented in Fallah Tehrani et al. (2011), Hüllermeier and Fallah Tehrani (2012b), is organized as follows. In the next section, we give a brief overview of related work. In Sect. 3, we recall the basic definition of the Choquet integral and some related notions. In Sect. 4, we analyze the flexibility of binary classifiers based on the Choquet integral in terms of the notion of VC dimension. In Sect. 5, we propose a generalization of logistic regression for binary classification, in which the Choquet integral is used to model the log odds of the positive class. In Sect. 6, we elaborate on complexity issues and propose a method for finding a suitable level of (non-)additivity for the Choquet integral in a concrete learning task. Experimental results are presented in Sect. 7, prior to concluding the paper with a few remarks in Sect. 8.

2 Related work

As already mentioned, the problem of monotone classification has received increasing attention in the machine learning community in recent years,Footnote 1 despite having been introduced in the literature much earlier (Ben-David et al. 1989). Meanwhile, several machine learning algorithms have been modified so as to guarantee monotonicity in attributes, including nearest neighbor classification (Duivesteijn and Feelders 2008), neural networks (Sill 1998), decision tree learning (Ben-David 1995; Potharst and Feelders 2002), rule induction (Dembczyński et al. 2009), as well as methods based on isotonic separation (Chandrasekaran et al. 2005) and piecewise linear models (Dembczyński et al. 2006).

Instead of modifying learning algorithms so as to guarantee monotone models, another idea is to modify the training data. To this end, data pre-processing methods such as re-labeling techniques have been developed. Such methods seek to repair inconsistencies in the training data, so that (standard) classifiers learned on that data will tend to be monotone (although, in general, they still do not guarantee this property) (Feelders 2010; Kotłowski et al. 2008).

Although the Choquet integral has been widely applied as an aggregation operator in multiple criteria decision making (Grabisch et al. 2000; Grabisch 1995a; Torra 2011), it has been used much less in the field of machine learning so far. There are, however, a few notable exceptions. First, the problem of extracting a Choquet integral (or, more precisely, the non-additive measure on which it is defined) in a data-driven way has been addressed in the literature (Beliakov 2008). Essentially, this is a parameter identification problem, which is commonly formalized as a constraint optimization problem, for example using the sum of squared errors as an objective function (Torra and Narukawa 2007; Grabisch 2003). To this end, Mori and Murofushi (1989) proposed an approach based on the use of quadratic forms, while an alternative heuristic, gradient-based method called HLMS (Heuristic Least Mean Squares) was introduced in Grabisch (1995b). In Angilella et al. (2009), Beliakov and James (2011), the Choquet integral is used in the context of ordinal classification. Besides, the Choquet integral has been used as an aggregation operator in the context of ensemble learning, i.e., for combining the predictions of different classifiers (Grabisch and Nicolas 1994).

3 The discrete Choquet integral

In this section, we give a brief introduction to the (discrete) Choquet integral, which, to the best of our knowledge, is not widely known in the field of machine learning so far. Since the Choquet integral can be seen as a generalization of the standard (Lebesque) integral to the case of non-additive measures, we start with a reminder of this type of measure.

3.1 Non-additive measures

Let C={c 1,…,c m } be a finite set and μ:2C→[0,1] a measure on this set. For each AC, we interpret μ(A) as the weight or, say, the importance of the set of elements A. As an illustration, one may think of C as a set of criteria (binary features) relevant for a job, like “speaking French” and “programming Java”, and of μ(A) as the evaluation of a candidate satisfying criteria A (and not satisfying CA). The term “criterion” is indeed often used in the decision making literature, where it suggests a monotone “the higher the better” influence. In the context of machine learning, to which we shall turn later on, criteria are playing the role of features (input attributes).

A standard assumption on a measure μ(⋅), which is, for example, at the core of probability theory, is additivity: μ(AB)=μ(A)+μ(B) for all A,BC such that AB=∅. Unfortunately, additive measures cannot model any kind of interaction between elements: Extending a set of elements A by a set of elements B always increases the weight μ(A) by the weight μ(B), regardless of A and B.

Suppose, for example, that the elements of two sets A and B are complementary in a certain sense. For instance, A={French,Spanish} and B={Java} could be seen as complementary, since both language skills and programming skills are important for the job. Formally, this can be expressed in terms of a positive interaction: μ(AB)>μ(A)+μ(B). In the extreme case, when language skills and programming skills are indeed essential, μ(AB) can be high although μ(A)=μ(B)=0 (suggesting that a candidate lacking either language or programming skills is completely unacceptable). Likewise, elements can interact in a negative way: If two sets A and B are partly redundant or competitive, then μ(AB)<μ(A)+μ(B). For example, A={C,C#} and B={Java} might be seen as redundant, since one programming language does in principle suffice.

The above considerations motivate the use of non-additive measures, also called capacities or fuzzy measures, which are simply normalized and monotone (Sugeno 1974):

(3)

A useful representation of non-additive measures, that we shall explore later on for learning Choquet integrals, is in terms of the Möbius transform:

$$ \mu(B) = \sum _{A\subseteq{B}} \boldsymbol{m}_{\mu}(A) $$
(4)

for all BC, where the Möbius transform m μ of the measure μ is defined as follows:

$$ \boldsymbol{m}_{\mu}(A) = \sum _{B\subseteq{A}}(-1)^{|A|-|B|}\mu{(B)} . $$
(5)

The value m μ (A) can be interpreted as the weight that is exclusively allocated to A, instead of being indirectly connected with A through the interaction with other subsets.

A measure μ is said to be k-order additive, or simply k-additive, if k is the smallest integer such that m(A)=0 for all AC with |A|>k. This property is interesting for several reasons. First, as can be seen from (4), it means that a measure μ can formally be specified by significantly fewer than 2m values, which are needed in the general case. Second, k-additivity is also interesting from a semantic point of view: As will become clear in the following, this property simply means that there are no interaction effects between subsets A,BC whose cardinality exceeds k.

3.2 Importance of criteria and interaction

An additive (i.e., k-additive with k=1) measure μ can be written as follows:

$$\mu(A) = \sum_{c_i \in A} w_i , $$

with w i =μ({c i }) the weight of c i . Due to (3), these weights are non-negative and such that \(\sum_{i=1}^{m} w_{i} = 1\). In this case, there is obviously no interaction between the criteria c i , i.e., the influence of a c i on the value of μ is independent of the presence or absence of any other c j . Besides, the weight w i is a natural quantification of the importance of c i .

Measuring the importance of a criterion c i becomes obviously more involved when μ is non-additive. Besides, one may then also be interested in a measure of interaction between the criteria, either pairwise or even of a higher order. In the literature, measures of that kind have been proposed, both for the importance of single as well as the interaction between several criteria.

Suppose to be given a fuzzy measure μ on C. In order to quantify the weight of a single criterion c i , it is natural to look at the increase in importance due to adding c i to another subset AC, which comes down to comparing μ(A∪{c i }) and μ(A). While the difference between these two values is always equal to w i in the additive case, it may depend on the subset A in the non-additive case. The Shapley value, also called importance index of c i , therefore averages the difference μ(A∪{c i })−μ(A) over all AC:

$$ \varphi(c_i) = \sum_{A \subseteq C \setminus\{ c_i\}} \frac{1}{m \binom{m-1}{\lvert A \rvert}} \bigl( \mu\bigl(A\cup\{c_i\}\bigr)-\mu(A) \bigr) . $$
(6)

The Shapley value of μ is the vector φ(μ)=(φ(c 1),…,φ(c m )). One can show that 0≤φ(c i )≤1 and \(\sum_{i=1}^{m} \varphi(c_{i})=1\). Thus, φ(c i ) is a measure of the relative importance of c i . Obviously, φ(c i )=μ({c i }) if μ is additive.

The interaction index between criteria c i and c j , as proposed by Murofushi and Soneda (1993), is defined as follows:

$$I_{i,j} = \sum_{A \subseteq C \setminus\{c_i,c_j\}} \frac{ ( \mu(A\cup\{c_i,c_j\})-\mu(A\cup\{c_i\}) -\mu(A\cup\{ c_j\} )+\mu(A) )}{(m-1) \binom{m-2}{\lvert A \rvert} } . $$

This index ranges between −1 and 1 and indicates a positive (negative) interaction between criteria c i and c j if I i,j >0 (I i,j <0). The interaction index can also be expressed in terms of the Möbius transform:

Furthermore, as proposed by Grabisch (1997), the definition of interaction can be extended to more than two criteria, i.e., to subsets TC:

3.3 The Choquet integral

So far, the criteria c i were simply considered as binary features, which are either present or absent. Mathematically, μ(A) can thus also be seen as an integral of the indicator function of A, namely the function f A given by f A (c)=1 if cA and =0 otherwise. Now, suppose that f: C→ℝ+ is any non-negative function that assigns a value to each criterion c i ; for example, f(c i ) might be the degree to which a candidate satisfies criterion c i . An important question, then, is how to aggregate the evaluations of individual criteria, i.e., the values f(c i ), into an overall evaluation, in which the criteria are properly weighted according to the measure μ. Mathematically, this overall evaluation can be considered as an integral \(\mathcal{C}_{\mu}(f)\) of the function f with respect to the measure μ.

Indeed, if μ is an additive measure, the standard integral just corresponds to the weighted mean

$$ \mathcal{C}_\mu(f) = \sum_{i=1}^m w_i \cdot f(c_i) = \sum_{i=1}^m \mu\bigl(\{ c_i\}\bigr) \cdot f(c_i) , $$
(7)

which is a natural aggregation operator in this case. A non-trivial question, however, is how to generalize (7) in the case where μ is non-additive.

This question, namely how to define the integral of a function with respect to a non-additive measure (not necessarily restricted to the discrete case), is answered in a satisfactory way by the Choquet integral, which has first been proposed for additive measures by Vitali (1925) and later on for non-additive measures by Choquet (1954). The point of departure of the Choquet integral is an alternative representation of the “area” under the function f, which, in the additive case, is a natural interpretation of the integral. Roughly speaking, this representation decomposes the area in a “horizontal” instead of a “vertical” manner, thereby making it amenable to a straightforward extension to the non-additive case. More specifically, note that the weighted mean can be expressed as follows:

where (⋅) is a permutation of {1,…,m} such that 0≤f(c (1))≤f(c (2))≤⋯≤f(c (m)) (and f(c (0))=0 by definition), and A (i)={c (i),…,c (m)}; see Fig. 1 as an illustration.

Fig. 1
figure 1

Vertical (left) versus horizontal (right) integration. In the first case, the height of a single bar, f(c i ), is multiplied with its “width” (the weight μ({c i })), and these products are added. In the second case, the height of each horizontal section, f(c (i))−f(c (i−1)), is multiplied with the corresponding “width” μ(A (i))

Now, the key difference between the left and right-hand side of the above expression is that, whereas the measure μ is only evaluated on single elements c i on the left, it is evaluated on subsets of elements on the right. Thus, the right-hand side suggests an immediate extension to the case of non-additive measures, namely the Choquet integral, which, in the discrete case, is formally defined as follows:

$$\mathcal{C}_\mu(f) = \sum _{i=1}^m \bigl( f(c_{(i)})-f(c_{(i-1)}) \bigr) \cdot\mu(A_{(i)}). $$

In terms of the Möbius transform of μ, the Choquet integral can also be expressed as follows:

(8)

where T (i)={S∪{c (i)}∣S⊂{c (i+1),…,c (m)}}.

4 The VC dimension of the Choquet integral

Advocating the Choquet integral as a novel tool for machine learning immediately begs an interesting theoretical question, namely the question regarding the capacity of the corresponding model class. In fact, since the Choquet integral in its general form (not restricted to k-additive measures) has a rather large number of parameters, one may expect it to be quite flexible and, therefore, to have a high capacity. On the other hand, the parameters cannot be chosen freely. Instead, they are highly constrained due to the properties of the underlying fuzzy measure.

In any case, knowledge about the VC dimension of the Choquet integral (or, more specifically, a binary classifier based on the Choquet integral as an underlying aggregation function) is not only of theoretical but also of practical relevance. In particular, it may help finding the right level of flexibility for the data at hand. As mentioned earlier, because of its highly nonlinear nature, one may expect the Choquet integral in its most general form comes with a danger of overfitting the data. On the other hand, a restriction to k-additive measures may provide a reasonable means for regularization. Both conjectures will be confirmed in this section.

In what follows, we are going to analyze the capacity of the Choquet integral in terms of the VC dimension (Vapnik 1998). To this end, we consider a setting in which the Choquet integral is used to classify instances represented in the form of m-dimensional vectors \(\boldsymbol{x} = (x_{1},x_{2}, \ldots, x_{m}) \in\mathbb{R}^{m}_{+}\), where x i =f(c i ) can be thought of as the evaluation of the criterion c i . More specifically, we consider the model class \(\mathcal{H}\) consisting of all threshold classifiers of the form

$$ \boldsymbol{x} = (x_1,x_2, \ldots, x_m) \mapsto\mathbb{I} \bigl( \mathcal{C}_{\mu}(\boldsymbol{x}) > \beta\bigr) , $$
(9)

where \(\mathbb{I}\) maps truth degrees {false,true} to {0,1} as usual, μ is a fuzzy measure, \(\mathcal{C}_{\mu }(\boldsymbol{x})\) is the Choquet integral of the (normalized) attribute values x 1,x 2,…,x m , and β∈[0,1] is a threshold value (as will be seen below, (9) corresponds to the “decision making” part of the choquistic regression model to be introduced in the next section; since this part is responsible for the classification decision, results on the VC dimension of \(\mathcal{H}\) directly apply to choquistic regression, too). Note that the class \(\mathcal{H}\) is parametrized by μ and β.

Theorem 1

For the model class \(\mathcal{H}\) as defined above, \(\mathit{VC}(\mathcal{H}) = \varOmega(2^{m}/\sqrt{m})\). That is, the VC dimension of \(\mathcal{H}\) grows asymptotically at least as fast as \(2^{m}/\sqrt{m}\).

Proof

In order to prove this claim, we construct a sufficiently large data set \(\mathcal {D}\) and show that, despite its size, it can be shattered by \(\mathcal{H}\). In this construction, we restrict ourselves to binary attribute values, which means that x i ∈{0,1} for all 1≤im. Consequently, each instance x=(x 1,…,x m )∈{0,1}m can be identified with a subset of indices S x X={1,2,…,m}, namely its indicator set S x ={ix i =1}.

In combinatorics, an antichain of X={1,2,…,m} is a family of subsets \(\mathcal {A} \subset2^{X}\) such that, for all \(A, B \in \mathcal {A}\), neither AB nor BA. An interesting question related to the notion of an antichain concerns its potential size, that is, the number of subsets in \(\mathcal {A}\). This number is obviously restricted due to the above non-inclusion constraint on pairs of subsets. An answer to this question is given by a well-known result of Sperner (1928), who showed that this number is

$$ \left ( \begin{array}{@{}c@{}} m \\ \lfloor m/2 \rfloor \end{array} \right ) . $$
(10)

Moreover, Sperner has shown that the corresponding antichain \(\mathcal {A}\) is given by the family of all q-subsets of X with q=⌊m/2⌋, that is, all subsets AX such that |A|=q.

Now, we define the data set \(\mathcal {D}\) in terms of the collection of all instances x=(x 1,…,x m )∈{0,1}m whose indicator set S x is a q-subset of X. Recall that, from a decision making perspective, each attribute can be interpreted as a criterion. Thus, each instance in our data set satisfies exactly q of the m criteria, and there is not a single “dominance” relation in the sense that the set of criteria satisfied by one instance is a superset of those satisfied by another instance. Intuitively, the instances in \(\mathcal {D}\) are therefore maximally incomparable. This is precisely the property we are now going to exploit in order to show that \(\mathcal {D}\) can be shattered by \(\mathcal {H}\).

Recall that a set of instances \(\mathcal {D}\) can be shattered by a model class \(\mathcal {H}\) if, for each subset \(\mathcal {P} \subseteq \mathcal {D}\), there is a model \(H \in \mathcal {H}\) such that H(x)=1 for all \(\boldsymbol{x} \in \mathcal {P}\) and H(x)=0 for all \(\boldsymbol{x} \in \mathcal {D} \setminus \mathcal {P}\). Now, take any such subset \(\mathcal {P}\) from our data set \(\mathcal {D}\) as constructed above, and recall that the Choquet integral in (9) can be written as

$$ \mathcal{C}_\mu(\boldsymbol{x})= \sum _{T \subseteq C} \boldsymbol{m}(T) \times f_T(\boldsymbol{x}) , $$
(11)

where f T (x)=1 if TS x and f T (x)=0 otherwise. We define the values m(T), TC, of the Möbius transform as follows:

$$\boldsymbol{m}(T) = \left \{ \begin{array}{@{}l@{\quad}l@{}} |\mathcal {P}|^{-1} & \text{if } T = S_{\boldsymbol{x}} \text{ for some } \boldsymbol{x} \in \mathcal {P}, \\[3pt] 0 & \text{otherwise.} \end{array} \right . $$

Obviously, this definition of the Möbius transform is feasible and yields a proper fuzzy measure μ: The sum of masses is equal to 1, and since all masses are non-negative, monotonicity is guaranteed right away. Moreover, from the construction of m and the fact that, for each pair \(\boldsymbol{x} \neq\boldsymbol{x}' \in \mathcal {D}\), neither S x S x nor S xS x , the Choquet integral is obviously given as follows:

$$\mathcal{C}_{\mu} = \left \{ \begin{array}{@{}l@{\quad}l@{}} |\mathcal {P}|^{-1} & \text{if } \boldsymbol{x} \in \mathcal {P}, \\ 0 & \text{otherwise.} \end{array} \right . $$

Thus with \(\beta= 1/(2|\mathcal {P}|)\), the classifier (9) behaves exactly as required, that is, it classifies all \(\boldsymbol{x} \in \mathcal {P}\) as positive and all \(\boldsymbol{x} \not\in \mathcal {P}\) as negative.

Noting that the special case where \(\mathcal {P} = \emptyset\) is handled correctly by the Möbius transform m such that m(C)=1 and m(T)=0 for all T⊆̷C (and any threshold β>0), we can conclude that the data set \(\mathcal {D}\) can be shattered by \(\mathcal {H}\). Consequently, the VC dimension of \(\mathcal {H}\) is at least the size of \(\mathcal {D}\), whence (10) is a lower bound of \(\mathit{VC}(\mathcal {H})\).

For the asymptotic analysis, we make use of Sterling’s approximation of large factorials (and hence binomial coefficients). For the sequence (b 1,b 2,…) of the so-called central binomial coefficients b n , it is known that

$$ b_n = \left ( \begin{array}{@{}c@{}} 2n \\ n \end{array} \right ) = \frac{(2n)!}{(n!)^2} \geq\frac{1}{2} \frac{4^n}{\sqrt{\pi\cdot n}} . $$
(12)

Thus, the fact that \(\mathit{VC}(\mathcal {H})\) grows asymptotically at least as fast as \(2^{m}/\sqrt{m}\) immediately follows by setting n=m/2 and ignoring constant terms. □

Remark 1

Recall the expression (8) of the Choquet integral in terms of its Möbius transform. This expression shows that the Choquet integral corresponds to a linear function, albeit a constrained one, in the feature space spanned by the set of features {f T T⊆{1,2,…,m}} (already used in (11)), where each feature is a min-term

$$ f_T = f_T(\boldsymbol{x}) = f_T(x_1, \ldots, x_m) = \min_{i \in{T}} x_i . $$
(13)

The dimensionality of this feature space is 2m−1. Thus, it follows immediately that \(\mathit{VC}(\mathcal {H}) \leq2^{m}\) (the class of linear hyperplanes in ℝn has VC-dimension n+1). Together with the lower bound \(2^{m}/\sqrt{m}\), which is not much smaller (despite the restriction to binary attribute vectors), we thus dispose of a relatively tight approximation of \(\mathit{VC}(\mathcal {H})\).

Remark 2

Interestingly, the proof of Theorem 1 does not exploit the full non-additivity of the Choquet integral. In fact, the measure we constructed there is ⌊m/2⌋-additive, since m(T)=0 for all TC with |T|>⌊m/2⌋. Consequently, the estimation of the VC-dimension still applies to the restricted case of k-additive measures, provided k≥⌊m/2⌋. For smaller k, it is not difficult to adapt the proof so as to show that

$$ \mathit{VC}(\mathcal {H}) \geq \left ( \begin{array}{@{}c@{}} m \\ k \end{array} \right ) . $$
(14)

5 Choquistic regression

Consider the standard setting of binary classification, where the goal is to predict the value of an output (response) variable \(y \in \mathcal {Y}=\{0,1\}\) for a given instance

$$\boldsymbol{x} = (x_1, \ldots, x_m) \in\mathcal{X} = \mathcal{X}_1 \times\mathcal{X}_2 \times\cdots\times \mathcal{X}_m $$

represented in terms of a feature vector. More specifically, the goal is to learn a classifier \(\mathcal{L}:\, \mathcal{X} \rightarrow \mathcal{Y}\) from a given set of (i.i.d.) training data

$$\mathcal{D} = \bigl\{ \bigl(\boldsymbol{x}^{(i)} , y^{(i)}\bigr) \bigr\}_{i=1}^n \subset(\mathcal{X} \times \mathcal{Y})^n $$

so as to minimize the risk

$$R(\mathcal{L}) = \int_{\mathcal{X} \times\mathcal{Y}} \ell\bigl (\mathcal{L}(\boldsymbol{x}), y\bigr) \, d \mathbf {P}_{XY}(\boldsymbol{x},y) , $$

where (⋅) is a loss function (e.g., the simple 0/1 loss given by \(\ell(\hat{y}, y) = 0\) if \(\hat{y} = y\) and =1 if \(\hat{y} \neq y\)).

Logistic regression is a well-established statistical method for (probabilistic) classification (Hosmer and Lemeshow 2000). Its popularity is due to a number of appealing properties, including monotonicity and comprehensibility: Since the model is essentially linear in the input attributes, the strength of influence of each predictor is directly reflected by the corresponding regression coefficient. Moreover, the influence of each attribute is monotone in the sense that an increase of the value of the attribute can either only increase or only decrease the probability of the positive class (depending on whether the associated regression coefficient is positive or negative).

Formally, the probability of the positive class (and hence of the negative class) is modeled as a generalized linear function of the input attributes, namely in terms of the logarithm of the probability ratio:

$$ \log\biggl( \frac{\mathbf {P}(y=1\,\vert\,\boldsymbol {x})}{\mathbf {P}(y=0\,\vert\,\boldsymbol{x})} \biggr) = w_0 + \boldsymbol{w}^\top\boldsymbol{x} , $$
(15)

where w=(w 1,w 2,…,w m )∈ℝm is a vector of regression coefficients and w 0∈ℝ a constant bias (the intercept). A positive regression coefficient w i >0 means that an increase of the predictor variable x i will increase the probability of a positive response, while a negative coefficient implies a decrease of this probability. Besides, the larger the absolute value |w i | of the regression coefficient, the stronger the influence of x i .

Since P(y = 0∣x) = 1−P(y = 1∣x), a simple calculation yields the posterior probability

$$ \pi_l \stackrel {\text {df}}{=}\mathbf {P}(y=1\,\vert\,\boldsymbol {x}) = \bigl( 1+\exp\bigl(- w_0 - \boldsymbol{w}^\top\boldsymbol{x}\bigr) \bigr)^{-1} . $$
(16)

The logistic function z↦(1+exp(−z))−1, which has a sigmoidal shape, is a specific type of link function.

Needless to say, the linearity of the above model is a strong restriction from a learning point of view, and the possibility of interactions between predictor variables has of course also been noticed in the statistical literature (Jaccard 2001). A standard way to handle such interaction effects is to add interaction terms to the linear function of predictor variables, like in (2). As explained earlier, however, the aforementioned advantages of logistic regression will then be lost.

In the following, we therefore propose an extension of logistic regression that allows for modeling nonlinear relationships between input and output variables while preserving the advantages of comprehensibility and monotonicity. As mentioned earlier, the monotonicity constraint is important if the direction of influence of an input attribute is known beforehand and needs to be reflected by the model, an assumption that we shall make in the following. As an aside, we note that one may also envision the case where an attribute is known to have a monotone influence, although the direction of influence is unknown. The learning problem then becomes slightly more difficult, since the learner has to figure out whether the influence is positive (increasing) or negative (decreasing). We shall not consider this problem any further, however, and instead assume the direction of influence to be given as prior knowledge.

5.1 The Choquistic model

In order to model nonlinear dependencies between predictor variables and the response, and to take interactions between predictors into account, we propose to extend the logistic regression model by replacing the linear function xw 0+w x in (15) with the Choquet integral. More specifically, we propose the following model

$$ \pi_c \stackrel {\text {df}}{=}\mathbf {P}(y=1\,\vert\,\boldsymbol{x}) = \bigl( 1+\exp\bigl(-\gamma\, \bigl(\mathcal{C}_{\mu }(f_{\boldsymbol{x}})- \beta\bigr)\bigr) \bigr)^{-1} , $$
(17)

where \(\mathcal{C}_{\mu}(f_{\boldsymbol{x}})\) is the Choquet integral (with respect to the measure μ) of the function

$$ f_{\boldsymbol{x}}:\, \{c_1,\ldots, c_m\} \rightarrow[0,1] $$
(18)

that maps each attribute c i to a normalized value x i =f x (c i )∈[0,1]; β,γ∈ℝ are constants.

Recalling the idea of “evaluating” an instance x in terms of a set of criteria, the model (17) can be seen as a two-step procedure: The first step consists of an assessment of x in terms of a (latent) utility degree

$$u = U(\boldsymbol{x}) = \mathcal{C}_{\mu}(f_{\boldsymbol{x}}) \in [0,1] . $$

Then, in a second step, a discrete choice (yes/no decision) is made on the basis of this utility. Roughly speaking, this is done through a “probabilistic thresholding” at the utility threshold β. If U(x)>β, then the decision tends to be positive, whereas if U(x)<β, it tends to be negative. The precision of this decision is determined by the parameter γ (see Fig. 2): For large γ, the decision function converges toward the step function \(u \mapsto\mathbb{I}(u > \beta)\), jumping from 0 to 1 at β. For small γ, this function is smooth, and there is a certain probability to violate the threshold rule \(u \mapsto\mathbb{I}(u > \beta)\). This might be due to the fact that, despite being important for decision making, some properties of the instances to be classified are not captured by the utility function. In that case, the utility U(x), estimated on the basis of the given attributes, is not a perfect predictor for the decision eventually made. Thus, the parameter γ can also be seen as an indicator of the quality of the classification model.

Fig. 2
figure 2

Probability of a positive decision, P(y=1 | x), as a function of the estimated degree of utility, u=U(x), for a threshold β=0.7 and different values of γ

5.2 Normalization

The normalization (18) is meant to turn each predictor variable into a criterion, i.e., a “the higher the better” attribute, and to assure commensurability between the criteria (Modave and Grabisch 1998). A simple transformation is given by the mapping

$$ z_i = \frac{x_i - m_i}{M_i - m_i} , $$
(19)

where m i and M i are lower and upper bounds for x i , which are either known or estimated from the data; if the influence of x i is actually negative (i.e., w i <0), then the mapping z i =(M i x i )/(M i m i ) is used instead.

The transformation (19) is problematic in the presence of outliers, in which case the distribution of its image can become extremely skewed. As an alternative, which is less sensitive in this regard and, moreover, produces a more uniform distribution of normalized values, we therefore propose the mapping

$$ z_i = F^{-1}(x_i) , $$
(20)

where F is the cumulative distribution function xP(X i x). Of course, since this function is in general not known, it has to be replaced by an estimate \(\hat{F}\); to this end, we simply adopt the empirical distribution of the training data (i.e., \(\hat {F}(x)\) is the relative frequency of instances x=(x 1,…,x m ) in the training data for which x i x).

5.3 Logistic regression as a special case

In order to verify that our model (17) is a proper generalization of standard logistic regression, recall that the Choquet integral reduces to a weighted mean (7) in the special case of an additive measure μ. Moreover, recall the transformation (19) and consider any linear function xg(x)=w 0+w x with w=(w 1,…,w m ). This function can also be written in the form

where p i =m i if w i ≥0 and p i =M i if w i <0, u i =|w i |(M i m i ), \(\gamma= ( \sum_{i=1}^{m} u_{i} )^{-1}\), \(u_{i}' = u_{i}/\gamma\), \(w_{0}' = w_{0} + \sum_{i=1}^{m} w_{i} p_{i}\), \(\beta= - w_{0}'/\gamma\). By definition, the \(u_{i}'\) are non-negative and sum up to 1, which means that \(\sum_{i=1}^{m} u_{i}' z_{i}\) is a weighted mean of the z i that can be represented by a Choquet integral.

5.4 Parameter estimation

The model (17) has several degrees of freedom: The fuzzy measure μ (Möbius transform m=m μ ) determines the (latent) utility function, while the utility threshold β and the scaling parameter γ determine the discrete choice model. The goal of learning is to identify these degrees of freedom on the basis of the training data \(\mathcal{D}\). Like in the case of standard logistic regression, it is possible to harness the maximum likelihood (ML) principle for this purpose. The log-likelihood of the parameters can be written as

(21)

One easily verifies that (21) is convex with respect to m,γ, and β. In principle, maximization of the log-likelihood can be accomplished by means of standard gradient-based optimization methods. However, since we have to assure that μ is a proper fuzzy measure and, hence, that m guarantees the corresponding monotonicity and boundary conditions, we actually need to solve a constrained optimization problem:

(22)

The last part of the objective function (22) is a standard L 1-regularizer on the Möbius transform, which is added as a means to prevent over-fitting; moreover, since many weights are typically set to 0 under L 1-regularization, it also serves as a feature selection mechanism (Lee et al. 2006).

A solution to the above problem can be produced by standard solvers. Concretely, we used the fmincon function implemented in the optimization toolbox of Matlab. This method is based on a sequential quadratic programming approach.

Recall that, once the model has been identified, the importance of each attribute and the degree of interaction between groups of attributes can be derived from the Möbius transform m; these are given, respectively, by the Shapley value and the interaction indexes as introduced in Sect. 3.2.

6 Complexity reduction

Obviously, choquistic regression can be interpreted as fitting a (constrained) linear function in the feature space spanned by the set of features f T defined by (13), with one feature for each subset of criteria T⊆{1,2,…,m}. Since the dimensionality of this feature space is 2m−1, the method is clearly critical from a complexity point of view. It was already mentioned that an L 1-regularization in (22) may shrink some coefficients to 0 and, therefore, some of the features f T may disappear. Although this may help to simplify the choquistic model, that is, the result produced by the learning algorithm, it does not simplify the optimization problem itself.

Thus, one may wonder whether some of the features (13) could not even be eliminated prior to solving the actual optimization problem. Specifically interesting in this regard is a possible restriction of the choquistic model to k-additive measures, for a suitable value of k<m. Since this means that significantly less parameters (namely 2k−1) need to be identified, the computational complexity might be reduced drastically. Besides, a restriction to k-additive measures may also have advantages from a learning point of view, as it reduces the capacity of the underlying model class (cf. Sect. 4) and thus may prevent from over-fitting the data in cases where the full flexibility of the Choquet integral is actually not needed. Of course, the key problem to be addressed concerns the question of how to choose k in the most favorable way.

6.1 Exploiting equivalence of features for dimensionality reduction

In the following, we shall elaborate on the following question: Is it possible to find an upper bound on the required level of complexity of the model, namely the level of additivity k, prior to fitting the Choquet integral to the data? Or, more specifically, can we determine the value k in such a way that fitting a k-additive measure is definitely enough, in the sense that each labeling of the training data produced by the full Choquet integral (k=m) can also be produced by a Choquet integral based on a k-additive measure?

In this regard, it is noticeable that, for a given instance x=(x 1,…,x m ), many of the min-terms (13) will assume the same value (in fact, there are 2m−1 such terms but only m possible values). Consequently, in the expression

$$ \mathcal{C}_\mu(\boldsymbol{x})= \sum_{T \subseteq C} \boldsymbol{m}(T) \times f_T (\boldsymbol{x}) $$
(23)

of the Choquet integral, many coefficients m(T) can be grouped and, in principle, be replaced by a single one. The groups thus defined solely depend on the order of the values x 1,…,x m of the original attributes. The number of terms in (23) will thus reduce from 2m−1 to at most m. However, since the order may change from instance to instance, different groupings may be obtained for different instances.

Now, imagine that a subset of features \(\mathcal{F} = \{ f_{T_{1}}, \ldots, f_{T_{r}} \}\) assumes the same value, not only for a single instance, but for all instances in the training data. Then, this set can be said to form an equivalence class. Thus, one of the features could in principle be selected as a representative, absorbing all the weights of the others; more specifically, the weight of this feature would be set to m(T 1)+m(T 2)+⋯+m(T r ), while the weights of the other features in \(\mathcal{F}\) would be set to 0.

Note, however, that this “transfer of Möbius mass” will in general not be feasible, as it may cause a violation of the monotonicity constraint on the fuzzy measure μ. As a side remark, we also note that, from a learning point of view, the equivalence of features may obviously cause problems with regard to the identifiability of coefficients; due to the monotonicity constraints just mentioned, however, this is not necessarily the case.

More generally, for two features f A and f B (A,BC), denote by v(A,B)∈[0,1] the fraction of training examples on which they assume the same value. We say that f A covers f B (and, vice versa, f B covers f A ) if v(A,B)=1. Moreover, for a feature f A , we denote by C(f A )⊆2C the set of features it covers. A straightforward way to find a sufficiently large k then consists of finding the smallest k such that

$$ \bigcup_{T \subseteq C, \, |T| \leq k} C(f_T) =2^C . $$
(24)

From the above construction, it follows that working with the corresponding k-additive measure, for k thus defined, is theoretically sound and guarantees that there is no loss in terms of expressivity of the model on the training data. We summarize this finding in terms of the following proposition.

Proposition 1

Consider a set of training instances x (1),…,x (n) and let k be the smallest value in {1,…,m} satisfying (24). Moreover, let μ be any measure on the set of criteria {c 1,…,c m }, and \(\mathcal{C}_{\mu}\) the Choquet integral with respect to this measure. Then, there exists a k-additive measure μ such that

$$ \mathcal{C}_{\mu^*}\bigl(\boldsymbol{x}^{(i)}\bigr) = \mathcal{C}_{\mu}\bigl(\boldsymbol{x}^{(i)}\bigr) $$
(25)

for all i∈{1,…,n}.

We like to emphasize that k is only an upper bound on the complexity needed to fit the training data. Thus, it is not necessarily the optimal k from the point of view of model induction (which might be figured out by the regularizer in (22)). In particular, note that the computation of k does not refer to the output values y (i). Instead, it should be considered as a measure of the complexity of the training instances. As such, it is obviously connected to the notion of VC dimension.

Since the exact reproducibility (25) may appear overly stringent or, stated differently, a small loss may actually be acceptable, we finally propose a relaxation somewhat in line with idea of probably approximately correct (PAC) learning (Valiant 1984). First, noting that the Choquet integral may change by at most ϵ when combining features f A and f B such that |f A f B |<ϵ, one may think of relaxing the definition of equivalence as follows: f A and f B are ϵ-equivalent (on a given training instance x) if |f A (x)−f B (x)|<ϵ. Second, we relax the condition of coverage. Denoting by v(A,B)∈[0,1] the fraction of training examples on which f A and f B are ϵ-equivalent, we say that f A ϵ-δ-covers f B if v(A,B)≥1−δ. Roughly speaking, for a small ϵ and δ close to 0, this means that, with only a few exceptions, the values of f A and f B are almost the same on the training data (we used ϵ=δ=0.1 is our experiments below). In order to find a proper upper bound k , the principle (24) can be used as before, just replacing coverage with ϵ-δ-coverage.

7 Experiments

In this section, we present the results of an experimental study that was conducted in order to validate the practical performance of our choquistic regression (CR) method. The goal of this study is twofold. First, we would like to show that CR is competitive in terms of predictive accuracy. To this end, we compare it with several alternative methods on a number of (monotone) benchmark data sets. Second, we would like to corroborate our claim that the CR model provides interesting information about attribute importance and interaction. To this end, we discuss some examples showing that the corresponding Shapley and interaction values produced by CR are indeed meaningful and plausible.

7.1 Data sets

Although the topic is receiving increasing interest in the machine learning community, benchmark data for monotone classification is by far not as abundant as for conventional classification. In total, we managed to collect 9 data sets from different sources, notably the UCI repositoryFootnote 2 and the WEKA machine learning framework (Hall et al. 2009), for which monotonicity in the input variables is a reasonable assumption; see Table 1 for a summary. All the data sets can be downloaded from our website.Footnote 3 Many of them have also been used in previous studies on monotone learning. Some of them have a numerical or ordered categorical output, however, which was hence binarized. Moreover, all input attributes were normalized using (20).

Table 1 Data sets and their properties

Den Bosch (DBS)

This data set contains 8 attributes describing houses in the city of Den Bosch: area, number of bedrooms, type of house, volume, storeys, type of garden, garage, and price. The output is a binary variable indicating whether the price of the house is low or high (depending on whether or not it exceeds a threshold).

CPU

This is a standard benchmark data set from the UCI repository. It contains eight input attributes, two of which were removed since they are obviously of no predictive value (vendor name, model name). The problem is to predict the (estimated) relative performance of a CPU (binarized by thresholding at the median) based on its machine cycle time in nanoseconds, minimum main memory in kilobytes, maximum main memory in kilobytes, cache memory in kilobytes, minimum channels in units, maximum channels in units.

Breast Cancer (BCC)

This data set was obtained from the University Medical Center, Institute of Oncology, Ljubljana, Yugoslavia. There are 7 attributes, namely menopause gain, tumor-size, inv-nodes, node-caps, deg-malig, breast cost, irradiat gain. The output is a binary variable, namely no-recurrence-events and recurrence-events.

Auto MPG

This data set was used in the 1983 American Statistical Association Exposition. The problem is to predict the city-cycle fuel consumption in miles per gallon (binarized by thresholding at the median) based on the following attributes of a car: cylinders, displacement, horsepower, weight, acceleration, model year, origin. We removed incomplete instances.

Employee Selection (ESL)

This data set contains profiles of applicants for certain industrial jobs. The values of the four input attributes were determined by expert psychologists based upon psychometric test results and interviews with the candidates. The output is an overall score on an ordinal scale between 1 and 9, corresponding to the degree of suitability of each candidate to this type of job. We binarized the output value by distinguishing between suitable (score 6–9) and unsuitable (score 1–5) candidates.

Mammographic (MMG)

This data set is about breast cancer screening by mammography. The goal is to predict the severity (benign or malignant) of a mammographic mass lesion from BI-RADS attributes (mass shape, mass margin, density) and the patient’s age.

Employee Rejection/Acceptance (ERA)

This data set originates from an academic decision-making experiment. The input attributes are features of a candidate such as past experience, verbal skills, etc., and the output is the subjective judgment of a decision-maker, measured on an ordinal scale from 1 to 9, to which degree he or she tends to accept the applicant for the job. We binarized the output value by distinguishing between acceptance (score 5–9) and rejection (score 1–4).

Lecturers Evaluation (LEV)

This data set contains examples of anonymous lecturer evaluations, taken at the end of MBA courses. Students were asked to score their lecturers according to four attributes such as oral skills and contribution to their professional/general knowledge. The output was a total evaluation of each lecturer’s performance, measured on an ordinal scale from 0 to 4. We binarized the output value by distinguishing between good (score 3–4) and bad evaluation (score 0–2).

Car Evaluation (CEV)

This data set contains 6 attributes describing a car, namely, buying price, price of the maintenance, number of doors, capacity in terms of persons to carry, the size of luggage boot, estimated safety of the car. The output is the overall evaluation of the car: unacceptable, acceptable, good, very good. We binarized this evaluation into unacceptable versus not unacceptable (acceptable, good or very good).

7.2 Methods

Since choquistic regression (CR) can be seen as an extension of standard logistic regression (LR), it is natural to compare these two methods. Essentially, this comparison should give an idea of the usefulness of an increased flexibility. On the other side, one may also ask for the usefulness of assuring monotonicity. Therefore, we additionally included two other extensions of LR, which are flexible but not necessarily monotone, namely kernel logistic regression (KLR) with polynomial and Gaussian kernels. The degree of the polynomial kernel was set to 2, so that it models low-level interactions of the features. The Gaussian kernel, on the other hand, is able to capture interactions of higher order. For each data set, the width parameter of the Gaussian kernel was selected from {10−4,10−3,10−2,10−1,100} in the most favorable way. Likewise, the regularization parameter η in choquistic regression was selected from {10−3,10−2,10−1,100,101,102}.

Finally, we also included two methods that are both monotone and flexible, namely the MORE algorithm for learning rule ensembles under monotonicity constraints (Dembczyński et al. 2009) and the LMT algorithm for logistic model tree induction (Landwehr et al. 2003). Following the idea of forward stagewise additive modeling (Tibshirani et al. 2001), the MORE algorithm treats a single rule as a subsidiary base classifier in the ensemble. The rules are added to the ensemble one by one. Each rule is fitted by concentrating on the examples that are most difficult to classify correctly by rules that have already been generated. The LMT algorithm builds tree-structured models that contain logistic regression functions at the leaves. It is based on a stagewise fitting process to construct the logistic regression models that can select relevant attributes from the data. This process is used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree structure.

7.3 Results

7.3.1 Performance in terms of predictive accuracy

As performance measures, we determined the standard misclassification rate (0/1 loss) as well as the AUC. Estimates of both measures were obtained by randomly splitting the data into two parts, one part for training and one part for testing. This procedure was repeated 100 times, and the results were averaged. In order to analyze the influence of the amount of training data, we varied the proportion between training and test data from 20:80 over 50:50 to 80:20. In these experiments, we used a variant of CR in which the underlying fuzzy measure is restricted to be k-additive, with k determined by means of an internal cross-validation. Compared with other variants (cf. Sect. 7.3.2), this one performed best in terms of accuracy.

A possible improvement of CR over its competitors, in terms of predictive accuracy, may be due to two reasons: First, in comparison to standard LR, it is more flexible and has the ability to capture nonlinear dependencies between input attributes. Second, in comparison to non-monotone learners, it takes background knowledge about the dependency between input and output variables into consideration.

An overview of the results of the experiments is given in Tables 2 and 3. Moreover, a summary in terms of pairwise win statistics is provided in Tables 4 and 5. As can be seen, CR compares quite favorably with the other approaches, especially with the non-monotone KLR methods, both in terms of 0/1 loss and AUC. It also outperforms LR, at least for sufficiently extensive training data; if the amount of training data is small, however, LR is even better, probably because CR will then tend to overfit the data. This is indeed a general trend that can be observed both for performance in terms of average ranks and the number of wins in pairwise comparison with another method: The more training data is available, the better CR becomes, arguably because its flexibility is then becoming more and more advantageous.

Table 2 Classification performance in terms of the mean and standard deviation of 0/1 loss. From top to bottom: 20 %, 50 %, and 80 % training data. (Average ranks comparing significantly worse with CR at the 90 % confidence level are put in bold font)
Table 3 Performance in terms the average AUC ± standard deviation. From top to bottom: 20 %, 50 %, and 80 % training data. (Average ranks comparing significantly worse with CR at the 90 % confidence level are put in bold font)
Table 4 Win statistics (number of data sets on which the first method was better than the second one) for 20 %, 50 %, and 80 % training data for 0/1 loss case
Table 5 Win statistics (number of data sets on which the first method was better than the second one) for 20 %, 50 %, and 80 % training data for AUC case

Needless to say, statistical significance is difficult to achieve due to the limited number of data sets. In terms of pairwise comparison, for example, a standard sign test will not report a significant difference (at the 10 % significance level) unless one of the method wins at least 7 of the 9 data sets. For the 0/1 loss, this is indeed accomplished by CR in all cases except two (comparison with KLR-ply and MORE on 50 % training data); see Table 4. For AUC, CR is superior, too, but significance is reached less often; see Table 5.

We also applied the two-step procedure recommended by Demsar (2006), consisting of a Friedman test and (provided this one rejects the null-hypothesis of overall equal performance of all methods) the subsequent use of a Nemenyi test in order to compare methods in a pairwise manner; both tests are based on average ranks. For both 0/1 loss and AUC, the Friedman test finds significant differences among the six classifiers (at the 10 % significance level) when all three different proportions of data are used for training. The critical distance of ranks in the Nemenyi test is 2.28 for both measures. In Tables 2 and 3, the average ranks for which this difference is exceeded are highlighted in bold font.

7.3.2 Variants of choquistic regression

In the above experiments, we used CR with a fuzzy measure of optimal order, namely a k-additive measure with k determined through internal cross-validation. In addition, we also learned with standard CR, i.e., CR using the full fuzzy measure with k=m (the number of attributes). As can be seen in Table 6, adapting k does obviously pay off and leads to improved performance most of the time. For the “full” CR, which is the most flexible variant, there is obviously a risk to overfit the training data and hence generalize worse.

Table 6 Performance in terms the average 0/1 loss and AUC ± standard deviation for CR using the full fuzzy measure compared with using a k-additive measure with cross-validated k. From top to bottom: 20 %, 50 %, and 80 % training data

Moreover, we also combined CR with the complexity reduction method proposed in Sect. 6. In addition to the average performance, the results in Table 7 also show the typical values of k as determined by this method (namely the most frequently chosen one). As can be seen, the method is indeed effective in the sense that the order of the fuzzy measure is often significantly reduced without compromising performance. On the other hand, in terms of performance, this method is still not competitive with using an optimal (cross-validated) k. This is not surprising, since the k determined by our complexity reduction method is only an upper bound (and learned in an unsupervised instead of a supervised manner).

Table 7 Performance in terms the average 0/1 loss and AUC ± standard deviation for CR using complexity reduction (ϵ=δ=0.1). From top to bottom: 20 %, 50 %, and 80 % training data

7.3.3 Model interpretation

As mentioned earlier, one may expect a close connection between the scaling parameter γ in the choquistic model and the prediction accuracy of the model. More specifically, the better the model performs on a particular data set, the higher γ is expected to be. It is worth mentioning that our experimental results are in perfect agreement with this expectation. Indeed, comparing the ranking of the nine data sets in terms of accuracy and in terms of the average values of γ (shown in Table 8), we obtain a (Kendall tau) correlation of more than 0.8 throughout.

Table 8 Average values of the scaling parameter γ in the choquistic regression model

As one of its key features, the Choquet integral offers interesting information about the importance of individual attributes as well as the interaction between them; this aspect was highlighted in Sect. 3.2. In fact, in many practical applications, this type of information is at least as important as the prediction accuracy of the model. A detailed analysis of this type of information is difficult and beyond the scope of this paper. Instead, we just give a few examples showing the plausibility of the results.

Regarding the Shapley index that measures the importance of individual attributes, the (average) values on the Car MPG data are as follows: cylinders ≈0.13, displacement ≈0, horsepower ≈0.25, weight ≈0.46, acceleration ≈0.03, model year ≈0.13, origin ≈0. In terms of attribute importance, this conveys the following picture:

$$\text{weight} > \text{horsepower} > \text{cylinders | model year} > \text{acceleration} > \text{displacement | origin} $$

Recalling the meaning of the data set, these weights should reflect the influence on the fuel consumption, and seen from this point of view, they appear to be fully plausible.

For the CPU data, the following Shapley values are obtained: machine cycle time in nanoseconds ≈0.07, minimum main memory in kilobytes ≈0.24, maximum main memory in kilobytes ≈0.30, cache memory in kilobytes ≈0.20, minimum channels in units ≈0.10, maximum channels in units ≈0.09. Thus, the most important properties are those concerning the memory (main and cache). The influence of the other properties (channels, cycle time) is not as strong, although they are not completely unimportant either.

Apart from the importance of individual attributes, it is interesting to look at the interaction between different attributes. As an example, Fig. 3 provides a visualization of the pairwise interaction between attributes for the car evaluation data, for which CR performs significantly better than LR. Recall that, in this data set, the evaluation of a car (output attribute) depends on a number of criteria, namely (a) buying price, (b) price of the maintenance, (c) number of doors, (d) capacity in terms of persons to carry, (e) size of luggage boot, (f) safety of the car. These criteria form a natural hierarchy, according to which the data was produced (Bohanec and Rajkovic 1990): (a) and (b) form a subgroup price, whereas the other properties are of a technical nature and can be further decomposed into comfort (c–e) and safety (f). Interestingly, the interaction in our model nicely agrees with this hierarchy or, stated differently, allows for recovering this hierarchy from the pairwise interactions between attributes: Interaction within each subgroup tends to be smaller (as can be seen from the darker colors) than interaction between criteria from different subgroups, suggesting a kind of redundancy in the former and complementarity in the latter case.

Fig. 3
figure 3

Visualization of the interaction index for the car evaluation data (numerical values are shown in terms of level of gray, values on the diagonal are set to 0). Groups of related criteria are indicated by the black lines

8 Concluding remarks

In this paper, we have advocated the use of the discrete Choquet integral as an aggregation operator in machine learning, especially in the context of learning monotone models. Apart from combining monotonicity and flexibility in a mathematically sound and elegant manner, the Choquet integral offers measures for quantifying the importance of individual predictor variables and the interaction between groups of variables, thereby providing important information about the relationship between independent and dependent variables.

We have analyzed several properties of the Choquet integral that appear to be interesting from a machine learning point of view, notably its capacity in terms of the VC-dimension. Moreover, we have addressed the issue of complexity reduction or, more specifically, the restriction of the Choquet integral to k-additive measures. In this regard, we have proposed a method for finding a suitable value of k.

As a concrete machine learning application of the Choquet integral, we have proposed a generalization of logistic regression, in which the Choquet integral is used for modeling the log odds of the positive class. First experimental studies have shown that this method, called choquistic regression, compares quite favorably with other methods. We like to mention again, however, that an improvement in prediction accuracy should not be seen as the only goal of monotone learning. Instead, the adherence to monotonicity constraints is often an important prerequisite for the acceptance of a model by domain experts.

Compared to standard logistic regression, the benefits of choquistic regression are coming at the expense of an increased computational complexity of the underlying learning algorithm, which solves a maximum likelihood estimation problem. This is mainly caused by the large number of parameters of the fuzzy measure on which the Choquet integral is based, and the complicated dependency between these parameters. In Hüllermeier and Fallah Tehrani (2012a), first steps aiming at a reduction of this complexity are made. Nevertheless, speeding up choquistic regression and making it scalable toward data sets with many attributes is an important topic of ongoing and future work.

Needless to say, the Choquet integral can be combined with machine learning methods other than logistic regression. Moreover, its use is not restricted to (binary) classification. In fact, we are quite convinced of its high potential in machine learning in general, and we are looking forward to exploring this potential in greater detail.