1 Introduction

At the heart of challenges like machine intelligence, robotics, Big Data, geospatial intelligence, and humanitarian demining, to name a few, is the dilemma of data and information fusion, specifically aggregation. A key element of fusion is the underlying mathematics to convert multiple, potentially heterogeneous inputs into fewer outputs (typically one), with the general hope that the aggregation either summarizes the sources well or enhances a system’s performance by exploiting interactions across the different sources. A well-known scenario is where “the whole can be worth more than the sum of its parts”. Aggregation methods with roots in fuzzy measure theory have been shown to be powerful (Xu and Gou 2017; Das et al. 2017; Xu and Wang 2016; Tang et al. 2017). Herein, we consider the fuzzy integral (FI), specifically Sugeno’s fuzzy Choquet integral (CI), for data and information aggregation (Sugeno 1974).

The CI is an aggregation operator generator as it is parameterized by the fuzzy measure (FM) (Sugeno 1974), aka a monotone and normal capacity. The FM is defined over the power set of N unique inputs and assigns a “worth” to each subset. It is in this way the CI provides a granular approach to aggregation—each subset represents a unique granule of the universe (Yao 2005; Pawlak 1998), and the CI aggregates contributions over a certain set of these granules.Footnote 1 This granulation leads to an extremely flexible aggregation operator that is capable of producing numerous useful aggregation operators such as the minimum, maximum, median, mean, soft maximum (minimum), other linear order statistics, etc. (Tahani and Keller 1990), but also a wealth of other more unique and tailored aggregation operators. The point is, the CI is a powerful non-linear function that is often used for fusion.

It is important to note that the CI is not trivial in any respect. The capacity has \(2^N-2\) free parameters, for N inputs, that must be either specified or learned. This exponential number can (and often does) impact an application rather quickly. For example, 10 inputs already gives rise to 1022 values that must be specified or learned (and 5110 monotonicity constraints at that). The reader can refer to Sugeno (1974), Anderson et al. (2014a) and Grabisch et al. (1995, 2000) for reviews of analytical methods for specifying a capacity relative to just knowledge about the worth of the individual sources (called densities, \(g(\{x_i\})\) for \(i \in \{1,2,\ldots ,N\}\)). Examples of FMs include the Sugeno \(\lambda\)-FM (Sugeno 1974), S-decomposable FMs (Anderson et al. 2014a; Klir and Yuan 1995), belief (and plausibility) and possibility (and necessity) FMs (Klir and Yuan 1995), and k-additive approaches (which model a limited number of capacity terms up to a k-tuple number) (Grabisch 1997).

To date, numerous methods have been proposed to learn the CI. While similar in underlying objective, these methodologies can and often do vary greatly with respect to factors like application domain, mathematics and how the CI is being used (e.g., signal and image processing, regression, decision-level fusion, etc.). In Grabisch et al. (1995, 2000), Grabisch proposed quadratic programming (QP) to acquire the full capacity based on the idea of minimizing the sum of squared error (SSE). In Keller and Osborn (1996), Keller et al. used gradient descent and then penalty and reward (Keller and Osborn 1995) to learn the densities in combination with the Sugeno \(\lambda\)-FM. In Beliakov (2009), Beliakov used linear programming and, in Anderson et al. (2010), a genetic algorithm was used to learn a higher-order (type-1) fuzzy set-valued capacity for the Sugeno integral (SI). Alternatively, the works Wagner and Anderson (2012), Havens et al. (2013, 2015) proposed different ways to automatically acquire, and subsequently aggregate, full capacities of specificity and agreement based on the idea of crowd sourcing when the worth of the individuals is not known but is instead extracted from the data. The reader can refer to Grabisch et al. (2008) for a detailed review of other existing work prior to 2008.

Fig. 1
figure 1

Summary of the process of learning the fuzzy measure for aggregation via the Choquet integral. The novelty of this work lies in the regularization functions described in Sects. 4 and 5

An underrepresented and unsolved challenge is learning the CI with respect to more than one criteria—a process summarized in Fig. 1. Herein, we focus on minimization of the SSE criteria, but do it in conjunction with model complexity. In Mendez-Vazquez and Gader (2007), are the first that we are aware to study the inclusion of an information-theoretic index of capacity-complexity in learning the CI. Specifically, Mendez-Vazquez and Gader explored the task of sparsity promotion in learning the CI and provided examples in decision level fusion. Their work has two parts: the Gibbs sampler and the exploration of a lexicographically encoded capacity vector as the \(\ell _p\)-norm complexity term. The goal of their regularization term was to explicitly reduce the number of nonzero parameters in the capacity to eliminate non-informative or “useless” information sources. In Anderson et al. (2014b), the idea of learning the CI based on the use of a QP solver and a lexicographically encoded capacity vector was also explored. The novelty in that work is the study of different properties of the regularization term in an attempt to unearth what it was promoting in different scenarios (with respect to both the capacity but also the resultant aggregation operator induced by the CI). In the theme of information theoretic measures of capacities, but not regularization with respect to such indices, Kojadinovic et al. (2005) and Yager (2000, 2002) both explored the concept of the entropy of an FM. Furthermore, in Labreuche (2008), explored the identification of an FM with an \(\ell _1\) entropy. Labreuche proposed a linearized entropy of an FM, allowing for the identification of an FM using linear programming.

For the most part, it appears that the vast majority of these works are largely unaware of each other. Therefore, in Sect. 2.2 we bridge this gap by analyzing and comparing different properties of these indices. In general, there does not appear to be a clear “winner”. That is, different indices exist and are useful for various applications and goals (contexts). Therefore, it is important to understand each index and ultimately what context it supports.

In addition to our review of existing regularization indices, we put forth new indices based on the Shapley values as discussed in Sects. 4 and 5. These new indices comprise this work’s novelty, and intend to promote “simpler models” that have either lower diversity in the Shapley values or fewer numbers of inputs (fewer non-zero Shapley terms). In fields such as statistics and machine learning, such a strategy can help with addressing challenges like preventing overfitting (aka improving the generalizability of a learned model). It is also the case that we are often concerned with problems having too many inputs, as more inputs are typically associated with higher cost—e.g., greater financial cost of different sensors, more computational or memory resources, time, or even physical cost in a health setting where an input is something like the result of a bone marrow biopsy.

This article is structured as follows: Sect. 2 is a review of important concepts in FM and FI theory. Section 2.1 discusses the Shapley and interaction indices and Sect. 2.2 reviews and compares existing indices for measuring different notions of complexity of a capacity. In Sect. 2.3, the \(\ell _0\)-norm, \(\ell _1\)-norm and Gini–Simpson index of the Shapley values are proposed. In Sect. 3 we propose the novel methods for optimizing the CI relative to the SSE criteria based on the QP. We propose Gini–Simpson index-based regularization of the Shapley values in Sects. 4 and 5 discusses \(\ell _0\)-norm based regularization, and Sect. 6 contains experiments illustrating different scenarios encountered in practice. Table 1 is notation used in this article.

Table 1 Notation

2 Fuzzy measure and integral

The aggregation of data/information using the FI has a rich history. Much of the theory and several applications can be found in Anderson et al. (2014a), Grabisch et al. (1995, 2000), Tahani and Keller (1990), Yang et al. (2008), Cho and Kim (1995), Melin et al. (2011) and Wu et al. (2013). There are a number of (high-level) ways to describe the FI, e.g., motivated by Calculus, signal processing, pattern recognition, fuzzy set theory, etc. Herein, we set the stage by considering a finite set of N sources of information, \(X=\{x_1,\ldots ,x_N\}\), and a function h that maps X into some domain (initially [0, 1]) that represents the partial support of a hypothesis from the standpoint of each source of information. Depending on the problem domain, X can be a set of experts, sensors, features, pattern recognition algorithms, etc. The hypothesis is often thought of as an alternative in a decision process or a class label in pattern recognition. Both Choquet and Sugeno integrals take partial support for the hypothesis from the standpoint of each source of information and fuse it with the (perhaps subjective) worth (or reliability) of each subset of X in a non-linear fashion. This worth is encoded in an FM (aka capacity). Initially, the function h (integrand) and FM (\(g:2^X \rightarrow [0,1]\)) took real number values in [0, 1]. Certainly, the output range for the support function and capacity can be (and have been) defined more generally, e.g., \(\mathfrak {R}_0^+\), but it is convenient to think of them on [0, 1] for confidence fusion. We now review the capacity and FI.

Definition 1

(Fuzzy Measure Sugeno 1974) The FM is a set function, \(g:2^X \rightarrow \mathfrak {R}_0^+\), such that

  1. P1.

    (Boundary condition) \(g(\phi ) = 0\) (often \(g(X)=1\));

  2. P2.

    (Monotonicity) If \(A, B \subseteq X\) and \(A \subseteq B\), \(g(A) \le g(B)\).

Note, if X is an infinite set, a third condition guaranteeing continuity is required, but this is a moot point for finite X. As already noted, the FM has \(2^N\) values; actually, \(2^{N}-2\) “free parameters” as \(g(\phi )=0\) and \(g(X)=1\). Before a definition can be given for the FI, notation must be established for the training data used to learn the capacity.

Definition 2

(Training Data) Let a training data set, T, be

$$T=\{(O_{j},\alpha _{j})|j=1,\ldots ,m\}$$

where \({\mathbf {O}} = \{O_{1},\ldots ,O_{j},\ldots ,O_{m}\}\) is a set of “objects” and \(\alpha _{j}\) are their corresponding labels (specifically, \(\mathfrak {R}\)-valued numbers). For example, \(O_{j}\) could be the strengths in some hypothesis from N different experts, signal inputs at time j, algorithm outputs for input j, kernel inputs or kernel classifier outputs for feature vector j, etc. Subsequently, \(\alpha _{j}\) could be the corresponding function output, class label, membership degree, etc. Next, we provide a definition for the FI, namely the CI with respect to T. To that end, let \(h_j\) be the jth integrand, i.e., \(h_j(x_i)\) is the input for the ith source with respect to object j.

Definition 3

The discrete CI, for finite X and object \(O_{j}\) is

$$\begin{aligned} \int {h_j \circ g}\, = & C_{g}(h_{j}) \nonumber \\= & {} \sum _{i=1}^{N}[ h_j(x_{\pi _{j}(i)}) - h_j(x_{\pi _{j}(i+1)}) ] g( A_{\pi _{j}(i)} ), \end{aligned}$$
(1)

for \(A_{\pi _{j}(i)} = \{x_{\pi _{j}(1)},\ldots ,x_{\pi _{j}(i)}\}\) and permutation \(\pi _{j}\) such that \(h_j(x_{\pi _{j}(1)}) \ge \cdots \ge h_j(x_{\pi _{j}(N)})\), where \(h_j(x_{\pi _{j}(N+1)})=0\) (Sugeno 1974).

2.1 Shapley and interaction indices

The CI is parameterized by the capacity. Specifically, the capacity encodes all of the rich tuple-wise interactions between the different sources and the CI utilizes this information to aggregate the inputs (the integrand, h). It is important to note that the CI operates on a weaker (and richer) premise than a great number of other aggregation operators that assume additivity (a stronger property than monotonicity). However, the capacity has a large number of values. It is not trivial to understand a capacity. For example, a commonly encountered question is what is the so-called worth of a single individual source? Information theoretic indices aid us in summarizing information such as this in the capacity. The point is, most of our questions are not about a single capacity value; we are interested in a complex question whose answer is dispersed across the capacity. For example, the Shapley index has been proposed to summarize the so-called worth of an individual source and the interaction index summarizes interactions between different sources.

Definition 4

(Shapley Index) The Shapley values of g are

$$\begin{aligned} \Phi _{g}(i)&= \sum _{ K \subseteq X \backslash \{i\} }\zeta _{X,1}( K )\left( g(K \cup \{i\}) - g(K) \right) , \end{aligned}$$
(2a)
$$\begin{aligned} \zeta _{X,1}(K)&= \frac{(|X| - |K| - 1)!|K|!}{|X|!}, \end{aligned}$$
(2b)

where \(K\subseteq\) \(X \backslash \{i\}\) denotes all proper subsets from X that do not include source i. The Shapley value of g is the vector \(\Phi _{g}=(\Phi _{g}(1),\ldots ,\Phi _{g}(N))^t\) and \(\sum _{i=1}^{N}\Phi _{g}(i)=1\). The Shapley index can be interpreted as the average amount of contribution of source i across all coalitions. Equation (2a) makes its decision based on the weighted sum of (positive-valued) numeric differences between consecutive steps (layers) in the capacity.

Remark 1

It is important to note the following property. When \(g(A)=0, \forall A \subset X\), the CI is the minimum operator. The Shapley values are \(\Phi _{g}(1)=\Phi _{g}(2)=\cdots =\Phi _{g}(N)\). This is easily verifiable as the Shapley value is a weighted sum of differences between g(X) and \(g(X \backslash {x_{i}})\) (one of our inputs). Thus, each Shapley value reduces to the same calculation, \(\zeta _{X,1}(K)\), where \(K \in 2^X\) and \(|K|=|X|-1\).

Definition 5

(Interaction Index) The interaction index (Murofushi and Soneda 1993) between i and j is

$$\begin{aligned} {\mathbf {I}}_{g}(i,j)& = \sum _{ K \subseteq X \backslash \{i,j\} }\zeta _{X,2}( K ) ( g(K \cup \{i,j\}) \nonumber \\& \quad - g(K \cup \{i\}) - g(K \cup \{j\}) + g(K) ),\quad i={1,\ldots ,N}, \end{aligned}$$
(3a)
$$\begin{aligned} \zeta _{X,2}(K)&= \frac{(|X| - |K| - 2)!|K|!}{(|X|-1)!}, \end{aligned}$$
(3b)

where \({\mathbf {I}}_{g}(i,j) \in [-1,1], \forall i,j\in \{1,2,\ldots ,N\}\). A value of 1 (respectively, \(-1\)) represents the maximum complementarity (respective redundancy) between i and j. The reader can refer to Grabisch and Roubens (2000) for further details about the interaction index, its connections to game theory and interpretations. Grabisch later extended the interaction index to the general case of any coalition (Grabisch and Roubens 2000).

Definition 6

(Interaction Index for coalition A) The interaction index for any coalition \(A \subseteq X\) is

$$\begin{aligned} {\mathbf {I}}_{g}(A)&= \sum _{ K \subseteq X \backslash A }\zeta _{X,3}( K, A )\sum _{C \subseteq A}{(-1)^{|A \backslash C|}g(C \cup K)}, \quad i={1,\ldots ,N}, \end{aligned}$$
(4a)
$$\begin{aligned} \zeta _{X,3}(K, A)&= \frac{(|X| - |K| - |A|)!|K|!}{(|X|-|A|+1)!}. \end{aligned}$$
(4b)

Equation (4a) is a generalization of both the Shapley index and Murofushi and Soneda’s interaction index as \(\Phi _{g}(i)\) corresponds with \({\mathbf {I}}_{g}(\{i\})\) and \({\mathbf {I}}_{g}(i,j)\) with \({\mathbf {I}}_{g}(\{i,j\})\).

While the Shapley and interaction indices are extremely useful, they do not, in their current explicit form, inform us about capacity complexity. In the next subsection, we review additional information theoretic capacity indices and we discuss their interpretations.

2.2 Existing indices for capacity complexity

Excluding indices that are subsumed by others, the bottom line is various indices exist for different reasons. First, some indices are simpler computationally while others are mathematically simpler in terms of our ability to manipulate and use them for tasks like optimization. Second, and arguably the most important, complexity can and often does mean different things to different people/applications. As we discuss in this article, there is no clear winner (i.e., index). Different indices are important for different applications and knowledge of their existence and associated benefits is what is ultimately important. In this section we review existing information theoretic indices for complexity.

Definition 7

(\(\ell _1\)-Norm of a Lexicographically Encoded Capacity Vector) Let \({\mathbf {u}}\in \mathbb {R}^{2^N-1}\) be \({\mathbf {u}}= \left( g_1, g_2, \ldots , g_{12}, g_{13} \ldots , g_{12 \ldots N} \right) ^t\). Note that we define this ordering such that it is also sorted by cardinality.Footnote 2 A relatively simple index of the complexity of g is

$$\begin{aligned} \upsilon_{\ell 1}(g) = \sum _{j=1}^{2^N-1}{ \left| {\mathbf {u}}_{j} \right| } = \sum _{j=1}^{2^N-1}{ {\mathbf {u}}_{j} }. \end{aligned}$$
(5)

As stated in Anderson et al. (2014b) and Mendez-Vazquez and Gader (2007), the intent of \(\upsilon_{\ell 1}(g)\) was to help reduce the number of nonzero parameters in the capacity to eliminate non-informative or useless information sources. However, this index is not as sophisticated as desired. The index is minimized when all \({\mathbf {u}}_{j}\) are equal to zero, i.e., the FM \(g(A)=0, \forall A \subset X\), which is a minimum operator for the CI (Tahani and Keller 1990). We also note that this is an FM of “ignorance”, as we assert that the answer resides in X, however we have assigned \(g(A)=0\) to all subsets outside X. The index is maximized for the FM \(g(A)=1, \forall A \subseteq X\), which is a maximum operator for the CI (Tahani and Keller 1990). There are really two problems with this index. First, it does not take advantage of any capacity summary mechanism like the Shapley index, interaction index or k-additivity. Second, it is well-known that the \(\ell _1\) is (geometrically) inferior to the \(\ell _0\) when it comes to promoting sparsity. However, the \(\ell _1\)-norm gives rise to convex problems that we can more easily solve for while the later does not.

Definition 8

(Marichal’s Index) Marichal’s Shannon-based entropy of g (Marichal 1998, 2000),

$$\begin{aligned} v_{M}(g)&= (-1)\sum _{j=1}^{N}\left( \sum _{ K \subseteq X \backslash \{j\} }\zeta _{X,1}\left( K \right) \left( g(K \cup \{j\}) - g(K) \right) \right. \nonumber \\&\quad \times \left. \ln ( g(K \cup \{j\}) - g(K) ) \right) , \end{aligned}$$
(6a)

is motivated in terms of the following (Kojadinovic 2006). Consider the set of all maximal chains of the Hasse diagram (\(2^N,\subseteq\)). A maximal chain in (\(2^N,\subseteq\)) is a sequence

$$\phi , \{x_{\pi (1)}\}, \{x_{\pi (1)},x_{\pi (2)}\}, \ldots , \{x_{\pi (1)},\ldots ,x_{\pi (N)}\},$$

where \(\pi\) is a permutation of N. On each chain, we can define a “probability distribution”,

$$\begin{aligned}p^g_{\pi }(i)&:=g(\{x_{\pi (i)},\ldots ,x_{\pi (N)}\})-g(\{x_{\pi (i+1)},\ldots ,x_{\pi (N)}\}),\\ i&\,=1,\ldots ,N, \pi \in \Pi _N. \end{aligned}$$

It is not entirely clear to us why this is called a probability distribution. For example, it is confusing why this is the case for a Belief measure, a Possibility measure, etc. We assume it is interpreted as such due to the properties of positivity and the distribution sums to 1. Furthermore, Kojadinovic (2006) states that “...the intuitive notion of uniformity of a capacity g on N can then be defined as an average of the uniformity values of the probability distributions” (distributions provided according to \(p^g_{\pi }(i)\)) (Kojadinovic 2006). Regardless, this account of entropy is the average of the uniformity values of the underlying probability distributions. In general, such an index can be of help with respect to the maximum entropy principle. Furthermore, maximization of index \(v_{M}(g)\) is non-linear and not quadratic (Labreuche 2008). As stated in Labreuche (2008), we can obtain a quadratic problem under linear constraints, considering a special case of Renyi entropy.

It is trivial to prove that minimum entropy for Eq. (6a) occurs if and only if \(g(K \cup \{j\}) - g(K)\) yields values in \(\{0,1\}\). Note, Eq. (6a) is defined for

$$g(K \cup \{j\}) - g(K) =0$$

by choosing 0. While many properties of this definition of entropy are discussed in Kojadinovic (2006), a few important properties were not discussed. First, there is not a single unique “solution” (minimum). That is, an FM of all 0s (minimum operator) and an FM of all 1s (maximum operator) both satisfy this criteria. There are other FMs that satisfy this criteria as, e.g., the N different order-statistics where a single input becomes the output and all other inputs are discarded (one input has a Shapley value of 1 and all other inputs have a Shapley value of 0). Also, there is the case of the ordered weighted average (OWA) (Yager 1988). An OWA is a special case of the CI when the FM is defined such that sets of equal cardinality have equal measure. The OWA weights are simply the differences between the constant-cardinality tuple values, i.e., \(w_i = g\left( A_i \right) - g\left( A_i {\setminus } \{i\} \right)\). For N inputs, we have N such OWAs that yield the mentioned minimum—capacities with values of 1 for all sets \(A \subseteq X\) with \(|A| \ge k\) and 0 otherwise, for \(k=1 \cdots N\). Note, two of these N cases are the maximum and minimum aggregation operators. On the other hand, maximum entropy occurs in the case of a “uniform distribution” (all \(\frac{1}{N}\) values). This only occurs in the case of a capacity in which \(g(A)=|A|/|X|\), which is a CI-based average operator. This uniqueness of the maximization case was one of the motivating reasons for the proposal of Marichal’s index (maximum entropy principle).

Definition 9

(Shannon’s Entropy of the Shapley Values) In Anderson et al. (2014b), a related but different formulation of Shannon’s entropy was explored in terms of the Shapley values,

$$\begin{aligned} \upsilon_{S}(g) = (-1)\sum _{j=1}^{N}\Phi _{g}\left( j \right) \ln \left( \Phi _{g}\left( j \right) \right) . \end{aligned}$$
(7)

Note, the Shapley index values sum to 1, i.e.,

$$\begin{aligned} \sum _{j=1}^{N}\Phi _{g}(j)=1. \end{aligned}$$

Furthermore, Eq. (7) is not defined for \(\Phi _{g}(j)=0\); it is by choosing \(\ln \left( \Phi _{g}\left( j \right) \right) =0\). When only one source is needed, i.e., one source is consistently superior to all others, a single Shapley value is 1 and the others are 0, i.e., Equation (7) equals 0. There are N such unique cases. There are also not any such cases in which the Shapley values are all 0s or 1s (by definition). On the other hand, the more uniformly distributed the Shapley values become, the more inputs are required (each are important relative to solving the task at hand). In the extreme case, as when all Shapley values are \(\frac{1}{N}\), all sources are needed and we obtain maximum entropy. This occurs when g causes the CI to reduce to an OWA and there is an infinite set of such capacities/OWAs (for a real-valued FM).

In summary, there are fewer and more easily rationalized solutions for Eqs. (7) versus (6a) in the case of minimizing the entropy of a capacity and the latter has fewer solutions for maximizing entropy. However, while there are more solutions in the case of Eq. (7), they can easily be rationalized (all such capacities treat the inputs as equally important in terms of the CI). These two definitions of entropy are similar but not equivalent.

Definition 10

Kojadinovic’s variance of g is (Kojadinovic 2006)

$$\begin{aligned} v_{K}(g)&= \frac{1}{N}\sum _{j=1}^{N}\left( \sum _{ K \subseteq X \backslash \{j\} }\zeta _{X,1}( K )\right. \nonumber \\&\quad \left. ( g(K \cup \{j\}) - g(K) - \frac{1}{N} )^{2} \right) . \end{aligned}$$
(8a)

It is trivial to verify that this index equals 0 if and only if the differences between the capacity terms equal \(\frac{1}{N}\). This is unique in the fact that it only occurs in the case of \(g(A)=|A|/|X|\) (i.e., a CI that reduces to the average operator). As Kojadinovic discusses, Eq. (8a) is a simpler way (versus Marichal’s index which involves logarithms) to measure the uniformity of a distribution. Also, Eq. (8a) equates to 0 if and only if we have the case of a “uniform distribution”. Kojadinovic’s goal, in the theme of Marichal’s notion of entropy, is that of maximum entropy—the “least specific” capacity compatible with the initial preferences of a decision maker. Kojadinovic’s variance is maximized in the case that the difference of the two capacity terms equals 0 or 1. This occurs in the case of a minimum operator, maximum operator, or the other \((N-2)\) OWAs discussed in the case of Marichal’s entropy. Thus, Kojadinovic’s variance and Marichal’s entropy are tightly coupled, while Eq. (7) is once again different in its design and set of relevant solutions.

Definition 11

Labreuche’s linearized entropy of g is (Labreuche 2008)

$$\begin{aligned} v_{L}(g) =&(-1)\sum _{j=1}^{N} \Biggl ( \sum _{ K \subseteq X \backslash \{j\} }\zeta _{X,1} \left( K \right) \nonumber \\&\left| g(K \cup \{j\}) - g(K) - \frac{1}{N} \right| \Biggr ). \end{aligned}$$
(9a)

The primary goal of this index is to linearize Kojadinovic’s index to assist in optimization (apply linear programming). Labreuche’s goal was to also satisfy, with respect to the different probability distributions, symmetry (value regardless of input permutation), maximality and minimality (probability distribution of all \(\frac{1}{N}\) values and the distribution of all zeros with a single value of one). Kojadinovic’s index does not satisfy the last two properties. Furthermore, index \(v_{L}(g)\) has a single minimum, the capacity \(g(A)=\frac{|A|}{|X|}\), which results in the CI becoming the mean operator. Labreuche’s index also has a single maximum, one for each probability distribution. In terms of the capacity, this equates to the minimum operator, the maximum operator, and the other \((N-2)\) OWAs discussed for Kojadinovic’s index.

Definition 12

The k-additive based index is (Tehrani et al. 2012; Tehrani and Hüllermeier 2013; Tehrani 2013)

$$\begin{aligned} v_{T}(g) = \sum _{A \subseteq X}{ f(|A|)|\mathcal {M}(A)| }, \end{aligned}$$
(10)

where f is a strictly increasing function defined on the cardinality of subsets of X and \(\mathcal {M}\) is the Mobius transform of g (Grabisch et al. 2000; Grabisch and Roubens 2000). The Mobius transform of g is used here to highlight and exploit k-additivity, i.e., \(\mathcal {M}(B)=0, \forall B \subseteq X\) with \(|B| > k\). This is a different approach as k-additivity allows for what could be considered a “compact” representation of g (under a set of restrictions) to combat the otherwise combinatorial explosion of g: \(\sum _{i=1}^k{N \atopwithdelims ()i}\) terms versus \(2^N\). In summary, \(v_{T}(g)\) favors the restriction that capacities have a low level of nonadditivity.

It is well-known that the sum of the Mobius terms for the capacity is equal to one (Beliakov et al. 2008). However, \(v_{T}(g)\) considers the sum of the absolute values of the Mobius terms. It is trivial to prove that \(v_{T}(g)\) has a single maximum for the case of a capacity of all ones, \(g(A)=1, \forall A \subseteq X\), i.e., the maximum operator. Although these values sum to one, they can be any value in \([-1,1]\). This index does not have a unique minimum. For example, a capacity of all zeros (except g(X)) has an index value of 1, the mean operator, \(g(A)=\frac{|A|}{|X|}\), has an index of 1, and a capacity where a single input has a Shapley value of 1 has an index of 1. In general, the higher the k-additivity, the greater the \(v_{T}(g)\).

While the above indices are all useful in their respective contexts, none truly address the desire to favor fewer numbers of inputs. In the next subsection we explore a few new indices to achieve this goal based on utilization of the Shapley values.

2.3 New indices of complexity based on the Shapley values

Definition 13

(\(\ell _1\)-Norm of Shapley Values) Let \({\mathbf {\Phi }}_{g} = \left( \Phi _{g}(1), \; \Phi _{g}(2), \; \ldots , \; \Phi _{g}(N) \right) ^t\), a vector of size \(N\times 1\), be the vector of Shapley values. The so-called \(\ell _1\)-norm of \({\mathbf {\Phi }}_{g}\) is

$$\begin{aligned} \left\| {\mathbf {\Phi }}_{g} \right\| _1 = \sum _{i=1}^{N}{ \left| \Phi _{g}(i) \right| } = \sum _{i=1}^{N}{ \Phi _{g}(i) } = 1. \end{aligned}$$
(11)

It is important to note that the constraint that the Shapley values sum to 1 renders the \(\ell _1\) index useless for regularization (as it yields a constant). Next, we explore the \(\ell _0\).

Definition 14

(\(\ell _0\)-Norm of Shapley Values) Let \({\mathbf {\Phi }}_{g} = \left( \Phi _{g}(1), \; \Phi _{g}(2), \; \ldots , \; \Phi _{g}(N) \right) ^t\), a vector of size \(N\times 1\), be the vector of Shapley values. The so-called \(\ell _0\)-norm of \({\mathbf {\Phi }}_{g}\) is

$$\begin{aligned} \left\| {\mathbf {\Phi }}_{g} \right\| _0 = \left| \left\{ i : \Phi _{g}(i) \ne 0 \right\} \right| . \end{aligned}$$
(12)

Technically, the \(\ell _0\)-norm is not really a norm. It is a cardinality function that counts the number of non-zero terms. The \(\ell _0\)-norm has been used extensively in areas like compressive sensing, where the goal is to typically find the sparsest solution for an under-determined linear system. If we define Eq. (12) on the lexicographically encoded capacity vector, versus the Shapley values vector, then we would be back in the same predicament of striving for a capacity of all 0s (except for \(g(X)=1\)), viz., the minimum operator for the CI. It is clear that Eq. (12) has its minimum for the case of one Shapley value equal to 1 (thus all other Shapley values are equal to 0). Its next smallest value is for the case of two Shapley values greater than zero and all other Shapley values are equal to zero (and so forth). It is clear to see that sparsity, in the sense of the fewest number of non-zero values, is preserved via the \(\ell _0\)-norm. Specifically, Eq. (12) has N minima, e.g., \({\mathbf {\Phi }}_{g}=\left( 1,0,\ldots ,0\right) ^t\), \({\mathbf {\Phi }}_{g}=\left( 0,1,0,\ldots ,0\right) ^t\), ..., \({\mathbf {\Phi }}_{g}=\left( 0,0,\ldots ,0,1\right) ^t\). For two non-zero values, there are \({N \atopwithdelims ()2}\) such solutions (\({N \atopwithdelims ()k}\) in general for k non-zero inputs).

As an index, the \(\ell _0\) with respect to the Shapley values is fantastic at helping promote fewer number of non-zero parameters (inputs). Problem solved, correct? Not entirely. One (big) challenge is that the \(\ell _0\) results in a non-convex optimization problem that is NP-hard. Before we consider \(\ell _0\) approximation, we investigate an alternative, but theoretically inferior (the tradeoff), index that is simpler to solve for based on the Gini-Simpson coefficient.

The Gini coefficient (aka Gini index or Gini ratio) is a summary statistic of the Lorenz curve and it was developed by the Italian statistician Corrado Gini in 1912. It is important to note that numerous mathematical formulations exist, from Corrado’s original formula to Brown’s Gini-style index measuring inequalities in health (Gini 1936; Brown 1994). A full review of the Gini index and its various discrete and continuous formulations is beyond the scope of this article (for a recent review, see Farris 2010). The Gini index is used extensively in areas like biological systems (for measuring species similarity Leinster and Cobbold 2012), knowledge discovery in databases (often referred to as an “impurity function”), social sciences and economics. For example, it is often used as a measure of income inequality within a population. On one extreme, the Gini index equates to perfect “equality” (everyone has the same income) and, at the other extreme, to perfect “inequality” (one person has all the wealth and everyone else has zero income). Herein, we use a mathematically simple, but pleasing nevertheless, instantiation of the Gini index—at Eq. (13)—often referred to as the Gini–Simpson index (or in ecology as the probability of interspecific encounter) with respect to a probability distribution (the Shapley values satisfy this criterion). However, the Gini–Simpson function belongs to a larger family of functions parameterized by a variable q (the sensitivity parameter) and Z (a matrix of similarity values between elements in the distribution) (Leinster and Cobbold 2012). Based on q and Z, we get different diversity measures, e.g., Shannon’s entropy, Rao’s quadratic entropy, “species richness” index, etc.

Definition 15

(Gini–Simpson Index of Shapley Values) The Gini–Simpson index of the Shapley values is

$$\begin{aligned} \upsilon_{G}(g) = \sum _{i=1}^{N}\sum _{j=1,j \ne i}^{N}\Phi _{g}(i)\Phi _{g}(j) = 1 - \sum _{i=1}^{N}{{\Phi _{g}(i)}^{2}}. \end{aligned}$$
(13)

Note, \(\upsilon_{G}(g)=0\) if and only if there is a single Shapley value equal to 1 (therefore all other values are equal to 0). There are N such possible unique solutions to this criteria. If \(\Phi _{g}(i)=1\) and \(\Phi _{g}(j)=0, \forall j \ne i\), then all g subsets that contain input i are of value 1 and 0 elsewhere. Also, the maximum of \(v_{G}(g)\) occurs only when all Shapley values are equal. When these values are all equal, all inputs are “equally important” and increasing the cardinality of any set of inputs in this case increases the FM by a constant value, regardless of which input was used to increase cardinality. This essentially reduces the CI to an OWA. It is obvious that Eq. (13) is nothing more than one minus the squared \(\ell _2\)-norm of the Shapley values. Next, we provide simple numeric examples (Table 2) to (empirically) demonstrate some similarities and differences between the \(\ell _0\) and the Gini–Simpson.

Table 2 Numeric examples, for \(N=3\), illustrating \(\ell _0\) and Gini–Simpson differences

Again, the \(\ell _0\) wants the fewest number of non-zero parameters and the Gini–Simpson index is a measure of diversity in the Shapley values, or more specifically one minus the squared \(\ell _2\)-norm, that ultimately aims to promote, in the extreme case, a single dominant input (one Shapley value of 1 and all other values equal to 0). They both have lowest value for a single input (case \(g_a\)) and maximum value for the case of a uniform distribution (case \(g_g\)). While their trends are often similar, e.g., both prefer \(g_a\) to \(g_b\) and \(g_b\) to \(g_d\) and \(g_e\), they do not always obviously agree. For example, consider \(g_c\) and \(g_d\). The \(\ell _0\)-norm prefers \(g_c\) to \(g_d\) as the prior has one zero term and the latter has no zero terms. However, the Gini–Simpson index prefers \(g_d\) to \(g_c\). In \(g_c\), the Shapley values indicate that one input is not important while the other two inputs are equally important. In \(g_d\), the Shapley values indicate that one input is important and the other two inputs are equal and not that important at that. According to the Gini–Simpson index, \(g_d\) is closer to a single input vs \(g_c\). This behavior is further emphasized by \(g_f\).

In addition, due to the relationship between the Gini–Simpson and the \(\ell _2\)-norm for the Shapley values, the underlying geometric interpretation and sparseness of solutions for the family of \(\ell _p\)-norms is well-known and heavily published (e.g., see Tibshirani et al. 2005). In the following sections, we outline a way to perform regularization based on the \(\ell _0\)-norm and the Gini-Simpson index of the Shapley values. However, we first review QP based solutions to capacity learning and \(\ell _p\)-norm regularization.

3 Sum of squared error and quadratic programming

Definition 16

(Sum of Squared Error of CI and T) Let the SSE between T and the CI be (Grabisch et al. 2000; Anderson et al. 2014b)

$$\begin{aligned} E_{1} = \sum _{j=1}^{m}( C_{g}(h_{j})) - \alpha _{j} )^2. \end{aligned}$$
(14)

Equation (14) can be expanded as follows,

$$\begin{aligned} E_{1} = \sum _{j=1}^{m}( {\mathbf {A}}_{O_{j}}^{t} {\mathbf {u}}- \alpha _{j} )^2, \end{aligned}$$

where \({\mathbf {u}}\) is the lexicographically encoded capacity vector and \({\mathbf {A}}^{t}_{O_{j}} = \left( \ldots , h_j(x_{\pi _{j}(1)}) - h_j(x_{\pi _{j}(2)}), \ldots , 0, \ldots , h_j(x_{\pi _{j}(N)}) \right) ^{t}\) is of size \(1\times (2^N-1)\). Note, the function differences, \(h_j(x_{\pi _{j}(i)}) - h_j(x_{\pi _{j}(i+1)})\), correspond to their respective g locations in \({\mathbf {u}}\). Folding Eq. (14) out further, we find

$$\begin{aligned} E_{1}&= \sum _{j=1}^{m}( {\mathbf {u}}^{t}{\mathbf {A}}_{O_{j}}{\mathbf {A}}_{O_{j}}^{t}{\mathbf {u}}- 2\alpha _{j}{\mathbf {A}}_{O_{j}}^{t}{\mathbf {u}}+ \alpha _{j}^{2} ) \nonumber \\&= {\mathbf {u}}^{t}\mathbf D{\mathbf {u}}+ \mathbf f^{t}{\mathbf {u}}+ \sum _{j=1}^{m}\alpha _{j}^{2}, \end{aligned}$$
(15)

where \(\mathbf D= \sum _{j=1}^{m} {\mathbf {A}}_{O_{j}}{\mathbf {A}}_{O_{j}}^{t}\) and \(\mathbf f= \sum _{j=1}^{m} ( - 2\alpha _{j}{\mathbf {A}}_{O_{j}} )\). In total, the capacity has (\(N(2^{N-1}-1)\)) monotonicity constraints. These constraints can be represented in a compact linear algebra (aka matrix) form. The following is the minimum number of constraints needed to represent the FM. Let \(\mathbf C{\mathbf {u}}+ \mathbf b\le \mathbf 0\), where \(\mathbf C^{t} = \left( \Psi _{1}^{t}, \Psi _{2}^{t}, \ldots , \Psi _{N+1}^{t}, \ldots , \Psi _{N(2^{N-1}-1)}^{t} \right) ^{t}\), and \(\Psi _{1}\) is a vector representation of constraint 1, \(g_{1} - g_{12} \le 0\). Specifically, \(\Psi _{1}^{t}{\mathbf {u}}\) recovers \({\mathbf {u}}_{1} - {\mathbf {u}}_{N+1}\). Thus, C is simply a matrix of \(\{0,1,-1\}\) values,

$$\begin{aligned} \mathbf C= \left[ \begin{array}{cccccccc} 1 &{} 0 &{} ... &{} -1 &{} 0 &{} ... &{} ... &{} 0 \\ 1 &{} 0 &{} ... &{} 0 &{} -1 &{} ... &{} ... &{} 0 \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ 0 &{} 0 &{} ... &{} 0 &{} 0 &{} ... &{} 1 &{} -1 \\ \end{array} \right] , \end{aligned}$$
(16)

which is of size \((N(2^{N-1}-1))\times (2^N-1)\). Also, \(\mathbf b=\mathbf {0}\), a vector of all zeroes. Note, in some works, \({\mathbf {u}}\) is of size \((2^{N}-2)\), as \(g(\phi )=0\) and \(g(X)=1\) are explicitly encoded. In such a case, \(\mathbf b\) is a vector of 0s and the last N entries are of value −1. Herein, we use the \((2^{N}-1)\) format as it simplifies the subsequent Shapley index mathematics. Given T, the search for FM g reduces to a QP of the form

$$\begin{aligned} \min _{\mathbf {u}}\frac{1}{2}{\mathbf {u}}^{t}\hat{\mathbf{ D}}{\mathbf {u}}+ \mathbf f^{t}{\mathbf {u}}, \end{aligned}$$
(17)

subject to \(\mathbf C{\mathbf {u}}+ \mathbf b\ge \mathbf 0\) and \((\mathbf {0},1)^t \le {\mathbf {u}}\le \mathbf {1}\). The difference between Eqs. (17) and (15) is \(\hat{\mathbf {D}} = 2\mathbf D\) and inequality in Eq. (16) need only be multiplied by \(-1\).

In Mendez-Vazquez and Gader (2007), it was pointed out that the QP approach for learning the CI is not without flaw due to the exponential size of the input. While scalability is definitely of concern, many techniques have and continue to be proposed for solving QPs with respect to fairly large and sparse matrices (Cevher et al. 2014). This attention and progress is coming primarily as a response to machine learning, statistics and signal processing. A somewhat large and sparse matrix is not a “game stopper”. We do agree that there is mathematically a point where the task at hand does become extremely difficult to solve and may eventually become intractable. However, most FI applications utilize a relatively small number of inputs, i.e., on the order of 3– 5, versus 50, 100. The notion that the QP has little-to-no value just because it is difficult (and may become intractable) to solve with respect to a sparse matrix for a large number of inputs is no reason to dismiss it. This challenge is akin to the current Big Data revolution, where previously intractable problems are being solved on a daily basis.

In general, the challenge of QP-based learning of the CI relative to a regularization term for tasks like decision-level fusion is the optimization of

$$\begin{aligned} E_{2} = \sum _{j=1}^{m}( {\mathbf {u}}^{t}{\mathbf {A}}_{O_{j}}{\mathbf {A}}_{O_{j}}^{t}{\mathbf {u}}- 2\alpha _{j}{\mathbf {A}}_{O_{j}}^{t}{\mathbf {u}}+ \alpha _{j}^{2} ) + \lambda v_{*}(g), \end{aligned}$$
(18)

where \(v_{*}(g)\) is one of our indices. In order for Eq. (18) to be suitable for the QP, \(v_{*}\) must be linear or quadratic.

Note, in certain problems one can simply fold the \(\ell _1\) regularization term into the linear term of the quadratic objective. We can rewrite \(\Vert {\mathbf {u}}\Vert _1 = \mathbf {1}^t {\mathbf {u}}\), where \(\mathbf {1}\) is the vector of all ones. Adding the regularization term to the QP, we get

$$\begin{aligned} \min _{\mathbf {u}}\frac{1}{2}{\mathbf {u}}^{t}\hat{\mathbf {D}}{\mathbf {u}}+ \mathbf f^{t}{\mathbf {u}}+ \lambda \mathbf {1}^t{\mathbf {u}}= \min _{\mathbf {u}}\frac{1}{2}{\mathbf {u}}^{t}\hat{\mathbf {D}}{\mathbf {u}}+ (\mathbf f+\lambda \mathbf {1})^t{\mathbf {u}}. \end{aligned}$$
(19)

4 Gini–Simpson index-based regularization of the Shapley values

We begin by considering a vectorial encoding of the Shapley index. The Shapley value of input 1 is

$$\begin{aligned} \Phi _{g}(1)&= \sum _{ K \subseteq X \backslash \{i=1\} }{\mathbf {\Gamma }}_{X}( K )( g(K \cup \{i=1\}) - g(K) ),\end{aligned}$$
(20a)
$$\begin{aligned}&= \eta _{1}g(\{x_{1}\}) + \eta _{2}[ \left( g(\{x_1,x_2\}) - g(\{x_{2}\}) \right) \nonumber \\&\quad + \left( g(\{x_1,x_3\}) - g(\{x_{3}\}) \right) + \cdots ] + \cdots , \end{aligned}$$
(20b)
$$\begin{aligned}&= \eta _{1}g(\{x_{1}\}) - \left[ \eta _{2}g(\{x_{2}\}) + \cdots + \eta _{2}g(\{x_{N}\}) \right] \nonumber \\&\quad + \left[ \eta _{2}g(\{x_1,x_2\}) + \eta _{2}g(\{x_1,x_N\}) \right] + \cdots , \end{aligned}$$
(20c)

where \(\eta _{i}={\mathbf {\Gamma }}_{X}(K)\), and \(K \in 2^X\), s.t. \(|K|=i-1\) (Shapley normalization constants). What Eq. (20a) tells us is the following. The Shapley index can be represented in linear algebra/vectorial form,

$$\begin{aligned} {\mathbf {\Gamma }}_i = \left( {\mathbf {\Gamma }}_{i,1} , {\mathbf {\Gamma }}_{i,2} , \ldots , {\mathbf {\Gamma }}_{i,2^N-1} \right) ^t, \end{aligned}$$
(21)

where \({\mathbf {\Gamma }}_i\) is the same size as g and the \({\mathbf {\Gamma }}_i\) terms are the coefficients of Eq. (20a) arranged such that multiplication with the lexicographic FM vector yields a particular Shapley value. For example, for \(N=3\),

$$\begin{aligned} \Gamma _1 = \left( \frac{1}{3}, -\frac{1}{6}, -\frac{1}{6}, \frac{1}{6}, \frac{1}{6}, -\frac{1}{3}, \frac{1}{3} \right) ^t. \end{aligned}$$

Thus, we can formulate a compact expression of an individual Shapley value as such,

$$\begin{aligned} \Phi _{g}(i) = {\mathbf {\Gamma }}_i^t {\mathbf {u}}, \end{aligned}$$
(22)

where \(\Phi _{g}(i) \in [0,1]\). Therefore, the Gini–Simpson index in linear algebra form becomes

$$\begin{aligned} \upsilon_{G}(g) = 1 - \sum _{k=1}^{N}{\left( {\mathbf {\Gamma }}_k^t {\mathbf {u}}\right) ^{2}}. \end{aligned}$$
(23)

Expanding Eq. (23) exposes an attractive property:

$$\begin{aligned} \upsilon_{G}(g) =&1 - \sum _{k=1}^{N}{\left( {\mathbf {\Gamma }}_k^t {\mathbf {u}}\right) ^{2}},\nonumber \\ =&1 - \sum _{k=1}^{N}{\left( {\mathbf {u}}^t {\mathbf {\Gamma }}_k {\mathbf {\Gamma }}_k^t {\mathbf {u}}\right) }, \nonumber \\ =&1 - {\mathbf {u}}^t {\mathbf {Z}}{\mathbf {u}}, \end{aligned}$$
(24)

where \({\mathbf {Z}}={\mathbf {\Gamma }}_1 {\mathbf {\Gamma }}_1^t + \cdots + {\mathbf {\Gamma }}_N {\mathbf {\Gamma }}_N^t\). First, \({\mathbf {\Gamma }}_k {\mathbf {\Gamma }}_k^t\) is positive semi-definite (PSD). Hence, \({\mathbf {Z}}\) is also PSD, as it is simply the addition of PSD matrices and addition preserves the PSD property. We propose a Gini–Simpson index-based regularization of \(E_{1}\) at Eq. (14) as follows.

Definition 17

(SSE with Gini–Simpson Index Regularization) The Gini–Simpson index regularization is

$$\begin{aligned} E_{3} = {\mathbf {u}}^{t}\mathbf D{\mathbf {u}}+ \mathbf f^{t}{\mathbf {u}}+ \sum _{j=1}^{m}\alpha _{j}^{2} + \lambda - \lambda \left( {\mathbf {u}}^t {\mathbf {Z}}{\mathbf {u}}\right) , \end{aligned}$$
(25)

where the regularization term can be simply folded into the quadratic term in the SSE yielding

$$\begin{aligned} \min _{{\mathbf {u}}} {\mathbf {u}}^{t}\left( \mathbf D- \lambda {\mathbf {Z}}\right) {\mathbf {u}}+ \mathbf f^{t}{\mathbf {u}}, \end{aligned}$$
(26)

subject to \(\mathbf C{\mathbf {u}}\ge \mathbf 0\) and \((\mathbf {0},1)^t \le {\mathbf {u}}\le \mathbf {1}\).

This is of the form of Tikhonov regularization, where \(-\lambda {\mathbf {Z}}\) is the Tikhonov matrix (Tikhonov 1943). As one can clearly see, the Gini–Simpson index does not result in a linear term and the constant is not part of the QP formulation. All that makes it into the quadratic term is \({\mathbf {u}}^t {\mathbf {Z}}{\mathbf {u}}\), our scaled (squared) \(\ell _2\)-norm.

5 \(\ell _0\)-norm based regularization of the Shapley values

The main difficulty behind the \(\ell _0\)-norm of the Shapley values is how do we carry out its optimization? Our QP task with an \(\ell _0\)-norm is a non convex problem, which makes it difficult to understand theoretically and solve computationally (NP-hard problem). There are numerous articles focused on approximation techniques for the \(\ell _0\)-norm. Herein, we take the approach of enhancing sparsity through reweighted \(\ell _1\) minimization. In Candes et al. (2008), Candes proposed a simple and computationally attractive recursively reweighted formulation of \(\ell _1\)-norm minimization designed to more democratically penalize nonzero coefficients. His approach finds a local minimum of a concave penalty function that approximates the \(\ell _0\)-norm. Specifically, the weighted \(\ell _1\) minimization task can be viewed as a relaxation of a weighted \(\ell _0\) minimization task.

Definition 18

(SSE with Weighted \(\ell _1\) -Norm) The SSE and weighted \(\ell _1\)-norm of the Shapley index based regularization is

$$\begin{aligned} E_{4} = {\mathbf {u}}^{t}\mathbf D{\mathbf {u}}+ \mathbf f^{t}{\mathbf {u}}+ \sum _{j=1}^{m}\alpha _{j}^{2} - \left( \lambda _1{\mathbf {\Gamma }}_1 + \cdots + \lambda _N{\mathbf {\Gamma }}_N \right) ^t {\mathbf {u}}. \end{aligned}$$
(27)

Thus, our goal is

$$\begin{aligned} \min _{{\mathbf {u}}} {\mathbf {u}}^{t}\mathbf D{\mathbf {u}}+ \left( \mathbf f- \left( \lambda _1{\mathbf {\Gamma }}_1 + \cdots + \lambda _N{\mathbf {\Gamma }}_N \right) \right) ^{t}{\mathbf {u}}, \end{aligned}$$
(28)

subject to \(\mathbf C{\mathbf {u}}+ \mathbf b\ge \mathbf 0\) and \((\mathbf {0},1)^t \le {\mathbf {u}}\le \mathbf {1}\), where \(\mathbf {0}\) is a vector of all zeros of length \(2^{N-2}\), and \(\mathbf {1}\) is a vector of all ones of length \(2^{N-1}\).

Algorithm 1 is exactly the method of Candes et al. just with the Shapley values as the parameters. For further mathematical analysis of Candes’s approximation, see Candes et al. (2008). Herein, our goal is not to advance this approximation technique. Instead, we simply apply it for learning the CI. As better approximations become available, the reader can employ those strategies. In Algorithm 1, \(\epsilon > 0\) is used to provide stability and to ensure that a zero-valued component in \(1 - {\mathbf {\Gamma }}_k^t(t-1) {\mathbf {u}}(t-1)\) does not strictly prohibit a nonzero estimate at the next step (as done in Candes et al. 2008). Intuitively, the update step takes the previous \(\lambda _k(t-1)\) terms and divides them by one minus their respective Shapley values. Thus, the “more important” (the larger) the Shapley value the smaller the divisor (number in [0, 1]) and therefore the larger the \(\lambda _k(t)\). Different stopping criteria exist for Algorithm 1. For example, the user can provide a maximum allowable SSE. The user can also compare the difference between the weights from iteration to iteration relative to a user specified threshold. Furthermore, the user can provide a maximum number of allowable iterations.

figure a

6 Experiments

In this section, we explore our methods with both synthetic and real-world data sets. These experiments demonstrate the capability of the CI with respect to the learned capacity in a supervised learning task, i.e., classification or regression. The goal of the synthetic experiments (Experiments 1 through 3) is to investigate the general behavior of the proposed theories under controlled settings in which we know the “answer” (the generating capacity). The goal of the real-world experiment (Experiment 4) is to investigate classification performance on widely-available benchmark data sets.

In Sect. 2.2 we reviewed and mathematically compared different indices, however, we do not include all indices in these experiments as they do not operate on the same basis. Each index more-or-less interprets complexity differently and, thus, each has its own place (application) and rationale for existence, both in terms of capacity theory and also in terms of how the CI is applied. In the experiments that follow, we restrict analysis to the study of the Gini–Simpson and the \(\ell _0\)-norm of the Shapley values and we compare it to the most related indices for decision-level fusion—specifically, the \(\ell _1\) and \(\ell _2\)-norm of a lexicographically encoded capacity vector and the Mobius-based index.

In our synthetic experiments (Experiments 1 through 3) we elected to not report a single summarized number or statistic, e.g., classification accuracy. Instead, we show the behavior of our technique across different possible choices of the regularization parameter \(\lambda\). While somewhat overwhelming at first, we believe it is important to give the reader a better (more detailed) feel for how the methods behave in general. However, it is worth briefly noting some \(\lambda\) selection strategies used in practice. For example, we can pick a “winner” by trying a range of values of \(\lambda\) in the context of cross validation (i.e., a grid search). Such an experiment emphasizes learning less complex models with respect to the idea of avoiding over fitting (one use of an information theoretic index). We employ the same strategy in our real-world benchmark data set experiment for kernel classification. If the reader desires, they can refer to one of many works in statistics or machine learning for further assistance in automatically determining or experimentally selecting \(\lambda\) (Candes et al. 2008).

6.1 Experiment 1: important, relevant, and irrelevant inputs

For this experiment, we consider the case of learning an aggregation function for three inputs \((N=3)\)—in essence, a regression task. While this experiment is easily generalized to more than three inputs, the advantage of \(N=3\) is that we can clearly visualize the results, since it becomes difficult to view the results for more inputs as the number of elements in the capacity grows exponentially with respect to N.

To generate the synthetic data, we define the densities of the FM as \(g(x_1)=0.8\), \(g(x_2)=0.2\), and \(g(x_3)=0.01\), such that the inputs can be considered important, relevant, and irrelevant, respectively. The capacity beyond the densities is determined as \(g(A)=\max _{x_{i} \in A}{g(x_{i})}\), for \(A \in 2^X \backslash \{ x_1, x_2, \ldots , x_N, X \}\), making the FM an S-Decomposable FM, specifically a possibility measure. This synthetic capacity was used in conjunction with the Choquet integral to produce training labels from 500 uniform (pseudo)randomly generated N-tuples, each in the range [0, 1].

We expect the FM learned from this synthetic training data to behave as follows. We would like for the third input to be ignored and the second input should be driven down to zero worth (in the Shapley sense) before the first input is driven toward zero worth. Figure 2 shows the results of this experiment. Views (a, b) show the FM values learned for values of \(\lambda\) between 0 and 50—the left side of each bar (the black) corresponds to the learned FM values at \(\lambda = 0\) and the right side of each bar (the bright yellow) corresponds to the FM values at \(\lambda = 50\). Views (c, d) show the value of the Gini–Simpson index for the learned FM and the resulting SSE versus each value of \(\lambda\). The scale for the solid blue line—the Gini–Simpson index—is shown on the left of each plot and the scale for the dashed red line—the SSE—is shown on the right of each plot.

Fig. 2
figure 2

Experiment 1 results. a, b Learned FM values in lexicographical order for \(\lambda = 0\) to 50. Bin 1 is \({\mathbf {u}}(1)=g(x_1)\), bin \(N+1\) is \({\mathbf {u}}(N+1)=g(\{x_1,x_2\})\), etc. Height of bar indicates FM value; color indicates \(\lambda\) value. c, d Plots showing performance of each regularization method in terms of SSE and Gini–Simpson index of the learned FM at each regularization parameter \(\lambda\)

Figure 2 tells the following story. The black color bars in views (a, b) show that both methods recover the desired possibility measure (the one that minimizes just the SSE criteria) when no regularization \(\lambda =0\) is used—after all, the methods are equivalent, i.e., no regularization, when \(\lambda =0\). View (a) shows that the Gini–Simpson index regularization pushes the contribution of input 3 to zero very quickly—at \(\lambda \approx 5\)—and the contribution of input 2 is reduced to zero at \(\lambda \approx 35\). The contribution of \({\mathbf {u}}_6 = g(\{x_2,x_3\})\) is also pushed to zero as \(\lambda\) increases. On the contrary, the contribution of input 1 and the FM values in the lattice that include input 1—i.e., \({\mathbf {u}}_4\) and \({\mathbf {u}}_5\)—are gradually increased with increasing \(\lambda\). Figure 2c shows that as \(\lambda\) is increased the Gini–Simpson index decreases, which is echoed in the FM values shown in view (a). As the model becomes more simple, by increasing \(\lambda\), the SSE increases (albeit, slightly). At \(\lambda \approx 35\), the Gini-Simpson index goes to zero, indicating the model is as simple as it can get. Increasing \(\lambda > 35\) has no effect on the model because it is already as simple as possible, with only one input (#1) being considered in the solution. The SSE of this minimum Gini–Simpson index model is about 4.

Figure 2b, d show visualizations of the same experiment for \(\ell _1\) regularization. As view (b) shows, this regularization starts decreasing all of the FM values as \(\lambda\) is increased. The contribution of input 3, \({\mathbf {u}}_3=g(x_3)\), is quickly pushed to zero, at \(\lambda \approx 2\), while the values \({\mathbf {u}}_2=g(x_2)\) and \({\mathbf {u}}_6=g(\{x_2,x_3\})\) go to zero at \(\lambda \approx 10\). Last, \({\mathbf {u}}_1=g(x_1)\), \({\mathbf {u}}_4=g(\{x_1,x_2\})\), and \({\mathbf {u}}_5=g(\{x_1,x_3\})\) go to zero at \(\lambda \approx 32\). At \(\lambda \gtrsim 32\), the \(\ell _1\) regularization learns, as expected, the FM of ignorance. Figure 2d shows that despite a lower complexity model, in terms of \(\ell _1\)-norm, the Gini–Simpson index increases as \(\lambda\) is increased; SSE also increases, as expected. The FM of ignorance learned at \(\lambda \gtrsim 32\) has an SSE of about 45. Compare this to the SSE of 4 achieved with the lowest complexity model with Gini–Simpson regularization.

In summary, this initial experiment shows that the Gini–Simpson index regularization and \(\ell _1\)-regularization of a lexicographically encoded capacity vector do as advertised.

6.2 Experiment 2: random AWGN noise

In Experiment 2, we use our setup from Experiment 1; however, (pseudo)random AWGN noise (\(\sigma = 0.2\)) is added to the labels. Figure 3 shows the results of Experiment 2. As views (c, d) show, neither procedure perfectly fits the data now due to the noise in the training labels. The Gini–Simpson procedure, shown in views (a, c), can find a solution close to our noise-free goal at small values of \(\lambda\). If regularization is increased, \(\lambda \gtrsim 45\), we eventually identify a single input, which interestingly still fits the data well (only a small increase in SSE). Again, the \(\ell _1\) procedure is only able to achieve low SSE at low \(\lambda\) values. As \(\lambda\) is increased the SSE is significantly increased (beyond that achieved by the Gini–Simpson).

Fig. 3
figure 3

Experiment 2 results. a, b Learned FM values in lexicographical order. Bin 1 is \({\mathbf {u}}(1)=g(x_1)\), bin \(N+1\) is \({\mathbf {u}}(N+1)=g(\{x_1,x_2\})\), etc. Height of the bar indicates FM value; color indicates \(\lambda\) value. c, d Plots showing performance of each regularization method in terms of SSE and Gini–Simpson index values of the learned FM at each regularization parameter \(\lambda\)

6.3 Experiment 3: iteratively reweighted \(\ell _1\)-norm

In Experiment 3, we use our setup from Experiment 1 to demonstrate the recursively reweighted \(\ell _1\) minimization procedure. The result (Fig. 4a, b) for the possibility FM with densities \(g(x_1)=0.8\), \(g(x_2)=0.2\), \(g(x_3)=0.01\), is as expected. After a few iterations we see the Shapley value increasing for input \(x_1\) and decreasing for inputs \(x_2\) and \(x_3\). This is the same trend and final answer that we saw in Experiment 1 with respect to the Gini–Simpson index and we obviously get a different final solution than the \(\ell _1\) with respect to the lexographically encoded capacity vector (eventual solution of 0s and corresponding CI minimum operator).

Fig. 4
figure 4

Experiment 3 results. Learned FM values in lexicographical order for Experiment 1. Bin 1 is \({\mathbf {u}}(1)=g(x_1)\), bin \(N+1\) is \({\mathbf {u}}(N+1)=g(\{x_1,x_2\})\), etc. Height of the bar indicates FM value; color indicates iteration number. Plot of the Shapley values of the learned FM for Experiment 1 at each iteration.eps

6.4 Experiment 4: multiple kernel learning

In this final experiment we consider a problem from pattern recognition. Kernel methods for classification is a well-studied field in which data are implicitly mapped from a lower-dimensional space to a higher-dimensional space, called the reproducing kernel Hilbert space (RKHS), to improve classification accuracy. The ultimate challenge is that we are not privileged to know what transform (kernel) solves a particular task at hand—we only have an existence theorem. Multiple kernel learning (MKL) is a way to learn the fusion of multiple known Mercer kernels (the building blocks) to identify a superior kernel. In Pinar et al. (2015, 2016), Hu et al. (2013, 2014), a genetic algorithm (GA) based \(\ell _p\)-norm linear convex sum of kernels called GAMKLp for feature-level fusion was proposed. In Pinar et al. (2015), the nonlinear fusion of kernels was also explored. Specifically, kernel classifiers were fused at the decision-level based on the fuzzy integral, a procedure called decision-level fuzzy integral MKL (DeFIMKL). In this experiment, we explore the use of QP learning and regularization for CI-based MKL in the context of support vector machine (SVM) classification with respect to well-known community benchmark data sets. In Pinar et al. (2015), the benefit of DeFIMKL and GAMKL was demonstrated versus other state-of-the-art MKL algorithms from machine learning, e.g., MKL group lasso (MKLGL). Herein, the goal is not to reestablish DeFIMKL but to explore the proposed indices and their relative performances. Note, in the other experiments we knew the answer, i.e., the “generating capacity”. However, while SVMs are supervised learners and our data has labels, we do not know the true capacity in the case of MKL. Herein, like often in machine learning, success is instead evaluated in terms of ones ability to improve classification performance. The fusion of classifiers via DeFIMKL results in a classifier and this experiment demonstrates the ability of regularization to help learn an improved classifier that does not succumb to overfitting and generalizes better.

Each learner, i.e., input to fusion, is a kernel classifier trained on a separate kernel. The kth (\(1 \le k \le N\)) SVM classifier decision value is

$$\begin{aligned} \eta _k(\mathbf x) = \sum _{i=1}^D \alpha _{ik}y_i \kappa _k(\mathbf x_i,\mathbf x)-b_k, \end{aligned}$$

which is the distance of \(\mathbf x\) (an object from T) from the hyperplane defined by the data labels, y, the kth kernel, \(\kappa _k(\mathbf x_i,\mathbf x)\), and the learned SVM model parameters, \(\alpha _{ik}\) and \(b_k\). For the two-class SVM binary decision problem, the class label is typically computed as \(\text {sgn}\{\eta _k(\mathbf x)\}\). One could use \(\text {sgn}\{\eta _k(\mathbf x)\}\) as the training input to the capacity learning, however this eliminates information about which kernel produces the largest class separation—essentially, the difference between \(\eta _k(\mathbf x)\) for classes labeled \(y=+1\) and \(y=-1\). In Pinar et al. (2015) \(\eta _k(\mathbf x)\) is remapped onto the interval \([-1,+1]\), creating the inputs for learning by the sigmoid function \(\frac{\eta _k(\mathbf x)}{\sqrt{1+\eta _k^2(\mathbf x)}}\). For training, we use our labeled data and cast the learning process as a SSE problem and the CI is learned using QP and regularization (see Pinar et al. 2015 for a full mathematical description).

The well-known LIBSVM library was used to implement the classifiers (Chang and Lin 2011). The machine learning UCI benchmark data sets used are sonar, dermatology, wine, ecoli and glass. Each experiment consists of 100 randomly sampled trials in which 80% of the data is used for training and the remaining 20% is preserved for testing. Each index was applied to the same random sample to guarantee a fair comparison. Note that in some cases multiple classes are joined together such that the classification decision is binary. Five radial basis function (RBF) kernels are used in each algorithm with respective RBF width \(\sigma\) linearly spaced on the interval defined in Table 3; the same RBF parameters are used for each algorithm. For the \(\ell _1\), \(\ell _2\), Gini–Simpson and k-additive indices, a dense grid search (of \(\lambda\)) was used and the “winner” was picked according to the highest classification accuracy on the test data. For the iteratively reweighted \(\ell _1\) approximation, we used Algorithm 1. Table 4 is the result of regularization on DeFIMKL.

Table 3 RBF kernel parameter ranges and data set properties
Table 4 Classifier performances—means and standard deviations

Table 4 tells the following story. First, in each instance regularization helps. In many instances, e.g., ecoli, glass and wine, the regularization results are extremely close. However, in other cases, e.g., sonar and dermatology, the regularization results vary more (both in terms of means and standard deviations). Note, we ran the k-additive index with different levels of forced k-additivity. This was done to explore the impact of assuming and working with subsets of the capacity. The results show for the various k-additive experiments that as k is increased classification accuracy generally increases, though for the ecoli and glass datasets the performance remains stable and decreases slightly, respectively. In our other experiments we were able to analyze specific conditions and properties relating to the fusion process. While this experiment is encouraging, i.e., better classification performance, we are sadly unable to connect a story to the results. The regularization results are what they are—with these datasets some form of regularization gives superior performance in all cases with respect to the k-additive algorithms. We cannot go to the next step and inform the reader why a Gini–Simpson or k-additive index is better suited given our limited knowledge about the machine learning classification task.

7 Conclusion and future work

In this article, we explored a new data-driven way to learn the CI in the context of decision-level fusion relative to the joint minimization of the SSE criteria and desire to obtain minimum model complexity. We brought together and analyzed a number of existing indices, put forth new indices based on the Shapley values, and explored their role in regularization-based learning of the CI. Our first proposed index promotes sparsity (specifically, fewer number of non-zero parameters), however it is complicated to optimize (NP-hard). Our second index is a tradeoff with respect to modeling accuracy relative to solution simplicity. The proposed indices and regularization approach are compared theoretically. We showed that there is no “winning index”, as these indices strive for different goals and are therefore valid in different contexts. Synthetic and real-world data set experiments are shown that demonstrate the benefits of the proposed indices and CI learning technique.

In future work, we will seek more efficient and scalable ways to solve the problem investigated here as the number of inputs (N) grows—since the number of capacity terms and subsequently associated monotonicity constraints increases at an exponential rate. We will also explore if there are other information theoretic measures that have additional benefit towards learning lower complexity and useful CIs.