1 Introduction

There is no one, universal, perfect optimization criterion that can be used to train machine learning model. Even for linear classifiers, one can find multiple objective functions, error measures to minimize, regularization methods to include [15]. Most existing classification methods are aimed at minimization of empirical risk (through some simple point-based error measured with loss function) with added regularization. We propose to approach this problem in more information theoretic way by investigating applicability of entropy measures as a classification model objective function. We focus on quadratic Renyi’s entropy and connected Cauchy–Schwarz Divergence.

One of the information theoretic concepts which has been found very effective in machine learning is the entropy measure. In particular, the rule of maximum entropy modeling led to the construction of MaxEnt model and its structural generalization—Conditional Random Fields which are considered state of the art in many applications. In this paper, we propose to use Renyi’s quadratic cross entropy as the measure of two densities’ estimations divergence to find the best linear classifier. It is a conceptually different approach than typical entropy models as it works in the input space instead of decisions distribution. As a result, we obtain a model closely related to the Fisher Discriminant (or more generally Linear Discriminant Analysis) which deepens the understanding of this classical approach. Together with a powerful extreme data transformation, we obtain a robust, nonlinear model competitive with the state of the art models not based on information theory such as Support Vector Machines (SVM [4]), Extreme Learning Machines (ELM [11]) or Least Squares Support Vector Machines (LS-SVM [23]). We also show that under some simplifying assumptions ELM and LS-SVM can be seen through a perspective of information theory as their solutions are (up to some constants) identical to the ones obtained by proposed method.

Fig. 1
figure 1

Visualization of the whole EEM classification process. From the left linearly non separable data in \(\mathcal {X}\); data mapped to the \(\mathcal {H}\) space, where we find covariance estimators; density of projected Gaussians on which the decision is based; decision boundary in the input space \(\mathcal {X}\)

To draw the general idea of proposed method, we shortly summarize it here. Figure 1 is provided for better intuition. We begin with data in the input space \(\mathcal {X}\), transform it randomly into Hilbert space \(\mathcal {H}\) where we model them as Gaussians, then perform optimization leading to the projection on \(\mathbb {R}\) through \(\varvec{\beta }\) and perform density-based classification leading to nonlinear decision boundary in \(\mathcal {X}\). It is worth noting that as a result we obtain model that:

  • has a trivial implementation (under 20 lines of code in Python),

  • learns rapidly,

  • is well suited for unbalanced problems,

  • constructs nonlinear hypothesis,

  • scales very well (better than SVM, LS-SVM and ELM),

  • has a few hyperparameters which are easy to optimize.

Paper is structured as follows: first, we recall some preliminary information regarding ELMs and Support Vector Machines, including Least Squares Support Vector Machines. Next, we introduce extreme entropy machine (EEM) together with its kernelized extreme counterpart—extreme entropy kernel machine (EEKM). We show some connections with existing models and some different perspectives for looking at proposed model. In particular, we show how learning capabilities of EEMs (and EEKM) resemble those of ELM (and LS-SVM, respectively). During evaluation on over 20 binary datasets (of various sizes and characteristics), we analyze generalization capabilities and learning speed. We show that it can be a valuable, robust alternative for existing methods. In particular, we show that it achieves analogous of ELM stability in terms of the hidden layer size. We conclude with future development plans and open problems.

2 Preliminaries

Let us begin with recalling some basic information regarding extreme learning machines [12] and Support Vector Machines [4] which are further used as a competing models for proposed solution. We focus here on the optimization problems being solved to underline some basic differences between these methods and EEMs.

2.1 Extreme learning machines

Extreme Learning Machines are relatively young models introduced by Huang et al. [11] which are based on the idea that single layer feed forward neural networks (SLFN) can be trained without iterative process by performing linear regression on the data mapped through random, nonlinear projection (random hidden neurons). More precisely speaking, basic ELM architecture consists of d input neurons connected with each input space dimension which are fully connected with h hidden neurons by the set of weights \(\mathbf{w} _j\) (selected randomly from some arbitrary distribution) and set of biases \(b_j\) (also randomly selected). Given some generalized nonlinear activation function \(\mathrm {G}\), one can express the hidden neurons activation matrix \({\mathbf{H}}\) for the whole training set \(\{(\mathbf{x} _i,t_i)\}_{i=1}^N\) such that \(\mathbf{x} _i \in \mathbb {R}^d\) and \(t_i \in \{-1,+1\}\) and formulate following optimization problem

2.1.1 Optimization problem: extreme learning machine

$$\begin{aligned} \underset{\varvec{\beta }}{\text {minimize}}&\quad \left\| {\mathbf{H}}\varvec{\beta }- {\mathbf{t}}\right\| ^2 \\ \text {where}&\quad {\mathbf{H}}_{ij} = {\mathrm{G}}({\mathbf{x}} _i,{\mathbf{w}} _j,b_j) ,\quad i = 1, \ldots ,N, \quad j = 1, \ldots , h. \end{aligned}$$

If we denote the weights between hidden layer and output neurons by \(\varvec{\beta }\) it is easy to show [12] that putting

$$\begin{aligned} \varvec{\beta }= {\mathbf{H}}^\dagger \mathbf{t}, \end{aligned}$$

gives the best solution in terms of mean squared error of the regression:

$$\begin{aligned} \left\| {\mathbf{H}}\varvec{\beta }- \mathbf{t}\right\| ^2 =\left\| {\mathbf{H}}({\mathbf{H}}^\dagger \mathbf{t}) - \mathbf{t}\right\| ^2 = \min _{\varvec{\alpha } \in \mathbb {R}^{h}} \left\| {\mathbf{H}}\varvec{\alpha } - \mathbf{t}\right\| ^2, \end{aligned}$$

where \({\mathbf{H}}^\dagger\) denotes the Moore–Penrose pseudoinverse of \({\mathbf{H}}\).

Final classification of the new point \(\mathbf{x}\) can be now performed analogously by classifying according to

$$\begin{aligned} cl(\mathbf{x} ) = {\mathrm{sign}}\left( \left[ \begin{array}{lll} \mathrm {G}(\mathbf{x} ,{\mathbf{w}} _1,b_1)&\ldots&\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} _h,b_h) \end{array} \right] \varvec{\beta }\right) . \end{aligned}$$

As it is based on the ordinary least squares optimization, it is possible to balance it in terms of unbalanced datasets by performing weighted ordinary least squares. In such a scenario, given a vector \(\mathbf{s}\) such that \(\mathbf{s}_i\) is a square root of the inverse of the \({\mathbf{x}} _i\)’s class size and \(\mathbf{s} \cdot \mathbf{X}\) denotes element-wise multiplication between \(\mathbf{s}\) and \(\mathbf{X}\):

$$\begin{aligned} \varvec{\beta }= (\mathbf{s} \cdot {\mathbf{H}})^\dagger \mathbf{s} \cdot \mathbf{t}. \end{aligned}$$

2.2 Support vector machines and least squares support vector machines

One of the most well-known classifiers of the last decade is Vapnik’s support vector machine [4], based on the principle of creating a linear classifier that maximizes the separating margin between elements of two classes.

2.2.1 Optimization problem: support vector machine

$$\begin{aligned} \underset{\varvec{\beta },b,\xi }{\text {minimize}}&\quad \frac{1}{2} \left\| \varvec{\beta }\right\| ^2 + C \sum _{i=1}^N \xi _i \\ {\rm subject\, to}&\quad t_i(\langle \varvec{\beta }, {\mathbf{x}} _i\rangle + b) \ge 1 - \xi _i \\&\quad \xi _i \ge 0 ,\quad i = 1, \ldots ,N. \end{aligned}$$

here, C denotes the tradeoff between fitting to the data (minimization of the empirical risk) and a size of the separating margin (minimization of model complexity). One needs to fit this hyperparameter to get the best generalization results. SVM can be further kernelized (delinearized) using any kernel \(\mathrm {K}\) (valid in the Mercer’s sense) which leads to the following optimization problem.

2.2.2 Optimization problem: kernel support vector machine

$$\begin{aligned} \underset{\varvec{\beta }}{\text {maximize}}&\quad \sum _{i=1}^N\varvec{\beta }_i - \frac{1}{2} \sum _{i,j=1}^N \varvec{\beta }_i \varvec{\beta }_j t_i t_j \mathrm {K}({\mathbf{x}} _i,{\mathbf{x}} _j) \\ {\rm subject\, to}&\quad \sum _{i=1}^N \varvec{\beta }_i t_i = 0 \\&\quad 0 \le \varvec{\beta }_i \le C ,\quad i = 1, \ldots ,N. \end{aligned}$$

This is a quadratic optimization problem with linear constraints, which can be efficiently solved using quadratic programming techniques. Due to the use of hinge loss function on \(\xi _i\), SVM attains very sparse solutions in terms of nonzero \(\varvec{\beta }_i\). As a result, classifier does not have to remember the whole training set, but instead, the set of so-called support vectors (\(\mathrm{SV} = \{ {\mathbf{x}} _i : \varvec{\beta }_i > 0 \}\)). A point \({\mathbf{x}}\) is classified according to

$$\begin{aligned} cl({\mathbf{x}} ) = {\mathrm{sign}} \left( \sum _{{\mathbf{x}} _i \in \mathrm{SV}} t_i \varvec{\beta }_i \mathrm {K}({\mathbf{x}} _i, {\mathbf{x}} ) + b \right) . \end{aligned}$$

It appears that if we change the loss function to the quadratic one, we can greatly reduce the complexity of the resulting optimization problem, leading to the so-called Least Squares Support Vector Machines (LS-SVM).

2.2.3 Optimization problem: least squares support vector machine

$$\begin{aligned} \underset{\varvec{\beta },b,\xi }{\text {minimize}}&\quad \frac{1}{2} \left\| \varvec{\beta }\right\| ^2 + C \sum _{i=1}^N \xi ^2_i \\ {\rm subject\, to}&\quad t_i(\langle \varvec{\beta }, {\mathbf{x}} _i\rangle + b) \ge 1 - \xi _i\\&\quad \xi _i \ge 0 ,\quad i = 1, \ldots ,N. \end{aligned}$$

and decision is made according to

$$\begin{aligned} cl({\mathbf{x}} ) = {\mathrm{sign}}( \langle \varvec{\beta }, {\mathbf{x}} \rangle + b ). \end{aligned}$$

As Suykens et al. showed [23] this can be further generalized for arbitrary kernel-induced spaces, where we classify according to:

$$\begin{aligned} cl({\mathbf{x}} ) = {\mathrm{sign}} \left( \sum _{i=1}^N t_i \varvec{\beta }_i \mathrm {K}({\mathbf{x}} _i, {\mathbf{x}} ) + b \right) , \end{aligned}$$

where \(\varvec{\beta }_i\) are Lagrange multipliers associated with particular training examples \({\mathbf{x}} _i\) and b is a threshold, found by solving the linear system

$$\begin{aligned} \left[ \begin{array}{cc} 0 & {1}\!\mathrm{l}^\mathrm{T} \\ {1}\!\mathrm{l}& \mathrm {K}(\mathbf{X},\mathbf{X}) + \mathbf{I}/C \end{array} \right] \left[ \begin{array}{c} b\\ \varvec{\beta }\end{array} \right] = \left[ \begin{array}{c} 0 \\ \mathbf{t}\end{array} \right] \end{aligned}$$

where \({1}\!\mathrm{l}\) is a vector of ones and \(\mathbf{I}\) is an identity matrix of appropriate dimensions. Thus, a training procedure becomes

$$\begin{aligned} \left[ \begin{array}{c} b\\ \varvec{\beta }\end{array} \right] = \left[ \begin{array}{cc} 0 & {1}\!\mathrm{l}^\mathrm{T} \\ {1}\!\mathrm{l}& \mathrm {K}(\mathbf{X},\mathbf{X}) + \mathbf{I}/C \end{array} \right] ^{-1} \left[ \begin{array}{c} 0 \\ \mathbf{t}\end{array} \right] . \end{aligned}$$

Similar to the classical SVM, this formulation is highly unbalanced (its results are skewed towards bigger classes). To overcome this issue, one can introduce a weighted version [24], using diagonal matrix of weights \(\mathbf{Q}\), such that \(\mathrm {diag}(\mathbf{Q})_{i}\) is inversely proportional to the size of \({\mathbf{x}} _i\)’s class and

$$\begin{aligned} \left[ \begin{array}{c} b\\ \varvec{\beta }\end{array} \right] = \left[ \begin{array}{cc} 0 & {1}\!\mathrm{l}^\mathrm{T} \\ {1}\!\mathrm{l}& \mathrm {K}(\mathbf{X},\mathbf{X}) + \mathbf{Q} /C \end{array} \right] ^{-1} \left[ \begin{array}{c} 0 \\ \mathbf{t}\end{array} \right] . \end{aligned}$$

Unfortunately, due to the introduction of the square loss, the Support Vector Machines sparseness of the solution is completely lost. Resulting training has a closed-form solution, but requires the computation of the whole Gram matrix and the resulting machine has to rememberFootnote 1 whole training set to perform new point’s classification.

3 Extreme entropy machines

Let us first recall the formulation of the linear classification problem in the highly dimensional feature spaces, i.e., when number of samples N is equal (or less) than dimension of the data space, \(\mathrm {dim}(\mathcal {H})\). In particular, we formulate the problem in the limiting caseFootnote 2 when \(\mathrm {dim}(\mathcal {H})=\infty\):

Problem 1

We are given finite number of (often linearly independent) points \({\mathbf{h}}^\pm\,\) in an infinite dimensional Hilbert space \(\mathcal {H}\). Points \({\mathbf{h}}^+ \in {\mathbf{H}}^+\) constitute the positive class, while \({\mathbf{h}}^- \in {\mathbf{H}}^-\) the negative class.

We search for \(\varvec{\beta }\in \mathcal {H}\) such that the sets \(\varvec{\beta }^\mathrm{T}{\mathbf{H}}^+\) and \(\varvec{\beta }^\mathrm{T}{\mathbf{H}}^-\) are optimally separated.

Observe that in itself (without additional regularization) the problem is not well posed as, by applying the linear independence of the data, for arbitrary \(m_+ \ne m_-\) in \(\mathbb {R}\) we can easily construct \(\varvec{\beta }\in \mathcal {H}\) such that

$$\begin{aligned} \varvec{\beta }^\mathrm{T}{\mathbf{H}}^+=\{m_+\} \quad \text { and }\quad \varvec{\beta }^\mathrm{T}{\mathbf{H}}^-=\{m_-\}. \end{aligned}$$
(1)

However, this leads to a model case of overfitting, which typically yields suboptimal results on the testing set (different from the original training samples).

To make the problem well posed, we typically need to:

  1. 1.

    add/allow some error in the data,

  2. 2.

    specify some objective function including term penalizing model’s complexity.

Popular choices of the objective function include per-point classification loss (like square loss in neural networks or hinge loss in SVM) with a regularization term added, often expressed as the square of the norm of our operator \(\varvec{\beta }\) (like in SVM or in weight decay regularization of neural networks). In general, one can divide objective functions derivations into following categories:

  • regression based (like in neural networks or ELM),

  • probabilistic (like in the case of Naive Bayes),

  • geometric (like in SVM),

  • information theoretic (entropy models).

We focus on the last group of approaches, and investigate the applicability of the Cauchy–Schwarz Divergence [13], which for two densities f and g is given by

$$\begin{aligned} {D}_\mathrm{CS}(f,g)&=\ln \left( \int f^2 \right) +\ln \left( \int g^2 \right) -2\ln \left( \int fg \right) \\&=-2\ln \left( \int \tfrac{f}{\Vert f\Vert _2} \tfrac{g}{\Vert g\Vert _2} \right) . \end{aligned}$$

Cauchy–Schwarz Divergence is connected to Renyi’s quadratic cross entropy (\(H_2^\times\) [21]) and Renyi’s quadratic entropy (\(H_2\)), defined for densities fg as

$$\begin{aligned} H_2^\times (f,g)&=-\ln \left( \int fg \right) \\ H_2(f)&=H_2^\times (f,f)=-\ln \left( \int f^2 \right) , \end{aligned}$$

consequently

$$\begin{aligned} {D}_\mathrm{CS}(f,g)=2H_2^\times (f,g) -H_2(f)-H_2(g) \end{aligned}$$

and as showed in [7], it is well suited as a discrimination measure which allows the construction of multi-threshold linear classifiers. In general, increase of the value of Cauchy–Schwarz Divergence results in better sets’ (densities’) discrimination. Unfortunately, there are a few problems with such an approach:

  • true datasets are discrete, so we do not know actual densities f and g,

  • statistical density estimators require rather large sample sizes and are very computationally expensive.

There are basically two approaches which help us recover underlying densities from the samples. First one is performing some kind of density estimation, like the well-known Kernel Density Estimation (KDE) technique, which is based on the observation that any arbitrary continuous distribution can be approximated arbitrarily closely by the convex combination of Gaussians [26]. The other approach is to assume some density model (distribution’s family) and fit its parameters to maximize the data generation probability. In statistics, it is known as maximum likelihood estimation (MLE) approach. MLE has an advantage that in general it produces much simpler densities descriptions than KDE as the second one’s description scales linearly in terms of sample size.

A common choice of density models are Gaussian distributions due to their nice theoretical and practical (computational) capabilities. As mentioned earlier, the convex combination of Gaussians can approximate the given continuous distribution f with arbitrary precision. To fit a Gaussian mixture model (GMM) to given dataset, one needs an algorithm such as Expectation Maximization [8] or conceptually similar Cross-entropy clustering [25]. However, for simplicity and strong regularization, we propose to model f as one big Gaussian \(\mathcal {N}({{\mathbf{m}}},{\mathbf{\Sigma }})\). One of the biggest advantages of such an approach is closed-form MLE parameter estimation, as we simply put \({\mathbf{m}}\) equal to the empirical mean of the data, and \({\mathbf{\Sigma }}\) as some data covariance estimator. Second, this way we introduce an error to the data which has an important regularizing role and leads to better posed optimization problem.

Let us recall that the Shannon’s differential entropy (expressed in nits) of the continuous distribution f is

$$\begin{aligned} H(f) = - \int f \ln f. \end{aligned}$$

We will show that choice of normal distributions is not arbitrary but supported by the assumptions of the entropy maximization. Following result is known [5], but we include the whole reasoning for completeness.

Lemma 1

Normal distribution \(\mathcal {N}({\mathbf{m}},{\mathbf{\Sigma }})\) has a maximum Shannon’s differential entropy among all real-valued distributions with mean \({{\mathbf{m}}}\in \mathbb {R}^h\) and covariance \({{\mathbf{\Sigma }}}\in \mathbb {R}^{h \times h}\).

Proof

Let f and g be arbitrary distributions with covariance \({\mathbf{\Sigma }}\) and means \({\mathbf{m}}\). For simplicity, we assume that \({\mathbf{m}}=0\) but the analogous proof holds for arbitrary mean, then

$$\begin{aligned} \int f {\mathbf{h}}_i {\mathbf{h}}_j d {\mathbf{h}}_i d {\mathbf{h}}_j = \int g {\mathbf{h}}_i {\mathbf{h}}_j d {\mathbf{h}}_i d {\mathbf{h}}_j = {\mathbf{\Sigma }}_{ij}, \end{aligned}$$

so for quadratic form \(\mathbf{A}\)

$$\begin{aligned} \int \mathbf{A}f = \int \mathbf{A}g. \end{aligned}$$

Notice that

$$\begin{aligned} \ln \mathcal {N}(0,{\mathbf{\Sigma }})[{\mathbf{h}}]&= \ln \left( \frac{1}{\sqrt{(2\pi )^h \det ({\mathbf{\Sigma }})}} \exp ( -\tfrac{1}{2}{\mathbf{h}}^\mathrm{T} {\mathbf{\Sigma }}^{-1} {\mathbf{h}}) \right) \\&= -\frac{1}{2} \ln ( (2\pi )^h \det ({\mathbf{\Sigma }}) ) -\tfrac{1}{2}{\mathbf{h}}^\mathrm{T} {\mathbf{\Sigma }}^{-1} {\mathbf{h}}\end{aligned}$$

is a quadratic form plus constant. Thus,

$$\begin{aligned} \int f \ln \mathcal {N}(0,{\mathbf{\Sigma }}) = \int \mathcal {N}(0,{\mathbf{\Sigma }}) \ln \mathcal {N}(0,{\mathbf{\Sigma }}), \end{aligned}$$

which together with non-negativity of Kullback–Leibler Divergence gives

$$\begin{aligned} 0&\le {D}_\mathrm{KL}(f\;||\;\mathcal {N}(0,{\mathbf{\Sigma }})) \\&= \int f \ln ( \tfrac{f}{\mathcal {N}(0,{\mathbf{\Sigma }})} ) \\&= \int f \ln f - \int f \ln \mathcal {N}(0,{\mathbf{\Sigma }}) \\&= -{H}(f) - \int f \ln \mathcal {N}(0,{\mathbf{\Sigma }}) \\&= -{H}(f) - \int \mathcal {N}(0,{\mathbf{\Sigma }}) \ln \mathcal {N}(0,{\mathbf{\Sigma }}) \\&= -{H}(f) + {H}( \mathcal {N}(0,{\mathbf{\Sigma }}) ), \end{aligned}$$

Consequently, \({H}( \mathcal {N}(0,{\mathbf{\Sigma }}) ) \ge {H}(f).\) \(\square\)

There appears nontrivial question how to find/estimate the desired Gaussian as the covariance can be singular. In this case, to regularize the covariance, we apply the well-known Ledoit–Wolf approach [16]

$$\begin{aligned} {\mathbf{\Sigma }}^{\pm\,}=\mathrm {cov_{\dagger }}( {\mathbf{H}}^{\pm\,} ) = (1-\varepsilon ^\pm\,)\mathrm {cov}({\mathbf{H}}^\pm\,) + \varepsilon ^\pm\,\mathrm {tr}(\mathrm {cov}({\mathbf{H}}^\pm\,))h^{-1} \mathbf{I}, \end{aligned}$$

where \(\mathrm {cov}(\cdot )\) is an empirical covariance and \(\varepsilon ^\pm\,\) is a shrinkage coefficient given in closed form by Ledoit and Wolf [16]. We would like to underline that this estimator is parameterless and is optimal in the sense that it minimizes the expected quadratic loss between true covariance matrix and the estimator under general asymptotics (meaning that both dataset size and its dimension grow to infinity, but their ratio converges to a constant).

Thus, our optimization problem can be stated as follows:

Problem 2

(Optimization problem) Suppose that we are given two datasets \({\mathbf{H}}^\pm\,\) in a Hilbert space \(\mathcal {H}\) which come from the Gaussian distributions \(\mathcal {N}({\mathbf{m}}^\pm\,,{\mathbf{\Sigma }}^\pm\,)\). Find \(\varvec{\beta }\in \mathcal {H}\) such that the datasets

$$\begin{aligned} \varvec{\beta }^\mathrm{T} {\mathbf{H}}^+\quad \text{ and } \quad \varvec{\beta }^\mathrm{T} {\mathbf{H}}^- \end{aligned}$$

are optimally discriminated in terms of Cauchy–Schwarz Divergence.

Because \({\mathbf{H}}^\pm\,\) has density \(\mathcal {N}({\mathbf{m}}^\pm\,,{\mathbf{\Sigma }}^\pm\,)\), we know that \(\varvec{\beta }^\mathrm{T}\mathbf{X}^\pm\,\) has the density \(\mathcal {N}(\varvec{\beta }^\mathrm{T}{\mathbf{m}}^\pm\,,\varvec{\beta }^\mathrm{T}{\mathbf{\Sigma }}^\pm\,\varvec{\beta }).\) We put

$$\begin{aligned} m_\pm\,=\varvec{\beta }^\mathrm{T}{\mathbf{m}}^\pm, \, S_\pm\,=\varvec{\beta }^\mathrm{T}{\mathbf{\Sigma }}^\pm\,\varvec{\beta }. \end{aligned}$$
(2)

Since, as one can easily compute [6],

$$\begin{aligned}&\int \frac{{\mathcal {N}}(m_+,S_+)}{\Vert {\mathcal {N}}(m_+,S_+)\Vert _2} \cdot \frac{{\mathcal {N}}(m_-,S_-)}{\Vert {\mathcal {N}}(m_-,S_-)\Vert _2}\\&\quad =\frac{{\mathcal {N}}(m_+-m_-,S_++S_-)[0]}{({\mathcal {N}}(0,2S_+)[0]{\mathcal {N}}(0,2S_-)[0])^{1/2}}\\&\quad =\frac{(2\pi S_+ S_-)^{1/4}}{(S_++S_-)^{1/2}}\exp \left( -\frac{(m_+-m_-)^2}{2(S_++S_-)} \right) , \end{aligned}$$

we obtain that

$$\begin{aligned}&{D}_\mathrm{CS}({\mathcal {N}}(m_+,S_+),{\mathcal {N}}(m_-,S_-)) \nonumber \\&\quad =-\ln \left( \int \frac{{\mathcal {N}}(m_+,S_+)}{\Vert {\mathcal {N}}(m_+,S_+)\Vert _2} \cdot \frac{{\mathcal {N}}(m_-,S_-)}{\Vert {\mathcal {N}}(m_-,S_-)\Vert _2} \right) \nonumber \\&\quad =-\frac{1}{2}\ln \frac{\pi }{2}-\ln \frac{\tfrac{1}{2}(S_++S_-)}{\sqrt{S_+S_-}}+\frac{(m_+-m_-)^2}{S_++S_-}. \end{aligned}$$
(3)

Observe that in the above equation the first term is constant, the second is the logarithm of the quotient of arithmetical and geometrical means and therefore in the typical cases is bounded and close to zero. However, in some singular cases, when \(S_+\) and \(S_-\) are of different order of magnitude the above term can blow up, and thus more precisely here we are maximizing a lower bound of \({D}_\mathrm{CS}\). As we will see in further sections, this simplification leads to the closed-form solution, which in particular means that learning process is extremely simple. In general, one could maximize the whole function using gradient methods, but this would lead to the need of introducing optimization procedure hyperparameters (such as learning rate, momentum, stopping condition, etc.) which we are trying to avoid.

Consequently, crucial information is given by the last term. To confirm this claim, we perform experiments on over 20 datasets used in further evaluation (more details are located in the Evaluation section). We compute the Spearman’s rank correlation coefficient between the \({D}_\mathrm{CS}(\mathcal {N}(m_+,S_+),\mathcal {N}(m_-,S_-))\) and \(\frac{(m_+-m_-)^2}{S_++S_-}\) for hundred random projections to \(\mathcal {H}\) and hundred random linear operators \(\varvec{\beta }\). As one can see in Table 1, in small datasets (first part of the table) the correlation is generally high, with some exceptions (like sonar, splice, liver disorders and diabetes). However, for bigger datasets (consisting of thousands examples), this correlation is nearly perfect (up to the randomization process it is nearly 1.0 for all cases) which is a very strong empirical confirmation of our claim that maximization of the \(\frac{(m_+-m_-)^2}{S_++S_-}\) is generally equivalent to the maximization of \({D}_\mathrm{CS}(\mathcal {N}(m_+,S_+),\mathcal {N}(m_-,S_-))\) as correlation coefficient captures if \({D}_\mathrm{CS}\) and its lower bound have similar monotonicity. Number of dimensions of the Hilbert space does not affect the result in a significant manner for large datasets, while in the case of small ones it seems that results vary significantly when it is changed. This might be the consequence of less reliable covariance estimations for small datasets, especially with higher number of dimensions.

Table 1 Spearman’s rank correlation coefficient between optimized term and whole \({D}_\mathrm{CS}\) for all datasets used in evaluation

This means that, after the above reductions, and application of (2) our final problem can be stated as follows:

3.1 Optimization problem: extreme entropy machine

$$\begin{aligned} \underset{\varvec{\beta }}{\text {minimize}}&\quad \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^+ \varvec{\beta }+ \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^- \varvec{\beta }&\\ {\rm subject\, to}&\quad \varvec{\beta }^\mathrm{T}( {\mathbf{m}}^+ - {\mathbf{m}}^- ) = 2\\ \text {where}&\quad {\mathbf{\Sigma }}^\pm\,= \mathrm {cov_{\dagger }}({\mathbf{H}}^\pm\,) \\&\quad {\mathbf{m}}^\pm\,= \frac{1}{|{\mathbf{H}}^\pm\,|} \sum _{{\mathbf{h}}^\pm\,\in {\mathbf{H}}^\pm\,} {\mathbf{h}}^\pm\,\\&\quad {\mathbf{H}}^\pm\,= \varphi (\mathbf{X}^\pm\,) \end{aligned}$$

Before we continue to the closed-form solution, we outline two methods of actually transforming our data \(\mathbf{X}^\pm\,\subset \mathcal {X}\) to the highly dimensional \({\mathbf{H}}^\pm\,\subset \mathcal {H}\), given by the \(\varphi : \mathcal {X}\rightarrow \mathcal {H}\).

We investigate two approaches which lead to the Extreme Entropy Machine and Extreme Entropy Kernel Machine, respectively.

  • For Extreme Entropy Machine (EEM), we use the random projection technique, exactly the same as the one used in the ELM. In other words, given some generalized activation function \(\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} ,b) : \mathcal {X}\times \mathcal {X}\times \mathbb {R} \rightarrow \mathbb {R}\) and a constant h denoting number of hidden neurons:

    $$\begin{aligned} \varphi : \mathcal {X}\ni {\mathbf{x}} \rightarrow [\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} _1,b_1),\dots ,\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} _h,b_h)]^\mathrm{T} \in \mathbb {R}^h \end{aligned}$$

    where \(w_i\) are random vectors and \(b_i\) are random biases.

  • For Extreme Entropy Kernel Machine (EEKM), we use the randomized kernel approximation technique [9], which spans our Hilbert space on randomly selected subset of training vectors. In other words, given valid kernel \(\mathrm {K}(\cdot ,\cdot ) : \mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}_+\) and size of the kernel space base h:

    $$\begin{aligned} \varphi _\mathrm {K}: \mathcal {X}\ni {\mathbf{x}} \rightarrow (\mathrm {K}({\mathbf{x}} ,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2})^\mathrm{T} \in \mathbb {R}^h \end{aligned}$$

    where \(\mathbf{X}^{[h]}\) is a h element random subset of \(\mathbf{X}\). It is easy to verify that such low rank approximation truly behaves as a kernel, in the sense that for \(\varphi _\mathrm {K}({\mathbf{x}} _i), \varphi _\mathrm {K}({\mathbf{x}} _j) \in \mathbb {R}^{h}\)

    $$\begin{aligned}&\varphi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\varphi _\mathrm {K}({\mathbf{x}} _j) \\&\quad = ((\mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2})^\mathrm{T})^\mathrm{T} \\&\qquad \times ( \mathrm {K}({\mathbf{x}} _j,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2} )^\mathrm{T} \\&\quad = \mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2} \\&\qquad \times ( \mathrm {K}({\mathbf{x}} _j,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2} )^\mathrm{T} \\&\quad = \mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2}\\&\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2} \mathrm {K}^\mathrm{T}({\mathbf{x}} _j,\mathbf{X}^{[h]}) \\&\quad = \mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1} \mathrm {K}(\mathbf{X}^{[h]},{\mathbf{x}} _j). \end{aligned}$$

    Given true kernel projection \(\phi _\mathrm {K}\) such that \(\mathrm {K}({\mathbf{x}} _i,{\mathbf{x}} _j)=\phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j)\), we have

    $$\begin{aligned}&\mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1} \mathrm {K}(\mathbf{X}^{[h]},{\mathbf{x}} _j) \\&\quad =\phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}(\mathbf{X}^{[h]}) \\&\qquad \times (\phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T}\phi _\mathrm {K}(\mathbf{X}^{[h]}))^{-1}\\&\qquad \times \phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j) \\&\quad =\phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}(\mathbf{X}^{[h]}) \phi _\mathrm {K}(\mathbf{X}^{[h]})^{-1}\\&\qquad \times (\phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T})^{-1} \phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j) \\&\quad = \phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j)\\&\quad = \mathrm {K}({\mathbf{x}} _i,{\mathbf{x}} _j). \end{aligned}$$

    Thus, for the whole samples’ set, we have

    $$\begin{aligned} \varphi _\mathrm {K}(\mathbf{X})^\mathrm{T} \varphi _\mathrm {K}(\mathbf{X}) = \mathrm {K}(\mathbf{X},\mathbf{X}), \end{aligned}$$

    which is a complete Gram matrix.

So the only difference between Extreme Entropy Machine and Extreme Entropy Kernel Machine is that in later we use \({\mathbf{H}}^\pm\,=\varphi _\mathrm {K}(\mathbf{X}^\pm\,)\) where \(\mathrm {K}\) is a selected kernel instead of \({\mathbf{H}}^\pm\,=\varphi (\mathbf{X}^\pm\,)\). Figure 2 visualizes these two approaches as neural networks, in particular EEM is a simple SLFN, while EEKM leads to the network with two hidden layers.

Fig. 2
figure 2

Extreme entropy machine (top) and extreme entropy kernel machine (bottom) as neural networks. In both cases, all weights are either randomly selected (dashed) or are the result of closed-form optimization (solid)

Remark 1

Extreme Entropy Machine optimization problem is closely related to the SVM optimization, but instead of maximizing the margin between closest points we are maximizing the mean margin.

Proof

Let us recall that in SVM we try to maximize the margin \(\frac{2}{\Vert \varvec{\beta }\Vert }\) under constraints that negative samples are projected at values at most \(-1\) (\(\varvec{\beta }^\mathrm{T} {\mathbf{h}}^- + b\le -1\)) and positive samples on at least 1 (\(\varvec{\beta }^\mathrm{T} {\mathbf{h}}^+ + b\ge 1\)). In other words, we are minimizing the \(\varvec{\beta }\) operator norm \(\Vert \varvec{\beta }\Vert\) which is equivalent to minimizing the square of this norm \(\Vert \varvec{\beta }\Vert ^2\), under constraint

$$\begin{aligned} \min _{{\mathbf{h}}^+ \in {\mathbf{H}}^+} \{ \varvec{\beta }^\mathrm{T} {\mathbf{h}}^+ \} - \max _{{\mathbf{h}}^- \in {\mathbf{H}}^-} \{ \varvec{\beta }^\mathrm{T} {\mathbf{h}}^- \} = 1 - (-1) = 2. \end{aligned}$$

On the other hand, EEM tries to minimize

$$\begin{aligned} \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^+ \varvec{\beta }+ \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^- \varvec{\beta }&= \varvec{\beta }^\mathrm{T} ( {\mathbf{\Sigma }}^+ + {\mathbf{\Sigma }}^- ) \varvec{\beta }\\&= \Vert \varvec{\beta }\Vert _{({\mathbf{\Sigma }}^+ + {\mathbf{\Sigma }}^-)^{-1}}^2 \end{aligned}$$

under the constraint

$$\begin{aligned} \tfrac{1}{|{\mathbf{H}}^+|}\sum _{{\mathbf{h}}^+ \in {\mathbf{H}}^+} \varvec{\beta }^\mathrm{T} {\mathbf{h}}^+ - \tfrac{1}{|{\mathbf{H}}^-|} \sum _{{\mathbf{h}}^- \in {\mathbf{H}}^-} \varvec{\beta }^\mathrm{T} {\mathbf{h}}^- = 2. \end{aligned}$$

So what is happening here is that we are trying to maximize the mean margin between classes in the Mahalanobis norm [17] generated by the inverse of the sum of classes’ covariances. \(\square\)

Similar observation regarding connection between large margin classification and entropy optimization has been done in case of the Multithreshold Linear Entropy Classifier [7]. One should also notice important relations to other methods studying so-called margin distributions [10], such as Large margin Distribution Machine [29]. Contrary to Zhang et al. approach, we are minimizing the summarized variances instead of minimizing the difference between data variance and cross class variance. As a result, proposed model is much easier to optimize (as shown below).

We are going to show by applying the standard method of Lagrange multipliers that the above problem has a closed-form solution (similar to the Fisher Discriminant). We put

$$\begin{aligned} L(\varvec{\beta },\lambda )=2\varvec{\beta }^\mathrm{T}({\mathbf{\Sigma }}^++{\mathbf{\Sigma }}^-) \varvec{\beta }-\lambda (\varvec{\beta }^\mathrm{T}({\mathbf{m}}^+-{\mathbf{m}}^-)-2). \end{aligned}$$

Then,

$$\begin{aligned} \nabla _{\varvec{\beta }} L=2({\mathbf{\Sigma }}^++{\mathbf{\Sigma }}^-) \varvec{\beta }-\lambda ({\mathbf{m}}^+-{\mathbf{m}}^-)\quad \text { and }\quad \frac{\partial }{\partial \lambda }L=\varvec{\beta }^\mathrm{T}({\mathbf{m}}^+-{\mathbf{m}}^-)-2, \end{aligned}$$

which means that we need to solve, with respect to \(\varvec{\beta }\), the system

$$\begin{aligned} {\left\{ \begin{array}{l} 2({\mathbf{\Sigma }}^++{\mathbf{\Sigma }}^-) \varvec{\beta }-\lambda ({\mathbf{m}}^+-{\mathbf{m}}^-)=0, \\ \varvec{\beta }^\mathrm{T}({\mathbf{m}}^+-{\mathbf{m}}^-)=2. \end{array}\right. } \end{aligned}$$

Therefore, \(\varvec{\beta }=\frac{\lambda }{2}({\mathbf{\Sigma }}^++{\mathbf{\Sigma }}^-)^{-1}({\mathbf{m}}^+-{\mathbf{m}}^-)\), which yields

$$\begin{aligned} \tfrac{\lambda }{2}({\mathbf{m}}^+-{\mathbf{m}}^-)^\mathrm{T}({\mathbf{\Sigma }}^++{\mathbf{\Sigma }}^-)^{-1}({\mathbf{m}}^+-{\mathbf{m}}^-)=2, \end{aligned}$$

and consequently,Footnote 3 if \({\mathbf{m}}^+-{\mathbf{m}}^-\ne 0\), then \(\lambda =4/\Vert {\mathbf{m}}^+-{\mathbf{m}}^-\Vert ^2_{{\mathbf{\Sigma }}^++{\mathbf{\Sigma }}^-}\) and

$$\begin{aligned} \varvec{\beta }&=\tfrac{2}{\Vert {\mathbf{m}}^+-{\mathbf{m}}^-\Vert ^2_{{\mathbf{\Sigma }}^++{\mathbf{\Sigma }}^-}}({\mathbf{\Sigma }}^++{\mathbf{\Sigma }}^-)^{-1}({\mathbf{m}}^+-{\mathbf{m}}^-) \nonumber \\&=\frac{2({\mathbf{\Sigma }}^+ + {\mathbf{\Sigma }}^-)^{-1}({\mathbf{m}}^+-{\mathbf{m}}^-)}{\Vert {\mathbf{m}}^+-{\mathbf{m}}^-\Vert ^2_{{\mathbf{\Sigma }}^+ + {\mathbf{\Sigma }}^-}}. \end{aligned}$$
(4)

The final decision of the class of the point \({\mathbf{h}}\) is therefore given by the comparison of the values

$$\begin{aligned} \mathcal {N}(\varvec{\beta }^\mathrm{T}{\mathbf{m}}^+,\varvec{\beta }^\mathrm{T}{\mathbf{\Sigma }}^+\varvec{\beta })[\varvec{\beta }^\mathrm{T}{\mathbf{h}}] \quad\text {and}\quad \mathcal {N}(\varvec{\beta }^\mathrm{T}{\mathbf{m}}^-,\varvec{\beta }^\mathrm{T}{\mathbf{\Sigma }}^-\varvec{\beta })[\varvec{\beta }^\mathrm{T}{\mathbf{h}}]. \end{aligned}$$

We distinguish two cases based on number of resulting classifier’s thresholds (points r such that \(\mathcal {N}(\varvec{\beta }^\mathrm{T}{\mathbf{m}}^+,\varvec{\beta }^\mathrm{T}{\mathbf{\Sigma }}^+\varvec{\beta })[r] = \mathcal {N}(\varvec{\beta }^\mathrm{T}{\mathbf{m}}^-,\varvec{\beta }^\mathrm{T}{\mathbf{\Sigma }}^-\varvec{\beta })[r]\)):

  1. 1.

    \(S_- = S_+\), then there is one threshold

    $$\begin{aligned} r_0=m_- + 1, \end{aligned}$$

    which results in a traditional (one-threshold) linear classifier,

  2. 2.

    \(S_- \ne S_+\), then there are two thresholds

    $$\begin{aligned} r_\pm\,= m_- + \tfrac{2S_- \pm\,\sqrt{S_-S_+(\ln (S_-/S_+)(S_--S_+)+4)}}{S_--S_+}, \end{aligned}$$

    which makes the resulting classifier a member of two-thresholds linear classifiers family [1].

Obviously, in the degenerated case, when \(m=0 \iff {\mathbf{m}}^-={\mathbf{m}}^+\) there is no solution, as the constraint \(\varvec{\beta }^\mathrm{T}({\mathbf{m}}^--{\mathbf{m}}^+)=2\) is not fulfilled for any \(\varvec{\beta }\). In such a case, EEM returns a trivial classifier constantly equal to any class (we put \(\varvec{\beta }=0\)).

From the neural network perspective, we simply construct a custom activation function \(\mathrm {F}(\cdot )\) in the output neuron depending on one of the two described cases:

  1. 1.

    \(\mathrm {F}(x) = \left\{ \begin{array} {ll}+1,&\quad {\text if }\,\, x \ge r_0\\ -1, &\quad {\text if }\,\, x < r_0 \end{array} \right. = {\mathrm{sign}}(x-r_0),\)

  2. 2.

    \(\mathrm {F}(x) = \left\{ \begin{array}{ll} +1,&\quad {\text if }\,\, x \in [r_-,r_+]\\ -1,&\quad {\text if }\,\, x \notin [r_-,r_+] \end{array} \right. = -{\mathrm{sign}}(x-r_-) {\mathrm{sign}}(x-r_+) ,\) if \(r_-<r_+\) and \(\mathrm {F}(x) = \left\{ \begin{array}{ll} -1, &\quad {\text if }\,\, x \in [r_+,r_-]\\ +1,&\quad {\text if }\,\, x \notin [r_+,r_-] \end{array} \right. = {\mathrm{sign}}(x-r_-) {\mathrm{sign}}(x-r_+) ,\) otherwise.

4 Theory: density estimation in the kernel case

To illustrate our reasoning, we consider a typical basic problem concerning the density estimation.

Problem 3

Assume that we are given a finite dataset \({\mathbf{H}}\) in a Hilbert space \(\mathcal {H}\) generated by the unknown density f, and the goal consists in estimating f.

Since the problem in itself is infinite dimensional, typically the data would be linearly independent [20]. Moreover, one usually cannot obtain reliable density estimation—the most we can hope is that after transformation by a linear functional into \(\mathbb {R}\), the resulting density will be well estimated.

To simplify the problem assume therefore that we want to find the desired density in the class of normal densities—or equivalently that we are interested only in the estimation of the mean and covariance of f.

The generalization of the above problem is given by the following problem:

Problem 4

Assume that we are given a finite dataset \({\mathbf{h}}^\pm\,\) in a Hilbert space \(\mathcal {H}\) generated by the unknown densities \(f^\pm\,\), and the goal consists in estimating the unknown densities.

In general, \(\text {dim}(\mathcal {H}) \gg N\)which means that we have very sparse data in terms of Hilbert space. As a result, classical kernel density estimation (KDE) is not reliable source of information [18]. In the absence of different tools, we can however use KDE with very big kernel width to cover at least some general shape of the whole density.

Remark 2

Assume that we are given a finite dataset \({\mathbf{h}}^\pm\,\) with means \({\mathbf{m}}^\pm\,\) and covariances \({\mathbf{\Sigma }}^\pm\,\) in a Hilbert space \(\mathcal {H}\). If we conduct kernel density estimation using a Gaussian kernel then, in a limiting case, each class becomes a normal distribution. Strictly speaking

$$\begin{aligned} \lim _{\sigma \rightarrow \infty } \Vert [\![ {\mathbf{H}}^\pm\,]\!]_\sigma - \mathcal {N}( {\mathbf{m}}^\pm\,, \sigma ^2 {\mathbf{\Sigma }}^\pm\,) \Vert _2 = 0, \end{aligned}$$

where

$$\begin{aligned}{}[\![ A ]\!]_\sigma = \tfrac{1}{|A|} \sum _{a \in A}\mathcal {N}(a, \sigma ^2 \cdot \mathrm {cov}(A)). \end{aligned}$$

Proof of this remark is given by Czarnecki and Tabor [7] and means that if we perform a Gaussian kernel density estimation of our data with big kernel width (which is reasonable for small amount of data in highly dimensional space) then for big enough \(\hat{\sigma }\) EEM is nearly optimal linear classifier in terms of estimated densities

$$\begin{aligned} \hat{f}^\pm\,= \mathcal {N}( {\mathbf{m}}^\pm\,, {\hat{\sigma }}^2 {\mathbf{\Sigma }}^\pm\,) \approx [\![ {\mathbf{H}}^\pm\,]\!]_{\hat{\sigma }}. \end{aligned}$$

Let us now investigate the probabilistic interpretation of EEM. Under the assumption that \({\mathbf{H}}^\pm\,\sim \mathcal {N}({\mathbf{m}}^\pm\,, {\mathbf{\Sigma }}^\pm\,)\), we have the conditional probabilities

$$\begin{aligned} p({\mathbf{h}}|\pm\,) = \mathcal {N}({\mathbf{m}}^\pm\,,{\mathbf{\Sigma }}^\pm\,)[{\mathbf{h}}], \end{aligned}$$

so from Bayes rule we conclude that

$$\begin{aligned} p(\pm\,|{\mathbf{h}})&= \frac{p({\mathbf{h}}|\pm\,)p(\pm\,)}{p({\mathbf{h}})} \\&\propto \mathcal {N}({\mathbf{m}}^\pm\,,{\mathbf{\Sigma }}^\pm\,)[{\mathbf{h}}] p(\pm\,), \end{aligned}$$

where \(p(\pm\,)\) is a prior classes’ distribution. In our case, due to the balanced nature (meaning that despite classes imbalance we maximize the balanced quality measure such as Balanced Accuracy), we have \(p(\pm\,)=1/2\).

But

$$\begin{aligned} p({\mathbf{h}}) = \sum _{t\in \{+,-\}} p({\mathbf{h}}|t) , \end{aligned}$$

so

$$\begin{aligned} p(\pm\,|{\mathbf{h}}) = \frac{ \mathcal {N}({\mathbf{m}}^\pm\,,{\mathbf{\Sigma }}^\pm\,)[{\mathbf{h}}] }{ \sum _{t\in \{+,-\}} \mathcal {N}({\mathbf{m}}^t,{\mathbf{\Sigma }}^t)[{\mathbf{h}}] }. \end{aligned}$$

Furthermore, it is easy to show that under the normality assumption, the resulting classifier is optimal in the Bayesian sense.

Remark 3

If data in feature space come from Normal distributions \(\mathcal {N}({\mathbf{m}}^\pm\,,{\mathbf{\Sigma }}^\pm\,)\) then \(\varvec{\beta }\) given by EEM minimizes probability of misclassification. More strictly speaking, if we draw \({\mathbf{h}}^+\) with probability 1 / 2 from \(\mathcal {N}({\mathbf{m}}^+,{\mathbf{\Sigma }}^+)\) and \({\mathbf{h}}^-\) with 1/2 from \(\mathcal {N}({\mathbf{m}}^-,{\mathbf{\Sigma }}^-)\) then for any \(\varvec{\alpha } \in \mathbb {R}^h\)

$$\begin{aligned} p( \mp | \varvec{\beta }^\mathrm{T} {\mathbf{h}}^\pm\,) \le p( \mp | \varvec{\alpha }^\mathrm{T} {\mathbf{h}}^\pm\,). \end{aligned}$$

5 Theory: learning capabilities

First, we show that under some simplifying assumptions, proposed method behaves as Extreme Learning Machine (or Weighted Extreme Learning Machine [30]).

Before proceeding further, we would like to remark that there are two popular notations for projecting data onto hyperplanes. One, used in ELM model, assume that \({\mathbf{H}}\) is a row matrix and \(\varvec{\beta }\) is a column vector, which results in the projection’s equation \({\mathbf{H}}\varvec{\beta }\). Second one, used in SVM and in our paper, assumes that both \({\mathbf{H}}\) and \(\varvec{\beta }\) are column oriented, which results in the \(\beta ^\mathrm{T}{\mathbf{H}}\) projection. In the following theorem, we will show some duality between \(\varvec{\beta }\) found by ELM and by EEM. To do so, we will need to change the notation during the proof, which will be indicated.

Theorem 1

Let us assume that we are given an arbitrary, balanced binary dataset which can be perfectly learned by ELM with N hidden neurons. If this dataset points’ image through random neurons \({\mathbf{H}}=\varphi (\mathbf{X})\) is centered (points’ images have 0 mean) and classes have homogeneous covariances (we assume that there exist real \(a_+\) and \(a_-\) such that \(\mathrm {cov}({\mathbf{H}}) = a_+\mathrm {cov}({\mathbf{H}}^+) = a_-\mathrm {cov}({\mathbf{H}}^-)\)) then EEM with the same hidden layer will also learn this dataset perfectly (with 0 error).

Proof

In the first part of the proof, we use the ELM notation. Projected data are centered, so \(\mathrm {cov}({\mathbf{H}}) = {\mathbf{H}}^\mathrm{T}{\mathbf{H}}\). ELM is able to learn this dataset perfectly, consequently \({\mathbf{H}}\) is invertible, thus also \({\mathbf{H}}^\mathrm{T}{\mathbf{H}}\) is invertible, as a result \(\mathrm {cov_{\dagger }}({\mathbf{H}})=\mathrm {cov}({\mathbf{H}})={\mathbf{H}}^\mathrm{T}{\mathbf{H}}\). We will now show that \(\exists _{a \in \mathbb {R}_+} \varvec{\beta }_{\mathrm{ELM}} = a \cdot \varvec{\beta }_{\mathrm{EEM}}.\) First, let us recall that \(\varvec{\beta }_{\mathrm{ELM}} = {\mathbf{H}}^\dagger \mathbf{t}= {\mathbf{H}}^{-1}\mathbf{t}\) and \(\varvec{\beta }_{\mathrm{EEM}}=\frac{2({\mathbf{\Sigma }}^++{\mathbf{\Sigma }}^-)^{-1}({\mathbf{m}}^+-{\mathbf{m}}^-)}{\Vert {\mathbf{m}}^+-{\mathbf{m}}^- \Vert _{{\mathbf{\Sigma }}^-+{\mathbf{\Sigma }}^+}^2}\) where \({\mathbf{\Sigma }}^\pm\,= \mathrm {cov_{\dagger }}({\mathbf{H}}^\pm\,)\). Due to the assumption of geometric homogeneity \(\varvec{\beta }_{\mathrm{EEM}}=\frac{2}{\Vert {\mathbf{m}}^+-{\mathbf{m}}^-\Vert _{{\mathbf{\Sigma }}}^2}(\frac{a_++a_-}{a_+a_-}{\mathbf{\Sigma }})^{-1}({\mathbf{m}}^+-{\mathbf{m}}^-)\), where \({\mathbf{\Sigma }}= \mathrm {cov_{\dagger }}({\mathbf{H}})\). Therefore,

$$\begin{aligned} \varvec{\beta }_{\mathrm{ELM}}&= {\mathbf{H}}^{-1}\mathbf{t}\\&= ({\mathbf{H}}^\mathrm{T}{\mathbf{H}})^{-1}{\mathbf{H}}^\mathrm{T}\mathbf{t}\\&= \mathrm {cov_{\dagger }}^{-1}({\mathbf{H}}){\mathbf{H}}^\mathrm{T}\mathbf{t}. \end{aligned}$$

From now, we change the notation back to the one used in this paper and obtain

$$\begin{aligned} \varvec{\beta }_{\mathrm{ELM}}&= {\mathbf{\Sigma }}^{-1} \left( \sum _{{\mathbf{h}}^+ \in {\mathbf{H}}^+} (+1 \cdot {\mathbf{h}}^+) + \sum _{{\mathbf{h}}^- \in {\mathbf{H}}^-} (-1 \cdot {\mathbf{h}}^-) \right) \\&= {\mathbf{\Sigma }}^{-1} \left( \sum _{{\mathbf{h}}^+ \in {\mathbf{H}}^+} {\mathbf{h}}^+ - \sum _{{\mathbf{h}}^- \in {\mathbf{H}}^-} {\mathbf{h}}^- \right) \\&= {\mathbf{\Sigma }}^{-1}\frac{N}{2}({\mathbf{m}}^+ - {\mathbf{m}}^-)\\&= \frac{N}{2} \frac{\Vert {\mathbf{m}}^+-{\mathbf{m}}^-\Vert _{{\mathbf{\Sigma }}}^2}{2}\frac{a_++a_-}{a_+a_-} \varvec{\beta }_{\mathrm{EEM}}\\&= a \cdot \varvec{\beta }_{\mathrm{EEM}}, \end{aligned}$$

for \(a = \frac{N}{2} \frac{\Vert {\mathbf{m}}^+-{\mathbf{m}}^-\Vert _{{\mathbf{\Sigma }}}^2}{2}\frac{a_++a_-}{a_+a_-} \in \mathbb {R}_+\). Again from homogeneity we obtain just one equilibrium point, located in the \(\varvec{\beta }_{\mathrm{EEM}}^\mathrm{T}({\mathbf{m}}^+-{\mathbf{m}}^-)/2\)which results in the exact same classifier as the one given by ELM. This completes the proof. \(\square\)

Similar result holds for EEKM and Least Squares Support Vector Machine.

Theorem 2

Let us assume that we are given arbitrary, balanced binary dataset which can be perfectly learned by LS-SVM. If dataset points’ images through Kernel-induced projection \(\varphi _\mathrm {K}\) have homogeneous classes’ covariances (we assume that there exist real \(a_+\) and \(a_-\) such that \(\mathrm {cov}(\varphi _\mathrm {K}(\mathbf{X})) = a_+\mathrm {cov}(\varphi _\mathrm {K}(\mathbf{X}^+)) = a_-\mathrm {cov}(\varphi _\mathrm {K}(\mathbf{X}^-))\)) then EEKM with the same kernel and N hidden neurons will also learn this dataset perfectly (with 0 error).

Proof

It is a direct consequence of the fact that with N hidden neurons and homogeneous classes projections covariances, EEKM degenerates to the kernelized Fisher Discriminant which, as Gestel et al. showed [28], is equivalent to the solution of the Least Squares SVM. \(\square\)

Both theorems can be extended to non-balanced datasets if we consider a Weighted ELM and Balanced LS-SVM. Proposed method has a balanced nature, so it internally assumes that classes priors are equal to 1/2. In the proofs, we show that when this is a true assumption, ELM and LS-SVM lead (under some assumptions) to the same solution. If one includes the same assumption in these two methods (through Weighted ELM and Balanced LS-SVM), then they again will solve the same problem despite true classes priors. We omit the exact proof as they are analogous to the above.

6 Practical considerations

In previous sections, we investigated the limiting case when \(\mathrm {dim}(\mathcal {H}) =\infty\). However, in practice, we choose h random nonlinear projections which embed data in a high-dimensional space (with dimension at least h, but we can still consider it as an image in higher dimensional space, analogously to how Gaussian kernel actually maps to infinitely dimensional space by projecting through just N functions). As we show in further evaluation, it is sufficient to use h which is much smaller than N, so resulting computational complexities, cubic in h, are acceptable.

We can formulate the whole EEM training as a very simple algorithm (see Algorithms 1, 2).

figure a
figure b

Resulting model consists of three elements:

  • feature projection function \(\varphi\),

  • linear operator \(\varvec{\beta }\),

  • classification rule \(\mathrm {F}\).

As described before, \(\mathrm {F}\) can be further compressed to just one or two thresholds \(t_\pm\,\) using equations from previous sections. Either way, complexity of the resulting model is linear in terms of hidden units and classification of the new point takes \(\mathcal {O}(dh)\) time.

During EEM training, the most expensive part of the algorithm is the computation of the covariance estimators and inversion of the sum of covariances. Even computation of the empirical covariance takes \(\mathcal {O}(Nh^2)\) time so the total complexity of training, equal to \(\mathcal {O}(h^3 + Nh^2) = \mathcal {O}(Nh^2)\), is acceptable. It is worth noting that training of the ELM also takes exactly \(\mathcal {O}(Nh^2)\) time as it requires computation of \({\mathbf{H}}^\mathrm{T}{\mathbf{H}}\) for \({\mathbf{H}}\in \mathbb {R}^{N \times h}\). Training of EEMK requires additional computation of the square root of the sampled kernel matrix inverse \(\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2}\) but as \(\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]}) \in \mathbb {R}^{h \times h}\) can be computed in \(\mathcal {O}(dh^2)\) and both inverting and square rooting can be done in \(\mathcal {O}(h^3)\) we obtain exact same asymptotical computational complexity as the one of EEM. Procedure of square rooting and inverting are both always possible as assuming that \(\mathrm {K}\) is a valid kernel in the Mercer’s sense yields that \(\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})\) is strictly positive definite and thus invertible. Further comparison of EEM, ELM and SVM is summarized in Table 2.

Table 2 Comparison of considered classifiers

Next aspect we would like to discuss is the cost-sensitive learning. EEMs are balanced models in the sense that they are trying to maximize the balanced quality measures (like Balanced Accuracy or GMean). However, in practical applications, it might be the case that we are actually more interested in the positive class than the negative one (like in the medical applications). Proposed model gives a direct probability estimates of \(p(\varvec{\beta }^\mathrm{T}{\mathbf{h}}|t)\), which we can easily convert to the cost-sensitive classifier by introducing the prior probabilities of each class. Directly from Bayes Theorem, given \(p(+)\) and \(p(-)\), we can label our new sample \({\mathbf{h}}\) according to

$$\begin{aligned} p(t|\varvec{\beta }^\mathrm{T}{\mathbf{h}}) \propto p(t)p(\varvec{\beta }^\mathrm{T}{\mathbf{h}}|t), \end{aligned}$$

so if we are given costs \(C_+, C_- \in \mathbb {R}_+\) we can use them as weighting of priors

$$\begin{aligned} cl({\mathbf{x}} ) = \mathop {\mathrm{arg max}}\limits _{t\in \{-,+\}} \tfrac{C_t}{C_-+C_+} p(\varvec{\beta }^\mathrm{T}{\mathbf{h}}|t). \end{aligned}$$

Let us now investigate the possible efficiency bottleneck. In EEKM, the classification of the new point \({\mathbf{h}}\) is based on

$$\begin{aligned} cl({\mathbf{x}} )&= \mathrm {F}( \varvec{\beta }^\mathrm{T} \varphi _\mathrm {K}({\mathbf{x}} ) ) \\&= \mathrm {F}( \varvec{\beta }^\mathrm{T} (\mathrm {K}({\mathbf{x}} ,\mathbf{X}^{[h]})\mathrm {K}^{[h]})^\mathrm{T} ) \\&= \mathrm {F}( \varvec{\beta }^\mathrm{T} (\mathrm {K}^{[h]})^\mathrm{T} \mathrm {K}({\mathbf{x}} ,\mathbf{X}^{[h]})^\mathrm{T} )\\&= \mathrm {F}( (\mathrm {K}^{[h]}\varvec{\beta })^\mathrm{T} \mathrm {K}(\mathbf{X}^{[h]}, {\mathbf{x}} ) ). \end{aligned}$$

One can convert EEKM to the SLFN by putting:

$$\begin{aligned} \hat{\varphi }_\mathrm {K}({\mathbf{x}} )&= \mathrm {K}(\mathbf{X}^{[h]}, {\mathbf{x}} )\\ \hat{\varvec{\beta }}&= \mathrm {K}^{[h]}\varvec{\beta }, \end{aligned}$$

so the classification rule becomes

$$\begin{aligned} cl({\mathbf{x}} ) = \mathrm {F}( {\hat{\varvec{\beta} }} ^\mathrm{T} {\hat{\varphi} }_\mathrm {K}({\mathbf{x}} ) ). \end{aligned}$$

This way complexity of the new point’s classification is exactly the same as in the case of EEM and ELM (or any other SLFN).

7 Evaluation

For the evaluation purposes, we implemented five methods, namely: Weighted Extreme Learning Machine (WELM [30]), Extreme Entropy Machine (EEM), Extreme Entropy Kernel Machine (EEKM), Least Squares Support Vector Machines (LS-SVM [23]) and Support Vector Machines (SVM [4]). All experiments were performed using machine with Intel Xeon 2.8Ghz processors with enough RAM to fit any required computations.

All methods with the exception of SVM were implemented using Python with use of numpy [27] and scipy [14] libraries included in anaconda Footnote 4 for fair comparison. For SVM, we used highly efficient libSVM [3] library with bindings available in scikit-learn [19]. Random projection-based methods (WELM, EEM) were tested using following three generalized activation functions \(\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} ,b)\)

  • sigmoid (sig): \(\tfrac{1}{1+\exp (-\langle {\mathbf{w}} ,{\mathbf{x}} \rangle + b)}\),

  • normalized sigmoid (nsig): \(\tfrac{1}{1+\exp (-\langle {\mathbf{w}} ,{\mathbf{x}} \rangle /d + b)}\),

  • radial basis function (rbf): \(\exp (-b \Vert {\mathbf{w}} - {\mathbf{x}} \Vert ^2 )\).

Random parameters (weights and biases) were selected from uniform distributions on [0, 1]. Training of WELM was performed using Moore–Penrose pseudoinverse and of EEM using Ledoit–Wolf covariance estimator, as both are parameterless, closed-form estimators of required objects. For kernel methods (EEKM, LS-SVM, SVM), we used the Gaussian kernel (rbf) \(\mathrm {K}_\gamma ({\mathbf{x}} _i,{\mathbf{x}} _j)=\exp (-\gamma \Vert {\mathbf{x}} _i-{\mathbf{x}} _j \Vert ^2 )\). In all methods requiring class balancing schemes (WELM, LS-SVM, SVM), we used balance weights \(w_i\) equal to the ratio of bigger class and current class (so \(\sum _{i=1}^N w_i t_i = 0\), which is equivalent to having \(w_i=({\max _{t\in \{-,+\}} N_t})/{N_{t_i}}\)).

Hyperparameters of each model were fitted, performed grid search included: hidden layer size \(h=50,100,250,500,1000\) (WELM, EEM, EEKM), Gaussian Kernel width \(\gamma =10^{-10},\ldots ,10^0\) (EEKM, LS-SVM, SVM), SVM regularization parameter \(C=10^{-1},\ldots ,10^{10}\) (LS-SVM, SVM).

Datasets’ features were linearly scaled (per feature) to have each feature in the interval [0, 1]. No other data whitening/filtering was performed. All experiments were conducted in repeated tenfold stratified cross-validation.

We use following evaluation metric

$$\begin{aligned} \mathrm {GMean}(\text {TP,FP,TN,FN}) = \sqrt{\frac{\text {TP}}{\text {TP}+\text {FN}} \cdot \frac{\text {TN}}{\text {TN}+\text {FP}}}, \end{aligned}$$

where TP represents true positives, TN true negatives, FP false positives, FN false negatives. We choose it due to the balanced nature and usage in previous works regarding Weighted Extreme Learning Machines [30], however very similar results can be obtained for Balanced Accuracy (which is an arithmetic mean of accuracies over each class).

Table 3 Characteristics of used datasets

7.1 Basic UCI datasets

We start our experiments with nine datasets coming from UCI repository [2], namely australian, breast-cancer, diabetes, german.numer, heart, ionosphere, liver disorders, sonar and splice, summarized in Table 3. This dataset includes rather balanced, low-dimensional problems.

On such data, EEM seems to perform noticeably better than ELM when using RBF activation function (see Table 4), and rather similar when using sigmoid one—in such a scenario, for some datasets, ELM achieves better results while for other EEM wins. Results obtained for EEKM are comparable with those obtained by LS-SVM and SVM, in both cases proposed method achieves better results on about third of problems, on the third it draws and on a third it loses. This experiment can be seen as a proof of concept of the whole methodology, showing that it can be truly a reasonable alternative for existing models in some problems. It appears that contrary to ELM, proposed methods (EEM and EEKM) achieve best scores across all considered models in some of the datasets regardless of the used activation function/kernel (only Support Vector Machines and their least squares counterpart are competitive in this sense).

Table 4 GMean on all considered datasets

7.2 Highly unbalanced datasets

In the second part, we considered the nine highly unbalanced datasets, summarized in the second part of Table 3. Ratio between bigger and smaller class varies from 10:1 to even 20:1 which makes them really hard for unbalanced models. Obtained results (see Table 4) resemble those obtained on UCI repository. We can see better results in about half of experiments if we fix a particular activation function/kernel (so we compare ELM\(_x\) with EEM\(_x\) and LS-SVM\(_x\) with EEKM\(_x\)).

Table 5 shows that training time of Extreme Entropy Machines is comparable with the ones obtained by Extreme Learning Machines (differences on the level of 0.1–0.2 are not significant on such datasets’ sizes). We have a robust method which learns in below two seconds a model for hundreds/thousands of examples. For larger datasets (like abalone7 or sick euthyroid) proposed methods not only outperform SVM and LS-SVM in terms of robustness but there is also noticeable difference between their training times and ELMs. This suggests that even though ELM and EEM are quite similar and on small datasets are equally fast, EEM can better scale up to truly big datasets. Obviously obtained training times do not resemble the full training time as it strongly depends on the technique used for hyperparameters selection and resolution of grid search (or other parameters tuning technique). In such full scenario, training times of SVM-related models are significantly bigger due to the requirement of exact tuning of both C and \(\gamma\) in real domains.

Table 5 Highly unbalanced datasets times in seconds using machine with Intel Xeon 2.8 GHz processors

7.3 Extremely unbalanced datasets

Third part of experiments involves analysis of extremely unbalanced datasets (with class imbalance up to 100:1) containing tens and hundreds thousands of examples. Five analyzed datasets span from NLP tasks (webpages) through medical applications (mammography) to bioinformatics (protein homology). This type of dataset often occurs in the true data mining which makes these results much more practical than the one obtained on small/balanced data. Hyperparameters of each method are carefully fitted as described in the previous section.

Scores obtained on Isolet dataset (see Table 4) for sigmoid-based random projections are a result of very high values (\({\approx } 200\)) of \(\langle {\mathbf{x}} ,{\mathbf{w}} \rangle\) for all \({\mathbf{x}}\), which results in \(\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} ,b)=1\), so the whole dataset is reduced to the singleton \(\{ [1,\ldots ,1]^\mathrm{T} \} \subset \mathbb {R}^h \subset \mathcal {H}\)which obviously is not separable by any classifier, neither ELM nor EEM.

For other activation functions, we see that EEM achieves slightly worse results than ELM. On the other hand, scores of EEKM generally outperform the ones obtained by ELM and are very close to the ones obtained by well-tuned SVM and LS-SVM. In the same time, EEM and EEKM were trained significantly faster, as Table 5 shows, it was order of magnitude faster than SVM-related models and even \(1.5-2 {\times }\) faster than ELM. It seems that the Ledoit–Wolf covariance estimation computation with this matrices inversion is simply a faster operation (scales better) than computation of the Moore–Penrose pseudoinverse of the \({\mathbf{H}}^\mathrm{T}{\mathbf{H}}\). Obviously, one can alternate ELM training routine to the regularized one where instead of \(({\mathbf{H}}^\mathrm{T}{\mathbf{H}})^\dagger\) one computes

$$\begin{aligned} ({\mathbf{H}}^\mathrm{T}{\mathbf{H}}+ \mathbf{I}/C)^{-1}, \end{aligned}$$
(5)

but we are analyzing here approach without parametrized regularization, while the analog could be used for EEM in the form of

$$\begin{aligned} (\mathrm {cov}(\mathbf{X}^-)+\mathrm {cov}(\mathbf{X}^+) + \mathbf{I}/C)^{-1} \end{aligned}$$
(6)

instead of computing Ledoit–Wolf estimator. In other words, in the regularization parameterless training scenario, as described in this paper, EEMs seem to scale better than ELMs while still obtaining similar classification results. In the same time, EEKM obtains SVM-level results with orders of magnitude faster training. Both ELM and EEM could be transformed into regularization parameter-based learning [see Eqs. (5), (6)], but this is beyond the scope of this work.

7.4 Entropy-based hyperparameters optimization

Now, we proceed to entropy-based evaluation. Given particular set of linear hypotheses \(\mathcal {M}\) in \(\mathcal {H}\), we want to select optimal set of hyperparameters \(\theta\) (such as number of hidden neurons or regularization parameter) which identify a particular model \(\varvec{\beta }_\theta \in \mathcal {M} \subset \mathcal {H}\). Instead of using expensive internal cross-validation (or other generalization error estimation technique like Err\(^{0.632}\)), we select such \(\theta\)which maximizes our entropic measure. In particular, we consider a simplified Cauchy–Schwarz Divergence-based strategy where we select \(\theta\) maximizing

$$\begin{aligned} {D}_\mathrm {CS}(\mathcal {N}(\varvec{\beta }_\theta ^\mathrm{T} {\mathbf{m}}^+, \mathrm {var}(\varvec{\beta }_\theta ^\mathrm{T} {\mathbf{H}}^+)),\mathcal {N}(\varvec{\beta }_\theta ^\mathrm{T} {\mathbf{m}}^-, \mathrm {var}(\varvec{\beta }_\theta ^\mathrm{T} {\mathbf{H}}^-))), \end{aligned}$$

and kernel density-based entropic strategy [7] selecting \(\theta\) maximizing

$$\begin{aligned} {D}_\mathrm {CS}( [\![ \varvec{\beta }_\theta ^\mathrm{T} {\mathbf{H}}^+ ]\!] , [\![ \varvec{\beta }_\theta ^\mathrm{T} {\mathbf{H}}^- ]\!]), \end{aligned}$$

where \([\![ A ]\!] = [\![ A ]\!]_{\sigma (A)}\) is a Gaussian KDE using Silverman’s rule of the window width [22]

$$\begin{aligned} \sigma (A) = \left( \tfrac{4}{3|A|} \right) ^{1/5}\mathrm {std}(A) \approx \tfrac{1.06}{\root 5 \of {|A|}}\mathrm {std}(A). \end{aligned}$$

This way we can use whole given set for training and do not need to repeat the process, as \({D}_\mathrm {CS}\) is computed on the training set instead of the hold-out set.

First, one can notice on Table 6 that such entropic criterion works well for EEM, EEKM and Support Vector Machines. On the other hand, it is not very well suited for ELM models. This confirms conclusions from Czarnecki and Tabor work on classification using \({D}_\mathrm {CS}\) [7] where SVMs were claimed to be conceptually similar in terms of optimization objective, as well as widens it to the new class of models (EEMs). Second, Table 6 shows that EEM and EEKM can truly select their hyperparameters using very simple technique requiring no model retraining. Computation of

$$\begin{aligned} {D}_\mathrm {CS}(\mathcal {N}(\varvec{\beta }_\theta ^\mathrm{T} {\mathbf{m}}^+, \mathrm {var}(\varvec{\beta }_\theta ^\mathrm{T} {\mathbf{H}}^+)),\mathcal {N}(\varvec{\beta }_\theta ^\mathrm{T} {\mathbf{m}}^-, \mathrm {var}(\varvec{\beta }_\theta ^\mathrm{T} {\mathbf{H}}^-))) \end{aligned}$$

is linear in terms of training set and constant time if performed using precomputed projections of required objects (which are either way computed during EEM training). This makes this very fast model even more robust.

Table 6 UCI datasets GMean with parameters tuning based on selecting a model according to (a) \({D}_\mathrm {CS}(\mathcal {N}(\varvec{\beta }^\mathrm{T} {\mathbf{m}}^+, \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^+ \varvec{\beta }),\mathcal {N}(\varvec{\beta }^\mathrm{T} {\mathbf{m}}^-, \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^- \varvec{\beta }))\) and (b) \({D}_\mathrm {CS}([\![ \varvec{\beta }^\mathrm{T} {\mathbf{h}}^+ ]\!],[\![\varvec{\beta }^\mathrm{T} {\mathbf{h}}^- ]\!])\)where \(\varvec{\beta }\) is a linear operator found by a particular optimization procedure instead of internal cross-validation

7.5 EEM stability

It was previously reported [12] that ELMs have very stable results in the wide range of the number of hidden neurons. We performed analogous experiments with EEM on UCI datasets. We trained models for 100 increasing hidden layers sizes (\(h=5,10,\ldots ,500\)) and plotted resulting GMean scores on Fig. 3.

Fig. 3
figure 3

Plot of the EEM’s (with RBF activation function) GMean scores from cross-validation experiments for increasing sizes of hidden layer. Error bars denote standard deviation

One can notice that, similar to ELM, proposed methods are very stable. Once machine gets enough neurons (around 100 in case of tested datasets), further increasing of the feature space dimension has minor effect on the generalization capabilities of the model. It is also important that some of these datasets (like sonar) do not even have 500 points, so there are more dimensions in the Hilbert space than we have points to build our covariance estimates, and even though we still do not observe any rapid overfitting.

8 Conclusions

In this paper, we have presented Extreme Entropy Machines, models derived from the information theoretic measures and applied to the classification problems. Proposed methods are strongly related to the concepts of Extreme Learning Machines (in terms of general workflow, rapid training and randomization) as well as Support Vector Machines (in terms of margin maximization interpretation as well as LS-SVM duality).

Main characteristics of EEMs are:

  • information theoretic background based on differential and Renyi’s quadratic entropies,

  • closed-form solution of the optimization problem,

  • generative training, leading to direct probability estimates,

  • small number of hyperparameters,

  • good classification results,

  • rapid training that scales well to hundreds of thousands of examples and beyond,

  • theoretical and practical similarities to the large margin classifiers and Fisher Discriminant.

Performed evaluation showed that, similar to ELM, proposed EEM is a very stable model in terms of the size of the hidden layer and achieves comparable classification results to the ones obtained by SVMs and ELMs. Furthermore, we showed that our method scales better to truly big datasets (consisting of hundreds of thousands of examples) without sacrificing results quality.

During our considerations, we pointed out some open problems and issues, which are worth investigation:

  • Can one construct a closed-form entropy-based classifier with different distribution families than Gaussians? It remains an open problem whether it is possible even for a convex combination of two Gaussians.

  • Is there a theoretical justification of the stability of the extreme learning techniques? In particular, can one show whether performing random projection is equivalent to some prior on the decision function space like in the case of kernels?

  • Is it possible to further increase achieved results by performing unsupervised entropy-based optimization in the hidden layer? For Gaussian nodes one could use some GMM clustering techniques, but is there an efficient way of selecting nodes with different activation functions, such as ReLU?