Extreme entropy machines: robust information theoretic classification
Abstract
Most existing classification methods are aimed at minimization of empirical risk (through some simple pointbased error measured with loss function) with added regularization. We propose to approach the classification problem by applying entropy measures as a model objective function. We focus on quadratic Renyi’s entropy and connected Cauchy–Schwarz Divergence which leads to the construction of extreme entropy machines (EEM). The main contribution of this paper is proposing a model based on the information theoretic concepts which on the one hand shows new, entropic perspective on known linear classifiers and on the other leads to a construction of very robust method competitive with the state of the art noninformation theoretic ones (including Support Vector Machines and Extreme Learning Machines). Evaluation on numerous problems spanning from small, simple ones from UCI repository to the large (hundreds of thousands of samples) extremely unbalanced (up to 100:1 classes’ ratios) datasets shows wide applicability of the EEM in reallife problems. Furthermore, it scales better than all considered competitive methods.
Keywords
Rapid learning Extreme learning machines Classification Random projections Entropy1 Introduction
There is no one, universal, perfect optimization criterion that can be used to train machine learning model. Even for linear classifiers, one can find multiple objective functions, error measures to minimize, regularization methods to include [15]. Most existing classification methods are aimed at minimization of empirical risk (through some simple pointbased error measured with loss function) with added regularization. We propose to approach this problem in more information theoretic way by investigating applicability of entropy measures as a classification model objective function. We focus on quadratic Renyi’s entropy and connected Cauchy–Schwarz Divergence.

has a trivial implementation (under 20 lines of code in Python),

learns rapidly,

is well suited for unbalanced problems,

constructs nonlinear hypothesis,

scales very well (better than SVM, LSSVM and ELM),

has a few hyperparameters which are easy to optimize.
2 Preliminaries
Let us begin with recalling some basic information regarding extreme learning machines [12] and Support Vector Machines [4] which are further used as a competing models for proposed solution. We focus here on the optimization problems being solved to underline some basic differences between these methods and EEMs.
2.1 Extreme learning machines
Extreme Learning Machines are relatively young models introduced by Huang et al. [11] which are based on the idea that single layer feed forward neural networks (SLFN) can be trained without iterative process by performing linear regression on the data mapped through random, nonlinear projection (random hidden neurons). More precisely speaking, basic ELM architecture consists of d input neurons connected with each input space dimension which are fully connected with h hidden neurons by the set of weights \(\mathbf{w} _j\) (selected randomly from some arbitrary distribution) and set of biases \(b_j\) (also randomly selected). Given some generalized nonlinear activation function \(\mathrm {G}\), one can express the hidden neurons activation matrix \({\mathbf{H}}\) for the whole training set \(\{(\mathbf{x} _i,t_i)\}_{i=1}^N\) such that \(\mathbf{x} _i \in \mathbb {R}^d\) and \(t_i \in \{1,+1\}\) and formulate following optimization problem
2.1.1 Optimization problem: extreme learning machine
2.2 Support vector machines and least squares support vector machines
One of the most wellknown classifiers of the last decade is Vapnik’s support vector machine [4], based on the principle of creating a linear classifier that maximizes the separating margin between elements of two classes.
2.2.1 Optimization problem: support vector machine
2.2.2 Optimization problem: kernel support vector machine
2.2.3 Optimization problem: least squares support vector machine
3 Extreme entropy machines
Let us first recall the formulation of the linear classification problem in the highly dimensional feature spaces, i.e., when number of samples N is equal (or less) than dimension of the data space, \(\mathrm {dim}(\mathcal {H})\). In particular, we formulate the problem in the limiting case^{2} when \(\mathrm {dim}(\mathcal {H})=\infty\):
Problem 1
We are given finite number of (often linearly independent) points \({\mathbf{h}}^\pm\,\) in an infinite dimensional Hilbert space \(\mathcal {H}\). Points \({\mathbf{h}}^+ \in {\mathbf{H}}^+\) constitute the positive class, while \({\mathbf{h}}^ \in {\mathbf{H}}^\) the negative class.
We search for \(\varvec{\beta }\in \mathcal {H}\) such that the sets \(\varvec{\beta }^\mathrm{T}{\mathbf{H}}^+\) and \(\varvec{\beta }^\mathrm{T}{\mathbf{H}}^\) are optimally separated.
 1.
add/allow some error in the data,
 2.
specify some objective function including term penalizing model’s complexity.

regression based (like in neural networks or ELM),

probabilistic (like in the case of Naive Bayes),

geometric (like in SVM),

information theoretic (entropy models).

true datasets are discrete, so we do not know actual densities f and g,

statistical density estimators require rather large sample sizes and are very computationally expensive.
A common choice of density models are Gaussian distributions due to their nice theoretical and practical (computational) capabilities. As mentioned earlier, the convex combination of Gaussians can approximate the given continuous distribution f with arbitrary precision. To fit a Gaussian mixture model (GMM) to given dataset, one needs an algorithm such as Expectation Maximization [8] or conceptually similar Crossentropy clustering [25]. However, for simplicity and strong regularization, we propose to model f as one big Gaussian \(\mathcal {N}({{\mathbf{m}}},{\mathbf{\Sigma }})\). One of the biggest advantages of such an approach is closedform MLE parameter estimation, as we simply put \({\mathbf{m}}\) equal to the empirical mean of the data, and \({\mathbf{\Sigma }}\) as some data covariance estimator. Second, this way we introduce an error to the data which has an important regularizing role and leads to better posed optimization problem.
Lemma 1
Normal distribution \(\mathcal {N}({\mathbf{m}},{\mathbf{\Sigma }})\) has a maximum Shannon’s differential entropy among all realvalued distributions with mean \({{\mathbf{m}}}\in \mathbb {R}^h\) and covariance \({{\mathbf{\Sigma }}}\in \mathbb {R}^{h \times h}\).
Proof
Thus, our optimization problem can be stated as follows:
Problem 2
Spearman’s rank correlation coefficient between optimized term and whole \({D}_\mathrm{CS}\) for all datasets used in evaluation
Dataset  1  10  100  200  500 

australian  0.928  \(\)0.022  0.295  0.161  0.235 
breastcancer  0.628  0.809  0.812  0.858  0.788 
diabetes  \(\)0.983  \(\)0.976  \(\)0.941  \(\)0.982  \(\)0.952 
german.numer  0.916  0.979  0.877  0.873  0.839 
heart  0.964  0.829  0.931  0.91  0.893 
ionosphere  0.999  0.988  0.98  0.978  0.984 
liver disorders  0.232  0.308  0.363  0.33  0.312 
sonar  \(\)0.31  \(\)0.542  \(\)0.41  \(\)0.407  \(\)0.381 
splice  \(\)0.284  \(\)0.036  \(\)0.165  \(\)0.118  \(\)0.101 
abalone7  1.0  0.999  0.999  0.999  0.998 
arythmia  1.0  1.0  0.999  1.0  1.0 
balance  1.0  0.998  0.998  0.999  0.998 
car evaluation  1.0  0.998  0.998  0.997  0.997 
ecoli  0.964  0.994  0.995  0.998  0.995 
libras move  1.0  0.999  0.999  1.0  1.0 
oil spill  1.0  1.0  1.0  1.0  1.0 
sick euthyroid  1.0  0.999  1.0  1.0  1.0 
solar flare  1.0  1.0  1.0  1.0  1.0 
spectrometer  1.0  1.0  0.999  0.999  0.999 
forest cover  0.988  0.997  0.997  0.992  0.988 
isolet  0.784  1.0  0.997  0.997  0.999 
mammography  1.0  1.0  1.0  1.0  1.0 
protein homology  1.0  1.0  1.0  1.0  1.0 
webpages  1.0  1.0  1.0  0.999  0.999 
This means that, after the above reductions, and application of (2) our final problem can be stated as follows:
3.1 Optimization problem: extreme entropy machine
 For Extreme Entropy Machine (EEM), we use the random projection technique, exactly the same as the one used in the ELM. In other words, given some generalized activation function \(\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} ,b) : \mathcal {X}\times \mathcal {X}\times \mathbb {R} \rightarrow \mathbb {R}\) and a constant h denoting number of hidden neurons:where \(w_i\) are random vectors and \(b_i\) are random biases.$$\begin{aligned} \varphi : \mathcal {X}\ni {\mathbf{x}} \rightarrow [\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} _1,b_1),\dots ,\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} _h,b_h)]^\mathrm{T} \in \mathbb {R}^h \end{aligned}$$
 For Extreme Entropy Kernel Machine (EEKM), we use the randomized kernel approximation technique [9], which spans our Hilbert space on randomly selected subset of training vectors. In other words, given valid kernel \(\mathrm {K}(\cdot ,\cdot ) : \mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}_+\) and size of the kernel space base h:where \(\mathbf{X}^{[h]}\) is a h element random subset of \(\mathbf{X}\). It is easy to verify that such low rank approximation truly behaves as a kernel, in the sense that for \(\varphi _\mathrm {K}({\mathbf{x}} _i), \varphi _\mathrm {K}({\mathbf{x}} _j) \in \mathbb {R}^{h}\)$$\begin{aligned} \varphi _\mathrm {K}: \mathcal {X}\ni {\mathbf{x}} \rightarrow (\mathrm {K}({\mathbf{x}} ,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{1/2})^\mathrm{T} \in \mathbb {R}^h \end{aligned}$$Given true kernel projection \(\phi _\mathrm {K}\) such that \(\mathrm {K}({\mathbf{x}} _i,{\mathbf{x}} _j)=\phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j)\), we have$$\begin{aligned}&\varphi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\varphi _\mathrm {K}({\mathbf{x}} _j) \\&\quad = ((\mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{1/2})^\mathrm{T})^\mathrm{T} \\&\qquad \times ( \mathrm {K}({\mathbf{x}} _j,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{1/2} )^\mathrm{T} \\&\quad = \mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{1/2} \\&\qquad \times ( \mathrm {K}({\mathbf{x}} _j,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{1/2} )^\mathrm{T} \\&\quad = \mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{1/2}\\&\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{1/2} \mathrm {K}^\mathrm{T}({\mathbf{x}} _j,\mathbf{X}^{[h]}) \\&\quad = \mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{1} \mathrm {K}(\mathbf{X}^{[h]},{\mathbf{x}} _j). \end{aligned}$$Thus, for the whole samples’ set, we have$$\begin{aligned}&\mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{1} \mathrm {K}(\mathbf{X}^{[h]},{\mathbf{x}} _j) \\&\quad =\phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}(\mathbf{X}^{[h]}) \\&\qquad \times (\phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T}\phi _\mathrm {K}(\mathbf{X}^{[h]}))^{1}\\&\qquad \times \phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j) \\&\quad =\phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}(\mathbf{X}^{[h]}) \phi _\mathrm {K}(\mathbf{X}^{[h]})^{1}\\&\qquad \times (\phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T})^{1} \phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j) \\&\quad = \phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j)\\&\quad = \mathrm {K}({\mathbf{x}} _i,{\mathbf{x}} _j). \end{aligned}$$which is a complete Gram matrix.$$\begin{aligned} \varphi _\mathrm {K}(\mathbf{X})^\mathrm{T} \varphi _\mathrm {K}(\mathbf{X}) = \mathrm {K}(\mathbf{X},\mathbf{X}), \end{aligned}$$
Remark 1
Extreme Entropy Machine optimization problem is closely related to the SVM optimization, but instead of maximizing the margin between closest points we are maximizing the mean margin.
Proof
Similar observation regarding connection between large margin classification and entropy optimization has been done in case of the Multithreshold Linear Entropy Classifier [7]. One should also notice important relations to other methods studying socalled margin distributions [10], such as Large margin Distribution Machine [29]. Contrary to Zhang et al. approach, we are minimizing the summarized variances instead of minimizing the difference between data variance and cross class variance. As a result, proposed model is much easier to optimize (as shown below).
 1.\(S_ = S_+\), then there is one thresholdwhich results in a traditional (onethreshold) linear classifier,$$\begin{aligned} r_0=m_ + 1, \end{aligned}$$
 2.\(S_ \ne S_+\), then there are two thresholdswhich makes the resulting classifier a member of twothresholds linear classifiers family [1].$$\begin{aligned} r_\pm\,= m_ + \tfrac{2S_ \pm\,\sqrt{S_S_+(\ln (S_/S_+)(S_S_+)+4)}}{S_S_+}, \end{aligned}$$
 1.
\(\mathrm {F}(x) = \left\{ \begin{array} {ll}+1,&\quad {\text if }\,\, x \ge r_0\\ 1, &\quad {\text if }\,\, x < r_0 \end{array} \right. = {\mathrm{sign}}(xr_0),\)
 2.
\(\mathrm {F}(x) = \left\{ \begin{array}{ll} +1,&\quad {\text if }\,\, x \in [r_,r_+]\\ 1,&\quad {\text if }\,\, x \notin [r_,r_+] \end{array} \right. = {\mathrm{sign}}(xr_) {\mathrm{sign}}(xr_+) ,\) if \(r_<r_+\) and \(\mathrm {F}(x) = \left\{ \begin{array}{ll} 1, &\quad {\text if }\,\, x \in [r_+,r_]\\ +1,&\quad {\text if }\,\, x \notin [r_+,r_] \end{array} \right. = {\mathrm{sign}}(xr_) {\mathrm{sign}}(xr_+) ,\) otherwise.
4 Theory: density estimation in the kernel case
To illustrate our reasoning, we consider a typical basic problem concerning the density estimation.
Problem 3
Assume that we are given a finite dataset \({\mathbf{H}}\) in a Hilbert space \(\mathcal {H}\) generated by the unknown density f, and the goal consists in estimating f.
Since the problem in itself is infinite dimensional, typically the data would be linearly independent [20]. Moreover, one usually cannot obtain reliable density estimation—the most we can hope is that after transformation by a linear functional into \(\mathbb {R}\), the resulting density will be well estimated.
To simplify the problem assume therefore that we want to find the desired density in the class of normal densities—or equivalently that we are interested only in the estimation of the mean and covariance of f.
The generalization of the above problem is given by the following problem:
Problem 4
Assume that we are given a finite dataset \({\mathbf{h}}^\pm\,\) in a Hilbert space \(\mathcal {H}\) generated by the unknown densities \(f^\pm\,\), and the goal consists in estimating the unknown densities.
In general, \(\text {dim}(\mathcal {H}) \gg N\)which means that we have very sparse data in terms of Hilbert space. As a result, classical kernel density estimation (KDE) is not reliable source of information [18]. In the absence of different tools, we can however use KDE with very big kernel width to cover at least some general shape of the whole density.
Remark 2
Remark 3
5 Theory: learning capabilities
First, we show that under some simplifying assumptions, proposed method behaves as Extreme Learning Machine (or Weighted Extreme Learning Machine [30]).
Before proceeding further, we would like to remark that there are two popular notations for projecting data onto hyperplanes. One, used in ELM model, assume that \({\mathbf{H}}\) is a row matrix and \(\varvec{\beta }\) is a column vector, which results in the projection’s equation \({\mathbf{H}}\varvec{\beta }\). Second one, used in SVM and in our paper, assumes that both \({\mathbf{H}}\) and \(\varvec{\beta }\) are column oriented, which results in the \(\beta ^\mathrm{T}{\mathbf{H}}\) projection. In the following theorem, we will show some duality between \(\varvec{\beta }\) found by ELM and by EEM. To do so, we will need to change the notation during the proof, which will be indicated.
Theorem 1
Let us assume that we are given an arbitrary, balanced binary dataset which can be perfectly learned by ELM with N hidden neurons. If this dataset points’ image through random neurons \({\mathbf{H}}=\varphi (\mathbf{X})\) is centered (points’ images have 0 mean) and classes have homogeneous covariances (we assume that there exist real \(a_+\) and \(a_\) such that \(\mathrm {cov}({\mathbf{H}}) = a_+\mathrm {cov}({\mathbf{H}}^+) = a_\mathrm {cov}({\mathbf{H}}^)\)) then EEM with the same hidden layer will also learn this dataset perfectly (with 0 error).
Proof
Similar result holds for EEKM and Least Squares Support Vector Machine.
Theorem 2
Let us assume that we are given arbitrary, balanced binary dataset which can be perfectly learned by LSSVM. If dataset points’ images through Kernelinduced projection \(\varphi _\mathrm {K}\) have homogeneous classes’ covariances (we assume that there exist real \(a_+\) and \(a_\) such that \(\mathrm {cov}(\varphi _\mathrm {K}(\mathbf{X})) = a_+\mathrm {cov}(\varphi _\mathrm {K}(\mathbf{X}^+)) = a_\mathrm {cov}(\varphi _\mathrm {K}(\mathbf{X}^))\)) then EEKM with the same kernel and N hidden neurons will also learn this dataset perfectly (with 0 error).
Proof
It is a direct consequence of the fact that with N hidden neurons and homogeneous classes projections covariances, EEKM degenerates to the kernelized Fisher Discriminant which, as Gestel et al. showed [28], is equivalent to the solution of the Least Squares SVM. \(\square\)
Both theorems can be extended to nonbalanced datasets if we consider a Weighted ELM and Balanced LSSVM. Proposed method has a balanced nature, so it internally assumes that classes priors are equal to 1/2. In the proofs, we show that when this is a true assumption, ELM and LSSVM lead (under some assumptions) to the same solution. If one includes the same assumption in these two methods (through Weighted ELM and Balanced LSSVM), then they again will solve the same problem despite true classes priors. We omit the exact proof as they are analogous to the above.
6 Practical considerations
In previous sections, we investigated the limiting case when \(\mathrm {dim}(\mathcal {H}) =\infty\). However, in practice, we choose h random nonlinear projections which embed data in a highdimensional space (with dimension at least h, but we can still consider it as an image in higher dimensional space, analogously to how Gaussian kernel actually maps to infinitely dimensional space by projecting through just N functions). As we show in further evaluation, it is sufficient to use h which is much smaller than N, so resulting computational complexities, cubic in h, are acceptable.
We can formulate the whole EEM training as a very simple algorithm (see Algorithms 1, 2).

feature projection function \(\varphi\),

linear operator \(\varvec{\beta }\),

classification rule \(\mathrm {F}\).
Comparison of considered classifiers
ELM  SVM  LSSVM  EE(K)M  

Optimization method  Linear regression  Quadratic programming  Linear system  Fisher discriminant 
Nonlinearity  Random projection  Kernel  Kernel  Random (kernel) projection 
Closedform  Yes  No  Yes  Yes 
Balanced  No\(^{\mathrm{a}}\)  No\(^{\mathrm{a}}\)  No\(^{\mathrm{a}}\)  Yes 
Regression  Yes  No\(^{\mathrm{a}}\)  Yes  No 
Criterion  Mean squared error  Hinge loss  Mean squared error  Entropy optimization 
Learning theory  Huang et al. [11]  Vapnik et al. [4]  Suykens et al. [23]  This paper 
No. of thresholds  1  1  1  1 or 2 
Problem type  Regression  Classification  Regression  Classification 
Model learning  Discriminative  Discriminative  Discriminative  Generative 
Direct probability estimates  No  No  No  Yes 
Training complexity  \(\mathcal {O}(Nh^2)\)  \(\mathcal {O}(N^3)\)  \(\mathcal {O}(N^{2.34})\)  \(\mathcal {O}(Nh^2)\) 
Resulting model complexity  hd  SVd, \(SV\ll N\)  \(Nd+1\)  \(hd+4\) 
Memory requirements  \(\mathcal {O}(Nd)\)  \(\mathcal {O}(Nd)\)  \(\mathcal {O}(N^2)\)  \(\mathcal {O}(Nd)\) 
Source of regularization  Moore–Penrose pseudoinverse  Margin maximization  Quadratic loss penalty term  Ledoit–Wolf estimator 
Hyperparameters  h, G  C, K  C, K  h, G or h, K 
Number of classes  Any  2  2  2 
7 Evaluation
For the evaluation purposes, we implemented five methods, namely: Weighted Extreme Learning Machine (WELM [30]), Extreme Entropy Machine (EEM), Extreme Entropy Kernel Machine (EEKM), Least Squares Support Vector Machines (LSSVM [23]) and Support Vector Machines (SVM [4]). All experiments were performed using machine with Intel Xeon 2.8Ghz processors with enough RAM to fit any required computations.

sigmoid (sig): \(\tfrac{1}{1+\exp (\langle {\mathbf{w}} ,{\mathbf{x}} \rangle + b)}\),

normalized sigmoid (nsig): \(\tfrac{1}{1+\exp (\langle {\mathbf{w}} ,{\mathbf{x}} \rangle /d + b)}\),

radial basis function (rbf): \(\exp (b \Vert {\mathbf{w}}  {\mathbf{x}} \Vert ^2 )\).
Hyperparameters of each model were fitted, performed grid search included: hidden layer size \(h=50,100,250,500,1000\) (WELM, EEM, EEKM), Gaussian Kernel width \(\gamma =10^{10},\ldots ,10^0\) (EEKM, LSSVM, SVM), SVM regularization parameter \(C=10^{1},\ldots ,10^{10}\) (LSSVM, SVM).
Datasets’ features were linearly scaled (per feature) to have each feature in the interval [0, 1]. No other data whitening/filtering was performed. All experiments were conducted in repeated tenfold stratified crossvalidation.
Characteristics of used datasets
Dataset  d  \(\mathbf{X}^\)  \(\mathbf{X}^+\) 

australian  14  383  307 
breast cancer  9  444  239 
diabetes  8  268  500 
german numer  24  700  300 
heart  13  150  120 
liver disorders  6  145  200 
sonar  60  111  97 
splice  60  483  517 
abalone7  10  3786  391 
arythmia  261  427  25 
car evaluation  21  1594  134 
ecoli  7  301  35 
libras move  90  336  24 
oil spill  48  896  41 
sick euthyroid  42  2870  293 
solar flare  32  1321  68 
spectrometer  93  486  45 
forest cover  54  571519  9493 
isolet  617  7197  600 
mammography  6  10923  260 
protein homology  74  144455  1296 
webpages  300  33799  981 
7.1 Basic UCI datasets
We start our experiments with nine datasets coming from UCI repository [2], namely australian, breastcancer, diabetes, german.numer, heart, ionosphere, liver disorders, sonar and splice, summarized in Table 3. This dataset includes rather balanced, lowdimensional problems.
GMean on all considered datasets
WELM\(_{\mathrm {sig}}\)  EEM\(_{\mathrm {sig}}\)  WELM\(_{\mathrm {nsig}}\)  EEM\(_{\mathrm {nsig}}\)  WELM\(_{\mathrm {rbf}}\)  EEM\(_{\mathrm {rbf}}\)  LSSVM\(_{\mathrm {rbf}}\)  EEKM\(_{\mathrm {rbf}}\)  SVM\(_{\mathrm {rbf}}\)  

australian  86.3 \(\pm\,4.5\)  87.0 \(\pm\,4.0\)  85.9 \(\pm\,4.4\)  86.5 \(\pm\,3.2\)  85.8 \(\pm\,4.9\)  86.9 \(\pm\,4.4\)  86.9 \(\pm\,4.1\)  86.8 \(\pm\,3.8\)  86.8 \(\pm\,3.7\) 
breastcancer  96.9 \(\pm\,1.7\)  97.3 \(\pm\,1.2\)  97.6 \(\pm\,1.5\)  97.4 \(\pm\,1.2\)  96.6 \(\pm\,1.8\)  97.3 \(\pm\,1.1\)  97.6 \(\pm\,1.3\)  97.8 \(\pm\,1.1\)  96.8 \(\pm\,1.7\) 
diabetes  74.2 \(\pm\,4.6\)  74.5 \(\pm\,4.6\)  74.1 \(\pm\,5.5\)  74.9 \(\pm\,5.0\)  73.2 \(\pm\,5.6\)  74.9 \(\pm\,5.9\)  75.5 \(\pm\,5.6\)  75.7 \(\pm\,5.6\)  74.8 \(\pm\,3.5\) 
german  68.8 \(\pm\,6.9\)  71.3 \(\pm\,4.1\)  70.7 \(\pm\,6.1\)  72.4 \(\pm\,5.4\)  71.1 \(\pm\,6.1\)  72.2 \(\pm\,5.7\)  73.2 \(\pm\,4.5\)  72.9 \(\pm\,5.3\)  73.4 \(\pm\,5.4\) 
heart  78.8 \(\pm\,6.3\)  82.5 \(\pm\,7.4\)  78.1 \(\pm\,7.0\)  83.7 \(\pm\,7.2\)  80.2 \(\pm\,8.9\)  81.9 \(\pm\,6.9\)  83.7 \(\pm\,8.5\)  83.6 \(\pm\,7.5\)  84.6 \(\pm\,7.0\) 
ionosphere  71.5 \(\pm\,9.5\)  77.0 \(\pm\,12.8\)  82.7 \(\pm\,7.8\)  84.6 \(\pm\,9.1\)  85.6 \(\pm\,8.4\)  90.8 \(\pm\,5.2\)  91.2 \(\pm\,5.5\)  93.4 \(\pm\,4.3\)  94.7 \(\pm\,3.9\) 
liver disorders  68.1 \(\pm\,8.0\)  68.6 \(\pm\,8.9\)  66.3 \(\pm\,8.2\)  62.1 \(\pm\,8.1\)  67.2 \(\pm\,5.9\)  71.4 \(\pm\,7.0\)  71.1 \(\pm\,8.3\)  70.2 \(\pm\,6.9\)  72.3 \(\pm\,6.2\) 
sonar  66.7 \(\pm\,10.1\)  70.1 \(\pm\,11.5\)  80.2 \(\pm\,7.4\)  78.3 \(\pm\,11.2\)  83.2 \(\pm\,6.9\)  82.8 \(\pm\,5.2\)  86.5 \(\pm\,5.4\)  87.0 \(\pm\,7.5\)  83.0 \(\pm\,7.1\) 
splice  64.7 \(\pm\,2.8\)  49.4 \(\pm\,5.5\)  81.8 \(\pm\,3.2\)  80.9 \(\pm\,2.7\)  75.5 \(\pm\,3.9\)  82.2 \(\pm\,3.5\)  89.9 \(\pm\,3.0\)  88.0 \(\pm\,4.0\)  88.0 \(\pm\,2.2\) 
abalone7  79.7 \(\pm\,2.3\)  79.8 \(\pm\,3.5\)  80.0 \(\pm\,2.8\)  76.1 \(\pm\,3.7\)  80.1 \(\pm\,3.2\)  79.7 \(\pm\,3.6\)  80.2 \(\pm\,3.4\)  79.9 \(\pm\,3.4\)  79.7 \(\pm\,2.7\) 
arythmia  28.3 \(\pm\,35.4\)  40.3 \(\pm\,20.9\)  64.2 \(\pm\,24.6\)  85.6 \(\pm\,10.3\)  66.9 \(\pm\,25.8\)  79.4 \(\pm\,12.5\)  84.4 \(\pm\,10.0\)  85.2 \(\pm\,10.6\)  80.9 \(\pm\,11.8\) 
car evaluation  99.1 \(\pm\,0.3\)  98.9 \(\pm\,0.4\)  99.0 \(\pm\,0.3\)  97.9 \(\pm\,0.6\)  99.0 \(\pm\,0.3\)  98.5 \(\pm\,0.3\)  99.5 \(\pm\,0.2\)  99.2 \(\pm\,0.3\)  100.0 \(\pm\,0.0\) 
ecoli  86.9 \(\pm\,6.5\)  88.3 \(\pm\,7.1\)  86.9 \(\pm\,6.8\)  88.6 \(\pm\,6.9\)  86.4 \(\pm\,7.0\)  88.8 \(\pm\,7.2\)  89.2 \(\pm\,6.3\)  89.4 \(\pm\,6.9\)  88.5 \(\pm\,6.2\) 
libras move  65.5 \(\pm\,10.7\)  19.3 \(\pm\,8.1\)  82.5 \(\pm\,12.0\)  93.0 \(\pm\,11.8\)  89.6 \(\pm\,11.9\)  93.9 \(\pm\,11.9\)  96.5 \(\pm\,8.6\)  96.6 \(\pm\,8.7\)  91.6 \(\pm\,11.9\) 
oil spill  86.0 \(\pm\,6.9\)  88.8 \(\pm\,6.5\)  83.8 \(\pm\,7.6\)  84.7 \(\pm\,8.7\)  85.8 \(\pm\,9.3\)  88.1 \(\pm\,6.1\)  86.7 \(\pm\,8.4\)  87.2 \(\pm\,4.9\)  85.7 \(\pm\,11.4\) 
sick euthyroid  88.1 \(\pm\,1.7\)  87.9 \(\pm\,2.4\)  88.5 \(\pm\,2.1\)  81.7 \(\pm\,2.7\)  89.1 \(\pm\,1.9\)  88.2 \(\pm\,2.4\)  89.5 \(\pm\,1.7\)  89.3 \(\pm\,1.9\)  90.9 \(\pm\,2.0\) 
solar flare  60.4 \(\pm\,16.8\)  63.7 \(\pm\,12.9\)  61.3 \(\pm\,10.8\)  67.4 \(\pm\,9.0\)  60.3 \(\pm\,14.8\)  68.9 \(\pm\,9.3\)  67.3 \(\pm\,8.8\)  67.3 \(\pm\,9.0\)  70.9 \(\pm\,8.5\) 
spectrometer  82.9 \(\pm\,13.0\)  87.3 \(\pm\,7.8\)  88.0 \(\pm\,10.8\)  90.2 \(\pm\,8.6\)  86.6 \(\pm\,8.2\)  93.0 \(\pm\,14.6\)  94.6 \(\pm\,8.4\)  93.5 \(\pm\,14.7\)  95.4 \(\pm\,5.1\) 
forest cover  90.8 \(\pm\,0.3\)  90.5 \(\pm\,0.3\)  90.7 \(\pm\,0.3\)  85.1 \(\pm\,0.4\)  90.9 \(\pm\,0.3\)  87.1 \(\pm\,0.0\)  –  91.8 \(\pm\,0.3\)  – 
isolet  0.0 \(\pm\,0.0\)  0.0 \(\pm\,0.0\)  96.3 \(\pm\,0.7\)  95.6 \(\pm\,1.1\)  93.0 \(\pm\,0.9\)  91.4 \(\pm\,1.0\)  98.0 \(\pm\,0.7\)  97.4 \(\pm\,0.6\)  97.6 \(\pm\,0.6\) 
mammography  90.4 \(\pm\,2.8\)  89.0 \(\pm\,3.2\)  90.7 \(\pm\,3.3\)  87.2 \(\pm\,3.0\)  89.9 \(\pm\,3.8\)  89.5 \(\pm\,3.1\)  91.0 \(\pm\,3.1\)  89.5 \(\pm\,3.1\)  89.8 \(\pm\,3.8\) 
protein homology  95.3 \(\pm\,0.8\)  94.9 \(\pm\,0.8\)  95.1 \(\pm\,0.9\)  94.2 \(\pm\,1.3\)  95.0 \(\pm\,1.0\)  95.1 \(\pm\,1.1\)  –  95.7 \(\pm\,0.9\)  – 
webpages  72.0 \(\pm\,0.0\)  73.1 \(\pm\,2.0\)  93.0 \(\pm\,1.8\)  93.1 \(\pm\,1.7\)  86.7 \(\pm\,0.0\)  84.4 \(\pm\,1.6\)  –  93.1 \(\pm\,1.7\)  93.1 \(\pm\,1.7\) 
7.2 Highly unbalanced datasets
In the second part, we considered the nine highly unbalanced datasets, summarized in the second part of Table 3. Ratio between bigger and smaller class varies from 10:1 to even 20:1 which makes them really hard for unbalanced models. Obtained results (see Table 4) resemble those obtained on UCI repository. We can see better results in about half of experiments if we fix a particular activation function/kernel (so we compare ELM\(_x\) with EEM\(_x\) and LSSVM\(_x\) with EEKM\(_x\)).
Highly unbalanced datasets times in seconds using machine with Intel Xeon 2.8 GHz processors
WELM\(_{\mathrm {sig}}\)  EEM\(_{\mathrm {sig}}\)  WELM\(_{\mathrm {nsig}}\)  EEM\(_{\mathrm {nsig}}\)  WELM\(_{\mathrm {rbf}}\)  EEM\(_{\mathrm {rbf}}\)  LSSVM\(_{\mathrm {rbf}}\)  EEKM\(_{\mathrm {rbf}}\)  SVM\(_{\mathrm {rbf}}\)  

abalone7  1.9  1.2  2.5  1.6  1.8  1.2  20.8  1.9  4.7 
arythmia  0.2  0.7  0.3  0.9  0.3  0.7  0.1  0.3  0.1 
car evaluation  1.3  0.9  1.5  1.0  1.1  0.9  2.0  1.4  0.1 
ecoli  0.2  0.8  0.2  0.8  0.1  0.7  0.0  0.1  0.2 
libras move  0.2  0.9  0.2  0.8  0.1  0.7  0.0  0.1  0.0 
oil spill  0.7  0.8  0.6  0.8  0.6  0.8  0.4  0.9  0.1 
sick euthyroid  1.5  1.1  1.4  1.1  1.5  1.1  9.6  1.7  21.0 
solar flare  0.7  0.8  0.7  0.8  0.8  0.8  1.1  1.3  16.1 
spectrometer  0.2  0.7  0.3  0.7  0.2  0.7  0.1  0.3  0.0 
forest cover  110.7  104.6  144.9  45.6  111.3  38.2  \({>}600\)  107.4  \({>}600\) 
isolet  9.7  4.5  4.9  3.0  3.4  2.1  126.9  3.2  53.5 
mammography  4.0  2.2  6.1  3.0  4.0  2.2  327.3  3.3  9.5 
protein homology  27.6  21.6  86.3  27.9  62.5  22.0  \({>}600\)  30.7  \({>}600\) 
webpages  16.0  6.2  14.5  8.5  7.1  6.4  \({>}600\)  9.0  217.0 
7.3 Extremely unbalanced datasets
Third part of experiments involves analysis of extremely unbalanced datasets (with class imbalance up to 100:1) containing tens and hundreds thousands of examples. Five analyzed datasets span from NLP tasks (webpages) through medical applications (mammography) to bioinformatics (protein homology). This type of dataset often occurs in the true data mining which makes these results much more practical than the one obtained on small/balanced data. Hyperparameters of each method are carefully fitted as described in the previous section.
Scores obtained on Isolet dataset (see Table 4) for sigmoidbased random projections are a result of very high values (\({\approx } 200\)) of \(\langle {\mathbf{x}} ,{\mathbf{w}} \rangle\) for all \({\mathbf{x}}\), which results in \(\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} ,b)=1\), so the whole dataset is reduced to the singleton \(\{ [1,\ldots ,1]^\mathrm{T} \} \subset \mathbb {R}^h \subset \mathcal {H}\)which obviously is not separable by any classifier, neither ELM nor EEM.
7.4 Entropybased hyperparameters optimization
UCI datasets GMean with parameters tuning based on selecting a model according to (a) \({D}_\mathrm {CS}(\mathcal {N}(\varvec{\beta }^\mathrm{T} {\mathbf{m}}^+, \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^+ \varvec{\beta }),\mathcal {N}(\varvec{\beta }^\mathrm{T} {\mathbf{m}}^, \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^ \varvec{\beta }))\) and (b) \({D}_\mathrm {CS}([\![ \varvec{\beta }^\mathrm{T} {\mathbf{h}}^+ ]\!],[\![\varvec{\beta }^\mathrm{T} {\mathbf{h}}^ ]\!])\)where \(\varvec{\beta }\) is a linear operator found by a particular optimization procedure instead of internal crossvalidation
WELM\(_{\mathrm {sig}}\)  EEM\(_{\mathrm {sig}}\)  WELM\(_{\mathrm {nsig}}\)  EEM\(_{\mathrm {nsig}}\)  WELM\(_{\mathrm {rbf}}\)  EEM\(_{\mathrm {rbf}}\)  LSSVM\(_{\mathrm {rbf}}\)  EEKM\(_{\mathrm {rbf}}\)  SVM\(_{\mathrm {rbf}}\)  

(a) \({D}_\mathrm {CS}(\mathcal {N}(\varvec{\beta }^\mathrm{T} {\mathbf{m}}^+, \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^+ \varvec{\beta }),\mathcal {N}(\varvec{\beta }^\mathrm{T} {\mathbf{m}}^, \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^ \varvec{\beta }))\)  
australian  51.2 \(\pm\,7.5\)  86.3 \(\pm\,4.8\)  50.3 \(\pm\,6.4\)  86.5 \(\pm\,3.2\)  50.3 \(\pm\,8.5\)  86.2 \(\pm\,5.3\)  58.5 \(\pm\,7.9\)  85.2 \(\pm\,5.6\)  85.7 \(\pm\,4.7\) 
breastcancer  83.0 \(\pm\,4.3\)  97.0 \(\pm\,1.6\)  72.0 \(\pm\,6.6\)  97.1 \(\pm\,1.9\)  77.3 \(\pm\,5.3\)  97.3 \(\pm\,1.1\)  79.2 \(\pm\,7.7\)  96.9 \(\pm\,1.4\)  97.5 \(\pm\,1.2\) 
diabetes  52.3 \(\pm\,4.7\)  74.4 \(\pm\,4.0\)  51.7 \(\pm\,4.0\)  74.7 \(\pm\,5.2\)  52.1 \(\pm\,3.7\)  73.5 \(\pm\,5.9\)  60.1 \(\pm\,4.2\)  72.2 \(\pm\,5.4\)  73.2 \(\pm\,5.9\) 
german  57.1 \(\pm\,4.0\)  69.3 \(\pm\,5.0\)  51.7 \(\pm\,3.0\)  72.4 \(\pm\,5.4\)  52.8 \(\pm\,6.3\)  70.9 \(\pm\,6.9\)  55.0 \(\pm\,4.3\)  67.8 \(\pm\,5.7\)  60.5 \(\pm\,4.5\) 
heart  68.6 \(\pm\,5.8\)  79.4 \(\pm\,6.9\)  65.6 \(\pm\,5.9\)  82.9 \(\pm\,7.4\)  60.3 \(\pm\,9.4\)  77.4 \(\pm\,7.2\)  66.2 \(\pm\,4.2\)  77.7 \(\pm\,7.0\)  76.5 \(\pm\,6.6\) 
ionosphere  \(62.7\pm\,10.6\)  \(77.0\pm\,12.8\)  68.5 \(\pm\,5.1\)  84.6 \(\pm\,9.1\)  69.5 \(\pm\,9.6\)  90.8 \(\pm\,5.2\)  72.8 \(\pm\,6.1\)  93.4 \(\pm\,4.2\)  94.7 \(\pm\,3.9\) 
liver disorders  53.2 \(\pm\,7.0\)  68.5 \(\pm\,6.7\)  52.2 \(\pm\,11.8\)  62.1 \(\pm\,8.1\)  53.9 \(\pm\,8.0\)  71.4 \(\pm\,7.0\)  62.9 \(\pm\,7.8\)  69.6 \(\pm\,8.2\)  66.9 \(\pm\,8.0\) 
sonar  66.3 \(\pm\,6.1\)  66.1 \(\pm\,15.0\)  80.2 \(\pm\,7.4\)  76.9 \(\pm\,5.2\)  83.2 \(\pm\,6.9\)  82.8 \(\pm\,5.2\)  85.9 \(\pm\,4.9\)  87.7 \(\pm\,6.1\)  86.6 \(\pm\,3.3\) 
splice  51.8 \(\pm\,4.3\)  49.4 \(\pm\,5.5\)  64.9 \(\pm\,3.1\)  80.2 \(\pm\,2.6\)  60.8 \(\pm\,3.5\)  82.2 \(\pm\,3.5\)  89.7 \(\pm\,3.3\)  88.0 \(\pm\,4.0\)  89.5 \(\pm\,2.9\) 
(b) \({D}_\mathrm {CS}([\![ \varvec{\beta }^\mathrm{T} {\mathbf{h}}^+ ]\!],[\![\varvec{\beta }^\mathrm{T} {\mathbf{h}}^ ]\!])\)  
australian  51.2 \(\pm\,7.5\)  86.3 \(\pm\,4.8\)  50.3 \(\pm\,6.4\)  86.5 \(\pm\,3.2\)  50.3 \(\pm\,8.5\)  86.2 \(\pm\,5.3\)  58.5 \(\pm\,7.9\)  85.2 \(\pm\,5.6\)  84.2 \(\pm\,4.1\) 
breastcancer  83.0 \(\pm\,4.3\)  97.0 \(\pm\,1.6\)  72.0 \(\pm\,6.6\)  97.4 \(\pm\,1.2\)  77.3 \(\pm\,5.3\)  97.3 \(\pm\,1.1\)  79.3 \(\pm\,7.1\)  96.9 \(\pm\,1.4\)  96.3 \(\pm\,2.4\) 
diabetes  52.3 \(\pm\,4.7\)  74.4 \(\pm\,4.0\)  51.7 \(\pm\,4.0\)  74.7 \(\pm\,5.2\)  52.1 \(\pm\,3.7\)  73.5 \(\pm\,5.9\)  60.1 \(\pm\,4.2\)  72.2 \(\pm\,5.4\)  71.9 \(\pm\,5.4\) 
german  57.1 \(\pm\,4.0\)  69.3 \(\pm\,5.0\)  51.7 \(\pm\,3.0\)  71.7 \(\pm\,5.9\)  52.8 \(\pm\,6.3\)  70.9 \(\pm\,6.9\)  54.4 \(\pm\,5.7\)  67.8 \(\pm\,5.7\)  59.5 \(\pm\,4.2\) 
heart  60.0 \(\pm\,9.2\)  79.4 \(\pm\,6.9\)  65.6 \(\pm\,5.9\)  82.9 \(\pm\,7.4\)  \(52.6 \pm\,9.0\)  77.4 \(\pm\,7.2\)  61.9 \(\pm\,5.8\)  77.7 \(\pm\,7.0\)  76.3 \(\pm\,7.7\) 
ionosphere  62.4 \(\pm\,8.1\)  \(77.0\pm\,12.8\)  68.5 \(\pm\,5.1\)  84.6 \(\pm\,9.1\)  \(67.6 \pm\,9.8\)  90.8 \(\pm\,5.2\)  67.0 \(\pm\,10.7\)  93.4 \(\pm\,4.2\)  92.3 \(\pm\,4.6\) 
liver disorders  50.9 \(\pm\,11.5\)  68.5 \(\pm\,6.7\)  50.4 \(\pm\,9.2\)  62.1 \(\pm\,8.1\)  53.9 \(\pm\,8.0\)  71.4 \(\pm\,7.0\)  62.9 \(\pm\,7.8\)  69.6 \(\pm\,8.2\)  66.9 \(\pm\,8.0\) 
sonar  66.3 \(\pm\,6.1\)  66.1 \(\pm\,15.0\)  80.2 \(\pm\,7.4\)  76.9 \(\pm\,5.2\)  62.9 \(\pm\,9.4\)  82.8 \(\pm\,5.2\)  83.6 \(\pm\,4.5\)  87.7 \(\pm\,6.1\)  86.6 \(\pm\,3.3\) 
splice  51.8 \(\pm\,4.3\)  33.1 \(\pm\,6.5\)  64.9 \(\pm\,3.1\)  80.2 \(\pm\,2.6\)  60.8 \(\pm\,3.5\)  82.2 \(\pm\,3.5\)  85.4 \(\pm\,4.1\)  88.0 \(\pm\,4.0\)  89.5 \(\pm\,2.9\) 
7.5 EEM stability
One can notice that, similar to ELM, proposed methods are very stable. Once machine gets enough neurons (around 100 in case of tested datasets), further increasing of the feature space dimension has minor effect on the generalization capabilities of the model. It is also important that some of these datasets (like sonar) do not even have 500 points, so there are more dimensions in the Hilbert space than we have points to build our covariance estimates, and even though we still do not observe any rapid overfitting.
8 Conclusions
In this paper, we have presented Extreme Entropy Machines, models derived from the information theoretic measures and applied to the classification problems. Proposed methods are strongly related to the concepts of Extreme Learning Machines (in terms of general workflow, rapid training and randomization) as well as Support Vector Machines (in terms of margin maximization interpretation as well as LSSVM duality).

information theoretic background based on differential and Renyi’s quadratic entropies,

closedform solution of the optimization problem,

generative training, leading to direct probability estimates,

small number of hyperparameters,

good classification results,

rapid training that scales well to hundreds of thousands of examples and beyond,

theoretical and practical similarities to the large margin classifiers and Fisher Discriminant.

Can one construct a closedform entropybased classifier with different distribution families than Gaussians? It remains an open problem whether it is possible even for a convex combination of two Gaussians.

Is there a theoretical justification of the stability of the extreme learning techniques? In particular, can one show whether performing random projection is equivalent to some prior on the decision function space like in the case of kernels?

Is it possible to further increase achieved results by performing unsupervised entropybased optimization in the hidden layer? For Gaussian nodes one could use some GMM clustering techniques, but is there an efficient way of selecting nodes with different activation functions, such as ReLU?
Footnotes
 1.
There are some pruning techniques for LSSVM but we are not investigating them here.
 2.
Which is often obtained by the kernel approach.
 3.
Where \(\Vert {\mathbf{m}}^+{\mathbf{m}}^\Vert ^2_{{\mathbf{\Sigma }}^++{\mathbf{\Sigma }}^}=({\mathbf{m}}^+{\mathbf{m}}^)^\mathrm{T}({\mathbf{\Sigma }}^++{\mathbf{\Sigma }}^)^{1}({\mathbf{m}}^+{\mathbf{m}}^)\) denotes the squared Mahalanobis norm of \({\mathbf{m}}^+{\mathbf{m}}^\).
 4.
References
 1.Anthony M (2003) Learning multivalued multithreshold functions. CDMA Research Report No. LSECDMA200303, London School of EconomicsGoogle Scholar
 2.Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 30 June 2015
 3.Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27CrossRefGoogle Scholar
 4.Cortes C, Vapnik V (1995) Supportvector networks. Mach Learn 20(3):273–297zbMATHGoogle Scholar
 5.Cover TM, Thomas JA (2012) Elements of information theory. Wiley, New YorkzbMATHGoogle Scholar
 6.Czarnecki WM, Tabor J (2014) Cluster based RBF kernel for support vector machines. ArXiv eprints. http://arxiv.org/abs/1408.2869. Accessed 30 June 2015
 7.Czarnecki WM, Tabor J (2014) Multithreshold Entropy Linear Classifier: Theory and applications. Expert Syst Appl 42(13):5591–5606CrossRefGoogle Scholar
 8.Dempster AP, Laird NM, Rubin DB Maximum likelihood from incomplete data via the em algorithm. In: Journal of the Royal Statistical Society. Series B (Methodological), JSTOR, pp 1–38 (1977)Google Scholar
 9.Drineas P, Mahoney MW (2005) On the Nyström method for approximating a Gram matrix for improved kernelbased learning. J Mach Learn Res 6:2153–2175MathSciNetzbMATHGoogle Scholar
 10.Durrant RJ, Kaban A (2013) Sharp generalization error bounds for randomlyprojected classifiers. Proceedings of International Conference on Machine Learning (ICML), pp 693–701Google Scholar
 11.Huang GB, Zhu QY, Siew CK: Extreme learning machine: a new learning scheme of feedforward neural networks. In: Proceedings of the 2004 IEEE international joint conference on neural networks, 2004, vol 2. IEEE, pp 985–990 (2004)Google Scholar
 12.Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1):489–501CrossRefGoogle Scholar
 13.Jenssen R, Principe JC, Erdogmus D, Eltoft T (2006) The Cauchy–Schwarz divergence and parzen windowing: connections to graph theory and mercer kernels. J Frankl Inst 343(6):614–629MathSciNetCrossRefzbMATHGoogle Scholar
 14.Jones E, Oliphant T, Peterson P (2001) Scipy: open source scientific tools for python. http://www.scipy.org/. Accessed 30 June 2015
 15.Kulkarni SR, Lugosi G, Venkatesh SS (1998) Learning pattern classificationa survey. IEEE Trans Inf Theory 44(6):2178–2206MathSciNetCrossRefzbMATHGoogle Scholar
 16.Ledoit O, Wolf M (2004) A wellconditioned estimator for largedimensional covariance matrices. J Multivar Anal 88(2):365–411MathSciNetCrossRefzbMATHGoogle Scholar
 17.Mahalanobis PC (1936) On the generalized distance in statistics. Proc Natl Inst Sci (Calcutta) 2:49–55zbMATHGoogle Scholar
 18.Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33:1065–1076MathSciNetCrossRefzbMATHGoogle Scholar
 19.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikitlearn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetzbMATHGoogle Scholar
 20.Poggio T, Girosi F (1989) A theory of networks for approximation and learning. In: Tech. rep, DTIC documentGoogle Scholar
 21.Principe JC (2000) Information theoretic learning. Springer, BerlinzbMATHGoogle Scholar
 22.Silverman BW (1986) Density estimation for statistics and data analysis, vol 26. CRC Press, Boca RatonCrossRefzbMATHGoogle Scholar
 23.Suykens JA, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300CrossRefzbMATHGoogle Scholar
 24.Suykens JA, De Brabanter J, Lukas L, Vandewalle J (2002) Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing 48(1):85–105CrossRefzbMATHGoogle Scholar
 25.Tabor J, Spurek P (2014) Crossentropy clustering. Pattern Recogn 47(9):3046–3059CrossRefzbMATHGoogle Scholar
 26.Titterington DM, Smith AF, Makov UE et al (1985) Statistical analysis of finite mixture distributions, vol 7. Wiley, New YorkzbMATHGoogle Scholar
 27.Van Der Walt S, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30CrossRefGoogle Scholar
 28.Van Gestel T, Suykens JA, Baesens B, Viaene S, Vanthienen J, Dedene G, De Moor B, Vandewalle J (2004) Benchmarking least squares support vector machine classifiers. Mach Learn 54(1):5–32CrossRefzbMATHGoogle Scholar
 29.Zhang, T., Zhou, Z.H.: Large margin distribution machine. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 313–322 (2014)Google Scholar
 30.Zong W, Huang GB, Chen Y (2013) Weighted extreme learning machine for imbalance learning. Neurocomputing 101:229–242CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.