Extreme entropy machines: robust information theoretic classification

Most existing classification methods are aimed at minimization of empirical risk (through some simple point-based error measured with loss function) with added regularization. We propose to approach the classification problem by applying entropy measures as a model objective function. We focus on quadratic Renyi’s entropy and connected Cauchy–Schwarz Divergence which leads to the construction of extreme entropy machines (EEM). The main contribution of this paper is proposing a model based on the information theoretic concepts which on the one hand shows new, entropic perspective on known linear classifiers and on the other leads to a construction of very robust method competitive with the state of the art non-information theoretic ones (including Support Vector Machines and Extreme Learning Machines). Evaluation on numerous problems spanning from small, simple ones from UCI repository to the large (hundreds of thousands of samples) extremely unbalanced (up to 100:1 classes’ ratios) datasets shows wide applicability of the EEM in real-life problems. Furthermore, it scales better than all considered competitive methods.


Introduction
There is no one, universal, perfect optimization criterion that can be used to train machine learning model.Even for linear classifiers one can find multiple objective functions, error measures to minimize, regularization methods to include [14].Most of the existing methods are aimed at minimization of empirical risk (through some simple point-based error measured with loss function) with added regularization.We propose to approach this problem in more information theoretic way by investigating applicability of entropy measures as a classification model objective function.We focus on quadratic Renyi's entropy and connected Cauchy-Schwarz Divergence.
One of the information theoretic concepts which has been found very effective in machine learning is the entropy measure.In particular the rule of maximum entropy modeling led to the construction of MaxEnt model and its structural generalization -Conditional Random Fields which are considered state of the art in many applications.In this paper we propose to use Renyi's quadratic cross entropy as the measure of two density estimations divergence in order to find best linear classifier.It is a conceptually different approach than typical entropy models as it works in the input space instead of decisions distribution.As a result we obtain a model closely related to the Fischer's Discriminant (or more generally Linear Discriminant Analysis) which deepens the understanding of this classical approach.Together with a powerful extreme data transformation we obtain a robust, nonlinear model competetive with the state of the art models not based on information theory like Support Vector Machines (SVM [4]), Extreme Learning Machines (ELM [10]) or Least Squares Support Vector Machines (LS-SVM [21]).We also show that under some simplifing assumptions ELM and LS-SVM can be seen through a perspective of information theory as their solutions are (up to some constants) identical to the ones obtained by proposed method.
Paper is structured as follows: first we recall some preliminary information regarding ELMs and Support Vector Machines, including Least Squares Support Vector Machines.Next we introduce our Extreme Entropy Machine (EEM) together with its kernelized extreme counterpart -Extreme Entropy Kernel Machine (EEKM).We show some connections with existing models and some different perspectives for looking at proposed model.In particular, we show how learning capabilities of EEMs (and EEKM) reasamble those of ELM (and LS-SVM respectively).During evaluation on over 20 binary datasets (of various sizes and characteristics) we analyze generalization capabilities and learning speed.We show that it can be a valuable, robust alternative for existing methods.In particular, we show that it achieves analogous of ELM stability in terms of the hidden layer size.We conclude with future development plans and open problems.

Preliminaries
Let us begin with recalling some basic information regarding Extreme Learning Machines [11] and Support Vector Machines [4] which are further used as a competiting models for proposed solution.We focus here on the optimization problems being solved to underline some basic differences between these methods and EEMs.

Extreme Learning Machines
Extreme Learning Machines are relatively young models introduced by Huang et al. [10] which are based on the idea that single layer feed forward neural networks (SLFN) can be trained without iterative process by performing linear regression on the data mapped through random, nonlinear projection (random hidden neurons).More precisely speaking, basic ELM architecture consists of d input neurons connected with each input space dimension which are fully connected with h hidden neurons by the set of weights w j (selected randomly from some arbitrary distribution) and set of biases b j (also randomly selected).Given some generalized nonlinear activation function G one can express the hidden neurons activation matrix H for the whole training set X, T = {(x i , t i )} N i=1 such that x i ∈ R d and t i ∈ {−1.+ 1} as H ij = G(x i , w j , b j )., i = 1, . . ., N, j = 1, . . ., h If we denote the weights between hidden layer and output neurons by β it is easy to show [11] that putting gives the best solution in terms of mean squared error of the regression: where H † denotes Moore-Penrose pseudoinverse of matrix H.
Final classification of the new point x can be now performed analogously by classifying according to As it is based on the oridinary least squares optimization, it is possible to balance it in terms of unbalanced datasets by performing weighted ordinary least squares.In such a scenario, given a vector B such that B i is a square root of the inverse of the x i 's class size and B ⋅ X denotes element wise multiplication between B and X:

Support Vector Machines and Least Squares Support Vector Machines
One of the most well known classifiers of the last decade is Vapnik's Support Vector Machine (SVM [4]), based on the principle of creating linear classifier that maximizes the separating margin between elements of two classes.
Optimization problem: Support Vector Machine minimize which can be further kernelized (delinearized) using any kernel K (valid in the Mercer's sense): Optimization problem: Kernel Support Vector Machine The problem is a quadratic optimization with linear constraints, which can be efficiently solved using quadratic programming techniques.Due to the use of hinge loss function on ξ i SVM attains very sparse solutions in terms of nonzero β i .As a result, classifier does not have to remember the whole training set, but instead, the set of so called support vectors (SV = {x i ∶ β i > 0}), and classify new point according to .
It appears that if we change the loss function to the quadratic one we can greatly reduce the complexity of the resulting optimization problem, leading to the so called Least Squares Support Vector Machines (LS-SVM).
Optimization problem: Least Squares Support Vector Machine minimize and decision is made according to As Suykens et al. showed [21] this can be further generalized for abitrary kernel induced spaces, where we classify according to: where β i are Lagrange multipliers associated with particular training examples x i and b is a threshold, found by solving the linear system where 1 is a vector of ones and I is an identity matrix of appropriate dimensions.Thus a training procedure becomes Similarly to the classical SVM, this formulation is highly unbalanced (it's results are skewed towards bigger classes).To overcome this issue one can introduce a weighted version [20], where given diagonal matrix of weights Q, such that Q ii is invertibly proportional to the size of x i 's class and .
Unfortunately, due to the introduction of the square loss, the Support Vector Machines sparseness of the solution is completely lost.Resulting training has a closed form solution, but requires the computation of the whole Gram matrix and the resulting machine has to remember 1 whole training set in order to perform new point's classification.

Extreme Entropy Machines
Let us first recall the formulation of the linear classification problem in the highly dimensional feature spaces, ie.when number of samples N is equal (or less) than dimension of the feature space h.In particular we formulate the problem in the limiting case 2 when h = ∞: Problem 1.We are given finite number of (often linearly independent) points h ± i in an infinite dimensional Hilbert space H. Points h + ∈ H + constitute the positive class, while h − ∈ H − the negative class.
We search for β ∈ H such that the sets β T H + and β T H − are optimally separated.
Observe that in itself (without additional regularization) the problem is not well-posed as, by applying the linear independence of the data, for arbitrary m + ≠ m − in R we can easily construct β ∈ H such that However, this leads to a model case of overfitting, which typically yields suboptimal results on the testing set (different from the orginal training samples).
To make the problem well-posed, we typically need to: 1. add/allow some error in the data, 2. specify some objective function including term penalising model's complexity.
Popular choices of the objective function include per-point classification loss (like square loss in neural networks or hinge loss in SVM) with a regularization term added, often expressed as the square of the norm of our operator β (like in SVM or in weight decay regularization of neural networks).In general one can divide objective functions derivations into following categories: • regression based (like in neural networks or ELM), • probabilistic (like in the case of Naive Bayes), • geometric (like in SVM), • information theoretic (entropy models).
We focus on the last group of approaches, and investigate the applicability of the Cauchy-Schwarz divergence [12], which for two densities f and g is given by 1 there are some pruning techniques for LS-SVM but we are not investigating them here 2 which is often obtained by the kernel approach Cauchy-Schwarz divergence is connected to Renyi's quadratic cross entropy (H × 2 [18]) and Renyi's quadratic entropy (H 2 ), defined for densities f, g as . and as we showed in [6], it is well-suited as a discrimination measure which allows the construction of mulit-threshold linear classifiers.In general increase of the value of Cauchy-Schwarz Divergence results in better sets' (densities') discrimination.Unfortunately, there are a few problems with such an approach: • true datasets are discrete, so we do not have densities f and g, • statistical density estimators require rather large sample sizes and are very computationally expensive.
There are basically two approaches which help us recover underlying densities from the samples.First one is performing some kind of density esimation, like the well known Kernel Density Estimation (KDE) technique, which is based on the observation that any arbitrary continuous distribution can be sufficiently approximated by the convex combination of Gaussians.The other approach is to assume some density model (distribution's family) and fit its parameters in order to maximize the data generation probability.In statistics it is known as maximum likelihood esetimation (MLE) approach.MLE has an advantage that in general it produces much simplier densities descriptions than KDE as later's description is linearly big in terms of sample size.
A common choice of density models are Gaussian distributions due to their nice theoretical and practical (computational) capabilities.As mentioned eariler, the conxex combination of Gaussians can approximate the given continuous distribution f with arbitrary precision.In order to fit a Gaussian Mixture Model (GMM) to given dataset one needs algorithm like Expectation Maximization [8] or conceptually similar Cross-Entropy Clustering [22].However, for simplicity and strong regularization we propose to model f as one big Gaussian N (m, Σ).One of the biggest advantages of such an approach is closed form MLE parameter estimation, as we simply put m equal to the empirical mean of the data, and Σ as some data covariance estimator.Secondly, this way we introduce an error to the data which has an important regularizing role and leads to better posed optimization problem.
Let us now recall that the Shannon's differential entropy (expressed in nits) of the continuous distribution f is we will now show that choice of Normal distributions is not arbitrary but supported by the assumptions of the entropy maximization.Following result is known, but we include the whole reasoning for completeness.
Remark 1.Normal distribution N (m, Σ) has a maximum Shannon's differential entropy among all real-valued distributions with mean m ∈ R h and covariance Σ ∈ R h×h .
Proof.Let f and g be arbitrary distributions with covariance Σ and means m.
For simplicity we assume that m = 0 but the analogous proof holds for arbitrary mean, then which together with non-negativity of Kullback-Leibler Divergence gives There appears nontrivial question how to find/estimate the desired Gaussian as the covariance can be singular.In this case to regularize the covariance we apply the well-known Ledoit-Wolf approach [15].
where cov(⋅) is an empirical covariance and ε ± is a shrinkage coefficient given by Ledoit and Wolf [15].Thus, our optimization problem can be stated as follows: x x ... ...

Input layer
Output layer x ...

Hidden layer
Hidden layer

Input layer
Output layer Problem 2 (Optimization problem).Suppose that we are given two datasets H ± in a Hilbert space H which come from the Gaussian distributions N (m ± , Σ ± ).
Find β ∈ H such that the datasets Since, as one can easily compute [5], we obtain that Observe that in the above equation the first term is constant, the second is the logarithm of the quotient of arithmetical and geometrical means (and therefore in the typical cases is bounded and close to zero).Consequently, crucial information is given by the last term.In order to confirm this claim we perform experiments on over 20 datasets used in further evaluation (more details are located in the Evaluation section).We compute the Spearman's rank correlation coefficient between the D cs (N (m + , S + ), N (m − , S − )) and (m+−m−) 2

S++S−
for hundread random projections to H and hundread random linear operators β.  1, in small datasets (first part of the table) the correlation is generally high, with some exceptions (like sonar, splice, liverdisorders and diabetes).However, for bigger datasets (consisting of thousands examples) this correlation is nearly perfect (up to the randomization process it is nearly 1.0 for all cases) which is a very strong empirical confirmation of our claim that maximization of the This means that, after the above reductions, and application of (2) our final problem can be stated as follows: Optimization problem: Extreme Entropy Machine Before we continue to the closed-form solution we outline two methods of actually transforming our data X ± ⊂ X to the highly dimensional H ± ⊂ H, given by the ϕ ∶ X → H.
We investigate two approaches which lead to the Extreme Entropy Machine and Extreme Entropy Kernel Machine respectively.
• for Extreme Entropy Machine (EEM) we use the random projection technique, exactly the same as the one used in the ELM.In other words, given some generalized activation function G(x, w, b) ∶ X × X × R → R and a constant h denoting number of hidden neurons: where w i are random vectors and b i are random biases.
• for Extreme Entropy Kernel Machine (EEKM) we use the randomized kernel approximation technique [9], which spans our Hilbert space on randomly selecteed subset of training vectors.In other words, given valid kernel K(⋅, ⋅) ∶ X × X → R + and size of the kernel space base h: where X [h] is a h element random subset of X.It is easy to verify that such low rank approxmation truly behaves as a kernel, in the sense that for ϕ given true kernel projection φ K such that Thus for the whole samples' set we have which is a complete Gram matrix.
So the only difference between Extreme Entropy Machine and Extreme Entropy Kernel Machine is that in later we use H ± = ϕ K (X ± ) where K is a selected kernel instead of H ± = ϕ(X ± ).Fig. 1 visualizes these two approaches as neural networks, in particular EEM is a simple SLFN, while EEKM leads to the network with two hidden layers.Linearly non separable data in X ; data mapped to the H space, where we find covariance estimators; density of projected Gaussians on which the decision is based; decision boundary in the input space X .
Remark 2. Extreme Entropy Machine optimization problem is closely related to the SVM optimization, but instead of maximizing the margin between closest points we are maximizing the mean margin.
Proof.Let us recall that in SVM we try to maximize the margin 2 β under constraints that negative samples are projected at values at most -1 (β T h − + b ≤ −1) and positive samples on at least 1 (β T h + + b ≥ 1) In other words, we are minimizing the β operator norm β which is equivalent to minimizing the square of this norm β 2 , under constraint that min On the other hand, EEM tries to minimize under the constraint that So what is happening here is that we are trying to maximize the mean margin between classes in the Mahalanobis norm generated by the sum of classes' covariances.It was previously shown in Two ellipsoid Support Vector Machines model [7] that such norm is an approximation of the margin coming from two ellpisoids instead of the single ball used by traditional SVM.
Similar observation regarding connection between large margin classification and entropy optimization has been done in case of the Multithreshold Linear Entropy Classifier [6].
We are going to show by applying the standard method of Lagrange multipliers that the above problem has a closed form solution (similar to the Fischer's Discriminant).Let Then which means that we need to solve, with respect to β, the system and consequently3 , if m ≠ 0, then λ = 4 m 2 Σ and The final decision of the class of the point h is therefore given by the comparison of the values We distinguish two cases based on number of resulting classifier's thresholds (points t such that N ( 1. S − = S + , then there is one threshold Obviously, in the degenerated case, when m = 0 ⇐⇒ m − = m + there is no solution, as the constraint β T (m − − m + ) = 2 is not fulfilled for any β.In such a case EEM returna a trivial classifier constantly equal to any class (we put β = 0).
From the neural network perspetive we simply construct a custom activation function F(⋅) in the output neuron depending on one of the two described cases: otherwise.
The whole classification process is visualized in Fig. 2, we begin with data in the input space X , transform it into Hilbert space H where we model them as Gaussians, then perform optimization leading to the projection on R through β and perform densitiy based classification leading to non-linear decision boundary in X .

Theory: density estimation in the kernel case
To illustrate our reasoning, we consider a typical basic problem concerning the density estimation.
Problem 3. Assume that we are given a finite data set H in a Hilbert space H generated by the unknown density f , and we want to obtain estimate of f .
Since the problem in itself is infinite dimensional typically the data would be linearly independent.Moreover, one usually can not obtain reliable density estimation -the most we can hope is that after transformation by a linear functional into R, the resulting density will be well-estimated.
To simplify the problem assume therefore that we want to find the desired density in the class of normal densities -or equivalently that we are interested only in the estimation of the mean and covariance of f .The generalization of the above problem is given by the following problem: Problem 4. Assume that we are given a finite data sets H ± in a Hilbert space H generated by the unknown densities f ± , and we want to obtain estimate of the unknown densities.
In general dim(H) = h ≫ N which means that we have very sparse data in terms of Hilbert space.As a result, classical kernel density estimation (KDE) is not reliable source of information [16].In the absence of different tools we can however use KDE with very big kernel width in order to cover at least some general shape of the whole density.
Remark 3. Assume that we are given a finite data sets H ± with means m ± and covariances Σ ± in a Hilbert space H.If we conduct kernel density estimation using Gaussian kernel then, in a limiting case, each class becomes a Normal distribution. where Proof of this remark is given by Czarnecki and Tabor [6] and means that if we perform a Gaussian kernel density estimation of our data with big kernel width (which is reasonable for small amount of data in highly dimensional space) then for big enough σ EEM is nearly optimal linear classifier in terms of estimated densities Let us now investigate the probabilistic interpretation of EEM.Under the assumption that H ± ∼ N (m ± , Σ ± ) we have the conditional probabilities p(h ±) = N (m ± , Σ ± )[h], so from Bayes rule we conclude that , where p(±) is a prior classes' distribution.In our case, due to the balanced nature (meaning that despite classes imbalance we maximize the balanced quality measure such as Averaged Accuracy) we have p . Furthermore it is easy to show that under the normality assumption, the resulting classifier is optimal in the Bayesian sense.
Remark 4. If data in feature space comes from Normal distributions N (m ± , Σ ± ) then β given by EEM minimizes probability of missclassification.More strictly speaking, if we draw h + with probability 1 2 from N (m + , Σ + ) and h − with 1/2 from N (m − , Σ − ) then for any  First we show that under some simplifing assumptions, proposed method behaves as Extreme Learning Machine (or Weighted Extreme Learning Machine [25]).
Before proceeding further we would like to remark that there are two popular notations for projecting data onto hyperplanes.One, used in ELM model, assumes that H is a row matrix and β is a column vector, which results in the projection's equation Hβ.Second one, used in SVM and in our paper, assumes that both H and β are column oriented, which results in the β T H projection.In the following theorem we will show some duality between β found by ELM and by EEM.In order to do so, we will need to change the notation during the proof, which will be indicated.
Theorem 1.Let us assume that we are given an arbitrary, balanced 4 which can be perfectly learned by ELM with N hidden neurons.If this dataset's points' image through random neurons H = ϕ(X) is centered (points' images have 0 mean) and classes have homogenous covariances (we can assume that ∃ a±∈R+ cov(H) = a + cov(H + ) = a − cov(H − ) then EEM with the same hidden layer will also learn this dataset perfectly (with 0 error).
Proof.In the first part of the proof we use the ELM notation.Projected data is centered, so cov(H) = H T H. ELM is able to learn this dataset perfectly, consequently H is invertible, thus also where Σ ± = cov † (H ± ).Due to the assumption of geometric homogenity , where Σ = cov † (H).Therefore (H)H T T From now we change the notation back to the one used in this paper.
Again from homogenity we obtain just one equilibrium point, located in the β T EEM (m + − m − ) 2 which results in the exact same classifier as the one given by ELM.This completes the proof.
Similar result holds for EEKM and Least Squares Support Vector Machine.
Theorem 2. Let us assume that we are given arbitrary, balanced 5 which can be perfectly learned by LS-SVM.If dataset's points' images through Kernel induced projection ϕ K have homogenous classes' covariances (we can assume that ∃ a±∈R+ cov(ϕ K (X)) = a + cov(ϕ K (X + )) = a − cov(ϕ K (X − )) then EEKM with the same kernel and N hidden neurons will also learn this dataset perfectly (with 0 error).
Proof.It is a direct consequence of the fact that with N hidden neurons and honogenous classes projections covariances, EEKM degenerates to the kernelized Fischer Discriminant which, as Gestel et al. showed [24], is equivalent to the solution of the Least Squares SVM.

Practical considerations
We can formulate the whole EEM training as a very simple algorithm (see Alg. 1).

Algorithm 1 Extreme Entropy
Resulting model consists of three elements: • feature projection function ϕ, • linear operator β, • classification rule F.
As described before, F can be further compressed to just one or two thresholds t ± using equations from previous sections.Either way, complexity of the resulting model is linear in terms of hidden units and classification of the new point takes O(dh) time.
During EEM training, the most expensive part of the algorithm is the computation of the covariance estimators and inversion of the sum of covariances.Even computation of the empirical covariance takes O(N h 2 ) time so the total complexity of training, equal to O(h 3 + N h 2 ) = O(N h 2 ), is acceptable.It is worth noting that training of the ELM also takes exactly O(N h 2 ) time as it requires computation of H T H for H ∈ R N ×h .Training of EEMK requires additional computation of the square root of the sampled kernel matrix inverse and both inverting and square rooting can be done in O(h 3 ) we obtain exact same asymptotical computational complexity as the one of EEM.Procedure of square rooting and inverting are both always possible as assuming that K is a valid kernel in the Mercer's sense yields that K(X [h] , X [h] ) is strictly positive definite and thus invertible.Further comparision of EEM, ELM and SVM is summarized in Table 2.
Next aspect we would like to discuss is the cost sensitive learning.EEMs are balanced models in the sense that they are trying to maximize the balanced quality measures (like Averaged Accuracy or GMean).However, in practical applications it might be the case that we are actually more interested in the positive class then the negative one (like in the medical applications).Proposed model gives a direct probability estimates of p(β T h t), which we can easily convert to the cost sensitive classifier by introducing the prior probabilities of each class.Directly from Bayes Theorem, given p(+) and p(−), we can label our new sample h according to p(t β T h) ∝ p(t)p(β T h t), so if we are given costs C + , C − ∈ R + we can use them as weighting of priors Let us now investigate the possible efficiency bottleneck.In EEKM, the classification of the new point h is based on One can convert EEKM to the SLFN by putting: This way complexity of the new point's classification is exactly the same as in the case of EEM and ELM (or any other SLFN).
Datasets' features were linearly scaled in order to have each feature in the interval [0, 1].No other data whitening/filtering was performed.All experiments were performed in repeated 10-fold stratified cross-validation.
We use GMean 7 (geometric mean of accuracy over positive and negative samples) as an evaluation metric.due to its balanced nature and usage in previous works regarding Weighted Extreme Learning Machines [25].

Basic UCI datasets
We start our experiments with nine datasets coming from UCI repository [2], namely australian, breast-cancer, diabetes, german.numer,heart, ionosphere, liver-disorders, sonar and splice, summarized in the Table 3.This datasets include rather balanced, low dimensional problems.On such data, EEM seems to perform noticably better than ELM when using RBF activation function (see Table 4), and rather similar when using sigmoid one -in such a scenario, for some datasets ELM achieves better results while for other EEM wins.Results obtained for EEKM are comparable with those obtained by LS-SVM and SVM, in both cases proposed method achieves better results on about third of problems, on the third it draws and on a third it loses.This experiments can be seen as a proof of concept of the whole methodology, showing that it can be truly a reasonable alternative for existing models in some problems.It appears that contrary to ELM, proposed methods (EEM and EEKM) achieve best scores across all considered models in some of the datasets regardless of the used activation function/kernel (only Support Vector Machines and their least squares counterpart are competetitive in this sense).

Highly unbalanced datasets
In the second part we proceeded to the nine highly unbalanced datasets, summarized in the second part of the Table 3. Ratio between bigger and smaller class varies from 10 ∶ 1 to even 20 ∶ 1 which makes them really hard for unbalanced models.Obtained results (see Table 5) resembles these obtained on UCI repository.We can see better results in about half of experiments if we fix a particular activation function/kernel (so we compare ELM x with EEM x and LS-SVM x with EEKM x ).Table 6 shows that training time of Extreme Entropy Machines are comparable with the ones obtained by Extreme Learning Machines (differences on the level of 0.1−0.2 are not significant on such datasets' sizes).We have a robust method which learns in below two seconds a model for hundreads/thousands of examples.For larger datasets (like abalone7 or sick euthyroid) proposed methods not only outperform SVM and LS-SVM in terms of robustness but there is also noticable difference between their training times and ELMs.This suggests that even though ELM and EEM are quite similar and on small datasets are equally fast, EEM can better scale up to truly big datasets.Obviously obtained training times do not resemble the full training time as it strongly depends on the technique used for metaparameters selection and resolution of grid search (or other parameters tuning technique).In such full scenario, training times of SVM related models is significantly bigger due to the requirment of exact tuning of both C and γ in real domains.

Extremely unbalanced datasets
Third part of experiments consists of extremely unbalanced datasets (with class imbalance up to 100:1) containing tens and hundreads thousands of examples.Five analyzed datasets span from NLP tasks (webpages) through medical applications (mammography) to bioinformatics (protein homology).This type of datasets often occur in the true data mining which makes these results much more practical than the one obtained on small/balanced data.0.0 scores on Isolet dataset (see Table 7) for sigmoid based random projections is a result of very high values (∼ 200) of ⟨x, w⟩ for all x, which results in G(x, w, b) = 1, so the whole dataset is reduced to the singleton {[1, . . ., 1] T } ⊂ R h ⊂ H which obviously is not separable by any classifier, netither ELM nor EEM.
For other activation functions we see that EEM achieves sllightly worse results than ELM.On the other hand, scores of EEKM generally outperform the ones obtained by ELM and are very close to the ones obtained by well tuned SVM and LS-SVM.In the same time, EEM and EEKM were trained significantly faster, as Table 8 shows, it was order of magnitude faster than SVM related models and even 1.5 − 2× faster than ELM.It seems that the Ledoit-Wolf covariance estimation computation with this matrices inversion is simply a faster operation (scales better) than computation of the Moore-Penrose pseudoinverse of the H T H. Obviously one can alternate ELM training routine to the regularized one where instead of (H T H) † one computes (H T H + I C) −1 , but we are analyzing here parameter less approaches, while the analogous could be used for EEM in the form of (cov(X − ) + cov(X + ) + I C) −1 instead of computing Ledoit-Wolf estimator.In other words, in the parameter less training scenario, as described in this paper EEMs seems to scale better than ELMs while still

Entropy based hyperparameter optimization
Now we proceed to entropy based evaluation.Given particular set of linear hypotheses M in H we want to select optimal set of hyperparameters θ (such as number of hidden neurons or regularization parameter) which identify a particular model β θ ∈ M ⊂ H. Instead of using expensive internal cross-validation (or other generalization error estimation technique like Err 0.632 ) we select such θ which maximizes our entropic measure.In particular we consider a simpified Cauchy-Schwarz Divergence based strategy where we select θ maximizing and kernel density based entropic strategy [6] selecting θ maximizing where ⟦A⟧ = ⟦A⟧ σ(A) is a Gaussian KDE using Silverman's rule of the window width [19] σ This way we can use whole given set for training and do not need to repeat the process, as D cs is computed on the training set instead of the hold-out set.
First, one can notice on Table 9 and Table 10 that such entropic criterion works well for EEM, EEKM and Support Vector Machines.On the other hand, it is not very well suited for ELM models.This confirms conclusions from our previous work on classification using D cs [6] where we claimed that SVMs are conceptually similar in terms of optimization objective, as well as widens it to the new class of models (EEMs).Second, Table 9 shows that EEM and EEKM can truly select their hyperparameters using very simple technique requiring no model retrainings.Computation of ( 5) is linear in terms of training set and constant time if performed using precomputed projections of required objects (which are either way computed during EEM training).This make this very fast model even more robust.

EEM stability
It was previously reported [11] that ELMs have very stable results in the wide range of the number of hidden neurons.We performed analogous experiments with EEM on UCI datasets.We trained models for 100 increasing hidden layers sizes (h = 5, 10, . . ., 500) and plotted resulting GMean scores on Fig. 3.One can notice that similarly to ELM proposed methods are very stable.Once machine gets enough neurons (around 100 in case of tested datasets) further increasing of the feature space dimension has minor effect on the generalization capabilities of the model.It is also important that some of these datasets (like sonar) do not even have 500 points, so there are more dimensions in the Hilbert space than we have points to build our covariance estimates, and even though we still do not observe any rapid overfitting.
• rapid training that scales well to hundreads of thousands of examples and beyond, • theoretical and practical similarities to the large margin classifiers and Fischer Discriminant.
Performed evaluation showed that, similarly to ELM, proposed EEM is a very stable model in terms of the size of the hidden layer and achieves comparable classification results to the ones obtained by SVMs and ELMs.Furthermore we showed that our method scales better to truly big datasets (consisting of hundreads of thousands of examples) without sacrificing results quality.
During our considerations we pointed out some open problems and issues, which are worth investigation: • Can one construct a closed-form entropy based classifier with different distribution families than Gaussians?
• Is there a theoretical justification of the stability of the extreme learning techniques?
• Is it possible to further increase achieved results by performing unsupervised entropy based optimization in the hidden layer?

Figure 1 :
Figure1: Extreme Entropy Machine (on the left) and Extreme Entropy Kernel Machine (on the right) as neural networks.In both cases all weights are either randomly selected (dashed) or are the result of closed-form optimization (solid).

Figure 2 :
Figure 2: Visualization of the whole EEM classification process.From the left:Linearly non separable data in X ; data mapped to the H space, where we find covariance estimators; density of projected Gaussians on which the decision is based; decision boundary in the input space X .

Figure 3 :
Figure 3: Plot of the EEM's (with RBF activation function) GMean scores from cross validation experiments for increasting sizes of hidden layer.

Table 1 :
Spearman's rank correlation coefficient between optimized term and whole D cs for all datasets used in evaluation.Each column represents different dimension of the Hilbert space.

Table 2 :
comparison of considered classiifers.SV denotes number of support vectors.Asterix denotes features which can be adde to a particular model by some minor modifications, but we compare here the base versiond of each model.

Table 3 :
Characteristics of used datasets

Table 5 :
Highly unbalanced datasets obtaining similar classification results.In the same time EEKM obtains SVMlevel results with orders of magnitude smaller training times.Both ELM and EEM could be transformed into regularization parameter based learning, but this is beyond the scope of this work.

Table 7 :
Big highly unbalanced datasets