A new approach in model selection for ordinal target variables

Multi-class predictive models are generally evaluated averaging binary classification indicators without a distinction between nominal and ordinal dependent variables. This paper introduces a novel approach to assess performances of predictive models characterized by an ordinal target variable and a new index for model evaluation is proposed. The new index satisfies mathematical properties and it can be applied to the evaluation of parametric and non parametric models. In order to show how our performance indicator works, empirical evidences obtained on toy examples and simulated data are provided. On the basis of the results achieved, we underline that our approach can be a more suitable criterion for model selection than the performance indexes currently suggested in the literature.


Introduction
Evaluation measures are widely used in predictive models to compare different algorithms, thus providing the selection of the best model for the data at hand.
Performance indicators can be used to assess the performance of a model in terms of accuracy, discriminatory power and stability of the results.The choice of indicators to made model selection is a fundamental point and many approaches have been proposed over the years (see e.g.[1,4,12]).
Restricting to binary target variables, distinct criteria for comparing the performance of classification models are available (see [9,10,14,22]).
Multi-class classification models are generally evaluated averaging binary classification indicators (see [11,14,23]) and in the literature there is not a clear distinction among them with respect to multi-class nominal and ordinal targets (e.g.[6,7,20]).
While in the model definition stage for ordinal target variable there are different approaches in the literature (see [2,3,17,24]), for the model selection there is a lack of adequate tools ( [5]).
In our opinion, performance indicators should take into account the nature of the target variable, especially when the dependent variable is ordinal.This leads us to propose a new class of measures to select the best model in predictive contexts characterized by a multi-class ordinal target variable, using the misclassification errors coupled with a measure of uncertainty on the predic-tion.
The paper is structured as follow: Section 2 reviews the metrics most used in literature; Section 3 shows our methodological proposal and proves some mathematical properties; Section 4 explains how our proposal works in two toy examples; Section 5 reports the empirical evidence obtained on simulated data.
Conclusions and further ideas for research are summarized in Section 6.

Review of the literature for ordinal dependent variable
The most popular measures of performances in ordinal predictive classification models are based on AUC (Area Under the ROC curve), accuracy (expressed in terms of correct classification) and MSE (Mean Square Error) (see [7] and [16] among others).The accuracy (percentage of correct predictions over total instances) is the most used evaluation metric for binary and multi-class classification problems ( [22]), assuming that the costs of the different misclassifications are equal.
The AUC for multi-class classification is defined in [11] as a generalization of the AUC (based on the probabilistic definition of AUC); it suffers of different weaknesses also in the binary classification problem ( [8]) and it is cost-independent, assumption that can be viewed as a weakness when the target is ordinal.
The mean square error (MSE) measures the difference between prediction values and observed values in regression problems using an Euclidean distance.MSE can be used in ordinal predictive models, converting the classes of the ordinal target variable y in integers and computing the difference between them and it does not takes into account the ordering in a predictive model characterized by ordinal classes in the response variable.Furthermore, it is well known that in imbalanced data characterized by underfitting or over-fitting the mean square error could provide trivial results (see [14]).

A new index for model performances evaluation and comparison for ordinal target
Let y = {y 1 , .., y N } be a test set for the ordinal target variable Y , where y i ∈ {1, ..., M } (with M number of classes ordered of the target variable) and let X be the N × p data matrix, where N is the number of observations and p the number of covariates.
The output of a predictive model is a matrix P = {p ij }, where 0 ≤ p ij ≤ 1, which contains the probability that observation i belong to the class j, estimated by the model under evaluation.
Standard multi-class classification rules assign the observation i to the class j = argmax l {p i,l }.
In order to introduce our proposal, the definitions of classification function and error interval are required.Definition 3.1 (Classification function).Let observations {1, ..., N } grouped by the estimated classes ŷi = j.For each class, sort the observations in a nonincreasing order with respect to p i,j .The vector of indexes i of the observations is a permutation of the original vector, according to the ordering defined above.
For a given model, the classification function is a piecewise constant function If no misclassification occurs in [n j−1 , n j ), the error interval is defined as an empty set and the length is e j = 0.
Consider, for example, N = 10 observations and a three levels target variable (M = 3).Suppose that a predictive model returns the predictions as in Table 1.For each observation, the real class is reported.The final sequence of observations can be written as in Table 2.
The classification function and the corresponding perfect classification function are depicted in Figure 1 and Figure 2 respectively.
In order to define the three error intervals, as a preliminary step we iden-   Proof.
by definition, than we can conclude that I ≥ 0.
We prove also that I = 0 if and only if f mod = f exact .
• w j = 0 ⇐⇒ ĩj = n j , i.e there are not classification errors, so f mod = f exact in class j.
So we can conclude that The other implication is trivial.Proof.
where K is defined as Proof.The maximum value is reached when the worst classification is obtained, i.e. when all observations are associated to the fairest class.If this happens, the error interval is long as the class domain, so w j = 1 ∀j = 1, ..., M and each integral is the sum is a rectangle with basis the class domain l j and height the maximum height reachable.
where K is the maximum defined in the Proposition 3.4.
In the previous example, K = 1.7 and the corresponding value of the defined normalized index is 0.255.All the components in the sum of the index I n remain unchanged except for the j th , thus obtaining I j n .So Looking at each of the two elements in the product: Two different cases are possible: if the probability associated to the i th observations is less or equal than the probability of the first error, the error interval w j = w j ; on the other hand, the error interval become larger, thus In C there is one misclassification more than in C, so the distance between f mod and f exact increases.
We can conclude that I j n ≥ I j n .
We remark that in the Proposition 3 the vice versa does not hold, i.e. if I mod1 ≥ I mod2 we can not make conclusion on the number of misclassified ob-servations in the two classifications.

Toy examples
In order to show how our index works with respect to the indexes proposed in the literature toy examples are reported in this section with the main aim of discussing the behaviour in terms of model selection of our index with respect to AUC, accuracy and MSE.
Y is a target variable characterized by M = 3 levels y i ∈ {1, 2, 3} and model 1 and model 2 are two competitive models under comparison.

First toy example
In the first toy example we take into account the ordinal structure of the target variable Y .Table 3 and Table 4     For the sake of comparison, for each model the AUC, the accuracy, the MSE and our index are computed as summarized in Table 5.
We remark that looking at

Second toy example
The second toy example considers the probability assigned to each observation.In practical applications where we need also to evaluate how much uncertainty is associated to a prediction, the starting point considers the prob-ability that the new observation belongs to the estimated class.
From Table 6, Model 1 and model 2 assign an observation of the first class to the second one.The first classification assigns a higher probability to the misclassified observation than the second.Then we can conclude that model 2 is better than model 1 for data at hands.

Empirical evaluation on simulated data
In order to show how our proposal works in model selection, this section reports the empirical results achieved on a simulated dataset.
The simulated dataset is composed of three covariates obtained by a Monte Carlo simulation and an ordinal target variable with M = 5, as reported in Table 8.The sample size is N = 7500.Five different models are under comparison: • Ordinal logistic regression (Ord Log), • Classification tree (Tree), • Support vector machine (SVM), • Random forest (RFor), • k-Nearest Neighbour (kNN).
For each model AUC, accuracy, MSE and our index are computed.For sake of clarity, Table 10 shows the resulting ranks for the models, using the results obtained for the four metrics under comparison.
We can see that the k-nearest neighbour is classified as the best model according  to all the indexes employed for model choice.Furthermore, from table 9 the knearest neighbour outperforms the other models.The Support vector machine is considered the second-best model with respect to all performance indicators.
The rest of the models under comparison are ranked differently with respect to the evaluation metrics adopted.

Conclusions
A new performance indicator is proposed to compare predictive classification models characterized by ordinal target variable.
Our index is based on a definition of a classification function and an error interval.A normalized version of the index is derived.The empirical evidence at hands underlined that our index discriminates better among different models with respect to classical measures available in the literature.
Our index can be used coupled with other metrics for model performance for model selection.
From a computational point of view a further idea of research will consider the implementation of our index in a new R package.In terms of application we think that our index could be directly incorporate in the process of assessment for predictive analytics.

Definition 3 . 2 (
As a special case, the perfect classification function, is a piecewise constant function f exact : [0, 1] → {1, .., M } such that each estimated class corresponds to the real class identified by y.Note that the function f exact is unique except for permutation of the observations in the same estimated class.The error interval in each class can be derived as the interval between the first misclassified observation and the end of the observations in that estimated class.Error Interval).Suppose that the range corresponding to the estimated class j is [n j−1 , n j ), let ĩj ∈ {n j−1 , ..., n j } the first misclassified observation.So the error interval is defined as [ ĩj N , nj N ) and its length is e j = nj − ĩj N .

Property 2 .
I has a sharp upper bound M − 1 The upper bound M − 1 is reached if and only if M = 2 (binary classification).

Proposition 3 . 6 .
The accuracy is a special case of the index introduced in Definition 3.3.Proof.The accuracy is acc = p err = #{misclassified observations} N i.e. the proportion of misclassified observations.Setting M = 2, from the Proposition 3.4, K = 1.max x |f mod (x) − f exact (x)| = 1, each error weights 1 N if w 1 = w 2 =1 and I n = p err .Property 3 (Monotonicity).Consider a classification C with misclassification and N observations.Operating a transformation of the classification C in C where an observation right classified is changed in a misclassification, the index I n becomes higher.Proof.In the classification C , = + 1 are misclassified observations: the observations misclassified in C plus a new misclassification.Suppose that the new misclassification is the observation i that is classified in the class j instead of the real class j.

are the corresponding confusion matrices for model 1 and model 2 .
It is clear that the model 2 makes a better classification than

Table 2 :
Index construction

Table 9 :
Model comparison

Table 10 :
Results in terms of ranking.