Hostility measure for multi-level study of data complexity

Complexity measures aim to characterize the underlying complexity of supervised data. These measures tackle factors hindering the performance of Machine Learning (ML) classifiers like overlap, density, linearity, etc. The state-of-the-art has mainly focused on the dataset perspective of complexity, i.e., offering an estimation of the complexity of the whole dataset. Recently, the instance perspective has also been addressed. In this paper, the hostility measure, a complexity measure offering a multi-level (instance, class, and dataset) perspective of data complexity is proposed. The proposal is built by estimating the novel notion of hostility: the difficulty of correctly classifying a point, a class, or a whole dataset given their corresponding neighborhoods. The proposed measure is estimated at the instance level by applying the k-means algorithm in a recursive and hierarchical way, which allows to analyze how points from different classes are naturally grouped together across partitions. The instance information is aggregated to provide complexity knowledge at the class and the dataset levels. The validity of the proposal is evaluated through a variety of experiments dealing with the three perspectives and the corresponding comparative with the state-of-the-art measures. Throughout the experiments, the hostility measure has shown promising results and to be competitive, stable, and robust.


Introduction
For several years now, Machine Learning (ML) is in the spotlight and supervised problems account for an important part of it. For classification problems and, indeed, for all analysis involving data, a first step of data exploration Carmen  Madox Viajes, C/ de Cantabria, 10, 28939, Arroyomolinos, Spain is essential, providing the user with the knowledge and understanding of the data. This is extremely useful for the following tasks involving data, and skipping it could lead to wrong decisions and results. However, on a daily basis, this exploratory phase is rarely focused on the underlying complexity of the dataset. As a matter of fact, during the modeling stage, the selection of the best classifier ordinarily follows a trial-and-error approach. Several classifiers are tested and the one offering the best performance is finally selected. No information is drawn about why some classifiers perform better or which characteristics of the data are causing the final results. Different factors can disturb the performance of classifiers [3]. For instance, the distribution of classes, the sparsity of data, the type of decision boundary, the overlap among classes or the noise. The purpose of complexity measures is to identify and quantify this type of data characteristics as a way of understanding the complexity of the data and its impact in the classification [14].
Complexity measures have mainly focused on a global perspective, quantifying the complexity of the whole dataset from different points of view: linearity, overlap, balance of classes, etc. In the last years, a new approach building complexity measures at the instance level and averaging them to get the global dataset perspective has emerged [33]. In fact, some of the classic measures were originally constructed from the instance level [14]. The instance approach is fruitful since it provides the global complexity estimation by identifying the critical and problematic individual points that are actually causing that complexity. Thus, data complexity is analyzed from two points of view: the instance and the dataset level.
In this paper, the two-level perspective is taken a step further presenting, to our knowledge for the first time, a multi-level study of data complexity through the here defined concept of hostility and the proposed complexity measure, the hostility measure, that estimates it. The notion of hostility refers to the difficulty of correctly classifying a point, a class, or a dataset according to their surroundings. For example, a point fully surrounded by instances of its own class will have an hostility of zero. On the other hand, the more instances from other classes are around it, the more hostility the point will have. This concept is intuitive, it naturally embraces the various perspectives of data: point, class, and dataset, and it offers an interpretable value in terms of the probability of how difficult it is to classify them regarding their neighborhoods.
The estimation of the notion of hostility is carried out by building a complexity measure at the instance level, aggregating it at the class level to get a complexity value for each class and also, aggregating it at the dataset level to have a global quantification of the complexity. This is the proposed multi-level complexity measure which is called the hostility measure. It analyzes the distribution of classes in neighborhoods of increasing size. By doing this, it detects critical points. That is, those points that are in overlapping areas (a really detrimental factor for classifiers [31,33,36]). These can also be borderline points, which are near the decision boundary, or noisy points that can be faded among points from other classes. In contrast with most complexity measures, the here proposed hostility measure combines different layers of information since it is calculated by applying the well-known algorithm k-means in a recursive and hierarchical way. This increases the robustness and adaptability of the method. Also, tracking the results from the different layers of the procedure provides useful information. Some promising results of a preliminary version of the proposal have been presented in [20].
The main contributions of the present paper are: 1. To revisit the main state-of-the-art complexity measures clarifying its levels of definition. 2. To introduce the concept of hostility. 3. Based on the notion of hostility, to propose a new complexity measure called the hostility measure that addresses a multi-level perspective of the data complexity. 4. To evaluate the performance of the hostility measure and compare it with the state-of-the-art. 5. To present the hostility tracking graph and the overlapping tracking graph that are able to offer exploratory information about a dataset.
The paper is structured as follows. Section 2 recapitulates the state-of-the-art of complexity measures with special emphasis on the ones considering more than one level of information. The concept of hostility is formally presented in Section 3 and the proposed measure of complexity is described in Section 4. In Section 5, experiments comprising the three perspectives are expounded. Section 6 describes the research opportunities that have emerged along the current research. Finally, Section 7 concludes.

State-of-the-art
Complexity measures gained more attention as a result of the work from Ho and Basu [14]. Ever since, these measures have been further studied and applied for several purposes: imbalanced problems [2,18,30,39], metalearning [21,24], analysis of learning algorithms [4,27], automatic recommendation of classifiers [6], and hyper-parameter optimization [7]. They have also been implemented in different fields like genetics, medicine, and human-computer interaction [3]. A detailed summary with applications of complexity measures as well as a recapitulation of the existing complexity measures can be found in [25].
Regarding complexity measures that address more than one level of information, the work in [33] is seminal providing the instance perspective. The aim is to detect which instances are harder to classify and to calculate the individual contribution of the instances to the global complexity. To this end, a range of complexity measures, called hardness measures, were defined to tackle the instance perspective specifically. They can be later averaged to get the dataset level. In accordance with this new perspective, some of the classical measures have been adapted to the instance level [1].
Following [25], complexity measures are grouped in six main categories: feature-based, linearity, neighborhood, network, dimensionality, and class imbalance. Next, a brief explanation and the main measures of each category are described.
Feature-based measures focus on the overlap of features between classes to assess the discriminant ability of the features: • F 1, the maximum Fisher's discriminant ratio measures the capacity of every feature to separate the classes in terms of the overlap that the values of each feature present [14]. • F 2, the volume of overlapping region calculated as the length of the overlapping zones among classes [14]. • F 3, the maximum individual feature efficiency is defined as the maximum discriminant capacity of the features. This is calculated as the maximum number of points from different classes in overlap, for the set of features, divided by the total number of points [13]. • F 4, the collective feature efficiency is similar to F 3 but analyzing jointly the discriminatory power of the features [29]. • F 1v, the directional-vector maximum Fisher's discriminant ratio was created as a version of F 1 able to overcome the major drawback of feature-based measures: they assume the discriminative hyperplane is perpendicular to the axis of one input feature [29]. F 1v projects data for maximizing class separation and, in that projection, looks for a vector able to separate the classes. • In [1], the idea of getting the overlap of features within the two classes is extended to the instance level with four hardness measures. F 1 H D captures the number of features for which an instance lies in an overlapping area. The distance of each instance to the overlapping region is gauged for each feature and then transformed to obtain higher values for instances placed on the middle of the overlapping region. This is calculated for each feature and F 2 H D is defined to be the minimum of them, F 3 H D the mean, and F 4 H D the maximum.
Linearity measures check the linear separateness in a problem: • L1, the sum of the error distance by linear programming [14]. It evaluates if the data is linearly separable by adding the distances of the incorrectly classified instances to the linear boundary. Note that although L1 detects if a problem is linearly separable, it is not able to distinguish which one is the simpler linear problem. The instance version of L1 is L1 H D [1] which multiplies, for each instance, its distance to the linear frontier by its label y i ∈ {−1, +1}. • L2, the error rate of a linear classifier [14]. • L3, non-linearity of a linear classifier [15]. It starts generating test points by linear interpolation between random pairs of points of the same class. Then, the linear classifier is trained on the original points and tested on the new points. L3 is the test error.
Neighborhood measures are based on the distance among points. They study the distribution of classes, how they intertwine with each other, the presence of overlapped and borderline points in neighborhoods: • N1, the fraction of borderline instances obtained from a Minimum Spanning Tree (MST) built from the data [14]. Each vertex of the tree corresponds to one instance and the edges are weighted according to the distance between them. N1 is the percentage of vertices connected to instances of other classes. The instance version of N1, N1 H D , is defined as the number of connections the instance holds with instances from other classes [1]. Both measures are sensitive to noisy instances. • N2 is the ratio of intra/extra class nearest neighbor distance [14], defined as r/(1+r), where r is the ratio of the sum of the distances between each point and its closest neighbor (intra) and the sum of the distances between every point and its closest neighbor from other class (extra). Its instance version N2 H D takes the ratio value for each point [1]. N2 is influenced by the data distribution, the shape of the boundary, and noisy points. • N3 is the error rate of the k-Nearest Neighbour (kNN) classifier with k = 1 and using a leave-oneout procedure [14]. • N4, non-linearity of the kNN classifier, is similar to L3 but using the kNN classifier with k = 1 [14]. • T 1 is the fraction of hyperspheres covering the data [14]. This measure starts building a hypersphere centered at each point, whose radius is determined by the distance between the point and its nearest enemy (i.e., the nearest point from other class). Then, all the hyperspheres completely included in bigger hyperspheres are eliminated. Finally, T 1 is the ratio between the final number of retained hyperspheres and the number of points. • LSC: local-set average cardinality. Following [21], the local-set of an instance (originally defined in [5]) is the set of instances closer to that instance than its nearest enemy. The cardinality of the local-set indicates its proximity to the decision boundary and also the space between classes. • In [1], four measures related to T 1 and local-set concept are added: -LSC(x i ), the local-set cardinality of each point. -LS radius is the radius of the local-set of each point, showing how close every point is to the other class.
-The usefulness index U of an instance is the number of instances containing it in its local-set. -The harmfulness index H of an instance is the number of instances for which it is the nearest enemy.
• The k-Disagreeing Neighbors (kDN) of a point is the percentage of its nearest neighbors from other classes [33]. It is averaged for the dataset level. • R-value [28] measures the existing overlap among classes on the dataset. It examines, using kNN, if a point is in an overlapping area. If more than a parameter θ of its k nearest neighbors are from other class, the point is considered to be in overlap. This is averaged to have an overlapping ratio per class and an overall overlapping ratio for the whole dataset.
Network measures are gauged based on a graph built from the data preserving the original similarities among instances. Each instance is represented as a node in the graph and is connected with undirected edges to instances distancing from it less than a threshold . In the final graph, nodes from different classes are not connected. The main measures [11] are the average density of the network (Density), the clustering coefficient (ClsCoef ) and the hub score (Hubs). Dimensionality measures focus on the sparsity of the data and measure the relationship between the number of points and the number of features. The principal measures are: the average number of features per dimension T 2 [14], the average number of Principal Component Analysis (PCA) dimensions per points T 3 [23], and the ratio of the PCA dimension to the original dimension T 4 [23]. Class imbalance measures assess the balance between the class sizes. Two common metrics are the imbalance ratio C2 and the entropy of class proportions C1.
Furthermore, in this work, the category model-inspired measures is proposed to be added to the previous taxonomy. These complexity measures are inspired by the learning mechanism of different classifiers. The hardness measures from [33] framed in this category are listed below. The dataset level value of these measures is just the average of the instance values.
• Disjunct Size (DS). The DS of an instance is the number of instances in its disjunct (i.e., leaf node where it is classified in a Decision Tree (DT)) divided by the number of instances in the largest disjunct. Disjuncts are created with a version of C4.5 algorithm: not pruned and allowing one instance per node.
• The Disjunct Class Percentage (DCP) of a instance is the proportion of instances from its class in its belonging disjunct. • The Tree Depth (TD) is the depth of the leaf node in a DT where the point is classified. It uses a C4.5 decision tree in its pruned version, Tree Depth Pruned (TDP), and unpruned version Tree Depth Unpruned (TDU). Note that if a point is misclassified in a shallow split of the pruned tree, the resulting complexity information of that point is not trustworthy. • Based on the philosophy of the Naïve Bayes, the Class Likelihood (CL) estimates the likelihood of an instance belonging to a class deeming independent features. For continuous variables, likelihood is gauged with a kernel density estimation. • Class Likelihood Difference (CLD) offers the difference between the CL of an instance and its maximum likelihood for the rest of classes.

Hostility
In this section, the formal notion of hostility at each one of the considered levels (point, class, and dataset) is presented. For this purpose, the notation used for the formal definitions of the hostility and the preliminary concept of the neighborhood of a point are first addressed.
Let X = (X 1 , . . . , X p ) T be the input vector of p random variables, Y the random output variable and assume that (X , Y) is the corresponding joint distribution. Suppose that D = {(x i , y i ) | i = 1, . . . , n} is the dataset containing n independent and identically distributed observations from (X , Y) where x i = (x i1 , . . . , x ip ) T is the ith observed value of X and y i is the ith observed value of Y, i = 1, . . . , n. Now, let X = {x i } n i=1 be the set of input observations from D and Y = {y i } n i=1 the set of the corresponding labels from D, where y i ∈ C = {1, . . . , c} being C the set of class labels.
In these terms and following [17], the neighborhood of a point x i ∈ X is a subset of X containing an open ball with center x i and radius r > 0, r ∈ R. This is, containing the set of all points x j ∈ X such that d(x i , x j ) < r, being d(·, ·) a distance function. The neighborhood of a point x i ∈ X will be denoted as N(x i ).

Definition 1
The hostility of an instance is the difficulty of correctly classifying the instance given its neighborhood. That is, the hostility of an instance (x i , y i ) ∈ D, denoted as H (x i , y i ), is the opposite of the probability of identifying  [1], LS radius [1], U [22], H [22] -T 1 [14] LSC(x i ) [1], LS radius [1], U [22], H [22] - -CLD [33] its class y i given the distribution of classes of all the points that belongs to its neighborhood: being {(x j , y j ) ∈ D | x j ∈ N(x i )} the instances pertaining to the neighborhood of the point x i , that is, to N(x i ).

Definition 2
The hostility of a class c is the difficulty of adequately identifying all the points of the class c, as belonging to class c, given their neighborhoods. That is, the hostility of a class c is the opposite of the probability of correctly classifying the complete class c given the points that belong to the neighborhood of the set of the points from class c. Let D c = {(x i , y i ) ∈ D | y i = c} be the restricted dataset D to class c. Then, the hostility of a class c, denoted as H (D c ), is:

Definition 3
The hostility of a dataset D is the difficulty of correctly classifying all the points of D given their neighborhoods. In other words, the hostility of a dataset D is the opposite of the probability of identifying the class of each point of the dataset given the neighborhood of the set X of input observations from D. Then, given a dataset D, its hostility H (D) is: where is the neighborhood of the set X.

Proposed method
In this section, the proposed hostility measure to estimate the previously defined hostility concept is described. The hostility measure is able to provide knowledge in three different levels: instance, class, and dataset. It is initially calculated for every single point, offering an hostility estimation value for every instance H (x i , y i ). These instance values are used for two further aggregations. First, an hostility value for every class H (D c ) and second, a global value for the whole dataset H (D). The dataset value goes hand-in-hand with complexity measures from the state-ofthe-art. However, the perspective per class is quite novel and offers prior knowledge about which class is more affected by the distribution of others and, hence, will be harder to classify.
The calculation of the hostility measure starts by applying the k-means clustering algorithm with the Euclidean distance. If this unsupervised algorithm is applied to a supervised dataset and, for each cluster, the probability of every class is extracted, an informative class data structure map can be achieved. Not only this map will reveal where a class is dominant, but it will also point out the most uncertain areas where classifiers tend to fail. Nevertheless, if the parameter k is not correctly selected, any exploratory analysis derived from the resulting partition would be worthless. To avert this situation the k-means algorithm is here hierarchically and recursively performed following [37] (see Fig. 1). The k-means is a simple method that allows to easily analyze the data in the natural groups they form according to their similarities and to select a good representative for each cluster. In addition, applying them in a hierarchic and recursive way guarantees robust partitions capturing the structure of the data and interactions among classes. Also, it enables to efficiently track the evolution of these partitions in the different iterations. In this recursive process, the different clustering iterations will be denoted as "layers" from now on. In the first layer, k-means is implemented using the whole set X and, in the next iterations, the data input is the set of centroids gathered from the previous step. Thus, the data input will be denoted as X , being X = X in the first iteration. Every time the algorithm is performed, a cluster Note that the larger the number of layers, the smaller the number of clusters and, consequently, more points are grouped together. The objective is to capture the behavior of classes through recursive partitions and to get successive clusters revealing how data from different classes are grouped.
In every layer l ∈ N, for any original point x i ∈ X, the probability of its class y i in the cluster B r ∈ B it pertains to is stored. The probability is denoted as p li , with l indicating the number of the layer and i = 1, . . . , n referring to the particular instance x i . This probability is the proportion of the class y i in the specific cluster B r based on the original points that belong to it: where | · | represents the cardinal of a set. As the procedure is hierarchical, it is straightforward to get to which cluster a point belongs to at any layer. Thus, in every layer l, a probability vector p l = (p l1 , . . . , p ln ) T ∈ R n is gathered and it is averaged with the probability vector from previous layers: This probability vector p = (p 1 , . . . , p n ) T ∈ R n summarizes for every point the dominance of its class through the variety of clusters where the point has been grouped. This average probability vector is the key for the hostility measure calculation since it reflects, for each point, the presence of its class. Consequently, its opposite 1 − p shows the absence of its class, that is, the dominance of the others and how harmful they are. In other words, the estimated hostility value for all points is just the opposite of this average probability vector, that is, These hostility measure values at the instance level estimate the probability of Definition To estimate the hostility for the class and the dataset levels meeting Definition 2 and 3, a probability threshold δ is applied to binarize the instance values. If the hostility measure is equal or higher than δ, its binary value will be 1. Otherwise, it will be binarized to 0 as the point lies in areas where its class is better represented and is less harmful for the classification task. This binary information is averaged to achieve the hostility estimation per class and for the total dataset: -The hostility measure of a class c is calculated by averaging the binarized hostility measure of the instances belonging to that specific class: being n c the number of instances in class c and I (·) the indicator function that takes value 1 when its argument is true and 0 otherwise. This estimation gives an indication of how complex it is to identify each class within the dataset and allows a ranking of the complexity of the different classes. -The global hostility measure for the dataset is calculated as the average of the binarized hostility measure of all points in dataset: Similarly to the hostility measure per class, it is estimated as the proportion of critical points in the whole dataset. That is, points expected to be erroneously identified as from other class.
For both cases, the maximum value is reached when the hostility measure of all points is higher or equal than δ, that is, when H (x i , y i ) ≥ δ, ∀(x i , y i ) ∈ D c for the class level and ∀(x i , y i ) ∈ D for the dataset one. Notice that all the proposed hostility measures are defined in the range [0, 1] to ease its interpretation and comparison. Besides, it supplies, in both levels, an estimation of the classification complexity.
The proposed method has three parameters: -The probability threshold (δ) aforementioned.
-The proportion of grouped points per cluster (σ ). This parameter automatically determines the number of clusters k in every layer. The purpose of σ is to set the pace of grouping in the recursive k-means process. -The minimum number of clusters allowed (k min ). It cannot be lower than the number of classes. The iterative process stops when the following k is going to be lower than k min . The final results come from this last layer.
Algorithm 1 presents the pseudocode to obtain the hostility measure. 1  Note that in all layers except for the first one, the hierarchical structure has to be used to extract the belonging cluster of every original point.
Since in every layer the hostility measure values are obtained, their evolution across partitions can be tracked. This tracking is used to select, by seeking for changes in the hostility measure behavior, the best layer to stop and, consequently, the k min parameter. A pattern change in this resulting hostility tracking graph will point out where the partition of clusters starts to lose the structure of the data, which is usually when the number of clusters is low. The final selected layer must be the one before the pattern changes to ensure that data structure is captured and stable results.
Notice that, even though the hostility measure uses k-means as the base of the method, it is not affected by its main drawbacks. The k-means method depends on the initialization and cannot form non-convex shapes [9]. Nevertheless, the hostility measure overcomes these problems thanks to its initialization with a high value for k to maximize the number of layers and, consequently, the resulting information to combine. When the number of clusters k is high, the chance of having a bad initialization decreases. Also, in the first layer, the method starts with a local perspective and the quantity of points per cluster is small which saves the problem of the inability to form non-convex clusters. As for the rest of the layers, thanks to tracking it is possible to detect when the behavior of the partition becomes different and, thus, to keep the previous results.
For the sake of simplicity and clarity, the method has been expounded for the binary case but its calculation for multi-class problems is straightforward. In fact, in the multiclass case, more information can be extracted: the estimated hostility that every single class received from the rest of classes and the estimated hostility that a class received from a specific class or a group of classes.

Experiments
This section is devoted to evaluate the hostility measure through a variety of experiments involving artificial data and benchmark real datasets. The section begins with the description of the datasets and the selection of parameters for the rest of experiments. Later, the performance of the proposed measure is analyzed and compared with the state-of-the-art measures. Since a multi-level approach is presented, these experiments are divided into instance, class, and dataset levels. After this, two more experiments are presented highlighting other abilities of the hostility measure: an experiment showing the explanatory power of the hostility and overlapping tracking graphs derived from the method as well as the extension of the proposal to multiclass problems. The section ends with the lessons learned throughout the experiments. Notice that all the results from this section related to hostility come from the hostility measure which estimates the formal concept of hostility. However, for the sake of simplicity, both terms will be used interchangeably.

Set up
A total of 27 datasets have been considered: 11 are artificial datasets specifically created to assess the behavior of the hostility measure and the remaining 16 are binary real datasets from [8,35]. 2 Fig. 2 Artificial datasets. The class −1 is the blue one represented with · and the class +1 is the orange one represented with + The 11 artificial datasets (see Fig. 2) are 9 sets of Normal distributions, the moon dataset and the XOR dataset. Each of the simulated Normal datasets is formed by 2 bivariate Normal distributions with different degrees of overlap, density, and a variety of shapes. Datasets 1,2,3, and 4 are formed by classes with equal variance and with an increasing symmetric overlap between classes. Datasets 5,6,7,8, and 9 present different dispersion for each pair of classes and various asymmetric types of overlap. In these last datasets, there is a sparse class that is always less overlapped and a denser class that can be fully or partially concentrated inside the sparse class. For all cases, each class has 4500 points.
The artificial data have been mainly generated using the Normal distribution so as to have a theoretical overlap reference value. Given two Normal probability density functions f (x) and g(x), their overlap [38] is defined as:  Table 3. Notice that the Wine and Yeast datasets are originally multi-class problems but the two more balanced classes have been chosen. Besides, as a reference of the complexity of each real and artificial dataset, a set of ML algorithms have been considered: SVM with linear kernel, SVM with RBF kernel, Random Forest (RF), MLP, XGBoost, kNN, DT and LR. The respective parameters have been selected through a 5-fold cross validation and a grid search maximizing the balanced accuracy. For the artificial datasets, 6000 points are destined to training and 3000 to testing. The real datasets are split into training (70%) and testing (30%). Finally, the best model for each dataset is the one maximizing the balanced accuracy while avoiding overfitting, that is, the model matching max(T est Balanced Accuracy −  max(0, T rain Balanced Accuracy − (T est Balanced Accuracy))). In all cases, complexity measures are only applied to the training set [14]. All results, except the classification error, are computed on the training set and, for all experiments, the datasets are previously standardized. Tables 2 and 3 contain the best model for each dataset and its corresponding test error. Notice that the error is, in all cases, 1 − Balanced Accuracy.
Regarding the parameters of the hostility measure, the selection of k min , σ and δ is required. The parameter δ is a threshold for a probability vector and, as such, is fixed to 0.5 as the default value for class probabilities. For the σ parameter, values between 4 and 8 are recommended by the authors. Smaller σ values are discarded because they are not able to capture the data structure in the first layer. On the other hand, higher values minimize the number of layers and can lose the data structure in intermediate and last layers due to the high number of clusters that they assemble. To maximize the number of layers, lower σ values are preferred in general. For large datasets, higher σ values can be used to save computational cost. Throughout the paper, results for these σ values in {4, 5, 6, 7, 8} will be shown to point out their validity. The parameter k min is, by default, the number of classes but it can be selected using the hostility tracking graph. As an example, the hostility tracking graphs for the datasets 2 and 5 are displayed in Fig. 3. The σ values are equal to 8 and 5, respectively. The size of the dots of the graph represents, for every layer and every class, the proportion of clusters in which the class is the majority class. Figure 3a reveals that both classes have a similar low hostility caused by the opposite class. This behavior is maintained across the three layers as the steady hostility values per class reflect. Note that in the last layer, there is a slight change in the hostility patterns. Therefore, the best layer to stop could be the layer 2 or 3. Moreover, in each layer, the size of the dots for both classes is similar, which means that they are the most representative class in a similar number of clusters. In Fig. 3b, the class −1 clearly has more hostility than the class +1. The trend of hostilities change from the layer 2 (k = 240) and starts to widen. Hence, the best layer to stop is the layer 2 to avoid instability. Regarding the dot sizes, the negative class is the most representative in most of the clusters through all layers. Given the low hostility of the class +1, this also reflects that it is less sparse than the class −1. To ease and automatize the user work, the rest of the experiments are all obtained following the next criterion: selecting the last layer that offers hostilities per class that do not vary more than 25% from the hostility results from the first layer.
Concerning measures from the state-of-the-art, the parameters have been chosen according to authors' recommendations: for kDN, k = 5 following [1,33] and R-value is obtained with k = 7 and θ = 3 following [28]. In the case of CL and CLD the Gaussian kernel density estimation is selected. The C4.5 algorithm from RWeka [16] is used to calculate the hardness measures based on C4.5 DT. For DS and TDU, parameters are chosen to avoid pruning and with a minimum number of instances per node equal to 1. For DCP and TDP, default parameters are taken. In particular, the complexity measures F 1, F 1v, F 2, F 3, F 4, N1, N2, N3, N4, T 1, LSC, Density, ClsCoef , H ubs, L1, L2, L3, T 2, T 3, C1, and C2 are obtained from the R package 'ECoL' [10].
Moreover, for all experiments, all complexity measures have been correspondingly re-scaled to behave accordingly to the error, i.e., lower values imply simpler instances and higher values more complex instances.

Instance level
This subsection is dedicated to the instance perspective of complexity measures. It is, in turn, divided into two different experiments. First, a graphical study and comparison of the behavior of several complexity measures is presented. This experiment shows the relation among complexity values at the instance level and the predicted probabilities from the best classifier. The second experiment aims to verify if each complexity measure is actually able to identify the most complex points. The complexity measures considered in this section are all the measures covering the instance level in Table 1.
In this experiment, the predicted probabilities offered by the best classifier of each point belonging to its correct class are obtained following the cross validation scheme in [33]. The opposite of these predicted probabilities serve as a complexity reference for all instances.

Graphical analysis and correlation with classification error
For this experiment, the datasets 3 and 6 have been chosen 3 as a representative sample of the artificial datasets: two Normal distributions with similar density and overlap and other two with a great difference in density and the positive class fully inside the negative one. Figures 4 and 5 contain the complexity measures values for instances in the datasets 3 and 6, respectively. Since some classes are in overlap, a graph per class and per complexity measure is generated. As detailed before, it is also presented a graph with the predicted probability that each point has of belonging to the opposite class according to the corresponding best classifier. In the graphs, yellow colors imply more complexity and blue colors less.
As expected, the feature-based measures (F 1 H D , F 2 H D , F 3 H D and F 4 H D ) fail in the task of determining the difficulty of each point since they appraise overlapping perpendicular to axes. CL and CLD are calculated with a kernel density estimation using the Gaussian distribution. This assumption is reflected in the results. CL is informing about how far are points from the center of the distribution but not about its complexity. Even though CLD detects better the hardest instances, the captured complexity distribution is biased by the Gaussian assumption. The measures based on decision trees (DS, DCP, TDP and TDU) show sharp cuts associated with the hyperplanes generated by the trees instead of degraded complexity values. This behavior does not comprise the complexity distribution of points. Although TDU provides richer information not so characterized by hyperplanes, it considers that the overlapping region of the dataset 6 (see Fig. 5s) is equally complex for both classes even when the class −1 is clearly less present in the specific area (recall that both classes have the same number of samples). The linearity measure L1 H D performs well for the dataset 3 ( Fig. 4n) but fails for the dataset 6 that it is not linearly separable (Fig. 5n). Concerning measures based on the local-set concept, the harmfulness index H is the less informative since it only assigns high complexity values (yellow points) to points absolutely surrounded by points of other class (i.e., the point is the nearest enemy of all of them). Except for these few points, it considers similar low levels of complexity for the rest of the dataset points. LSC, LSradius and the usefulness index U perform better in detecting the most complex areas but, inside those areas, they do not generate a clear complexity degradation (in contrast with the classifier behavior in Figs. 4a and 5a). In addition, they overestimate the complexity and identify as complex some points that are not even in overlap (see Fig. 4a, f and g). The measures kDN, N1 H D and N2 H D reveal the best results among measures  5 Visual analysis of the complexity measures for the dataset 6 at the instance level. For each complexity measure, the left graph is for the class +1 and the right graph is for the class −1. Yellow colors indicates more complex instances from the state-of-the-art. They capture the distribution of the complexity for the two datasets and reflect the same patterns as the classifier. Nevertheless, none of them accomplishes a smooth complexity degradation as the classifier. The only measure that achieves it is the here proposed hostility measure thanks to its construction. The combination of information from different layers produces richer results.
Also, as the number of clusters is smaller in each layer, their size is larger and this enables to study the points and the distribution of classes from a local to a global perspective. For both datasets, the hostility measure is the complexity measure that visually most closely resembles the results from the classifier (see Fig. 4a and b for the dataset 3 and Fig. 5a and b for the dataset 6). Points receiving more To evaluate analytically if the complexity values per instance are consistent with the predicted probabilities from the corresponding best classifier, the Spearman correlation is gauged between the complexity values of each measure at the instance level and those predicted probabilities of the classifier for each dataset. These correlations ease the comparison of the capacity of the complexity measures to correctly rank the points given their complexity. Then, to easily compare all results, a boxplot based on these correlations is generated per each complexity measure. High and positive Spearman correlations mean that the complexity measure is able to order the points adequately according to their complexity and matching the predicted probabilities from the best model. On the other hand, low and negative values imply that the complexity ranking established by the complexity measure behaves differently than the results from the best model. That is, there is no agreement about which points are more complex. Note that, for this experiment, the hostility measure is computed for σ ∈ {4, 5, 6, 7, 8}.
For the artificial data (see Fig. 6a), the measures revealing higher correlation with the classifier are the hostility measure, some of the local-set concept based measures like LSC, LSradius, and U and also N2 H D and kDN. In the case of the real datasets (see Fig. 6b), the outstanding measures are the hostility measure, LSC, U , CLD, kDN and L1 H D . The only measures keeping its behavior for both types of datasets are the hostility measure, LSC, U and kDN. Taking into account the performance of LSC and U in the visual study, it can be concluded that the hostility measure and kDN are the two complexity measures performing better in estimating the complexity of each point.

Complexity points detection verification
The purpose of the current experiment is to prove that the instances pointed out as complex by complexity measures are indeed harder to classify. To that aim, the evolution of the train error when using all points and when filtering a proportion of the most complex ones is analyzed. In particular, two subsets of the train data are considered: the first subset removes the 10% of most complex points and the second subset the 50%. Since, in every subset, the samples are simpler according to the complexity measures, the error is expected to decrease.
In this experiment, results are shown for the 10 real datasets with higher classification error and for the highlighted measures in the former experiment: the hostility measure, kDN, L1 H D , LSC, U and CLD. 4 Table 4 reveals that, in general, all the considered measures are detecting the most complex points since they achieve an error reduction when filtering the 10% and the 50% of those points. Another common and expected pattern is that the more complex points are removed, the lower the error. The case of Hill Valley is noteworthy: only the hostility measure and L1 H D have managed to reduce the initial error. When retaining the 90% of the simplest points, the hostility measure slightly increases the error of the Hill Valley data set. Note that, due to its construction, when filtering the most complex points with the proposed measure, at the beginning, only noise points or outliers will be extracted. However, in an intermediate stage, the points lying in the most uncertainty areas will be removed, only remaining the simpler ones. The behavior of LSC and kDN in the Yeast data set is also remarkable. They increase the error when training with the second and simpler subset. Despite this, in general, the six considered complexity measures correctly identify which points are harder to classify and they could be used to reduce the size of train data. Nevertheless, the hostility measure, L1 H D , and CLD outshine presenting a more stable performance.

Class level
This section of experiments is devoted to the approach of data complexity through the class level perspective. To assess the capacity of the proposal to estimate the complexity of each class, a comparison among the results of the hostility measure for each class with the best classifier's errors is addressed in this section. The comparison includes also the R-value since it offers the class perspective.
For the artificial data (Table 5), the hostility measure (with all the σ values considered) and R-value show values similar to the error committed in both classes. Hence, both measures allow anticipating what to expect in the classification task. Similar results are found for the real data ( Table 6). The two measures perform well in capturing the proportion of critical points that really harm the classifier. Nevertheless, R-value fails in determining which class is more difficult to classify for the Hill Valley dataset and the proposal fails for the Mammographic one. In this case, the hostility results are less stable for the different values of σ due to the small size of some real datasets (see Table 3). In these cases, the recommended σ among the possible options is the one maximizing the number of layers. For the small datasets, σ should be 4 or 5.
These results are accompanied by the Spearman correlation among the error, the hostility measure and R-value, to evaluate if they are able to rank classes according to the real complexity. Results are presented in Table 7a for For each dataset, the first row shows the results for the class −1 and the second one for the class +1. The numbers accompanying the hostility measure indicate the corresponding σ value For each dataset, the first row shows the results for the class −1 and the second one for the class +1. The numbers accompanying the hostility measure are the σ value the artificial datasets and in Table 7b for the real datasets. Both measures achieve correlations between 0.79 and 1. Hence, the hostility measure is able to correctly identify which class is harder and will need more attention during the classification task. Also, it offers interpretable values.

Dataset level
The final experiments deal with the classic dataset perspective to study the performance of the hostility measure and to compare it with measures from the state-of-the-art.
Since complexity measures should be well-correlated with the classification error, the Spearman correlation among complexity measures and the error from best classifiers are presented in Table 8 for the artificial data and in Table 8b for the real data. The correlations with the theoretical overlap are also shown for the artificial case. All measures have been previously normalized so that a positive correlation with the error is expected. Both tables  In the case of the real datasets, they have low and negative correlations. To enable the comparison, the error, and the theoretical overlap for the artificial case, are also presented As a way to proof the clarity and explainability of the hostility measure, Table 9a and b show the dataset hostility value and the classification error for the artificial and real datasets, respectively. In this case, σ = 8 has been chosen for the artificial datasets due to the high number of instances (6000). For the real datasets, to maximize the number of layers, σ = 4 has been selected. Recall that, at the dataset level, the proposed measure is an estimation of the proportion of critical points. It is shown that it differs slightly from the error. That is, the hostility measure offers a good estimation of the critical points of a dataset.

Enrichment of the hostility tracking graph
The hostility tracking graph, besides its utility to select the best layer to stop, offers information about the interaction among classes. The hostility tracking graph of the dataset 2 (see Fig. 3a) showed a close and constant hostility behavior centered around 0.08. This means that the global complexity for the dataset is low and that both classes are equally harmed by the other class. There is an 8% of critical points in each class. In the case of the dataset 5, Fig. 3b presented a different situation. Until layer 2, the hostility of each class remains pretty steady, but from layer 3 both hostilities start to widen. This means that the behavior of the data when analyzed in small neighborhoods is not the same than in bigger ones. Moreover, the class +1 always shows lower hostility than the class −1. Hence, the class −1 is expected to be harder to classify.
Previous figures were obtained from the binarization of the hostility measure using the threshold δ = 0.5. Furthermore, if it is binarized with δ > 0, insights about the overlap of each class are achieved. With this binarization, 1 means that the point is in an overlapping area. Thus, any point that shares cluster with a point from other class is considered to be in overlap. In the first layer, this offers an estimation of the overlapping per class. As the number of clusters increases, the overlapping is obviously tending to 1. Despite this, tracking this overlapping estimation provides information about the density of the classes and how they interact. Besides, the dot sizes of both graphs inform about the distribution of the classes among the clusters of the different partitions and how this evolves across layers.
As an example, the overlapping tracking graph of the datasets 2 and 5 are presented in Fig. 7. In the case of the dataset 2, the same behavior in both classes across layers is detected. That is, as the number of clusters increases, a similar amount of points from both classes share cluster with points from the other class. If this information is combined with the one from the hostility tracking graph (Fig. 3a) and the balanced size of the dots (similar distribution of both classes in the partitions), it is concluded that they have the same density, and they overlap in a symmetric way. The overlapping tracking graph of the dataset 5 is quite different: the class +1 begins with a really high value of overlap and quickly reaches the maximum (i.e., in every cluster with a point from the class +1 there is at least one point from the class −1) and the class −1 Fig. 7 Overlapping tracking graph. The negative class is represented in blue and the positive one in orange. For each layer and each class, the size of each dot indicates the proportion of clusters in which each class is the majority one. The overlapping is obtained with the hostility measure with σ = 8 for the dataset 2 and σ = 5 for the dataset 5 starts with a low overlap value and increases uniformly but keeping distance with the class +1. This pattern highlights an asymmetric overlap. The class +1 is totally in overlap but not the class −1. That is, the class +1 is fully (or almost fully) inside the class −1. This complements the information of the hostility tracking graph (Fig. 3b). The class −1 has more hostility than the class +1 but the class +1 is totally in overlap. The dot sizes reflect that the class +1 represents the majority class in few clusters but, given its low hostility, the class is well covered by these clusters. Notice that, in the first layer, the class +1 is the majority class in a bigger proportion of clusters than in the rest of layers. This means that when clusters are smaller, it is easier for the class +1 to be the majority one. Therefore, the class +1 is clearly denser and, even though it is totally in overlap, all its points are concentrated. This also explains why the behavior of the dataset 5 when analyzing more local clusters or more global clusters was different. Since one class is so dense, when the number of cluster starts to be low, it is eminently being grouped in the same cluster. Thus, as mostly all its class is grouped together and is the densest class, it has lower hostility. Consequently, all points from the sparse class that are grouped in the same cluster as the dense one have more hostility.
As a conclusion, these tracking graphs serve to provide rich information about classes and guide the user's next steps. For example, knowing that one class is entirely inside the other let the user to filter the non-overlapping ones and to focus the classification model in the complex areas. Another strong aspect is that these tracking graphs can be made for high dimensional data offering some exploratory insights for data hard to visualize.

Hostility measure for multi-class problems
The hostility measure is expounded for the binary case but its extension to multi-class is straightforward. In this experiment, two artificial datasets composed of 3 classes have been generated and presented in (Fig. 8). 5 The first one is similar to the dataset 1 with a new more dispersed class that overlaps with the two former ones. The second one is similar to the dataset 7 with a new dense class that only overlaps with the black one (originally the negative class). For both datasets, the total hostility per class and the hostility that each class presents due to each one of the other classes are contained in Table 10a and b, respectively. Regarding the first multi-class dataset, the hostility reflects that the class 2 is the one with more hostility (0.139). This hostility comes mostly from the class 0 (0.119) while the class 1 brings practically no disturb to it (0.018). Similarly, the most perturbing class to the class 0 is the class 2 (0.120). Concerning the class 1, its total hostility is 0.045 which is equally provoked by the classes 0 and 1 (0.021). The second multi-class dataset is specially interesting since the hostility results reveal that the classes 1 and 2 do not generate hostility towards each other. The class 0 is the one that has more hostility (0.122) and that is equally caused by the classes 1 and 2. The hostility of the classes 1 and 2 due to the class 0 is pretty low revealing that these two classes are better differentiated than the class 0.
Thus, the hostility measure is useful for multi-class problems since it shows how hostile is every class for Fig. 8 Multi-class data. The class 0 is represented in blue, the class 1 in orange and the class 2 is the black one each of the other classes and the total hostility that a class is suffering. Thus, it let the user divide the data space according to its hostility and tackle every area with a different strategy.

Lessons learned
The main lessons learned from the proposal along the experiments are summarized in the current section. These lessons include from the great stability of the method and its good performance in all evaluated datasets to the validity of the recommended parameters.
Some of the alternative complexity measures have performed properly, offering really good results for some of these experiments. However, they have revealed poor performance in other cases. For instance, L1 is weak in estimating the global complexity but stands out in filtering the complex points. kDN performs well in general but in the graphical analysis showed closest points with opposite complexity values and was not so balanced when tackling artificial or real data. The hostility measure is clearly the most stable always offering good and satisfactory results for both artificial and real datasets. This is because the combination of layers and the use of k-means to analyze the points in their inherent groups, instead of in fixed structures, provides adaptability and robustness to the method.
It is remarkable that the hostility measure, using the Euclidean distance, has performed well in all the considered types of datasets. The datasets involved in the experiments have contemplated extreme scenarios regarding overlap, density, and decision boundary shape, including linear and non-linear data. Hence, the method is able to detect the interactions among classes independently of the linearity of the data and of the shape of the decision boundary.
The results of the hostility measure have also revealed the validity of the recommended σ values: {4, 5, 6, 7, 8}. Taking into account the construction of the method, the validity of a range of values for the σ parameters reflects that the hostility measure is capturing the data structure. Smaller and higher σ values are discarded since they group few points in the first layer or too many points in the intermediate and final layers, respectively. Note that the weakest part of the hostility measure is the number of parameters. Despite the fact that this should be further revised by the authors, at the moment the selection of parameters has been quite simplified to the user. Table 10 The hostility measure values for the multi-class datasets The tables contain the hostility measure between each pair of classes and the total value for each class. The hostility measure is obtained with σ = 6 for both datasets 1 and 2. C q , q = 0, 1, 2 refers to each one of the classes

Research opportunities
The research ways that have appeared throughout the construction of this work are expounded in this section: -Finding the decision boundary. Medium hostility values mean that the point resides in an uncertain area where both classes are equally present. This is normally the decision boundary and it could be detected by identifying those uncertain areas. One way of doing this is by translating the hostility in terms of uncertainty (for example, using Rényi's entropy) since its maximum value reflects the worst case, that is, when all classes are equally present. On the other hand, its minimum value is the best scenario in which there is only one class. This can be used for classification: applying weights to points depending on their uncertainty, dividing the data in different sets and building different classifiers for each one of them, etc. -Feature selection. The hostility measure can be used to select the subset of features that minimizes the hostility of the dataset. This can also be tackled from the class level to discover which features are more harmful for each class. -Imbalanced data. In the present work, the parameter δ has been set to 0.5 as the default threshold for probabilities. However, when classes are not balanced, this value might favor the majority class. The idea will be to set the δ parameter equal to the proportion of the minority class. The resulting hostility measure would allow the user to know whether the classification model used is harming the minority class or not. This version of the proposed measure could be compared with complexity measures specifically created for the imbalanced case that offer values per class [2,26,32]. -The choice of metric. In this paper, the Euclidean distance has been selected. It is worth analyzing the effect of the metric in the performance of the proposal and checking if the hostility measure with, for example, a Radial Basis Function Kernel achieves better results in the case of non-linear data. -Methods to estimate the concept of hostility. The k-means algorithm performed hierarchically and recursively has been considered to estimate the hostility. However, alternative estimation methods like applying hierarchical clustering in every layer with an automatic selection of the number of clusters and selection of prototypes or using a dendrogram as the basis for the calculation of the concept of hostility could be also contemplated. -The applicability of the method. As reflected in Section 2, complexity measures have been applied to different fields [3]. Besides the application of the hostility measure in those fields, it could also perform in different domains like, for example, Big Data, to check if the condensation of data is rich enough to substitute the original data [12]. Also, an adapted version of the hostility could be applied to obtain useful information to enrich the modeling phase in targetenvironment networks that arises in fields like genetics and economics [19].

Conclusions
In the present work, the concept of hostility has been introduced for the instance, the class, and the dataset level.
The hostility is the damage, in terms of probability, that a point, a class or a dataset suffers from the presence of points of other classes in its surroundings when being classified. To estimate it, the hostility measure, a neighborhood measure offering a multi-level data complexity perspective, has been presented. The measure is constructed at the instance level by means of a hierarchical and recursive application of the k-means algorithm. After this procedure, an hostility measure value between 0 and 1 is obtained for every point. These values per point are aggregated to get an hostility measure value per class, indicating how hard is to identify each class, and for the whole dataset, illustrating the difficulty of separating the classes. As the method is hierarchical and recursive, the neighborhoods analyzed are of increasing size which allows the method to combine a local and a more global perspective.
To evaluate the proposed complexity measure, several experiments for each one of the perspectives have been carried out. In them, the performance of the hostility measure has been compared with complexity measures from the state-of-the-art. The hostility measure has generally stood out, showing a good and stable performance in all of them. It is easy to understand, to interpret and is suitable for binary and multi-class classification problems. Also, the proposal does not make assumptions about data nor it is based on a supervised classifier which ensures that there is no relation between results from the complexity measure and the posterior classification. In addition, to the best of the authors' knowledge, the hostility measure is the only complexity measure that offers a multi-level analysis of data complexity and combines different layers of information which increases the robustness of the method. Moreover, the complexity class perspective is deeply tackled in the present work. Not only an estimation of the complexity of each class is offered, but also two exploratory graphs (hostility and overlapping tracking graph) are presented providing information about the density and the relation between classes. Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.