Feature ranking for semi-supervised learning

The data used for analysis are becoming increasingly complex along several directions: high dimensionality, number of examples and availability of labels for the examples. This poses a variety of challenges for the existing machine learning methods, related to analyzing datasets with a large number of examples that are described in a high-dimensional space, where not all examples have labels provided. For example, when investigating the toxicity of chemical compounds, there are many compounds available that can be described with information-rich high-dimensional representations, but not all of the compounds have information on their toxicity. To address these challenges, we propose methods for semi-supervised learning (SSL) of feature rankings. The feature rankings are learned in the context of classification and regression, as well as in the context of structured output prediction (multi-label classification, MLC, hierarchical multi-label classification, HMLC and multi-target regression, MTR) tasks. This is the first work that treats the task of feature ranking uniformly across various tasks of semi-supervised structured output prediction. To the best of our knowledge, it is also the first work on SSL of feature rankings for the tasks of HMLC and MTR. More specifically, we propose two approaches—based on predictive clustering tree ensembles and the Relief family of algorithms—and evaluate their performance across 38 benchmark datasets. The extensive evaluation reveals that rankings based on Random Forest ensembles perform the best for classification tasks (incl. MLC and HMLC tasks) and are the fastest for all tasks, while ensembles based on extremely randomized trees work best for the regression tasks. Semi-supervised feature rankings outperform their supervised counterparts across the majority of datasets for all of the different tasks, showing the benefit of using unlabeled in addition to labeled data.


Introduction
In the era of massive and complex data, predictive modeling is undergoing some significant changes.Since data are becoming ever more high dimensional, i.e., the target attribute potentially depends on a large number of descriptive attributes, there is a need to provide better understanding of the importance or relevance of the descriptive attributes for the target attribute.This is achieved through the task of feature ranking [Guyon and Elisseeff, 2003, Jong et al., 2004, Nilsson et al., 2007, Petković et al., 2019]: the output of a feature ranking algorithm is a list (also called a feature ranking) of the descriptive attributes ordered by their relevance to the target attribute.The obtained feature ranking can then be used in two contexts: (1) to better understand the relevance of the descriptive variables for the target variable or (2) as a frequent pre-processing step to reduce the number of descriptive variables.By performing the latter, not only the computational complexity of building a predictive model later on is decreased, but at the same time, the models that use a lesser number of features are easier to explain and understand which is of high importance in a variety of application domains such as medicine [Holzinger et al., 2019, Hoogendoorn et al., 2016, Tjoa and Guan, 2019], life sciences [Grissa et al., 2016, Saeys et al., 2007, Tsagris et al., 2018] and ecological modeling [Bhardwaj and Patra, 2018, Galelli et al., 2014, Zhou et al., 2018].
Another aspect of massiveness is the number of examples in the data.However, for some problems such as sentiment analysis of text, e.g., tweets [Kralj Novak et al., 2015], or determining properties of new chemical compounds [DiMasi et al., 2003], e.g., in QSAR (quantitative structure activity relationship) studies (which is one of the considered datasets in the experiments), one can only label a limited quantity of data, since labeling demands a lot of human effort and time (labelling tweets), or is expensive (performing wet lab QSAR experiments).Since the cases where many examples remain unlabeled are not that rare, advances in predictive modeling have brought us to the point where we can make use of them.In this work, we focus on semisupervised learning (SSL) techniques that handle data where some examples are labeled and some are not (as opposed to supervised learning (SL) where all examples are labeled).Another direction of research goes into weakly supervised learning [Zhou, 2017] where all examples may be labeled but (some) labels may be inaccurate or of a lower quality.
The SSL approaches are all based on the assumption that the target values are well-reflected in the structure of the data, i.e., Assumption 1 (Clustering Hypothesis) Clusters of data examples (as computed in the descriptive space) well resemble the distribution of target values.
If the clustering hypothesis is satisfied, then a SSL algorithm that can make use of unlabeled data, may outperform the classical SL algorithms that simply ignore them.This holds for predictive modeling tasks [Levatić, 2017, Zhu et al., 2009], and as we show in this work, for feature ranking tasks also.
In addition to the massiveness, the complexity of the data is also increasing.Predictive modeling is no longer limited to the standard classification and regression, but also tackles their generalizations.For example, in classification, the target variable may take only one of the possible values, for each example in the data.On the other hand, problems such as automatic tagging (e.g., the Emotions dataset (see Sec. 6.2) where the task is to determine emotions that a given musical piece carries) allow for more than one label per example (e.g., a song can be sad and dramatic at the same time).A further generalization of this problem is hierarchical multi-label classification, where the possible labels are organized into a hierarchy, such as the one in Fig. 1, which shows animalrelated labels.If a model labels an example as koala, it should also label it with the generalizations of this label, i.e., Australian and animal.
Similarly, the task of regression can be generalized to multi-target regression, i.e., predicting more than one numeric variable at the same time, e.g., predicting canopy density and height of trees in forests (the Forestry dataset in Sec.6.2).
The main motivation for the generalized predictive modeling tasks is that considering all the target variables at the same time may exploit the potential interactions among them which are ignored when one predicts every variable separately.Moreover, building a single model for all targets can dramatically lower the computational costs.
In many cases, the data are at the same time semi-supervised (has missing), high dimensional and has a structured target, as for example in gene function prediction: Labeling genes with their functions is expensive (semi-supervision), the genes can be described with a large number of variables (high dimensionality), and the functions are organized into a hierarchy (structured target).Thus, designing feature ranking algorithms that i) can use unlabeled data, and ii) can handle a variety of target types, including structured ones, is a relevant task that we address in this work.To the best of our knowledge, this is the first work that treats jointly the task of feature ranking in the context of semi-supervised learning for structured outputs.
We propose two general feature ranking approaches.In the first approach, a ranking is computed from an ensemble of predictive clustering trees [Kocev et al., 2013, Blockeel, 1998], adapted to structured outputs and SSL [Levatić, 2017], whereas the second approach is based on the distance-based Relief family of algorithms [Kira and Rendell, 1992].An initial study, investigated the performance of the ensemble-based approach in the classification task [Petković et al., 2019].In this work, we substantially extend our previous study in several directions: 1.Additional datasets for classification are considered.
2. Additional four tasks are considered (multi-label and hierarchical multilabel classification, single-and multi-target regression), and the ensemblebased feature ranking methods are evaluated in these cases.3. The Relief family of algorithms is extended to SSL, and evaluated for all five tasks (in comparison to the ensemble-based feature ranking methods).
The rest of the paper is organized as follows.In Sec. 2, we give the formal definitions of the different predictive modeling tasks, and introduce the notation.Sec. 3 surveys the related work, whereas Secs. 4 and 5 define the ensemble-based and Relief-based feature importance scores, respectively.Sec.6 fully describes the experimental setup.We present and discuss the results in Sec. 7, and conclude with Sec. 8.
The implementation of the methods, as well as the results are available at http://source.ijs.si/mpetkovic/ssl-ranking.

Preliminaries
Basic notation.The data D consist of examples (x, y), where x is a vector of values of D descriptive variables (features), and y is the value of the target variable(s).The domain X i of the feature x i is either numeric, i.e., X i ⊆ R, or categorical, i.e., it is a finite set of categorical values, e.g., X i = {A, B, AB, 0} if a feature describes blood type.Both numeric and categorical types are considered primitive unstructured types.The domain Y of the target variable depends on the predictive modeling task at hand.In this paper, we consider five tasks, two having unstructured, and three having structured target data types.
Regression (STR).In this case, the target is a single numeric variable.Since we later consider also its generalization (multi-target regression), we refer to this task as single-target regression (STR).
Multi-target regression (MTR).Here, the target variable is a vector with T numeric variables as components, i.e., Y ⊆ R T .Equivalently, we can define MTR as having T numeric targets, hence the name.In the special case of T = 1, MTR boils down to STR.
Classification.In this case, the target is categorical.Since the algorithms considered in this paper can handle any classification task, we do not distinguish between binary (|Y| = 2) and multi-class classification (|Y| > 2).
Multi-label classification (MLC).The target domain is a power set P(L ) of some set L of categorical values, whose elements are typically referred to as labels.Thus, the target values are sets.Typically, the target value y of the example (x, y) is referred to as a set of labels that are relevant for this example.The sets y can be of any cardinality, thus the labels are not mutually exclusive, as is the case with the task of (standard) classification.
Hierarchical multi-label classification (HMLC).This is a generalization of MLC where the domain is again a power set of some label set L , which, additionally, is now partially-ordered via some ordering ≺.An exemplary hierarchy (of animal-related labels), which results from such an ordering is shown in the corresponding Haase diagram in Fig. 1.

Australian
African Asian dingo koala giraffe elephant tiger 1 Fig. 1: An exemplary hierarchy of animal related labels.
If 1 ≺ 2 , the label 1 is predecessor of the label 2 .If, additionally, there is no such label , such that 1 ≺ ≺ 2 , we say that 1 is a parent of 2 .If a label does not have any parents, it is called a root.A hierarchy can be either tree-shaped, i.e., every label has at most one parent, or it can be directed acyclic graph (DAG).Since the label elephant has two parents (African and Asian), the hierarchy in Fig. 1 is not a tree.
Regarding predictive modeling, the ordering results in a hierarchical constraint, i.e., if a label is predicted to be relevant for a given example, then, also its predecessors must be predicted relevant, e.g., if a given example is koala, it must also be Australian and animal.
In the cases of MLC and HMLC, each set of relevant labels S ⊆ L is conveniently represented by the 0/1 vector s of length |L |, whose j-th component equals one if and only if j ∈ S. Thus, we will also use the notation T = |L |.
Semi-supervised learning (SSL).The unknown target values will be denoted by question marks (?).If the target value of the example is known, we say that the example is labeled, otherwise the example is unlabeled.This applies to all types of targets and is not to be confused with the labels in the tasks of MLC and HMLC.

Related Work
In general, feature ranking methods are divided into three groups [Stańczyk and Jain, 2015].Filter methods do not need any underlying predictive model to compute the ranking.Embedded methods compute the ranking directly from some predictive model.Wrapper methods are more appropriate for feature selection, and build many predictive models which guide the selection.
Filters are typically the fastest but can be myopic, i.e., can neglect possible feature interactions, whereas the embedded methods are a bit slower, but can additionally serve as an explanation of the predictions of the underlying model.The prominence of the feature ranking reflects in numerous methods solving this task in the context of classification and STR [Guyon andElisseeff, 2003, Stańczyk andJain, 2015], however, the territory of feature ranking for SSL is mainly uncharted, especially when it comes to structured output prediction.
An overview of SSL feature ranking methods for classification and STR is given in [Sheikhpour et al., 2017].However, the vast majority of the methods described there are either supervised or unsupervised (ignoring the labels completely).An exception is the SSL Laplacian score [Doquire and Verleysen, 2013], applicable to the STR problems.This method is a filter and stems from graph theory.It first converts a dataset into a graph, encoded as a weighted incidence matrix whose weights correspond to the distances among the examples in the data.The distances are measured in the descriptive space but more weight is put on the labeled examples.One of the drawbacks of the original method is that it can only handle numeric features.Our modification that overcomes this is described in Sec.6.6.
For structured output prediction in SSL, we could not find any competing feature ranking methods.Our ensemble-based scores belong to the group of embedded methods, and crucially depend on ensembles of SSL predictive clustering trees (PCTs) [Levatić, 2017].We thus describe bellow SSL PCTs and ensembles thereof.

Predictive clustering trees
PCTs are a generalization of standard decision trees.They can handle various structured output prediction tasks and have been recently adapted to SSL [Levatić, 2017].This work considers the SSL of PCTs for classification, STR, MTR [Levatić et al., 2018], MLC, and HMLC.
For each of these, one has to specify the impurity function impu that is used in the best test search (Alg.2), and the prototype function prototype that creates the predictions in the leaf nodes.After these two are specified, a PCT is induced in the standard top-down-tree-induction manner.
Starting with the whole dataset D TRAIN , we find the test (Alg.1, line 1) that greedily splits the data so that the heuristic score of the test, i.e., the decrease of the impurity impu of the data after applying the test, is maximized.For a given test, the corresponding decrease is computed in line 4 of Alg. 2.
If no useful test is found, the algorithm creates a leaf node and computes a leaf node with the prediction (Alg. 1, line 3).Otherwise, an internal node N with the chosen test is constructed, and the PCT-induction algorithm is recursively called on the subsets in the partition of the data, defined by the test.The resulting trees become child nodes of the node N (Alg 1, line 7).
return Leaf (prototype(E)) 4: else 5: for each E i ∈ P * do 6: return Node(t * , i {tree i }) Algorithm 2 BestTest(E) 1: (t * , h * , P * ) = (none, 0, ∅) 2: for each test t do 3: The impurity functions for a given subset E ⊆ D TRAIN in the considered tasks are defined as weighted averages of the feature impurities impu(E, x i ), and target impurities impu(E, y j ).
For nominal variables z, the impurity is defined in terms of the Gini Index , where the sum goes over the possible values v of the variable z, and p E (v) is the relative frequency of the value v in the subset E. In order not to favoritize any variable a priori, the impurity is defined as the normalized Gini value, i.e., impu(E, z) = Gini (E, z)/Gini (D TRAIN , z).This applies to nominal features and the target in classification.
For numeric variables z, the impurity is defined in terms of their variance Var(E, z), i.e., impu(E, z) = Var(E, z)/Var(D TRAIN , z).This applies to numeric features and targets in other predictive modeling tasks, since the sets in MLC and HMLC are also represented by 0/1 vectors.However, note that computing the Gini-index of a binary variable is equivalent to computing the variance of this variable if the two values are mapped to 0 and 1.When computing the single-variable impurities, missing values are ignored.
In a fully-supervised scenario, the impurity of data is measured only on the target side.However, the majority of target values may be missing in the semisupervised case.Therefore, for SSL, also the features are taken into account when calculating the impurity, which is defined as where the level of supervision is controlled by the user-defined parameter w ∈ [0, 1].Setting it to 1 means fully-supervised tree-induction (and consequently ignoring unlabeled data).The other extreme, i.e., w = 0, corresponds to fully-unsupervised tree-induction (also known as clustering).The dimensional weights α j and β i are typically all set to 1, except for HMLC where α i = 1 for the roots of the hierarchy, and α i = α • mean(parent weights) otherwise, where α ∈ (0, 1) is a user-defined parameter.A MLC problem is considered a HMLC problem where all labels are roots.
The prototype function returns the majority class in the classification case, and the per-component mean [ȳ 1 , . . ., ȳT ] of target vectors otherwise.In all cases, the prototypes (predictions) are computed from the training examples in a given leaf.In the cases of MLC and HMLC, the values ȳj can be additionally thresholded to obtain the actual subsets, i.e., ŷ = { j | ȳj ≥ ϑ, 1 ≤ j ≤ T }, where taking ϑ = 0.5 corresponds to computing majority values of each label.

Ensemble methods
To obtain a better predictive model, more than one tree can be grown, for a given dataset, which results in an ensemble of trees.Predictions of an ensemble are averaged predictions of trees (or, in general, arbitrary base models) in the ensemble.However, a necessary condition for an ensemble to outperform its base models is, that the base models are diverse [Hansen and Salamon, 1990].To this end, some randomization must be introduced into the tree-induction process, and three ways to do so have been used [Levatić, 2017].
Bagging.When using this ensemble method, instead of growing the trees using D TRAIN , a bootstrap replicate of D TRAIN is independently created for each tree, and used for tree induction.
Random Forests (RFs).In addition to the mechanism of Bagging, for each internal node of a given tree, only a random subset (of size D < D) of all features is considered when searching for the best test, e.g., D = ceil ( √ D).
Extremely Randomized PCTs (ETs).As in Random Forests, a subset of features can be considered in every internal node (this is not a necessity), but additionally, only one test per feature is randomly chosen and evaluated.In contrast to Random Forests (and Bagging), the authors of original ETs did not use bootstrapping [Geurts et al., 2006].However, previous experiments [Petković et al., 2019] showed that it is beneficial to do so when the features are (mostly) binary, since otherwise ets can offer only one possible split and choosing one at random has no effect.

Ensemble-Based Feature Ranking
The three proposed importance scores can be all computed from a single PCT, but to stabilize the scores, they are rather computed from an ensemble: Since the trees are grown independently, the variance of each score importance(x i ) decreases linearly with the number of trees.
Once an ensemble (for a given predictive modeling task) is built, we come to the main focus of this work: Computing a feature ranking out of it.There are three ways to do so: Symbolic [Petković et al., 2019], Genie3 [Petković et al., 2019] (its basic version (for standard classification and regression) was proposed in [Huynh-Thu et al., 2010]), and Random Forest score [Petković et al., 2019] (its basic version was proposed in [Breiman, 2001]): Here, E is an ensemble of trees T , T (x i ) is the set of the internal nodes N of a tree T where the feature x i appears in the test, E(N ) ⊆ D TRAIN is the set of examples that reach the node N , h * is the heuristic value of the chosen test, e(OOB T ) is the value of the error measure e, when using T as a predictive model for the set OOB T of the out-of-bag examples for a tree T , i.e., examples that were not chosen into the bootstrap replicate, thus not seen during the induction of T .Similarly, e(OOB T i ) is the value of the error measure e on the OOB T with randomly permuted values of the feature x i .
Thus, Symbolic and Genie3 ranking take into account node statistics: The Symbolic score's award is proportional to the number of examples that reach this node, while Genie3 is more sophisticated and takes into account also the heuristic value of the test (which is proportional to |E(N )|, see Alg. 2, line 4.
The Random Forest score, on the other hand, measures to what extent noising, i.e., permuting, the feature values decreases the predictive performance of the tree.In Eq. ( 4), it is assumed that e is a loss, i.e., lower is better as is the case, for example, in the regression problems where (relative root) mean squared errors are used.Otherwise, e.g., for classification tasks and the F 1 measure, the importance of a feature is defined as −importance RF from Eq. ( 4).Originally, it was designed to explain the predictions of the RFs ensemble [Breiman, 2001] (hence the name), but it can be used with any predictive model.However, trees are especially appropriate, because the predictions can be obtained fast, provided the trees are balanced.

Ensemble-based ranking for SSL structured output prediction
The PCT ensemble-based feature ranking methods for different structured output prediction (SOP) tasks have been introduced by [Petković et al., 2019, Petković et al., 2020], and evaluated for different SL SOP tasks.In this case, PCTs use a heuristic based on the impurity reduction on the target space, as defined by Eq. ( 1), in a special case when w = 1.As for SSL, the general case of Eq. ( 1) applies.Once we have SSL PCTs, the ensemble-based feature ranking methods technically work by default.They have been evaluated in the case of SSL classification.However, they have not been evaluated on STR and SOP tasks.

Does the ensemble method matter?
From Eqs. ( 2)-( 4), it is evident that all three feature ranking scores can in theory be computed from a single tree, and averaging them over the trees in the ensemble only gives a more stable estimate of E[importance(x i )].However, one might expect that bagging, RFs and ETs on average yield the same order of features (or even the same importance values) since the latter two are more randomized versions of the bagging method.Here, we sketch a proof that this is not (necessarily) the case.
One of the cases when the expected orders of features are equal, is a dataset where each of the two binary features x 1 and x 2 completely determine the target y, e.g., y = x 1 and y = 1 − x 2 , and the third feature is effectively random noise.It is clear that the expected values of the importances are in all cases importance(x 1 ) = importance(x 2 ) > importance(x 3 ).
One of the cases where bagging gives rankings different from those of RFs, is a dataset where knowing the values of ranking pairs (x 1 , x 2 ) and (x 3 , x i ), for 4 ≤ i ≤ D again completely reconstructs the target value y, and h(x 1 ) > h(x i ) > max{h(x 2 ), h(x 3 )}, for i ≥ 4. In this case, bagging will first choose x 1 and then x 2 in the remaining two internal nodes of the tree, so x 1 and x 2 would be the most important features.On the other hand, RFs with D = 1 and D sufficiently large, will in the majority of the cases first choose one of the features x i , i ≥ 4, and then, sooner or later, x 3 .Unlike in the bagging-based ranking, x 3 is now more important than x 1 .

Time complexity
In predictive clustering, the attributes in the data belong to three (not mutually exclusive) categories: i) Descriptive attributes are those that can appear in tests of internal nodes of a tree, ii) Target attributes are those for which predictions in leaf nodes of a tree are made, and iii) Clustering attributes are those that are used in computing the heuristic when evaluating the candidate tests.Let their numbers be D, T and C, respectively, and let M be the number of examples in D TRAIN .Note that in the SSL scenario (if w / ∈ {0, 1}), we have the relation C = D + T .Assuming that the trees are balanced, we can deduce that growing a single semi-supervised tree takes O(M D log M (log M + C)) [Levatić, 2017].
After growing a tree, ranking scores are updated in O(M ) time (where M is the number of internal nodes) for the Symbolic and Genie3 score, whereas updating the Random Forest scores takes O(DM log M ).Thus, computing the feature ranking scores does not change the O-complexity of growing a tree, and we can compute all the rankings from a single ensemble.Thus, growing an ensemble E and computing the rankings takes O(|E|M D log M (log M + C)).

Relief-based Feature Ranking
The Relief family of feature ranking algorithms does not use any predictive model.Its members can handle various predictive modeling tasks, including classification [Kira and Rendell, 1992], regression [Kononenko and Robnik-ikonja, 2003], MTR [Petković et al., 2019], MLC [Petković et al., 2018, Reyes et al., 2015], and HMLC [Petković et al., 2020].The main intuition behind Relief is the following: the feature x i is relevant if the differences in the target space between two neighboring examples are notable if and only if the differences in the feature values of x i between these two examples are notable.

Supervised Relief
More precisely, if r = (x 1 , y 1 ) ∈ D TRAIN is randomly chosen, and n = (x 2 , y 2 ) is one of its nearest k neighbors, then the computed importances importance Relief (x i ) of the Relief algorithms equal the estimated value of where the probabilities are modeled by the distances between r and n in appropriate subspaces.For the descriptive space X spanned by the domains X i of the features x i , we have where 1 denotes the indicator function.The definition of the target space distance d Y depends on the target domain.In the cases of classification and MTR, the categorical and numeric part of the definition d i in Eq. ( 6) apply, respectively.Similarly, in multi-target regression, d Y is the analogue of d X above.
In the cases MLC and HMLC, we have more than one option for the target distance definition [Petković et al., 2018], but in order to be as consistent as possible with the STR and MTR cases, we use the Hamming distance between the two sets.Recalling that sets S ⊆ L are presented as 0/1 vector s (Sec.2), the Hamming distance d Y is defined as where the weights α i are based on the hierarchy and are defined as in Eq. ( 1), and γ is the normalization factor that assures that d Y maps to [0, 1].It equals 1 |L | in the MLC case, and depends on the data in the HMLC case [Petković et al., 2020].
To estimate the conditional probabilities P 1,2 from Eq. ( 5), they are first expressed in the unconditional form, e.g., P 1 = P (x 1 i = x 2 i ∧ y 1 = y 2 )/P (y 1 = y 2 ).Then, the numerator is modeled as the product d i d Y , whereas the nominator is modeled as d Y .The probability P 2 is estimated analogously.

Semi-supervised Relief
In the SSL version of the above tasks, we have to resort to the predictive clustering paradigm, using descriptive and clustering attributes instead of descriptive and target ones.More precisely, the descriptive distance is defined as above.As for the clustering distance, it equals d Y when the target value of both y 1 and y 2 are known, and equals d X otherwise.The contribution of each pair to the estimate of probabilities is weighted according to their distance to the labeled data.The exact description of the algorithm is given in Alg. 3.
1: imp = zero list of length D 2: P diffAttr, diffCluster , P diffAttr = zero lists of length D 3: P diffCluster = 0.0 4: w = computeInstanceInfluence(D TRAIN , w 0 , w 1 ) 5: s = 0 # sum of weights of the pairs, used in normalization 6: for iteration = 1, 2, . . ., m do 7: r = random example from D 8: n 1 , n 2 , . . ., n k = k nearest neighbors of r 9: for = 1, 2, . . ., k do 10: s += w 12: if r and n are labeled then 13: SSL-Relief takes as input the standard parameters (D TRAIN , the number of iterations m, and the number of Relief neighbors k), as well as the interval [w 0 , w 1 ] ⊆ [0, 1], which the influence levels of r-n pairs are computed from (line 4): First, for every (x, y) ∈ D TRAIN , we find the distance d x to its nearest labeled neighbor.If d = 0, i.e., the value y is known, the influence w of this example is set to 1. Otherwise, the influence of the example is defined by a linear function d → w(d) that goes through the points (max (x,?) d x , w 0 ) and (min (x,?) d x , w 1 ).Thus, the standard regression version of Relief is obtained when no target values are missing.

Time complexity
For technical reasons, the actual implementation of SSL-Relief does not follow the Alg. 3 word for word, and first computes all nearest neighbors.This takes O(mM D) steps, since the majority of the steps in this stage is needed for computing the distances in the descriptive space.We use the brute-force method, because it is, for the data at hand, still more efficient than, for example, k-D trees.Since the number of iterations is typically set to be a proportion of M (in our case m = M ), the number of steps is O(M 2 D).When computing the instances' influence (line 4), only the nearest neighbor of every instance is needed, so this can be done after the K-nearest neighbors are computed, within a negligible number of steps.
In the second stage, the probability estimates are computed and the worstcase time complexity is achieved when all examples are labeled since this is the case when we have to additionally compute d Y (otherwise, we use the stored distances d X ).The number of steps needed for a single computation od d Y depends on the domain: O(1) suffices for classification and STR, whereas O(T ) steps are required in the MTR, MLC and HMLC cases.
The estimate updates themselves take O(D) steps per neighbor, thus, the worst case time complexity is where C = D + T is (again) the number of clustering attributes.

Experimental Setup
In this section, we undertake to experimentally evaluate the proposed feature ranking methods.We do so by answering a set of experimental questions listed below.We then describe in detail how the experimental evaluation is carried out.

Experimental questions
The evaluation is based on the following experimental questions: 1.For a given ensemble-based feature ranking score, which ensemble method is the most appropriate?2. Are there any qualitative differences between the semi-supervised and supervised feature rankings?3. Can the use of unlabeled data improve feature ranking? 4. Which feature ranking algorithm performs best?

Datasets
All datasets are well-known benchmark problems that come from different domains.For classification, we have included five new datasets (those below the splitting line of Tab.1), in addition to the previous ones [Petković et al., 2019].
Since MLC can be seen as a special case of HMLC with a trivial hierarchy, we show the basic characteristics of the considered MLC and HMLC problems in a single table (Tab.2), separating the MLC and HMLC datasets by a line.
Similarly, the regression problems (for STR and MTR) are shown in Tab. 3. The given characteristics of the data differ from tasks to task, but the last column of every table (CH) always gives the estimate of how well the clustering hypothesis (Asm. 1) holds.For all predictive modeling tasks, this estimate is based on k-means clustering [Arthur and Vassilvitskii, 2007] or, more precisely, on the agreement between the distribution of the target values in these clusters.The number of clusters was set to the number of classes in the case of classification, and to 8 otherwise, i.e., the default Scikit Learn's [Pedregosa et al., 2011] parameters are kept.The highest agreement of the five runs of the method is reported.
CH computation.In the case of classification, the measure at hand is the Adjusted Random Index [Hubert and Arabie, 1985] (ARI) that we have already used earlier [Petković et al., 2019].It computes the agreement between the classes that examples are assigned via clustering, and the actual class values.The optimal value of ARI is 1, whereas the value 0 corresponds to the case when clustering is independent of class distribution.
In the other cases, we compute the variance of each target variable, i.e., an actual target in the STR and MTR case, and a component of the 0/1 vector which a label set in the case of MLC and HMLC is represented by.Let C be the set of the obtained clusters, i.e., c ⊆ D, for each cluster c ∈ C.Then, for every target variable y j , we compute v j = c p(c)Var(c, y j )/Var(D, y), i.e., the relative decrease of the variance after the clustering is applied, where p(c) = |c|/|D|.It can be proved (using the standard formula for the estimation of sample variance and some algebraic manipulation) that v j ≤ 1. Trivially, v j ≥ 0. We average the contributions v j over the target variables to obtain the score v.In the case of HMLC, we use weighted average where the weights are proportional to the hierarchical weights α i , defined in Sec.3.1.Finally, the tables report the values of CH = 1 − v ∈ [0, 1], to make the value 1 optimal.

Parameter instantiation
We parametrize the used methods as follows.The number of trees in the ensembles was set to 100 [Kocev et al., 2013].The number of features that are considered in each internal node was set to √ D for RFs and D for ETs [Geurts et al., 2006].The optimal value of the level of supervision parameter w for computing the ensembles of PCTs was selected by internal 4-fold crossvalidation.The considered values were w ∈ {0, 0.1, 0.2, . . ., 0.9, 1}.
The amount of supervision in SSL-Relief is adaptive, which allows for coarser set of values, and we consider w 1,2 ∈ {0, 0.25, 0.5, 0.75, 1} (where w 1 ≤ w 2 ).The considered numbers k of Relief neighbors were k ∈ {15, 20, 30}, and the best hyper-parameter setting option (the values of w 1 , w 2 , and k) was again chosen via internal 4-fold cross-validation.Since more is better when the number of iterations m in Relief is concerned, this parameter was set to m = |D|.
The evaluation through kNN was chosen because of three main reasons.First it can be used for all the considered predictive modeling tasks.Second, this is a distance based method, hence, it can easily make use of the information contained in the feature importances in the learning phase.Third, kNN is simple: Its only parameter is the number of neighbors.In the prediction stage, the neighbors' contributions to the predicted value are equally weighted, so we do not introduce additional parameters that would influence the performance.

Evaluation measures
To asses the predictive performance of a kNN model, the following evaluation measures are used: F 1 for classification (macro-averaged for multi-class problems), Root Relative Squared Error (RRMSE) for STR and MTR, and area under the average precision-recall curve for MLC and HMLC (AU PRC).Their definitions are given in the Tab. 4. In the cross-validation setting, we average the scores over the folds (taking test set sizes into account).For each ranking and dataset, we construct a curve that consist of points (L, performance L ).The comparison of two methods is then based either i) on these curves directly (see Fig. 3), or ii) on the area under the computed curves.We observe that for both regression tasks (STR and MTR), RFs ensembles almost never perform best (with the exception of Genie3 SSL-rankings), whereas for the other three classification-like tasks, they quite consistently outperform the other two ensemble methods.The differences among the average ranks are typically not considerable (with the exception of the most of the MLC rankings, and supervised MTR rankings) which is probably due to the fact that the split selection mechanisms of the considered ensemble methods are still quite similar, and the trees are fully-grown, so sooner or later, a relevant feature appears in the node.In the case of ties, we choose the more efficient one (see Tab. 6): RFs are always the most efficient, whereas the second place is determined by the number of possible splits per feature.For lower values (e.g., when most of the features are binary, as is the case in MLC and HMLC data), bagging is faster than ETs.To make the later graphs more readable, we plot, for every score, only the curve that corresponds to the most suitable ensemble method for this score.

Qualitative difference between SSL and SL rankings
We first discuss the qualitative difference between the SSL-rankings and their supervised counterparts.In the process of obtaining a feature ranking, the SSL-version of the ranking algorithm sees more examples than its supervised version, and it turns out that this is well-reflected in the results.Fig. 3 shows the results for five datasets (one dataset, for each task) and the performance of the rankings, as assessed by kNN models, for k ∈ {20, 40}.Those two values of k are used to show that SSL-rankings tend to capture a more global picture of data, whereas the supervised ones reflect a more local one.This phenomenon is most visible in the two regression datasets.In the case of the treasury dataset, SSL-rankings perform worse than supervised ones on the local scale for smaller numbers L of labeled examples (Fig. 3g), and are equal or better for L ≥ 200.However, on the global scale (Fig. 3h), the SSL-rankings are clear winners.A similar situation is observed for the other datasets in Fig. 3, and also in general.Tab.7 reveals that for the vast majority of the rankings (and datasets), the SSL rankings are more global.This proportion is the highest for STR data (it even equals 100%), and is understandably the lowest for classification, where the datasets have the smallest number of examples on average.From the mainly positive numbers in the table, one can conclude that SSLrankings successfully recognize the structure of data, and outperform their supervised analogs, even in most of the cases where the CH values are low, e.g., for digits dataset in Fig. 3a, or, most notably, for pageblocs.
Continuing with the results for MLC (the upper part of Tab.9), we first see that CH values are rather low, since, in contrast to the ARI values from classification, correction for chance is not incorporated into these CH values.An exception to this are the genbase (see Figs. 3c and 3d) and the scene dataset.For both datasets, the SSL-versions of the rankings outperform their SL-analogs.This also holds for the birds and emotions datasets, for all rankings, and additionally for the medical dataset in the case of Relief.
The bottom part of Tab. 9 gives the results for HMLC datasets.One can notice that Asm. 1 is never satisfied (low CH values), and that SSL-scores mostly could not overcome this, with the exception of Relief rankings on the ecogen dataset.However, inspecting the corresponding curves in detail (Fig. 3f), reveals that the negative differences in the performance of SSL-rankings and SL-rankings are mostly due to the bad start of SSL-rankings: For L ≥ 200, the SSL-versions prevail.We finish this section with the regression results.The upper part of Tab. 10 shows that when CH is well-setisfied, i.e., for the datasets mortgage and treasury (see Fig. 3h), the SSL-rankings outperform the SL-rankings.Moreover, this also holds for the pol data (except for the Relief rankings).Inspecting the datasets where negative values are present (most notably the qsar dataset) reveals the same phenomenon as in HMLC case: for extremely low values of L, e.g., L = 50, the SSL-rankings do not perform well, possibly because knowing the labels of 50 out of approximately 2000 examples simply does not suffice.With more and more labels known, the performance of SSL-rankings drastically improves, while the performance of SL-rankings stagnates.Finally, for L ≥ 200 or L ≥ 350, all SSL-rankings again outperform the SL-ones.
Similar findings hold for the MTR data and the results in the bottom part of Tab. 10.The SSL-rankings perform well from the very beginning on the three datasets where CH holds the most, i.e., oes10 (see Fig. 3j), atp1d, and edm1, but can only catch up with the SL-rankings (and possibly outperform them) for larger values of L in the other cases.

Which SSL-ranking performs best?
To answer this question, we compare the predictive performances of the corresponding 40NN models and report their ranks in Tab.11.The results reveal that, for majority of the tasks, ensemble-based rankings perform best, however, in some cases, the winners are not clear, e.g., in the case of the classification.Still, Symbolic ranking quite clearly outperforms the others on both regression tasks, STR and MTR.
To complement this analysis, we also compute the average ranks of the algorithms for their induction times.As explained in Sec.7.1, for the ensemble-

Fig. 2 :
Fig. 2: Training and test set creation in SSL cross-validation: In the test fold, all examples keep their labels, whereas the folds that form the training set, together contain (approximately) L labeled examples.

Fig. 3 :
Fig. 3: Comparison of the SL and SSL feature rankings, for different predictive modeling tasks.The curves for the SSL and the SL versions of a ranking are shown as a solid and a dashed line of the same color.The graphs in the left column use 20NN models in the evaluation, whereas those in the right, use 40NN models.

Table 1 :
Basic properties of the classification datasets: number of examples |D|, number of features D, number of classes (the y-domain size |Y|), the proportion of examples in the majority class (MC), and the CH value.

Table 2 :
Basic properties of the MLC (above the line) and HMLC (below the line) datasets: number of examples |D|, number of features D, number of labels |L |, label cardinality (average number of labels per example) c , the depth of hierarchy, and the CH value.

Table 3 :
Basic properties of the STR and MTR datasets: number of examples |D|, number of features D, number of targets T , and the CH value.

Table 4 :
Evaluation measures, for different predictive modeling tasks.The F 1 measure and AU PRC are defined in terms of precision p = tp/(tp + fp) and recall r = tp/(tp + fn), where the numbers tp, fp and fn denote the number of true positive, false positive and false negative examples, respectively.

Table 5 :
Average ranks of the considered SSL and SL ensembles, for a fixed ensemble-based score and predictive modeling task.The best ranks are shown in bold, unless all three methods perform equally well.In the case of ties, we bold the most efficient method (see Tab. 6).

Table 6 :
Average ranks of the ensemble methods, in terms of induction times.

Table 7 :
Proportions of the computed feature rankings whose SSL-version captures more global properties of the data, as compared to its supervised version.The differences δ 20 and δ 40 of the areas under the performance curves of 20NN and 40NN models are computed (always in a way that δ > 0 means that SSL-version performs better).Therefore, if ∆ = δ 40 − δ 20 > 0, the SSL-version of the ranking is more global, and is more local if ∆ < 0.

Table 9 :
The differences ∆ of areas under the curves of AU PRC-values of the 40NN models whose distance weights base on SSL-ranking and SL-ranking.

Table 10 :
The differences ∆ of areas under the curves of RRMSE-values of the 40NN models with distance weights based on SSL-rankings and SL-rankings.

Table 11 :
The average ranks of different SSL-ranking algorithms that base on the performance of the corresponding 40NN models.RFs are always preferable in terms of speed.They can still be outperformed by Relief if the number of features is higher and the number of examples is moderate, which follows directly from the O-values in Secs.4.3 and 5.3.All these methods are implemented in the Clus system (Java), whereas our implementation of the Laplace score is, as mentioned before, Python-based (Scikit Learn and numpy).Thus, even though Laplace and Relief have the same core operations (finding nearest neighbors), using higly-optimized Scikit Learn's methods (such as kNN) puts Laplace at the first place, whereas Relief is (second but) last, for STR problems.

Table 12 :
The average ranks of different SSL-ranking algorithms in terms of their induction times.Since the time complexity of ensemble-based rankings (almost) equals the induction time of the ensembles, we report the latter.For each task, we show the ranks for both extreme values of L.