Multi-dimensional Classification via Selective Feature Augmentation

In multi-dimensional classification (MDC), the semantics of objects are characterized by multiple class spaces from different dimensions. Most MDC approaches try to explicitly model the dependencies among class spaces in output space. In contrast, the recently proposed feature augmentation strategy, which aims at manipulating feature space, has also been shown to be an effective solution for MDC. However, existing feature augmentation approaches only focus on designing holistic augmented features to be appended with the original features, while better generalization performance could be achieved by exploiting multiple kinds of augmented features. In this paper, we propose the selective feature augmentation strategy that focuses on synergizing multiple kinds of augmented features. Specifically, by assuming that only part of the augmented features is pertinent and useful for each dimension’s model induction, we derive a classification model which can fully utilize the original features while conduct feature selection for the augmented features. To validate the effectiveness of the proposed strategy, we generate three kinds of simple augmented features based on standard kNN, weighted kNN, and maximum margin techniques, respectively. Comparative studies show that the proposed strategy achieves superior performance against both state-of-the-art MDC approaches and its degenerated versions with either kind of augmented features.


Introduction
Traditional supervised learning tasks usually characterize the semantics of objects with one output variable, i.e., single-output learning, among which multi-class classification is one of the most important learning frameworks. However, in some real-world applications, it is better to use multiple output variables to characterize the rich semantics of objects, which results in the problem of multi-output learning [1] . Here, when the type of each output variable is restricted to discrete-valued, then the multi-dimensional classification (MDC) framework is obtained [2,3] . Under the MDC setting, each object is represented by a single instance while associated with multiple class variables, each of which corresponds to a specific class space characterizing the object's semantics along one specific dimension. Specifically, the MDC problem widely exists in many application scenarios, such as bioinformatics [4,5] , text classification [6,7] , computer vision [8][9][10] , resource allocation [11] , etc. Fig. 1 shows an illustrative example of MDC on vehicle classification.
Formally speaking, let be the d-dimensional feature space and be the output space. Here, corresponds to the Cartesian product of q class spaces which consists of possible classes respectively. Given a set of MDC training examples , where is a d-dimensional feature vector and is the q-dimensional class vector associated with with each element , the MDC task aims to learn a predictive model from which can return a proper class vector for unseen instance . It is obvious that the MDC problem can be solved dimension by dimension, i.e., training a multi-class classifier for each class space. However, this independent decomposition strategy does not consider potential dependencies among class spaces which might impact the generalization performance of the resulting model. The MDC problem can also be solved by a single multi-class classifier, where each distinct class combination is regarded as a new class. However, this powerset-like strategy cannot consider class combinations not appearing in the training Bin-Bin Jia 1,2,3 Min-Ling Zhang 1,3 set and usually suffers high computational complexity due to possible large number of classes. In fact, one of the key challenges for MDC studies is how to model dependencies among class spaces in appropriate ways. Existing works mainly focus on modeling class dependencies in output space, such as capturing pairwise class dependencies [12][13][14] , specifying chaining order over class spaces [15,16] , learning a directed acyclic graph (DAG) structure for class spaces [17][18][19] , and partitioning class spaces into groups [2] , etc.
Recently, feature augmentation strategy which aims at manipulating feature space has been shown as an effective solution for MDC. This strategy enriches the original feature space with a set of new features which are generated by making use of some well-established techniques, e.g., kNN [20] or deep learning [21] . Existing works only focus on how to design more informative augmented features while it might be beneficial to exploit multiple kinds of augmented features generated by making use of different techniques. In this paper, we propose the selective feature augmentation strategy which makes the first attempt towards how to synergize multiple kinds of augmented features. The strategy is abbreviated as SFAM, i.e., Selective Feature Augmentation for Multi-dimensional classification, in the following parts of this paper for brevity. Specifically, SFAM assumes only part of augmented features are pertinent and useful for each dimension's model induction. To validate the effectiveness of SFAM, three simple kinds of augmented features are generated by making use of standard kNN, weighted kNN, and maximum margin techniques respectively. After that, for each dimension, SFAM derives a classification model which can take full advantage of the original features via regularization and conduct feature selection for the augmented features via regularization (i.e., selective feature augmentation). Experimental results demonstrate that SFAM achieves superior performance against both state-of-the-art MDC approaches and its degenerated versions with either kind of augmented features.
The rest of this paper is organized as follows. Firstly, related works on multi-dimensional classification are briefly discussed. Secondly, technical details of SFAM are introduced. Thirdly, experimental results of comparative studies are reported. Finally, we conclude this paper.

Related Work
The most related learning framework to multidimensional classification is the widely studied multi-label classification (MLC) [22][23][24] , which can be regarded as a special case of MDC when the type of class variable in each dimension is restricted to binary-valued. However, MDC usually assumes heterogeneous class spaces which are used to characterize the rich semantics of objects from different dimensions, while MLC usually assumes homogeneous class space in which multiple concepts are relevant to the polysemous objects.
The MDC problem can be solved via independent decomposition strategy, where a total of q multi-class classifiers are learned independently, one per dimension.
However, this intuitive strategy ignores possible dependencies among class spaces and the induced model would be suboptimal. An improved strategy is learning the q multi-class classifiers in a chaining order, where predictions of preceding classifiers are used as extra features by the subsequent ones [15,16] . However, the chaining order would largely affect the generalization performance while determining an optimal one is NP-hard. The MDC problem can also be solved via powerset transformation strategy where a single multi-class classifier is learned by regarding all distinct class combinations in training set as new classes. However, this intuitive strategy cannot consider class combinations not appearing in training set and usually suffers high computational complexity due to large number of new classes. An improved strategy is partitioning the class spaces into groups according to conditional dependencies [2] . However, the combinatorial nature still exists which leads to that the deficiencies cannot be fully addressed. A family of MDC models called multi-dimensional Bayesian network classifier [25] aim at learning different kinds of DAG structures over class spaces to explicitly model the class dependencies. However, determining DAG structures is computationally demanding and only nominal features can be tackled generally. The class dependencies can also be modeled in a two-level strategy [12][13][14] , where pairwise dependencies are captured in the first level and then high-order dependencies are captured in the second-level based on the predictions from the first level. However, capturing pairwise dependencies needs complexity which is very time-consuming.
The aforementioned strategies mainly focus on directly modeling class dependencies in the output space, while the KRAM approach [20] attempts to manipulate the feature space of MDC examples via feature augmentation by making use of kNN techniques. Helpful discriminative information is expected to be brought into feature space which would facilitate the subsequent MDC model induction. Based on deep learning techniques, the LEFA approach [21] further generates better augmented features which can depict the inter-class dependencies and the intra-class exclusiveness simultaneously. However, these approaches simply treat the original and augmented features equally which might be less reasonable due to different characteristics of different features. Moreover, it is usually easier to design multiple kinds of simple augmented features than a terrific one, and it might be beneficial to consider synergizing the discriminative information residing in different kinds of augmented features. Fig. 2 shows an intuitive comparison between existing feature augmentation techniques and the proposed one in this paper. Existing works usually employ general MDC algorithms to accomplish the training phase, while the proposed one designs a novel training algorithm which can accomplish the selective feature augmentation phase.

Technical Details of SFAM
This section presents how we implement the selective feature augmentation strategy. In other words, the technical details in this section correspond to the part of Training Algorithm in Fig. 2(b). For any instance , let be the concatenation of all the corresponding augmented features generated by the N augmentation models, and corresponds to the concatenation of and , i.e., , in the following, we derive a regularized classification model which fully utilizes original features while employs feature selection mechanism over augmented features .

Augmentation Model
Augmented-features

Training Algorithm
Predictive Model For simplicity, we employ the one-vs-rest decomposition strategy for each dimension where a classification model with both and regularization is derived to solve the decomposed binary classification problems. Specifically, for the ath decomposed binary classification problem in the jth dimension, we determine the optimal model as follows: where is a trade-off parameter. The first term denotes the empirical loss function. In this paper, we simply employ the cross-entropy loss which is defined as follows: Here, , the predicate returns 1 if holds and 0 otherwise, and is the logistic function which is defined as follows: where returns the inner product of two vectors. The second term denotes the regularization term which is defined as follows: where and denote the first d elements and the last remaining elements of respectively, i.e., . It is worth noting that the regularization corresponds to the original features while the regularization corresponds to the augmented features. By doing this, we employ feature selection mechanism over the augmented features to synergize multiple kinds of augmented features in a better way, while the original features are still fully utilized. In other words, the selective feature augmentation strategy is implemented here.
To optimize the problem (1), we solve one of the three sets of parameters and alternatingly, while the remaining parameters are fixed.
(a) Optimizing w.r.t. when and are fixed: When and are fixed, the optimization problem (1) can be equivalently reformulated as follows: (2) where and Here, is a constant which is not dependent on variables . In this paper, we use gradient descent to solve the optimization problem (2). Specifically, let be the objective function, the gradient is given as follows: when and are fixed: When and are fixed, the optimization problem (1) can be equivalently reformulated as follows: where International Journal of Automation and Computing and Here, is a constant which is not dependent on variables . In this paper, we use accelerated proximal gradient method [26] to solve it.

Theorem 1. For the derivable function , is Lipschitz continuous and the Lipschitz constant is:
where denotes the differential operator. Proof. For , it can be calculated as: Given any and , we have: Here, the second " " is due to the truth that the Lipschitz constant of logistic function equals . Then, it is easy to know: which completes the proof. □ According to Theorem 1, given any initial value of , let , the following inequation always holds: Then, the quadratic approximation of around can be given as follows: where is a constant which is not dependent on variables , and According to the descent lemma [27] , the approximation is an upper bound of the original function, i.e., always holds. Therefore, we can minimize the original function by iteratively minimizing the approximation. Plugging the above approximation into the optimization problem (3), we can obtain the following iterative equation: Here, is the (element-wise) soft-thresholding function which is defined as follows: In [28], it is shown that the convergence rate of iterative equation in (6) can be improved to from if we replace in (5) with the following : where and when .
(c) Optimizing w.r.t. when and are fixed: When and are fixed, the optimization problem (1) can be equivalently reformulated as follows: (8) where and Here, is a constant which is not dependent on variable . In this paper, we use gradient descent to solve it. Specifically, the gradient of the objective function (i.e., ) is given as follows: The proposed SFAM approach. As the above three alternating optimizing steps converge, we can obtain the optimal values of and . For unseen instance , let be its augmented features, then its class label in the jth dimension can be determined based on the augmented instance as follows: The complete procedure of the proposed SFAM approach is summarized in Algorithm 1. Firstly, SFAM transforms the original MDC training set into by augmenting each instance's feature space (steps 1-7). After that, the predictive model is induced via a classification model with both and regularization (steps 8-25), where regularization and the bias term are updated via gradient descent and regularization is updated via accelerated proximal gradient method. Finally, the class vector of unseen instance is predicted based on the augmented features as well (steps [26][27][28][29][30]. As shown in Algorithm 1, it is worth noting that SFAM should be regarded as a general framework and can be coupled with any kind of augmented-features, while this paper only aims at investigating the feasibility of synergizing the different kinds of augmented features.

Experiments
This section conducts comparative studies and the obtained experimental results clearly validate the superiority and effectiveness of SFAM. Firstly, Subsection 4.1 introduces the experimental setup including the employed benchmark data sets, the evaluation metrics and the compared approaches. Then, Subsection 4.2 reports the detailed experimental results with statistical tests. Finally, Subsection 4.3 further investigates SFAM's algorithmic design and parameter sensitivity.

Experimental Setup 1) Benchmark Data Sets
In this paper, a total of ten benchmark data sets are collected for comparative studies. Table 1  Here, for the #Labels/Dim. column, if all dimensions contain the same number of class labels, then only this number is recorded, otherwise, the number of class labels per dimension is recorded in turn; for the #Features column, n and x denote numeric and nominal features respectively.

2) Evaluation Metrics
In this paper, a total of three widely-used evaluation metrics are employed for performance evaluation, including Hamming Score (HS), Exact Match (EM) and Sub-Exact Match (SEM) [2,13,14,20,29] . Specifically, given the test set , for the MDC model to be evaluated, let be the predicted class vector for while the ground-truth one is , then the number of correctly predicted dimensions corresponds to . The detailed definitions of the metrics are given as follows: For the three metrics, it is easy to know that the larger the values, the better the performance. Ten-fold cross validation is conducted over each benchmark data set, where both the mean metric value as well as standard deviation are recorded for comparative studies.

3) Compared Approaches
In this paper, a total of six state-of-the-art MDC approaches are employed as compared approaches, including BR, CP, ECC [16] , ESC [2] , gMML [29] and SEEM [13] :  super-class is used as a compound class variable.  gMML works by learning a regression model for each class label as well as a Mahalanobis metric which can shorten the distance between the regression outputs and ground-truth label vector.  SEEM models the class dependency via a two-level strategy, where the pairwise and high-order class dependencies are modeled in the first and second level respectively. For BR, CP, ECC, ESC and SEEM, the multi-class base learner is implemented via LIBLINEAR [30] with the parameter setting "L2-regularized logistic regression (primal)" for fair comparison. Following [2], for ensemble approaches ECC and ESC, a total of 10 base models are trained over 67% examples randomly selected from training set, and the predictive results are combined via majority voting. For gMML and SEEM, the recommended parameters are used according to respective literatures.
To validate the effectiveness of the selective feature augmentation strategy, for the proposed SFAM approach, a total of three simple kinds of augmented features are generated, where two of them are generated by making use of standard and weighted kNN techniques respectively, and the remaining one are generated by making use of maximum margin techniques. Specifically, the two kinds of kNN-based augmented features are generated by KRAM [20] . To be more specific, for each instance , let be the set of indices for the nearest neighbors of identified in training set , we can define an indicating vector which is defined as follows: Here, . corresponds to the class vector of the neighboring MDC example for . Based on , the following discrete version of statistics can be defined w.r.t. the jth class space: where is a column vector of all ones with length k. By concatenating all the counting statistics vectors, the first kNN-augmented feature vector based on standard kNN techniques for can be obtained: Moreover, let , a bias vector is defined as follows: Here, and are two hyper-parameters and set as 0.5 and 0 respectively, and , where and denotes the rth element of weight vector . Then, the second kNN-augmented feature vector based on weighted kNN techniques for can be obtained: For the maximum margin-augmented features, SFAM employs the real-valued predictions which are returned by the multi-class support vector machine to generate the maximum margin-augmented features. Specifically, SFAM solves the following maximum margin formulation [31] for the jth dimension ( ): where is the weight matrix to be determined, and is the slack variable vector. Suppose , then if (i.e., ) and 0 otherwise. Furthermore, is a trade-off parameter. Let , the maximum margin-augmented feature vector for each instance can be defined as follows: (11) Then, the MDC training set can be transformed into: where denotes the augmented features of . Here, we reiterate that we make use of standard kNN, weighted kNN, and maximum margin techniques to generate the augmented feature only for the purpose of simplicity. The experiments in this paper mainly aims at validating the effectiveness of the selective feature augmentation strategy. In the future, it is interesting to further investigate synergizing multiple kinds of augmented features which are generated by making use of more advanced techniques such as deep learning [21] . Table 2 Experimental results (mean±std. deviation) of each MDC approach. In addition, the performance rank on each data set is also shown in the parentheses.

Experimental Results
According to the reported experimental results, we can make the following observations:  Among all the 30 cases (10 data sets×3 evaluation metrics), SFAM ranks first in 24 cases, ranks second in 1 cases, ranks third in 1 cases, ranks fourth in 1 cases, ranks fifth in 5 cases, and never ranks last.  BR solves the MDC problem by dealing with each dimension independently, where potential class dependencies are fully ignored. Although SFAM also induces classification models for each dimension independently, the class dependencies can be considered by the augmented features [20] . It is shown that SFAM achieves superior performance against BR in terms of each metrics, which reveals that considering class dependencies is important for learning MDC models.  Both ECC and gMML explicitly consider the class dependencies, where a chaining order over class spaces or a Mahalanobis metric is employed to accomplish this task. It is shown that SFAM also achieves superior performance against ECC and gMML in terms of each metrics, which validates the superiority of SFAM's selective feature augmentation strategy.  CP solves the MDC problem by dealing with all dimensions jointly via powerset transformation, which can be viewed as optimizing Exact Match. ESC and SEEM can be regarded as two improved versions of CP, where class spaces are grouped into super-classes according to conditional dependencies or each pair of class spaces are considered in the first level learning. It is shown that SFAM still achieves comparable performance against CP, ESC and SEEM in terms of Exact Match, and superior performance against CP, ESC and SEEM in terms of Hamming Score and Sub-Exact Match.

Further Analysis 1) Effectiveness of Algorithmic Design
In this paper, SFAM generates three kinds of augmented features according to (9), (10) and (11) respectively. To further investigate the effectiveness of SFAM's algorithmic design, we also compare SFAM with its three degenerated versions which generate either kind of augmented features. The one with discrete version of kNN-augmented features in (9) is denoted as DeV1, which is also known as the KRAMd approach [20] , the one with continuous version of kNN-augmented features in (10) is denoted as DeV2, which is also known as the KRAMc approach [20] , and the another one with maximum margin-augmented features in (11) is denoted as DeV3. It is worth noting that the baseline BR actually serves as another degenerated version without any kind of augmented features, whose experimental results have been reported and analyzed in Subsection 4.2. Table 4 Experimental results (mean±std. deviation) of SFAM and its two degenerated versions. In addition, the performance rank on each data set is also shown in the parentheses.

Data
Hamming Score  Detailed experimental results are shown in Table 4. Table 5 summarizes the test results of Wilcoxon signedranks test (at 0.05 significance level). It is shown that SFAM achieves superior performance against DeV1 in terms of Hamming Score and Exact Match, and DeV3 in terms of all metrics. For DeV2, although SFAM achieves comparable performance against it in terms of all metrics, as shown in Table 4, among the 19 cases where the performance of SFAM is different with DeV2, there are 14 cases where the performance of SFAM is better than DeV2. These results clearly validate that SFAM can identify the pertinent and useful features from the three kinds of augmented features. Besides, it is shown that both DeV1 and DeV2 achieve similar results compared to SFAM, possible reason is that the two kinds of kNN-augmented features contain more useful discriminative information than maximum margin-augmented features. Nonetheless, SFAM is able to utilize all the available information in hand to achieve better generalization performance. Furthermore, Fig.3 shows the weight matrix (absolute value) of the learned model w.r.t. the regularization. Specifically, following the notations in Section III, Fig.3 shows the absolute value of weight matrix for data set Voice, TIC2000, Adult, and Default. For each figure, each row corresponds to the binary classification model of one class label, and the first third of all columns correspond to the discrete version of kNN-augmented features in (9), the middle third of all columns to the continuous version of kNN-augmented features in (10), and the last third of all columns to the maximum margin-augmented features in (11). It is shown that, for each third of all columns, the diagonal element usually takes the largest value in its corresponding row. Note that each element in all the three kinds of augmented features (i.e., each column in Fig.3) corresponds to one class label, and each binary classification model (i.e., each row in Fig.3) also corresponds to one class label. In other words, the largest value corresponds to the augmented feature w.r.t. its own class label. It is also shown that each binary classification model is only related to part of augmented features, where the model weights w.r.t. the two kinds of kNN-augmented features are usually larger than the model weights w.r.t. the maximum margin-augmented features. This observation further supports the aforementioned conjecture that the two kinds of kNNaugmented features contain more useful discriminative information than maximum margin-augmented features. Besides, we can also observe that the model weights w.r.t. the regularization for some binary classification models are almost all zero, which means that not all binary classification models rely on augmented features.

2) Parameter Sensitivity Analysis
The regularized classification model (1) has one trade-off parameter . In this subsection, we investigate how the performance of SFAM changes with different values of . Fig.4 illustrates SFAM's performance fluctuation when ranges in {0.01, 0.1, 1, 10, 100} over data sets Flare1, WQplants, WQanimals and BeLaE. It is shown that the performance of SFAM degenerates with either small or large value of generally, and is usually a better choice. Therefore, we fix as 1 in all the previous comparative studies.

Conclusions
Feature augmentation has been shown as an effective strategy for solving the MDC problem. Existing works only focus on how to generate better augmented features, while it might be beneficial to exploit multiple kinds of augmented features generated by making use of different techniques. This paper makes a first attempt towards how to synergize the discriminative information residing in multiple kinds of augmented features. Accordingly, a novel strategy named selective feature augmentation is proposed which assumes that only part of the augmented features is pertinent and useful for each dimension's model induction.
Comparative studies clearly validate the effectiveness of the proposed strategy.
Current feature augmentation works simply concatenate the original and augmented features, though the proposed SFAM has treated them differently via different regularization terms. In fact, the original and augmented features (even different kinds of augmented features) can be regarded as features from different views [33] . In the future, other ensemble strategies borrowing from multi-view learning can also be used instead of merely using the concatenation operation. Besides, this paper only generates two simple kinds of augmented features to validate the proposed selective feature augmentation strategy, it is also deserved to investigate generating more kinds of augmented features.