Sequential approaches for learning datumwise sparse representations
Authors
 First Online:
 Received:
 Accepted:
DOI: 10.1007/s1099401253067
 Cite this article as:
 DulacArnold, G., Denoyer, L., Preux, P. et al. Mach Learn (2012) 89: 87. doi:10.1007/s1099401253067
Abstract
In supervised classification, data representation is usually considered at the dataset level: one looks for the “best” representation of data assuming it to be the same for all the data in the data space. We propose a different approach where the representations used for classification are tailored to each datum in the data space. One immediate goal is to obtain sparse datumwise representations: our approach learns to build a representation specific to each datum that contains only a small subset of the features, thus allowing classification to be fast and efficient. This representation is obtained by way of a sequential decision process that sequentially chooses which features to acquire before classifying a particular point; this process is learned through algorithms based on Reinforcement Learning.
The proposed method performs well on an ensemble of mediumsized sparse classification problems. It offers an alternative to global sparsity approaches, and is a natural framework for sequential classification problems. The method extends easily to a whole family of sparsityrelated problem which would otherwise require developing specific solutions. This is the case in particular for costsensitive and limitedbudget classification, where feature acquisition is costly and is often performed sequentially. Finally, our approach can handle nondifferentiable loss functions or combinatorial optimization encountered in more complex feature selection problems.
Keywords
Classification Features selection Sparsity Sequential models Reinforcement learning1 Introduction
Feature Selection is one of the main contemporary issues in Machine Learning and has been approached from many directions. One modern approach to feature selection in linear models consists in minimizing an L _{0} regularized empirical risk. This particular risk encourages the model to have a good balance between a low classification error and a high sparsity (where only a few features are used for classification). As the L _{0} regularized problem is combinatorial, many approaches such as the LASSO (Tibshirani 1994) try to address the combinatorial problem by using more practical norms such as L _{1}. These classical approaches to sparsity aim at finding a sparse representation of the feature space that is global to the entire dataset.
We propose a new approach to sparsity where the goal is to limit the number of features per datapoint classified, thus datumwise sparse classification model (DWSM). Our approach allows the choice of features used for classification to vary relative to each datapoint; data points that are easy to classify can be inferred on with only a few features, and others which might require more information can be classified using more features. The underlying motivation is that, while classical approaches balance between accuracy and sparsity at the dataset level, our approach optimizes this balance at the individual datum level.
This approach differs from the usual feature selection paradigm which operates at a global level on the dataset. We believe that several problems could benefit from our approach. For many applications, a decision could be taken by observing only a few features for most items, whereas other items would require closer examination. In these situations, datumwise approaches can achieve higher overall sparsity than classical methods. In our opinion, however, this is not the only important aspect of this model, as there are two primary domains where this alternative approach to sparsity can also be beneficial. First, this is a natural framework for sequential classification tasks, where a decision regarding an item class is taken as soon as enough information has been gathered on this item (Louradour and Kermorvant 2011; DulacArnold et al. 2011). Second, the proposed framework naturally adapts to a variety of sparsity or feature selection problems that would require new specific solutions with classical approaches. It also allows for the handling of situations where the loss function is not continuous—situations that are difficult to cope with through classical optimization. It can be easily adapted, for example, to limited budget or cost sensitive problems where the acquisition of features is costly, as it is often the case in domains such as diagnosis, medicine, biology or even for information extraction problems (Kanani and McCallum 2012). DWSM also allows handling easily complex problems where features admit a certain inherent structure.
In order to solve the combinatorial feature selection problem, we propose to model feature selection and classification as a single sequential Markov Decision Process (MDP). A scoring function associated to the MDP policy will return at any time the best possible action. Each action consists either in choosing a new feature for a given datum or in deciding on the class of x if enough information is available for deciding so. During inference, this score function will allow us to greedily choose which features to use for classifying each particular datum. Learning the policy is performed using an algorithm inspired by Reinforcement Learning (Sutton and Barto 1998). In this sequential decision process, datumwise sparsity is obtained by introducing a penalizing reward when the agent chooses to incorporate an additional feature into the decision process. We show that an optimal policy in this MDP corresponds to an optimal classifier w.r.t. the datum wise loss function which incorporates a sparsity inducing term.
 1.
We formally introduce the concept of datumwise sparse classification, and propose a new classifier model whose goal is twofold: maximize the classification accuracy, while using as few features as possible to represent each datum. The model considers classification as a sequential process, where the model’s choice of features is dependent on the current datum’s values.
 2.
We formalize this sequential model using a Markov Decision Process. This allows us to use an MDP, we show the equivalence between learning to maximize a reward function and minimizing a datumwise loss function. This new approach results in a model that obtains good performance in terms of classification while maximizing datumwise sparsity, i.e. the mean number of features used for classifying the whole dataset. Our model also naturally handles multiclass classification problems, solving them by using as few features as possible for all classes combined.
 3.
We propose a series of extensions to our base model that allow us to deal with variants of the feature selection problem, such as: hard budget classification, group features, costsensitive classification, and relational features. These extensions aim at showing that the proposed sequential model is more than a classical sparse classifier, but can be easily adapted to many different classification tasks where the features selection/acquisition process is complex, all while maintaining good classification accuracy.
 4.
We perform a series of experiments on many different corpora: 13 for the base model and additional corpora for the extensions. Then we compare the model with those obtained by optimizing the LASSO problem (Efron et al. 2004), an L _{1}regularized SVM, and CART decision trees. This provides a qualitative study of the behavior of our algorithm. Additionally, we perform a series of experiments to demonstrate the potential of the various extensions proposed to our model.
2 Datumwise sparse classifiers
This empirical risk minimization problem does not consider any prior assumption or constraint concerning the form of the solution and can result in overfitting models. Moreover, when facing a very large number of features, obtained solutions usually need to perform computations on all the features for classifying each datum, thus negatively impacting the model’s classification speed. We propose a different risk minimization problem where we add a penalization term that encourages the obtained classifier to classify using on average as few features as possible. In comparison to classical L _{0} or L _{1} regularized approaches where the goal is to constrain the number of features used at the dataset level, our approach performs sparsity at the datum level, allowing the classifier to use different features when classifying different inputs. This results in a datumwise sparse classifier that, when possible, only uses a few features for classifying easy inputs, and more features for classifying difficult or ambiguous ones.
This definition of datawise classifiers has two main advantages: First, as we will see in the next section, because f _{ θ } can explain its use of features with z _{ θ }(x), we can add constraints on the features used for classification. This allows us to encourage datumwise sparsity which we define below. Second, experimental analysis of z _{ θ }(x) gives a qualitative explanation of how the classification decision has been made, which we discuss in Sect. 6. Note that the way we define datumwise classification is an extension to the usual definition of a classifier.
2.1 Datumwise sparsity
Note that the optimization of the loss defined in Eq. (2) is a combinatorial problem that becomes quickly intractable. In the next section, we propose an original way to deal with this problem, based on a Markov Decision Process.
3 Datumwise sparse sequential classification
In the next couple of pages, we begin by proposing a model that allows us to solve the problem posed in Eq. (2). Then we explain how we can use this model to classify in a sequential manner that allows for datumwise sparsity.
3.1 Markov decision process

A set of states \(\mathcal{X} \times\mathcal{Z}\), where the tuple (x,z) corresponds to the state where the agent is considering datum x and has selected features specified by z. The number of currently selected features is thus ∥z∥_{0}.

A set of actions \(\mathcal{A}\) where \(\mathcal{A}(\mathbf{x},\mathbf{z})\) denotes the set of possible actions in state (x,z). We consider two types of actions:

\(\mathcal{A}_{f}\) is the set of feature selection actions such that, for \(a \in\mathcal{A}_{f}\), choosing action a=a _{ j } corresponds to choosing feature f _{ j }. Note that the set of possible feature selection actions on state (x,z), denoted \(\mathcal{A}_{f}(\mathbf{x},\mathbf{z})\), is equal to the subset of currently unselected features, i.e. \(\mathcal{A}_{f}(\mathbf{x},\mathbf{z}) = \{a_{j} \in \mathcal{A}_{f}, \text{ s.t. } \mathbf{z}_{j}=0\}\).

\(\mathcal{A}_{y}\) is the set of classification actions—one action for each possible class—that correspond to assigning a label to the current datum. Classification actions stop the sequential decision process.


A transition function defined only for feature selection actions (since classification actions are terminal):where z′ is an updated version of z such that for a=a _{ j }, z′=z+e _{ j } with e _{ j } as the vector whose value is 1 for component j and 0 for the other components.$$\mathcal{T}: \begin{cases} \mathcal{X}\times\mathcal{Z}\times\mathcal{A}_f\rightarrow \mathcal{X}\times\mathcal{Z}, \\ \mathcal{T}((\mathbf{x},\mathbf{z}), a) = (\mathbf{x},\mathbf{z}') \end{cases} $$
3.1.1 Policy
Here \(r_{\theta}^{t} \mid((\mathbf{x},\mathbf{z}),a)\) corresponds to the reward obtained at step t. Taking the sum of these rewards gives us the total reward from state (x,z) until the end of the episode. Since the policy is deterministic, we may refer to a parameterized policy using only θ. Note that the optimal parameterization θ ^{∗} obtained after learning (see Sect. 3.4) is the parameterization that maximizes the expected reward over all possible stateaction pairs.
In practice, the initial state of such a process for an input x corresponds to an empty z vector where no feature has been selected. The policy θ sequentially picks, one by one, a set of features pertinent to the classification task, and then chooses to classify once enough features have been considered.
3.1.2 Reward

If a corresponds to a feature selection action i.e. \(a \in \mathcal{A}_{f}\):$$r(\mathbf{x}_{\mathbf{i}},\mathbf{z},a) = \lambda. $$

If a corresponds to a classification action i.e. \(a \in \mathcal{A}_{y}\):$$r(\mathbf{x}_{\mathbf{i}},\mathbf{z},a) = \left\{ \begin{array}{l@{\quad}l} 0 & \text{if }a = y_i,\\ 1 & \text{if }a \neq y_i. \end{array} \right. $$
3.2 Reward maximization and loss minimization
Here, \(\pi_{\theta}(\mathbf{x}_{\mathbf{i}}, \mathbf{z}_{\theta}^{(t)})\) is the action taken at time t by the policy π _{ θ } for the training example x _{ i } and T _{ θ }(x _{ i }) is the number of features acquired by π _{ θ } before classifying x _{ i }.
Such an equivalence between risk minimization and reward maximization shows that the optimal classifier θ ^{∗} corresponds to the optimal policy in the MDP defined previously. This equivalence allows us to use classical MDP resolution algorithms (Sutton and Barto 1998) in order to find the best classifier. We detail the learning procedure in Sect. 3.4.
3.3 Inference and approximated decision processes
Due to the infinite number of possible inputs x, the number of states is also infinite. Moreover, the reward function r(x,z,a) is only known for the values of x that are in the training set and cannot be computed for any other datum. For these two reasons, it is not possible to compute the score function for all stateaction pairs in a tabular manner, and this function has to be approximated.
3.4 Learning
The goal of the learning phase is to find an optimal policy parameterization θ ^{∗} which maximizes the expected reward, thus minimizing the datumwise regularized loss defined in Eq. (2). The combinatorial space consisting of all possible feature subsets for each individual datum in the training set is extremely large. Therefore, we cannot exhaustively explore the state space during training, and thus use a MonteCarlo approach to sample example states from the learning space.
 1.
The algorithm begins by sampling a set of random states: the x vector is sampled from a uniform distribution in the training set, and z is also sampled using a uniform binomial distribution.
 2.
For each sampled state, the policy \(\pi_{\theta^{(t1)}}\) is used to compute the expected reward of choosing each possible action from that state. We now have a feature vector Φ(x,z,a) for each stateaction pair in the sampled set, and the corresponding expected reward denoted \(R_{\theta^{(t1)}}(\mathbf{x},\mathbf{z},a)\).
 3.
The parameters θ ^{(t)} of the new policy are then computed using classical linear regression on the set of feature vectors—Φ(x,z,a)—and corresponding expected rewards—\(R_{\theta^{(t1)}}(\mathbf{x},\mathbf{z},a)\)—as regression targets. This classifier gives an estimated score to stateaction pairs even if we have never seen them previously.
RCPI is based on two different hyperparameters that have to be tuned manually: the number of states used for MonteCarlo Simulation and the number of rollout trajectories sampled for each stateaction pair. These parameters have a direct influence over the performances of the algorithm and the time spent for learning. As explained in Lazaric et al. (2010), a good choice consists in choosing a high value of sampled state with only a few sampling trajectories. This is the choice we have made for the experiments.
4 Model extensions
So far, we have introduced the concept of datumwise sparsity, and showed how it can be modeled as a Sequential Decision Process. Let us now show how DWSM can be extended to tackle other types of feature selection problems. This section aims to show that the proposed DWSM model is very general and can easily be adapted for many new feature selection problems, while keeping its datumwise properties. We show how we can address the following classification tasks: hard budget feature selection, costsensitive feature acquisition, group feature sparsity, and relational feature sparsity. All of these problems have been derived from realworld applications and have been explored separately in different publications where problem is solved by a particular approach. We show that our model allows us to address all these tasks by making only a few changes to the original formulation.
We begin by providing an informal description of each of these tasks and describing the corresponding losses that are to be minimized on a training set. Note that these loss minimization problems are datumwise variants inspired by losses found in the literature, and are therefore slightly different. We then describe how these losses can be solved by making simple modifications to the structure of the decision process described in Sect. 3. Experimental results are presented in Sect. 6.3.
4.1 Definitions

Hard budget feature selection (Kapoor and Greiner 2005) considers that there is a fixed budget on feature acquisition, be it during the training, the inference, perdatum, or globally. We choose to put in place this constraint as a perdatum hard budget during inference. The goal is to maximize classification accuracy while respecting this strict limit on the feature budget. The corresponding loss minimization problem can be written as:

Costsensitive feature acquisition and classification (Turney 1995; Greiner 2002; Ji and Carin 2007) is an important domain in both feature selection and active learning. The problem is defined by assigning a fixed cost to each feature the classifier can consider. Moreover, the cost of misclassification errors depends on the error made. For example, false positive errors will have a different cost than false negatives. The goal is thus to minimize the overall cost which is composed of both the misclassification cost and also the sum of the cost of all the features acquired for classifying. This task is wellsuited for some medical applications where there is a cost associated with each medical procedure (blood test, xray, etc.) and a cost depending on the quality of the final diagnosis.
Let us denote ξ the vector that contains the cost of each of the possible features, the datumwise minimization problem can be written as:where Δ _{ cost } is a costsensitive error loss. Let C be a classification cost matrix such that C _{ i,j } is the cost of classifying a datum as i when its real class is j. This cost is generally positive^{5} for i≠j, and negative or zero for i=j. We can thus define Δ _{ cost } as:$$ \theta^*=\mathop{\mathrm{argmin}}_\theta \frac{1}{N} \sum_{i=1}^{N} \varDelta_{cost}\bigl(y_\theta(\mathbf{x}_{\mathbf{i}}),y_i \bigr) + \frac{1}{N} \sum_{i=1}^{N} \bigl\langle\xi; z_\theta(\mathbf{x}_{\mathbf{i}}) \bigr\rangle, $$(6)The matrix C is defined a priori by the problems one wants to solve.$$ \varDelta_{cost}(i,j) = C_{i,j}. $$(7) 
Group feature selection has been previously considered in the context of the Group Lasso (Yuan and Lin 2006). In this problem, feature selection is considered in the context of groups of features; the classifier can choose to use a certain number of groups of features, but cannot select individual features. Many feature selection tasks present a certain organization in the feature space. For example, a subset of features f _{ s } may all be somehow correlated, and need to be selected together. For example, f _{ s } may represent a discretized real variable, or an ensemble of values that correspond to a single physical test. These groups can either be defined relative to a certain structure already present in the data (Jenatton et al. 2011), or can be used to reduce the dimensionality of the problem. Let us consider the set of n features denoted \(\mathcal{F}\) and a set of g groups of features denoted \(\mathcal{F}_{1} \ldots\mathcal{F}_{g}\) such that \(\bigcup^{g}_{i=1} \mathcal{F}_{i} = \mathcal{F}\). Let us define the set of selected features for a particular datum x _{ i } as \(\mathcal{Z}_{\theta}(\mathbf{x}_{\mathbf{i}}) = \{j \in \mathcal{F} \text{ s.t. } z^{j}_{\theta}(\mathbf{x}_{\mathbf{i}})=1\}\). The corresponding datumwise loss, inspired by the Group Lasso, can be now be written as: This loss tries to minimize the number of \(\mathcal{F}_{t}\) groups present in the actual set of selected features. We use as a truth function: This allows us to quantify the number of groups that have been chosen in \(\mathcal{Z}_{\theta}(\mathbf{x}_{\mathbf{i}})\), so that we may minimize their number.

Relational feature selection: Finally, we consider a more complex problem where features are organized in a complex structure. This problem, which we call relational feature selection, is inspired by structured sparsity (Huang et al. 2009). We imagine a couple of problems that can fall into this category:

Conditional features, where one or a subset of features can only be selected depending on the previously acquired features.

Constrained features, where the cost of acquiring a particular feature depends on the previously acquired features. For example, in computer vision, one can constrain a system to acquire values of pixels that are close in an image—see Sect. 6.3.3.
Let us define a boolean function that tells us if two features are related:In the case of this function, the relation can be any imaginable constraint that can be calculated simply by considering f and f′.$$ \mathit{Related}{:}\quad \begin{cases} \mathcal{A}_f \times\mathcal{A}_f \rightarrow\{1,0\},\\ \mathit{Related}(f, f') = 1\quad \mbox{if related, else } 0. \end{cases} $$(9)The underlying idea is that acquiring features that are somehow related is less expensive than acquiring features that do not share a relation. The corresponding loss can be written as: Here, the term Related(f,f′)(λ−γ)+γ equals λ if f and f′ are related, and γ otherwise. In that definition, the cost of acquiring non related features is γ while the cost of related features is λ. Therefore, to encourage the use of related features, one simply needs to set γ>λ. 
Proposed tasks and corresponding learning problems
Task 
Loss minimization problem 

Hard budget 
\(\begin{array}{l} \theta^{*}=\mathop{\mathrm{argmin}}_{\theta}\frac{1}{N}\sum_{i=1}^{N} \varDelta(y_{\theta}(\mathbf{x}_{\mathbf{i}}),y_{i}) + \lambda\frac{1}{N} \sum_{i=1}^{N} \Vert z_{\theta}(\mathbf{x}_{\mathbf{i}}) \Vert_{0} \\ \quad\text{subject to } \Vert z_{\theta}(\mathbf{x}_{\mathbf{i}}) \Vert_{0} \leq M \end{array}\) 
Costsensitive 
\(\theta^{*}=\mathop{\mathrm{argmin}}_{\theta}\frac{1}{N}\sum_{i=1}^{N} \varDelta(y_{\theta}(\mathbf{x}_{\mathbf{i}}),y_{i}) + \frac{1}{N} \sum_{i=1}^{N} \langle\xi; z_{\theta}(\mathbf{x}_{\mathbf{i}}) \rangle\) 
Grouped features  
Relational features 
\(\begin{array}{ll} \theta^{*}=\mathop{\mathrm{argmin}}_{\theta}\frac{1}{N}\sum_{i=1}^{N} \varDelta(y_{\theta}(\mathbf{x}_{\mathbf{i}}),y_{i})\\[3pt] \qquad{}+ \frac{1}{N} \sum_{i=1}^{N} \sum_{f,f' \in\mathcal{Z}_{\theta}(x_{i})} \mathit{Related}(f,f')(\lambda \gamma) + \gamma \end{array}\) 
4.2 Adapting the datumwise classifier
In the rest of this section, we will show how these different tasks can be easily solved by making slight modifications to the model proposed in Sect. 3. We will first cover the general idea underlying these modifications, and then detail for each of the previously described tasks how they can be handled by the sequential process. This section aims at showing that our approach is not only a novel way to compute sparse models, but also an original and flexible approach that allows one to easily imagine many different models for solving complex classification problems.
4.2.1 General idea
Our model is based on the idea^{6} of proposing a sequential decision process where the long term reward obtained by a policy is equal to the negative loss obtained by the corresponding classifier. With such an equivalence, the optimal policy obtained through learning is thus equivalent to an optimal classifier for the particular loss considered. In order to deal with the previously proposed classification problems, we simply modify DWSM’s MDP in order to correspond to the new loss function.
The main advantage of making only small changes to the structure of the decision process is that we do not need to change the principles of the learning and inference algorithms, thus resulting in new classifiers that are very simple to specify and implement. We believe that this approach is well suited for solving realworld classification tasks that often corresponds to complex loss functions.
In order to deal with the four different proposed tasks, we have to make modifications to the MDP by changing: the reward function r(⋅,⋅,⋅), the action set \(\mathcal{A(\cdot, \cdot)}\), and/or the transition function \(\mathcal{T}(\cdot,\cdot)\).
To be able to adapt our model to these new problems, we do not need to modify either the feature projector, Φ(⋅,⋅), the actual definition of the state space, the learning algorithm, or the inference algorithm.
Summary of the modifications made for incorporating the different variants in the original model
Task 
Decision process modification 
Commentary 

Hard budget 
\({\mathcal{A}(\mathbf{x},\mathbf{z}) = \left\{ \begin{array}{l} \mathcal{A}_{f}(\mathbf{x},\mathbf{z}) \cup\mathcal{A}_{y}(\mathbf{x},\mathbf{z}) \text{ if } \Vert \mathbf{z}\Vert_{0} < M \\ \mathcal{A}_{y}(\mathbf{x},\mathbf{z}) \text{ if } \Vert\mathbf {z}\Vert_{0} = M \end{array} \right.} \) 
Allows users to choose a minimum level of sparsity. Reduces training complexity. 
Costsensitive 
\(r(\mathbf{x}_{\mathbf{i}},\mathbf{z},a) = \begin{cases} \xi_{i} \text{ if } a \in\mathcal{A}_{f} \\ C_{a,y_{i}} \text{ if } a \in\mathcal{A}_{y} \end{cases} \) 
Wellsuited for features with variable costs. 
Grouped features 
\(\begin{array}{l} \mathcal{A}_{f} = \mathcal{A}_{group}\\ \mathcal{T}\bigl((\mathbf{x},\mathbf{z}), a_{j}\bigr) = \biggl( \mathbf{x},\mathbf{z} + \sum_{i \in\mathcal{F}_{j}} \mathbf{e}_{\mathbf{i}}\biggr) \end{array} \) 
Well adapted to features presenting a grouped nature. Complexity is reduced. 
Relational features 
\(r(\mathbf{x},\mathbf{z},a_{j}) = \begin{cases} \lambda\text{ if } \forall f \in\mathcal{Z}(\mathbf{x}), \mathit{Related}(f_{j},f) = 1 \\ \gamma\text{ otherwise} \end{cases} \) 
Naturally suited for complex feature interdependencies. 
4.2.2 Hard budget
This new action set function allows the model to either choose a new feature, or classify if the number of selected features ∥z∥_{0} is inferior to M. When M−1 features have been selected, only classification actions may be performed by the classifier.
One advantage of this constraint is to reduce the complexity of the training algorithm, since the maximum size of a trajectory in the decision process is now M. This has the effect of limiting the length of each rollout, thus making the training simulation much faster to compute.
4.2.3 Costsensitive
4.2.4 Group features
Because the size of \(\mathcal{A}_{group}\) is inferior to the size of \(\mathcal{A}_{f}\), the new MDP corresponds to a learning and inference complexity which is greatly reduced relative to he original problem. This aspect allows us to deal both with datasets where features are “naturally” grouped, but also, to deal with datasets with large number of features, by artificially grouping the features into groups. We consider both of these approaches experimentally in Sect. 6.3.3.
4.2.5 Relational features
4.3 Overview of model extensions
As we have seen in this section, our model is easily adaptable to a large set of sparsityinspired problems. What we believe to be of particular interest is not only that our model can function under complex constraints, but that adapting it for these constraints is relatively straightforward. Indeed, one can imagine many more sparsityinspired problems requiring constraints that are not easily dealt with traditional classification approaches, yet that could be easily expressed with our model. For this reason, we strongly believe that our model’s adaptability to more complex problems is one of its strong points.
5 Complexity analysis and scaling
Let us focus on the analysis of the complexity of the proposed model. We detail the complexity concerning the initial datumwise sparse model proposed in Sect. 2, and then detail the complexity of each proposed extension. We discuss the ability of our approach to deal with datasets with a large number of features and propose different possible extensions for reducing the complexity of the approach.
5.1 Complexity of the datumwise sparse classifier
Inference complexity
Inference on an input x consists in sequentially choosing features, and then classifying x. Having acquired t features, on a dataset with n features and c categories in total, one has to perform (n−t)+c linear computations through the s _{ θ } function in order to choose the best action at each state. The inference complexity is thus O(N _{ f }⋅n), where N _{ f } is the mean number of features chosen by the system before classifying. In fact, due to the shape of the Φ function presented in Sect. 3.3 and the linear nature of s _{ θ }, the score of the actions can be efficiently incrementally computed at each step of the process by only adding the contribution of the newly added feature to each action’s score. This complexity makes the model able to quickly classify very large datasets with many examples. The inference complexity is the same for all the proposed variants.
Learning complexity
Learning complexity of the different variants of the DatumWise Model. n is the number of features, c the number of classes, N _{ s } is the number of states used for rollouts
Variant 
Learning complexity 
Remarks 

Sparse model 
\(\mathcal{O}(N_{s} \cdot T\cdot(n+c)^{2})\) 
Limited to hundreds of features, T≈n. 
Hard budget 
\(\mathcal{O}(N_{s} \cdot\bar{T}\cdot(n+c)^{2})\) 
Same complexity as the base model, shorter learning time with the budget \(\bar{T} \ll n\). 
Grouped features 
\(\mathcal{O}(N_{s} \cdot T\cdot(\bar{n}+c)^{2})\) 
Same complexity as the base model, much shorter learning time with the number of groups \(\bar{n} \ll n\), and \(T\approx\bar{n}\). 
Relational features 
\(\mathcal{O}(N_{s}\cdot T\cdot(n+c)^{2})\) to \(\mathcal{O}(N_{s} \cdot T\cdot c^{2})\) 
The complexity depends on the structure of the features. In the extreme case, where features have to be acquired in a fixed order, the complexity^{a} is O(N _{ s }⋅T⋅c ^{2}) and allows the model to scale linearly relative to features. 
5.2 Scalability
If the learning complexity of our model is higher than baseline global linear methods, inference is linear. In practice, during training, most of the baseline methods select a subset of variables in a couple seconds to a couple minutes, whereas our method is an order of magnitude slower. The problem encountered is the increase in training time relative to the number of features. Inference, however, is indeed performed at the same speed as baseline methods, which is in our opinion the important factor.

The complexity of the hard budget variant is clearly an order of magnitude lower than the complexity of the initial model. This variant is useful when one wants to obtain a very high sparsity by limiting the maximum number of used features to ten or twenty. In this case, this model is faster than the DWSM model and can be learned on larger datasets

The groupedfeatures model has a complexity which depends on the number of groups of features. So one possible solution when dealing with many features, is to group these features in a hundred of packets. These groups can be formed randomly, or by hand if the features are naturally organized in a complex structure. Such a use of the model on a large dataset is illustrated in Table 7.

When dealing with sparse datasets, the learning algorithm can be easily adapted for reducing its complexity. This acquisition of features can thus be restricted (during learning) to the subset of nonnull values, strongly reducing the number of possible actions in the MDP to the number of nonnull features.

At last, the use of faster Reinforcement Learning techniques can be a possible solution to fasten the learning phase. Recent techniques have been developed (DulacArnold et al. 2012) allowing to reduce the complexity of our model from O(N _{ s }⋅(n+c)^{3}) to O(N _{ s }⋅log(n+c)) at the price of a final suboptimal classification policy. These methods will be tested on this task in a future work
6 Experiments
The experimental section is organized as follows: First, we present the results obtained by basic DWSM on 8 binary classification datasets in Sect. 6.1. We analyze the performance and behavior of this algorithm and compare with stateoftheart methods. We then present results for multiclass classification with this base model in Sect. 6.2. After that, we describe experiments performed with the four extensions to the base model proposed in Sect. 4 on the binary datasets and additional corpora. For brevity, we present only representative results in the core article, while providing results obtained on all binary dataset with the DWSM and its variants in the Appendix.
6.1 Sparse binary classification
Binary classification datasets characteristics
Name 
Number of features 
Number of examples 
Number of classes 
Task 

Australian 
14 
690 
2 
Binary 
Breast Cancer 
10 
683 
2 
Binary 
Diabetes 
8 
768 
2 
Binary 
Heart 
13 
270 
2 
Binary 
Ionosphere 
34 
351 
2 
Binary 
Liver Disorders 
6 
345 
2 
Binary 
Sonar 
60 
208 
2 
Binary 
Splice 
60 
1,000 
2 
Binary 

LARS was used as a baseline linear model with L _{2} loss and L _{1} regularization.

L _{1}SVM is an SVM classifier with L _{1} regularization, effectively providing LASSOlike sparsity.^{9}

CART (Breiman et al. 1984) was used as a baseline decision tree model.

DatumWise Sequential Model (DWSM) is the DatumWise Sparse model presented above.
For evaluation, we used a classical accuracy measure which corresponds to 1errorrate on the test set of each dataset. The sparsity has been measured as the proportion of features not used for the LARS and SVML _{1} models, and the mean proportion of features not used to classify testing examples in DWSM and CART. Each model was run with many different hyperparameters values—the C and ϵ values for LARS and SVML _{1}, the pruning value for CART and λ for DWSM.
Concerning our method, the number of rollout states (step 1 of the learning algorithm) is set to ten states for each learning example and the number of policy iterations is set to ten.^{10} Note that experiments with more rollout states and/or more iterations give similar results. Experiments were made using an αmixture policy ^{11} with α=0.7 to ensure the stability of the learning process—a lower αvalue involves less stability while a higher value makes more learning iterations necessary for convergence. The following figures present accuracy/sparsity curves averaged over the 30 runs and also show the variance obtained over the 30 runs.
Figures within this experimental section present average accuracies over the 30 different splits. In the case of L _{1}SVM and LARS, models have a fixed number of features, however, in the case of DWSM or decision trees, where the number of features is variable, results are actually representative of multiple sparsities within a same experiment. Horizontal error bars are therefore presented for our models.

On Fig. 2 (left)—corresponding to experiments on the breast cancer dataset—one can see that at a level of sparsity of 70 %, we obtain 96 % accuracy while the two baselines obtain about 88 % to 90 %. The same observation also holds for other datasets as shown on Fig. 2 (right). Looking at all the curves given in the Appendix, one can see that our model tends to outperform L _{1} models—LARS and SVML _{1}—on seven of eight datasets.^{13} In these cases, at similar levels of global sparsity, our approach tends to maintain higher accuracy.^{14} These results show that DWSM can be competitive w.r.t. L _{1} methods, and can outperform these baseline methods on a certain number of datasets.

We have also compared DWSM with CART. The latter shares some similarities with our sequential approach in the sense that, for both algorithms only some of the features will be considered before classifying a data point. Aside from this point, the two methods have strong differences; in particular, CART builds a global tree with a fixed number of possible paths whereas, DWSM adaptively decides for each pattern which features to use. Note that CART does not incorporate a specific sparsity mechanism and has never been fully compared to L _{1} based methods in term of accuracy and sparsity. Figure 3 gives two illustrative results obtained on two datasets. On Fig. 3 (left), one can see that our method outperforms decision trees in term of accuracy at the same level of sparsity. Moreover, DWSM allows one to easily obtain different models at different levels of sparsity, while this is not the case for CART where sparsity could only be controlled indirectly by the pruning mechanism. For some datasets, CART outperforms both DWSM and L _{1} based models. An example is given in Fig. 3 (right), where at the 0.9 level of sparsity, CART achieves about 90 % in term of accuracy, while DWSM achieves similar performance to the baseline methods. CART’s advantage is most certainly linked to its ability to create a highly nonlinear decision boundary.
Qualitative results
The histogram in Fig. 4 (Right) describes the average number of testing examples classified with a particular amount of acquired features. For example, DWSM classifies around 60 % of the examples using no more than 3 features. DWSM mainly uses 1, 2, 3 or 10 features, meaning that it identifies two “levels of difficulty”; some easy examples can be classified using less than 3 features, while for the hard to classify examples, all the features have to be used. A deeper analysis of this behavior shows that almost all the classification mistakes have been made after looking at all 10 features.
Note that DWSM is able to attain better accuracy than the LARS for equivalent sparsity levels. This is due to the fact that DWSM is not bound to a strict set of features for classification, whereas the LARS is. Therefore, it can request more features than the LARS has available for a particularly ambiguous datum. This allows DWSM to have more information in difficult regions of the decision space, while maintaining sparsity for a “simple” datum.
6.2 Sparse multiclass classification
Multiclass classification datasets
Name 
Number of features 
Number of examples 
Number of classes 

Segment 
19 
2,310 
7 
Vehicle 
18 
846 
4 
Vowel 
10 
1,000 
11 
Wine 
13 
178 
3 
The sequential model is particularly interesting with low sparsities, as it is able to maintain good accuracy even with high sparsity. We can see this in Fig. 6 (left), with DWSM able to maintain ∼90 % accuracy while at sparsity of 0.8, whereas the SVML _{1} model has already sharply decreased in performance. Additionally, extending the model to the multiclass case is completely natural and requires nothing more than adding additional classification actions to the MDP.
6.3 Model extensions
In this section we will present a series of results for the model extensions described in Sect. 4. Contrary to what was done for the base model, we did not attempt to perform extensive comparisons with baseline models but rather wanted to show—using some of the datasets used previously—that the extensions were indeed sound and efficient. The reason for this is twofold: First, there is no baseline able to cope with all the extensions, so comparison would require for each problem a specific algorithm for each extension. Second, for some of the extensions, not all the datasets are pertinent.
6.3.1 Hard budget
6.3.2 Costsensitive feature acquisition
This table presents results obtained on the costsensitive Pima Diabetes dataset. As a reference, results from one of the costparameterization of Li and Carin is presented as well
Classifier 
Error penalty 
Average cost 
Accuracy 

DWSM 
800 
181 
0.75 
DWSM 
400 
74 
0.76 
Li and Carin 
800 
180^{a} 
0.75^{a} 
Li and Carin 
400 
75^{a} 
0.75^{a} 
The reference results in Table 6 were extracted from page 18 of the article (Ji and Carin 2007). We can see that our costsensitive model obtains, in each of the two cases^{18} the same average cost as the one obtained by Li and Carin. Accuracy is also equivalent for both models, showing that DWSM is competitive w.r.t. to a costsensitivespecific algorithm, while only needing a slight modification to its MDP.
6.3.3 Grouped and relational features
In order to test the ability of our approach to deal with grouped and relational features, we have performed three sets of experiments:
Artificial random groups
This table describes groupfeature results for the Gisette and Adult datasets
Classifier 
Dataset 
# groups 
λ 
Sparsity 
# of features 
Accuracy 

DWSM 
Gisette 
10 
0.000 
0.255 
7.4 groups 
0.932 
0.005 
0.308 
6.9 groups 
0.937  
0.010 
0.310 
6.9 groups 
0.926  
0.100 
0.549 
4.5 groups 
0.898  
LASSO 
Gisette 
*** 
*** 
0.98 
100 features 
0.962 
DWSM 
Adult 
14 
0.000 
0.41 
8.75 groups 
0.82 
0.005 
0.677 
4.83 groups 
0.79  
0.010 
1.0 
0 groups 
0.76^{a}  
LASSO 
Adult 
*** 
*** 
0.32 
95 features 
0.83 
Grouped features based on the structure of the features
One advantage of the Grouped Features model is that it can consider datasets where features are naturally organized into groups. This is for example the case for the Adult^{20} dataset from UCI, which has 14 attributes, some of which are categorical. These categorical attributes have been expanded to continuous features using discretized quantiles—each quantile is represented by a binary feature. The set of features that corresponds to the quantile of a particular categorical attribute naturally corresponds to a group of features. The continuous dataset is composed of 123 real features grouped in 14 groups. We have created a mapping from the expanded dataset back to the original set of features, and run experiments using this feature grouping. We use the LASSO as a baseline. Table 7 presents the results obtained by our model at three different levels of sparsity and show that our method achieves a 79 % accuracy while selecting on average 4.83 of the 14 groups of features. Results for the LASSO are also presented, although the LASSO’s results are not constrained to respect the group structure, and furthermore its sparsity corresponds to the sparsity obtained on the expanded dataset, not the sparsity over the initial dataset.
Relational features
Finally, we present an experiment performed on image data which allows us to consider both group and relational constraints. Experiments have been performed on the MNIST (LeCun et al. 1998) dataset with the relational model described in Sect. 4.2.5. We have used the following two constraints: (i) First, we have put in place a group mapping on the pixels that corresponds to a 4×4 grid of blocks. (ii) Secondly, we have forced the model to focus on contiguous blocks of pixels i.e. the cost of acquiring a block of pixels touching a previously acquired block is lower than the cost of acquiring a block which is further away. We then make use of spatial relations between groups of pixels. Referring to the relational model in Sect. 4.2.5, the Related(⋅,⋅) function tells us whether two feature blocks are contiguous in the image.
7 Related work
We begin by providing an overview of feature selection techniques that correspond that are close to the datumwise sparse model presented in Sect. 2. Then, for each of the proposed extensions, we describe the works that address similar problems. Note that none of the following citations correspond to a model that can deal with all these classification problem in an unified way.
Features selection and sparsity
DatumWise Feature Selection positions itself in the field of feature selection, a field that has seen a good share of approaches (Guyon and Elisseefi 2003). Our approach positions itself between two main veins in feature selection: embedded approaches and wrapper approaches.
Embedded approaches include feature selection as part of the learning machine. These include algorithms solving the LASSO problem (Tibshirani 1994), and other linear models involving a regularizer with a sparsityinducing norm (L _{ p∈[0;1]}norms such as Elastic Net, Zou and Hastie 2005 and group LASSO, Yuan and Lin 2006). These methods are very different from our proposed method, and rely on the direct minimization of a regularized empirical risk. These approaches are very effective at these specific tasks, but are difficult to extend to more complex risks that are neither continuous nor differentiable. Nevertheless, some interesting work has been done along the lines of finding surrogate losses for more structured forms of sparsity (Jenatton et al. 2011). We believe our method is nevertheless more naturally expressive for these types of problems, as its optimization criteria is not subject to any constraints on continuity or derivability.
Wrapper approaches aim at searching the feature space for an optimal subset of features that maximizes the classifier’s performance. Searching the entire feature space is very quickly intractable and therefore various recent approaches have been proposed to restrict the search using genetic programming (Girgin and Preux 2008) or UCTbased algorithms (Gaudel and Sebag 2010). We were encouraged by these works, as the use of a learning approach to direct the search for feature subsets in the feature graph is very similar in spirit to our approach. We differentiate our approach from these through its datumwise nature, which is not considered by either of the aforementioned articles.
Regarding the datumwise nature of our algorithm, the classical model that shares some similarities in terms of inference process is the Decision Tree (Quinlan 1993). During inference with a Decision Tree, feature usage is in effect datumdependent. In contrast to our method, Decision Trees are highly nonlinear and as far as we know, have never been studied in terms of sparsity. Moreover, the learning algorithm is very different to the one proposed in this paper, and Decision Trees are not easily generalizable to more complex problems described in Sect. 4. Nevertheless, Decision Trees prove to be perform very well in situations where strong sparsity is imposed.
Cost sensitive classification
The particular extensions presented in Sect. 4 have been inspired by various recent works. Costsensitive classification problems have been studied by Turney (1995) and Greiner (2002). The model proposed by Turney (1995) is an extension of decision trees to costsensitive problems, using a certain heuristic to discourage the use of costly features. Greiner (2002) models this task as a sequential problem. However, the formalism used by Greiner is different from the one we propose, and restricted to costsensitive problems.
Hard budget classification
Hard Budget classification has been considered before, Kapoor and Greiner (2005), in the context of Active Learning. Modelization as an MDP is suggested in the article, but is not performed. Hard Budget classification is primarily motivated by its more finegrained ability to tune the sparsity, as well as the inherent speedups it provides in the complexity of the learning phase.
Grouped features
Many datasets inherently provide some form of group or relational structure, and grouped features have been recently proposed as an extension to the LASSO problem called GroupLASSO (Yuan and Lin 2006). Relational features have also been studied in different papers about structured sparsity (Huang et al. 2009; Jenatton et al. 2011), which also base themselves on LASSOderived resolution algorithms. Additionally, these papers consider global sparsity and are not datumwise. They are based on a continuous convex formulation of the sparse L _{1} regularized loss and are thus very different from our approach. DWSM provides a much richer expressivity relative to these methods, at the cost of a more complex resolution algorithm.
All the different approaches to our extensions have not been previously brought together under one framework as far as we can tell, additionally many more extensions can be imagined, with the ability to adapt to the finegrained constraints of realworld problems.
Classification as a sequential problem
At last, the idea of using sequential models for classical machine learning tasks has recently seen a surge of interest. For example, there have been sequential models proposed for structured classification (Daumé and Marcu 2005; Maes et al. 2009). These methods leverage Reinforcement Learning approaches to solved more ‘traditional’ Structured Prediction tasks. Although they are specialized in the prediction of structured data, and do not concentrate on aspects of sparsity or feature selection, the general idea of applying RL to ML tasks is in the same vein of work as DWSM.
The authors have previously presented an original sequential model for text classification (DulacArnold et al. 2011), and there has been similar work using Reinforcement Learning techniques for selfterminating anytime classification (Póczos et al. 2009). These approaches can be considered as more constrained versions of the problem proposed in this paper, since the only criteria being learned is when to stop asking for more information, but not what information to ask for. Nevertheless, these approaches provide the base intuition for datumwise approaches.
The most similar Reinforcement Learning works are the paper by Ji and Carin (2007) and the (still unpublished) paper by Rückstieß et al. (2011) which proposes MDP models for costsensitive classification. Both of these papers have formalizations that are similar to ours, yet concentrate on costsensitive problems. We compare ourselves to experiments performed by Ji and Carin in Sect. 6.3.2.
8 Conclusion
In this article we have introduced the concept of datumwise classification, where we learn simultaneously a classifier and a sparse representation of the data that adapts to each new datum being classified. For solving the combinatorial feature selection problem, we have proposed a sequential approach where we have modeled the selection—classification problem as a MDP. Learning this MDP is performed using an algorithm inspired by Reinforcement Learning. Solving the MDP is shown to be equivalent to minimizing a L _{0} datum wise regularized loss for the classification problem.
This base model has then been extended to different families of feature selection problems: costsensitive, grouped features, hard budget and structured features. The proposed formalism can be easily adapted to any of these problems and thus provides a fairly general framework for datum wise sparsity.
Experimental results on 12 datasets have shown that the base model is indeed able to learn data dependent sparse classifiers while maintaining a good classification accuracy. The potential of the 4 extensions to the base model, has been demonstrated on different datasets. All of them solve a specific sparsity problem while requiring only slight changes to the initial model. We believe that this model might be easily adapted to other complex classification problems while requiring only slight changes to the MDP.
For inference, the model complexity is similar to a classical—non datum dependent—sparse classifier. Training the MDP remains however more costly than for global classification approaches. A couple of directions for future work are being considered: the first one consists in using more efficient RLinspired algorithm such as Fitted QLearning, which could greatly reduce the time spent during training. Another possible extension is to remove—during learning—features that are generally judged as irrelevant by the system i.e. features that are never or rarely used for classifying data. In that case, the system only keeps in memory a subset of the possible features and thus reduces the dimensionality of the training space. Finally, a more prospective research direction is to consider a sequential process that is also able to create new features—by combining existing features—opening the way to feature construction.
Acknowledgements
This work was partially supported by the French National Agency of Research (Lampada ANR09EMER007). The authors gratefully acknowledge the many discussions with Dr. Francis Maes.