Let
be distributed according to an unknown distribution
. A data point has K features, x={x
1,x
2,…,x
K
}, and belongs to one of C classes indicated by its label y. A kth feature is extracted from a measurement acquired at kth stage. x
k
is allowed to be a vector. We define a truncated feature vector at kth stage: x
k={x
1,x
2,…,x
k
}. Let
be the space of the first k features such that
.
The system has K stages, the order of the stages is fixed, and kth stage acquires a kth measurement. At each stage, k, there is a decision with a reject option, f
k. It can either classify an example,
, or delay the decision until the next stage, f
k(x
k)=r and incur a penalty of δ
k+1. Here, r indicates the “reject” decision. f
k has to make a decision using only the first k sensing modalities. The last stage K is terminal, a standard classifier. Define the system risk to be,
Here, R
k
is the cost of classifying at kth stage, and S
k(x
k)∈{0,1} is the binary state variable indicating whether x has been rejected up to kth stage.
If x is active and is misclassified, the penalty is 1.Footnote 1 If it is rejected then the system incurs a penalty of δ
k+1, and the state variable for that example remains at 1.
Bayesian setting
In this section, we will digress from the discriminative setting and analyze the problem under the assumption that the underlying distribution
is known. In doing so, we hope to discover some fundamental structure that will simplify our empirical risk formulation in the next section.
If
is known the problem reduces to an POMDP, and the optimal strategy is to minimize the expected risk,
If we allow arbitrary decision functions then we can equivalently minimize conditional risk,
This problem—by appealing to dynamic programming—remarkably reduces to a single stage optimization problem for a modified risk function. To see this, we denote the cost-to-go,
and the modified risk functional,
and prove the following theorem,
Theorem 1
The optimal solution
f
1,f
2,…,f
K
to the multi-stage risk in Eq. (4) decomposes to single stage optimization,
and the solution is:
Proof
To simplify our derivations, we assume uniform class prior probability: \(\mathrm{P}_{y} [ {y=\hat{y}} ]=\frac{1}{C}, \hat{y}=1, \ldots, C\). However, our results can be easily modified to account for a non-uniform prior. The expected conditional risk can be solved optimally by a dynamic program, where a DP recursion is,
Consider kth stage minimization, f
k can take C+1 possible values {1,2,…C,r} and J
k
(x
k,S
k) can be recast as a conditional expected risk minimization,
Define,
$$\tilde{\delta}\bigl(x^k\bigr)=\delta^{k+1} + \mathbf{E}_{\mathbf{x}^{k+1} \ldots \mathbf{x}^K} \bigl[ {J_{k+1}\bigl(\mathbf{x}^{k+1},S^{k+1}=1 \bigr)} \bigm{|}{\mathbf{x}^k} \bigr] $$
and rewrite the conditional risk in (9),
Reject is the optimal decision if,
If reject is not the optimal strategy then a class is chosen to maximize the posterior probability:
which is exactly our claim. □
The main implication of this result is that if the cost-to-go function \(\tilde{\delta}^{k}(\mathbf{x}^{k})\) is known then the risk \(\tilde{R}_{k}(\cdot)\) is only a function of the current stage decision f
k. Therefore, we can ignore all of the other stages and minimize a single stage risk. Effectively, we decomposed the multi-stage problem in Eq. (4) into a stage-wise optimization in Eq. (5).
Note that the modified risk functional, \(\tilde{R}_{k}\), is remarkably similar to R
k
except that the modified reject cost \(\tilde{\delta}^{k}(\mathbf{x}^{k})\) replaces the constant stage cost δ
k. Also, consider the range for which δ
k(x
k) is meaningful. If we have C classes then a random guessing strategy would incur an average risk of \(1-\frac{1}{C}\). Therefore the risk for rejecting, \(\tilde{\delta}^{k}(\mathbf{x}^{k}) \leq1-\frac{1}{C}\) in order to be a meaningful option. The work in Chow (1970) contains a detailed analysis of single stage reject classifier in a Bayesian setting.
In the analysis of the POMDP, we allowed multiple classes because it is a natural extension of the binary case. However, each stage still has C+1 decisions, and it is unclear how to parameterizing such multi-class classifier with a reject option in an empirical setting. Parameterizing regular multi-class learning is a difficult problem in itself, and most existing techniques (Allwein et al. 2001) reduce the problem to a series of binary learning methods. In our setting, the reject option cannot be treated as an additional class since there is no ground truth labels for which examples should be rejected. So in forming the empirical risk problem, we restrict ourselves to the binary setting since it allows for an intuitive parametrization of a reject option which we describe in the next section. We leave the multi-class setting to be the subject of future research.
Reject classifier as two binary decisions
Consider a stage k classifier with a reject option from Theorem 1 in a binary classification setting, y∈{−1,+1}.
It is clear from the expression that we can express the decision regions in terms of two binary classifiers f
n
and f
p
. Observe that for a given reject cost \(\tilde{\delta}^{k}(\mathbf{x}^{k})\), the reject region is an intersection of two binary decision regions. To this end we further modify the risk function in terms of agreement and disagreement regions of the two classifiers, f
n
,f
p
, namely,
Note that the above loss function is symmetric between f
n
and f
p
and so any optimal solution can be interchanged. Nevertheless, we claim:
Theorem 2
Suppose
f
n
and
f
p
are two binary classifiers that minimize
\(\mathbf{E}[ L_{k}(\mathbf{x}^{k},y,f_{n},\allowbreak f_{p},\tilde{\delta}^{k}) \mid \mathbf{x}^{k} ]\)
over all binary classifiers
f
n
and
f
p
. Then following resulting reject classifier:
is the minimizer for
\(\mathbf{E}[ \tilde{R}_{k}(\mathbf{x}^{k},y,f,\tilde{\delta}^{k}) \mid \mathbf{x}^{k} ]\)
in Theorem 1 and the
kth stage minimizer in Eq. (3).
Proof
For a given x
k and \(\tilde{\delta}(\mathbf{x}^{k})\),
By inspection, the decomposition in (15) is the optimal Bayesian classifier minimizing \(\mathbf{E}_{y} [ {\tilde{R}_{k}(\mathbf{x}^{k},y,f,\tilde{\delta}^{k})} \mid{ \mathbf{x}^{k}} ]\). □
We refer to Fig. 4 for an illustration. We can express the new loss compactly as follows:
Note that in arriving at this expression we have used:
for binary variables a,b,c.
In summary, in this section, we derive the optimal POMDP solution and decouple a multi-stage risk to single stage optimization. Then, for the binary classification setting, we derive an optimal representation for a reject region classifier in terms of two biased binary decisions:
$$\min_{f^k} \mathbf{E}\bigl[ R(\mathbf{x},y, \ldots, f^k, \ldots \bigr] \rightarrow\min_{f^k} \mathbf{E}\bigl[ \tilde{R}_k \bigl(\mathbf{x}^k,y, f^k,\tilde{\delta}^k\bigr) \bigr] \rightarrow \min_{f_p^k,f_n^k} \mathbf{E}\bigl[ L_k\bigl( \mathbf{x}^k,y, f_p^k,f_n^k, \tilde{\delta}^k\bigr) \bigr] $$
Stage-wise empirical minimization
In this section, we assume that the probability model
is no longer known and cannot be estimated due to high-dimensionality of the data. Instead, our task is to find multi-stage decision rules based on a given training set: (x
1,y
1),(x
2,y
2),…,(x
N
,y
N
). Here, we consider binary classification setting: y
i
∈{+1,−1}.
We will take advantage of the stage-wise decomposition of the POMDP solution in Theorem 1 and parametrization of reject region in Theorem 2 to formulate an empirical version of the stage risk L
k
(⋅) in Eq. (16). However, this requires the knowledge of the cost-to-go,
. Instead of trying to learn this complex function, we will define a point-wise empirical estimate of the cost-to-go on the training data:
$$\tilde{\delta}^k\bigl(\mathbf{x}^k_i\bigr) \to\tilde{\delta}^k_i,\quad i=1,2, \ldots, N $$
and use it to learn the decision boundaries directly.
Note that by definition, \(\tilde{\delta}^{k}(\mathbf{x}^{k}_{i})\) is a only function of f
k+1,…,f
K. So the cost-to-go estimate is conveniently defined by the recursion,
Now, we can form the empirical version of the risk in Eq. (5) and optimize for a solution at stage k over some family of functions,
.
Observe that, as in standard setting, we need to constrain the class of decision rules
here. This is because with no constraints the minimum risk is equal to zero and can be achieved in the first stage itself.
Note, our stage-wise decomposition significantly simplifies the ERM. The objective in Eq. (18) is only a function of \(f_{p}^{k},f_{n}^{k}\) given \(\tilde{\delta}^{k}_{i}\) and the state \(S^{k}_{i}\). To minimize an empirical version of a multi-stage risk in Eq. (3) is much more difficult due to stage interdependencies.
Given \(\delta^{k}_{i}\) and all the stages but the kth, we can solve (18) by iterating between \(f^{k}_{p}\) and \(f^{k}_{n}\). To solve for \(f^{k}_{p}\), we fix \(f^{k}_{n}\) and minimize a weighted error
We can solve for f
n
in the same fashion by fixing f
p
,
To derive these expressions from (18), we used another identity for any binary variables a,b,c
Note the advantage of our parametrization from Theorem 2. We converted the problem from learning a complicated three region decision to learning two binary classifiers (f
p
,f
n
), where learning each of the binary classifiers reduces to solving a weighted binary classification problem. This is desirable since binary classification is a very well studied problem, and existing machine learning techniques can be utilized here, as we will demonstrate in the next section.