Classification Under Partial Reject Options

We study set-valued classification for a Bayesian model where data originates from one of a finite number $N$ of possible hypotheses. Thus we consider the scenario where the size of the classified set of categories ranges from 0 to $N$. Empty sets corresponds to an outlier, size 1 represents a firm decision that singles out one hypotheses, size $N$ corresponds to a rejection to classify, whereas sizes $2\ldots,N-1$ represent a partial rejection, where some hypotheses are excluded from further analysis. We introduce a general framework of reward functions with a set-valued argument and derive the corresponding optimal Bayes classifiers, for a homogeneous block of hypotheses and for when hypotheses are partitioned into blocks, where ambiguity within and between blocks are of different severity. We illustrate classification using an ornithological dataset, with taxa partitioned into blocks and parameters estimated using MCMC. The associated reward function's tuning parameters are chosen through cross-validation.


Introduction
Classification of observations among a finite number N of hypotheses or categories is a well studied problem in statistics, and also a central concept of modern machine learning (Bishop, 2006;Hastie et al., 2009).With a Bayesian approach the maximum aposteriori classifier maximizes the probability of a correct classification (Berger, 2013;Bishop, 2006).For some problems, when there is much ambiguity about the category that fits data the best, it is possible to add a reject option, where no classification is made (Chow, 1970;Ripley, 2007;Freund et al., 2004;Herbei and Wegkamp, 2006), and if such a reject option is followed by additional data collection, this leads to Bayesian sequential analysis (Arrow et al., 1949).These types of classifiers were generalized in Karlsson and Hössjer (2021), and formulated in the context of reward functions with a set-valued argument.Each such reward function leads to a set-valued classifier, where the size of the classified set of categories is 0 if the classifier rejects all hypotheses, 1 for a firm decision that singles out one category, between 1 and N for a partial reject option, where ambiguity remains between some but not all categories, and N for a rejection to classify, i.e when none of the categories are singled out.
In this paper we analyze reward functions with a set-valued argument in more detail, and in particular we derive the associated optimal Bayes classifiers that maximize expected rewards.We start by considering reward functions for a homogeneous collection of categories of the same type, and find explicit formulas for the Bayes classifier in terms of the ordered aposteriori probabilities of all hypotheses and a penalty term for the size of the classified set.It is shown that conformal prediction (Vovk et al., 2005;Shafer and Vovk, 2008) with posterior probabilities as nonconformity measure is a special case of our approach.Then we consider scenarios where the categories are divided into a number of blocks, such that categories (typically) are more similar within than between blocks.The associated Bayes classifiers involve the ordered aposteriori probabilities within each block and penalty terms a and b for adding classified categories from the correct and wrong blocks respectively.In particular, we demonstrate that classification with indifference zones (Bechhofer, 1954;Goldsman, 1986) can be represented as an instance of set-valued classification with two blocks of categories.
Our framework of set-valued classification, with the set of categories partitioned into blocks, is applied to an ornithological data set.Each observation in data consists of three observed traits for a bird and which taxon the bird belongs to.Based on these data we want to train a classifier of future birds to taxon.The taxa are partitioned into blocks with regard to cross-taxon similarities.The parameters of the underlying statistical model are estimated through a Markov Chain Monte Carlo procedure, and the tuning parameters of the set-valued reward function are estimated through cross validation.
The article is organized as follows: In Section 2 we introduce the statistical model with N hypotheses, and define the optimal (Bayes) set-valued classifier.Then we introduce a large class of reward functions for models with one block of categories (Section 3) and several blocks of categories (Section 4) respectively, and give explicit expressions for the corresponding Bayes classifiers.The ornithological dataset is analyzed in Section 5 and a discussion in Section 6 concludes the paper.

Statistical model and optimal classifiers
Consider a random variable Z ∈ Z, whose distribution follows one of N possible hypotheses (or categories) where f i is the density or probability function of Z under H i .We will assume a Bayesian framework, and thus the true but unknown hypothesis I ∈ N = {1, . . ., N } is a random variable.It is given a categorical prior distribution with parameter vector π i = P (I = i), for i = 1, . . ., N and the corresponding parameter vector of the categorical posterior distribution of an observed value z of Z is Our objective is to classify z.To this end, a classifier Î = Î(z) ⊂ N with partial reject options is defined as a subset of categories.When Î = {i}, a firm decision of category i is made, whereas Î = N corresponds to a reject option, where no decision is made about which categories that conform with z the most.Note that it is the action "to classify" that is rejected, when Î = N .The intermediate case 2 ≤ | Î| = m ≤ N − 1 corresponds to a partial reject option, where the classifier excludes N − m hypotheses but rejects discrimination among the remaining m = m(z) categories.Another possibility, Î = ∅, corresponds to a scenario where none of the N categories fit the observed data z well, i.e. we exclude all hypotheses.This can be regarded as a safeguard against outliers or otherwise faulty data (Ripley, 2007;Karlsson and Hössjer, 2021).
refer to the ordered posterior category probabilities.The well-known Maximum Aposteriori (MAP) classifier (Berger, 2013) always makes a firm decision, so that It also maximizes the probability P( Î = {i}) of a correct classification, but has a higher probability of misclassification than correct classification, when p (N ) < 1/2.The MAP classifier can be formulated as Î being the classifier that maximizes the expected value of the reward function assigned to a classified subset I ⊂ N of categories when the true category is i.Since Î = Î(Z), the expectation in ( 6) is with respect to Z and I. Let the value function refer to the conditional expected reward given Z = z.It is clear the optimal Bayes classifier, that maximizes the expected reward, or equivalently maximizes the expected value function for each z ∈ Z. Since the reward function ( 7) only takes non-zero values for singelton choices of I ⊂ N , we only need to consider I = {i} for i = 1, . . ., N , in order to find Î.For each choice I = {i}, V (z; {i}) = p i .Thus we choose i = (N ), since p (N ) for each fixed z is the largest value V (z, I) can return.This shows that the MAP classifier (4) is the optimal classifier under (7).
In this paper we will consider optimal classifiers (8) for reward functions other than (7); those that allow not only for single classified categories, but also for outliers and (partial) reject options.
3 Partial rejection, one block of categories When N represents one homogeneous block of categories, the following class of reward functions is natural to use: Definition 1.Given a set of categories N = {1, . . ., N }, with i ∈ N the true category of an observation and I ⊂ N the classified subset of categories, a reward function invariant w.r.t.permutations τ : N → N of the labels is called an invariant reward function.
It is easily seen that is an invariant reward function.The first term of (10) corresponds to a reward of 1 for a classified set I that includes the true category i, whereas the second term g(|I|) ≥ 0 penalizes based on the size |I| of the classified set.Of particular interest are the following classifiers: Definition 2. A classifier containing the m ∈ {0, . . ., N } categories with the largest posterior probabilities, where is called an m most probable classifier, abbreviated as an m-MP classifier.
The MAP classifier is thus a special case of the more general class of m-MP classifiers, and obviously the 1-MP classifier is the MAP classifier, whereas the 0-MP classifier corresponds to the empty set, since we then choose none of the categories.We will now link invariant reward functions of type (10) to m-MP classifiers: Proposition 1.The optimal classifier for an invariant reward function of type (10), is an m-MP classifier Î(z) = Îm(z) with where v(0; z) = 0 and Proof.Recall that the optimal classifier Î(z) maximizes, for each z ∈ Z, the value function V (z; I) among all nonempty subsets I of N .For the reward function of equation ( 10) we have that Note that the inequality occurs since we go from considering a subset I ⊆ N of categories to considering another subset with equally many but the most probable categories.Consequently, among all I ⊂ N of size |I| = m, the value function V (z; I) is maximized by Îm in (11), for some m ∈ {0, . . ., N }.Among these subsets, the optimal classifier is Î = Îm(z) , where m(z) is the value of |I| that maximizes the right hand side of (14).Since this value of m(z) is identical to the one in (12), this finishes the proof.
Example 1 (Classification with reject options.)Ripley (2007) introduced a reward function with a reject option.In our notation it corresponds to with a reward of 1/N < r < 1 assigned to the reject option.The special case of (15) for N = 2 categories was treated by Chow (1970); Herbei and Wegkamp (2006).Note that ( 15) is equivalent to a reward function (10) with penalty term The corresponding classifier is of interest for Bayesian sequential analysis (Arrow et al., 1949;Berger, 2013) and sequential clinical trials (Carlin et al., 1998), where the reject option Î = N corresponds to delaying the decision and collecting more data before the seqential procedure is stopped and a specific category is chosen.
The next result is a corollary of Proposition 1, and it treats an important class of reward functions with a convex penalty function: Corollary 1.Consider a classifier Î(z) based on a reward function (10), for which the penalty term g(m) is a convex function of m, with g(0) = 0. Then (12) simplifies to Î(z) = I m(z) , with where and max ∅ = −∞ in the definition of m (z).
Remark 1.Note that in the second line of (19), the inequality constitutes an inclusion criterion that a category needs to fulfill to be included in the classifier.As can be seen from the left and right hand sides of the inequality, the category with the largest posterior probability not yet included, will be included if its posterior probability is larger than the added penalty g(m) − g(m − 1) associated with enlarging the size of the classifier from m − 1 to m. Proof.
where max ∅ = −∞ in the inner maximization.By the definition of v(m; z) in (13), equation ( 20) is equivalent to the expression for m(z) given in equation ( 19).
Example 2 (Proportion-based reward functions.)In order to illustrate Corollary 1 we introduce the following class of reward functions with a penalty germ g(m) = c max(0, m − 1) that only includes one cost parameter c: Definition 3.An invariant reward function of the form Proof.Since a proportion-based reward function ( 21) has penalty term Inserting ( 23) into ( 19), equation ( 22) follows.
Note that the MAP classifier ( 4) is obtained for c > p (N ) .For this reason Karlsson and Hössjer (2021) restricted the cost term of ( 21) to a range 0 ≤ c ≤ p (N ) and reparametrized it as with 0 ≤ ρ ≤ 1.The m-MP classifier ( 11), with m = m(z) as in ( 18) and ( 22), then takes the form Example 3 (Conformal prediction) Conformal prediction (Vovk et al., 2005;Shafer and Vovk, 2008) is a general method for creating a prediction region Γ δ = Γ δ (z) for an observation z with a confidence level (1 − δ)%, where δ ∈ (0, 1) is chosen freely, typically close to 0. For the case of categorical prediction, the conformal algorithm (Shafer and Vovk, 2008, Section 4.3) uses as input the new observation z that we want to classify, a labeled training data set B, the number δ, and a nonconformity measure A. Then for each possible label i ∈ N of z a decision is made as to whether i should be included in the prediction region Γ δ or not.
It turns out that conformal prediction is as a special case of the classification theory presented this section, when posterior probabilities are used as nonconformity measure.We choose the penalty term g(|I|) = c|I| in ( 10) and find that Γ δ = Î(z) = Îm(z) , where We can use theory from Shafer and Vovk (2008) to specify how c = c(δ) in ( 26) relates to δ.To this end, let z be the observation we want to classify.
For each possible label (or category) i ∈ N of z, provisionally set (z, i) as a future observation that is part of training data (although in practice z is rather a future observation that we want to classify).The batch of previous examples B corresponds to a very large training data set (which is called D = D 1 ∪ . . .∪ D N in Section 5.2), with n i → ∞ observations from each category i, which in the limit makes it possible to know the distributions f 1 , . . ., f N of data under the various hypotheses.If of the distributions of the N categories.We then use the posterior probabilities π i to define the nonconformity measure Let F i refer to the distribution function of p i (Z) when Z ∼ f i , and let be the corresponding mixture distribution function.The conformal algorithm then corresponds to choosing a classifier (26) with where, as mentioned, 1 − δ is the confidence level of the prediction region.
As a last remark, if the penalty parameter is changed from c to ρ = c/p (N ) (cf.Example 2), we notice that ( 26) is equivalent to (25) for any value of ρ ≥ 0.

Several blocks of categories
This section will cover an extension of the classical classification problem, where observations belong to a category and categories belong to supercategories, or blocks, as we will call them.Thus we partition the N categories into K blocks of sizes N 1 , . . ., N K , with K k=1 N k = N .Without loss of generality, the labels i are defined so that each block consists of adjacent categories.The different scenarios when it is worse to misclassify within a block than between blocks, and vice versa, will be a subject of study later on in the paper.In order to define classifiers that take this into account, we introduce a new type of reward functions.
Definition 4. For a set of N categories, partitioned into blocks N k , k = 1, . . ., K with N k categories in each block, block invariant reward functions satisfy (9) only for block preserving permuations τ A class of block invariant reward functions is where contains the categories of the classified set I that belong to block k, whereas k(i) is the block to which i belongs, i.e. i ∈ N k(i) .Moreover, g k is a penalty term for misclassification, when the true category i belongs to N k .This term is a function of the number of categories |I k(i) | in the classified set I that belong to the correct block as well as the number of classified categories |I| − |I k(i) | that belong to any of the wrong blocks.In order to define the classifiers of interest we order the posterior probabilities p i , i ∈ N k within each block as p (k1) ≤ . . .≤ p (kN k ) .
Definition 5.For an integer vector be a classifier that includes the m k categories with the largest posterior probabilites from block k.We call this a composite classifier.
A composite classifier is a subset of N of size In particular, the two extreme scenarios with no categories classified or a rejection to classify, correspond to Î(0,...,0) = ∅ and Î(N 1 ,. where and v k (0; z) = 0 for any k.
Proof.The proof mimics that of Proposition 1.We start by finding the value function V (z; I) in ( 8) for the block invariant reward function (32).It is given by From this it follows that V (z; I) is maximized, among all 35).The optimal classifier is therefore that maximizes the right hand side of (39).Hence m(z) is given by (37).
Example 4 (Composite proportion-based reward functions.)We will consider a class of reward functions that are special cases (32).These reward functions involve two cost parameters a and b: Definition 6.A block invariant reward function of the form where 0 ≤ a ≤ b are fixed constants, is called a composite proportion-based reward function.
It follows from Propositon 2 that composite proportion-based reward function have very explicit optimal classifiers: Corollary 3. A composite proportion-based reward function (40), gives rise to an optimal classifier that is a composite classifier Î(z) = Îm(z) in ( 35) with for k = 1, . . ., K, where and with max ∅ = −∞ used in the definition of m k (z).
Proof.In view of Proposition 2, it suffices to prove that for a composite proportion-based reward function ( 40), the optimal classifier is of type Î(z) = Îm(z) , where m(z) = (m 1 (z), . . ., n K (z)) is given by (41).To this end, we first notice, from the right hand side of (39), that the value function of a classified set Îm is Since V (z; Îm ) splits into a sum of K terms that are functions of m 1 , . . ., m K respectively, it follows that V (z; Îm ) is maximized, as a function of m, by maximizing each term separately with respect to m k .The maximum for term k, on the right hand side of (43), is attained for in accordance with the first identity of (41).The second identity of (41) follows from the fact that the function being maximized in ( 44) is concave in m k .
To give some more intuition to the choice of a and b for the reward function (40), we will look at the penalty term and the optimal classifier Î(z) = Îm(z) defined in equations ( 41)-( 42) of Corollary 3. Note that a cost of a is incurred per extra category from the correct block in the classified set Î, whereas a cost of b is added for each category in the classified set originating from the wrong block.These costs are chosen, and can be interpreted as threshold values for including categories in the classifier, especially when looking at (42).From the first row of this equation we notice that a low value on b means that we are more prone to include the most probable category from each block, whereas the second row implies that with a low value of a we are more prone to include several categories from the same block.However, since b occurs in the second row of (42) as well, a small a might not have any effect if b is large.On the other hand, combining a small b with a large a, we get a classifier that is composed of categories from many blocks, but few categories from each block.Such a classifier might not include the correct category, but will to a large extent include some category from the correct block, and might be suitable when we want to safeguard in particular against erraneous superclassification.Finally, if we want to ensure that Î = ∅ in composite classification, we have to choose Let us end Example 4 by considering two special cases of the proportionbased reward function ( 40).The first one occurs when a = b = c, and it corresponds to a reward function that differs slightly from ( 21) in that all categories in the classified set I are penalized by c when none of them belong to the correct block k(i), that is, when The second special case of (40) occurs when it is known that observation z belongs to block k, i.e.P k (z) = 1.The classifier Îm(z) in ( 35), with m(z) = (m 1 (z), . . ., m K (z)) as in ( 41), then simplifies to This corresponds to an m-MP classifier (11), with m = m(z) as in ( 18) and ( 22), when classification is restricted to categories within class k, and a penalty c = a is incurred per extra classified category.
Example 5 (Indifference zones.)Assume that we want to know which of N normally distributed populations with unit variances and expected values θ 1 , . . ., θ N has the largest expected value θ (N ) .Let be a sample of independent random variables, with Z ij ∼ N (θ i , 1).Letting φ be the density function of a standard normal distribution, this gives a likelihood with a parameter vector θ = (θ 1 , . . ., θ N ) that is assumed to have a prior density P (θ).Divide the parameter space Θ = R N into a disjoint union of N + 1 regions, where and > 0 is a small number.Whereas Θ 1 , . . ., Θ N correspond to all hypotheses where some parameter θ i is largest by a margin of at least , Θ N +1 is the indifference zone, where none of the populations has an expected value that is at least units larger than all the others (Bechhofer, 1954, Goldsman, 1986).
It is possible to put this model into the framework of Section 2, with for i = 1, . . ., N + 1, and an extra category N + 1 is added that represents the indifference zone.We will consider reward functions where r > 0 is the reward of not selecting any population as having the largest mean, when the parameter vector belongs to the indifference zone.Formally, this corresponds to a block invariant reward function with the two blocks of categories, although misclassification in this example is more serious within than between blocks.However, the reward function ( 50), with blocks as in (51), does not belong to the class of reward functions in (32).It is therefore not possible to make use of Propositon 2, but it can still be seen that the optimal classifier is In Section 6 we briefly discuss an extended class of reward functions that includes (50).

Illustration for an ornithological data set
In this section taxon identification will be used as a case study.In particular, we will look at four bird species that are morphologically similar, but share three measurable traits.In Karlsson and Hössjer (2021), we treat the underlying fitting problem in detail, with covariates, heteroscedasticity, missing values and imperfect observations, examplified on data on the same four bird species.Since this paper has a stronger emphasis on developing a theory of classification, we use a subset of data from Karlsson and Hössjer (2021) for the purpose of illustration.This data set includes complete observations only from a certain stratum of the population, eliminating the need for covariates.
To a large extent, our data is the same as in Malmhagen et al. (2013), where the same classification problem is treated with mostly descriptive statistics.The four species considered are Reed warbler, Marsh warbler, Blyth's reed warbler and Paddyfield warbler, and the three shared traits are wing length, notch length and notch position.For details on the measurements of the traits, see e.g.Svensson (1992); Malmhagen et al. (2013) and Karlsson and Hössjer (2021).In total we have 882 complete observations of juvenile birds, with the number n i of birds of each species given in Table 1.This gives rise to a training data set
The species were partitioned as follows: Reed warbler and Marsh Warbler constitute the block "common breeders", Blyth's reed warbler constitutes the block "rare breeder" and Paddyfield warbler constitutes the block "rare vagrant".We acknowledge that the grouping is quite arbitrary and that the block names are inaccurate in most places of the world, but here it will mainly illustrate classification with a partitioned label space.

Model
We assume that the parameters associated with each category are independent.If θ i ∈ Θ i is the parameter vector associated with category i, with a prior distribution P (θ i ), and if f (D i ; θ i ) refers to the likelihood of training data for taxon i, the posterior distribution of θ i is Let z be an observed new data point that we wish to classify, based on training data.If f (z; θ i ) is the likelihood of the test data point for taxon i, we integrate over the posterior distribution (52) in order to obtain the corresponding likelihood of z for the hypothesis H i that corresponds to this taxon.Then we insert ( 53) into (2) in order to obtain the posterior probabilities p i (z) of all categories.
In our example we assume that We estimate (53) using Monte Carlo simulation, i.e. we simulate L realizations of and then plugg ( 54) into (2).For a more detailed model setup, see Section 3.1 and Appendix A of Karlsson and Hössjer (2021).All implementation was done in R (R Core Team, 2021), using the package mvtnorm (Genz et al., 2021).

Classifiers based on composite proportion-based reward functions
We will derive composite classifiers from the composite proportion-based reward function ( 40).This reward function involves the two constants a ≥ 0 and b > 0, and the corresponding classifier of z is denoted Î(a,b) (z).We will regard 0 ≤ ε = a/b as a fixed parameter that quantifies how much more severe it is to misclassify a category outside a block than inside it, with severity inversly proportional to ε.If 0 ≤ ε < 1, it is more severe to misclassify between blocks than within, whereas the opposite is true when ε > 1.The parameter b will be chosen through leave-one-out cross validation.To this end, let R(I, i) be a binary-valued reward function (to be chosen below) without any penalty term, and let refer to the fraction of observations z ij in the training data set D = D 1 ∪ . . .∪ D N that return a reward in the cross-validation procedure.That is, is the classifier of z ij based on the rest of the data.It is further assumed that w i are non-negative weights that sum to 1, such as w i = 1/N or w i = n i / N j=1 n j .The choice of b will depend on which reward function R that is used in (55).Some possible choices of the binary-valued reward function are given in Table 2.For two of them ( R3 and R4 ) the reward R(I, i) is a non-decreasing function of I, and therefore the non-reward rate 1 − R cv εb,b , obtained from the cross validation procedure (55), is a non-decreasing function of b.We will therefore choose b as the largest cost parameter for which the non-reward rate is at most δ > 0, i.e.
In particular, it can be shown that when the reward function R3 is used in ( 56), this generalizes the conformal algorithm (Shafer and Vovk, 2008, Section 4.3) from the case of one block (K = 1) to several blocks (K > 1), although we use cross-validation from training data rather than prediction of a new observations, as in Shafer and Vovk (2008).
For the other two reward functions R1 or R2 of Table 2, it is no longer the case that estimated non-reward rate 1 − R cv εb,b in ( 55) is monotonic in b.For these two choices of R we rather choose the cost parameter in order to minimize the estimated non-reward rate.
We will look at three different prior distributions, namely a uniform prior (π (flat) ), a prior proportional to the number of observations n i from each category in training data (π (prop) ), and a prior proportional to the number of registered birds of each species at the Falsterbo Bird observatory throughout its operational history (π (real) ).The purpose of π (flat) is to represent a situation of no prior knowledge on how likely any of the categories is to occur.The prior π (real) is supposed to examplify a real world situation of having some commonly occuring species and some very rare ones, whereas π (prop) is a middle ground between these two extremes that fits the data set D very well.As weights we choose for i = 1, 2, 3, 4. We weight each observation equally with w (bird) , each species equally with w (spec) , whereas the species are weighted higher the less expected

Binary reward function
Reward critera Correct category is in the classifier, and no category from an incorrect block is in the classifier.The analogy of R1 for block prediction.
Correct category in classifier.
Some category from the correct block in the classifier.The analogy of R3 for block prediction.
Table 2: The four binary reward functions used in the case study.Since R1 (I, i) ≤ R2 (I, i) ≤ R3 (I, i) ≤ R4 (I, i), it follows that R1 is the least generous reward function and R4 the most generous one.Note that R cv εb,b ( R3 ) and R cv εb,b ( R4 ) are both decreasing functions of b, so that (56) makes sense for choosing b for any of these two choices of R, whereas (57) is more approriate for choosing b for the other two reward functions.they are with w (rare) .Since the number of observations are not balanced across species, w (bird) will weight species higher the more common they are, i.e. the more observations we have of them.Using w (spec) , birds will be weighted unequally due to the same imbalance (the less birds of a given species there are, the higher weights are assigned to these birds).Finally, as mentioned above, w (rare) weights less frequently observed species more heavyly; the rationale being that we value observations of rarely occuring species, as data on these are scarce.

Results
We will analyze the estimated reward rate (55) for the four choices of R that are listed in Table 2.In Tables 3 and 4 we present the automatic choice of the cost parameter b (cf.( 56) and ( 57)) for two ratios ε = 1/2 and ε = 2 of a and b, and for all nine combinations priors and weights.As can be seen from Table 4, the same optimal b-values are found for R1 and R2 , with the same non-reward rates.This is mostly due to the small block sizes, meaning that   56) with δ = 0.05.These estimates of b are computed for each combination of ε, prior π i , and weights w i .We evaluated (55) for 0.01 ≤ b ≤ 20 with a resolution of 0.01.
it sometimes is equivalent to pick the correct species, as to pick the correct block.
In Figures 1 and 2 we plot the value of estimated non-reward rate for a grid of b-values, for ε = 1/2 and ε = 2 respectively.Note the monotone decrease of R3 and R4 as b decreases.Also notice the minimums of R1 and R2 in the graphs.
Finally we refer to Appendix A for further visualisations of R1 and R2 , evaluated over a lattice of a and b-values.It can seen that for R2 the optimal non-reward rate is achived by choosing a = 0.This is straightforward to explain, as R2 does not punish the inclusion of several categories from the correct block.For this reason the classifier Î(a,b) with minimal non-reward rate includes as many categories as possible from each block with at least one classified member, corresponding to a = 0. Notice also that there are large regions of values of a and b that attain the minimum non-reward rate.using (57).These estimates of b are computed for each combination of ε, prior π i , and weights w i .They were found using the optimise-function in R. (a) Using w (bird) we observe similar curves for π (flat) and π (prop) , whereas π (real) gives overall higher non-reward rates.  spec) , the curves for R1 and R2 look similar for π (flat) and π (prop) , wheras the large drop in R3 and R4 occurs either for large values of b (π (flat) ) or for small values of b (π (prop) ).Again, the non-reward rates are overall higher using π (real) .rare) , with a very uneven weighting of species, the curves for R1 and R2 , as well as for R3 and R4 , take values very close to each other.Thus it seems like these curves overlap, when in reality they do not.This is just a consequence of the extreme amount of up-weighting of rare species, which are both in singelton blocks.(a) Using w (bird) we observe similar curves for π (flat) and π (prop) , whereas π (real) gives overall higher non-reward rates.spec) , the curves for R1 and R2 look similar for π (flat) and π (prop) , wheras the large drop in R3 and R4 occurs either for large values of b (π (flat) ) or for small values of b (π (prop) ).Again, the non-reward rates are overall higher using π (real) .rare) , with a very uneven weighting of species, the curves for R1 and R2 , as well as for R3 and R4 , take values very close to each other.Thus it seems like these curves overlap, when in reality they do not.This is just a consequence of the extreme amount of up-weighting of rare species, which are both in singelton blocks.In this article we introduce a general framework of set-valued classification of data that originates from one of a finite number of possible hypotheses.We introduce reward functions with a set-valued input argument, and derive the optimal (Bayes) classifier by maximizing the expected reward.Explicit formulas for the Bayes classifier are derived for a large class of reward functions.This includes scenarios where hypotheses either consitute one homogeneous block, or consist of several blocks of hypotheses, such that ambiguity within blocks is (typically) less serious than ambiguity between them.Our procedure is illustrated with an ornithological data set, where taxa (hypotheses) are divided into blocks.
As mentioned in Ripley (2007), a possible reason for including reject options is to obtain classifiers that are more reliable but also less expensive to use.In our case study of Section 5, for instance, a possible option is to consult an expert who would be able to identify the bird species morphologically, without using the measured traits.Although expertise does not come cheap, this could still be an alternative when the expected cost of algorithmic classification exceeds the cost of consulting an expert.The latter cost might be independent or a function of the number of hypotheses she needs to consider.In the former case, the reject option of Ripley (2007) would suffice, and in the latter case a partial rejection, as proposed in this paper, could be beneficial.
A number of generalizations of our work is possible.Firstly, suppose the new observation z that we want to classify, by means of the optimal classifier Î = Î(z) in (8), involves a covariate vector x and a reponse variable Y .Following Karlsson and Hössjer (2021), the most straightforward approach is to include covariates into the observation z = (x, Y ) that is to be classified.The covariate information will then be included in the category distributions f i and posterior probabilities p i of all categories i = 1, . . ., N , as well as in the resulting m-MP and composite However, if we also want the ambiguity of the classifier to depend on covariate information, it is possible to let the cost parameters of the reward function R depend on x as well.For instance, the conformal prediction algorithm of Example 3 involves choosing the cost c(δ) = F −1 (δ) = Q(δ) of including more lables in the classifier, as a quantile Q(δ) of the distribution F of posterior probabilities (cf.( 30)).In this context it is possible to use quantile regression (Koenker, 2005;Bottai et al., 2010) and choose the cost parameter c(δ; x) = Q(δ|x) = g −1 [x β(δ)] as a conditional quantile function.This is a regression model where the parameter vector β(δ) is a function of the quantile δ, whereas g is a link function.This approach might be particularly helpful for models with heteroscedasticity.
Secondly, it is possible to consider reward functions with other penalty terms.Instead of penalizing the number of classified categories within the correct and wrong blocks respectively, as in (32), one penalizes the number of wrongly classified categories within the correct and wrong blocks.This corresponds to a reward function where I k(i),(−i) = I k(i) \{i} is the number of wrong categories of I that belong to block k(i) when category i is true.For instance, the reward function ( 50) for indifference zones (Example 5) can be formulated in accordance with (59).
Thirdly, suppose an observation z belongs to several categories J ⊂ N , such as when J represents the properties associated with z.It is natural in this context to have a reward R(I, J ) for a classified set I ⊂ N of categories.The MAP classifier (7), for instance, corresponds a reward function R(I, J ) = 1(I = J ).If we allow for some misclassified categories and all categories belong to one homogeneous block, a natural extension of ( 10) is a function R(I, J ) = 1(J ⊂ I) − g(|I|) where the first term gives a unit reward if all true categories are included in the classifier, whereas the second term g(|I|) penalizes the size |I| of the classified set.
A Optimising R1 and R2 In Figure 3 the non-reward rate 1 − R cv ab ( R) (cf.( 55)) is plotted as a function of the two cost parameters a and b of the classifier Î(a,b) for the two reward functions R1 and R2 of Table 2.The objective function is not smooth and thus it can be hard to optimize.However, we obtained good results with Nelder-Mead optimization (Nelder and Mead, 1965), as implemented in the optim-function in R, with a starting value of (a, b) that corresponds to a small non-reward rate.58)), the reward function R1 (top row) and R2 (bottom row).The columns, from left to right, correspond to the priors π (flat) , π (prop) and π (real) (cf.Section 5.3).The levels of the contour plots are crudely drawn, but it can still seen that R1 attains a low value for a large set of (a, b), whereas R2 attains its lowest values over a small region where a is close to 0.
with c ≥ 0, is called a proportion-based reward function.Note that a proportion-based reward function has a penalty term that is almost linear in |I|.It is only the maximum operator that prevents us from refering to (21) as a reward function with a linear penalty term (abbreviated as a linear reward function).We may interpret c as a cost per extra classified category on top of the first one.For proportion-based reward functions, Corollary 1 simplifies as follows: Corollary 2. A proportion-based reward function (21) gives rise to an m-MP classifier (11) with m(z) as in (18) and

Figure 1 :
Figure1: This figure represents the case ε = 1/2.The prior is specified in the title of each graph, whereas the weights are explained in the subcaptions.Each color of the functions in the graphs correspond to one of the four reward functions R1 , R2 , R3 , R4 , given by the legends above each subfigure.For all priors and weights we observe that R3 and R4 decrease monotonically as b decreases, whereas R1 and R2 have a global minimum.
The scale along the vertical axis (for the non-reward rate) is different in these subplots compared to (a) and (b).Using w (

Figure 2 :
Figure2: This figure represents the case ε = 2.The prior is specified in the title of each graph, whereas the weights are explained in the subcaptions.Each color of the functions in the graphs correspond to one of the four reward functions R1 , R2 , R3 , R4 , given by the legends above each subfigure.For all priors and weights we observe that R3 and R4 decrease monotonically as b decreases, whereas R1 and R2 have a global minimum.
Using w ( The scale along the vertical axis (for the non-reward rate) is different in these subplots compared to (a) and (b).Using w (

Figure 3 :
Figure 3: The figure contains filled contour plots of the estimated non-reward rate 1 − R cv ab ( R) (cf.(55)), as a function of the two cost parameters a and b of the classifier Î(a,b) .These estimated non-reward rates make use of weights w (bird) i (cf.(58)), the reward function R1 (top row) and R2 (bottom row).The columns, from left to right, correspond to the priors π (flat) , π(prop) and π (real) (cf.Section 5.3).The levels of the contour plots are crudely drawn, but it can still seen that R1 attains a low value for a large set of (a, b), whereas R2 attains its lowest values over a small region where a is close to 0.
..,N K ) = N respectively.Moreover, (36) reduces to an m k -MP classifier (11) for a particular block k if m has only one non-zero element m k (cf.Example 4).The following result links composite classifiers to block invariant reward functions of type (32): 2. The optimal classifier, for a reward function (32), is a composite classifier Î(z) = Îm(z) in (35), with

Table 1 :
The number of observations of each species.

Table 3 :
The table specifies the estimated values of the cost parameter b, using (

Table 4 :
The table specifies the estimated values of the cost parameter b,