Optimally ordering IDK classifiers subject to deadlines

A classifier is a software component, often based on Deep Learning, that categorizes each input provided to it into one of a fixed set of classes. An IDK classifier may additionally output “I Don’t Know” (IDK) for certain inputs. Multiple distinct IDK classifiers may be available for the same classification problem, offering different trade-offs between effectiveness, i.e. the probability of successful classification, and efficiency, i.e. execution time. Optimal offline algorithms are proposed for sequentially ordering IDK classifiers such that the expected duration to successfully classify an input is minimized, optionally subject to a hard deadline on the maximum time permitted for classification. Solutions are provided considering independent and dependent relationships between pairs of classifiers, as well as a mix of the two.


Extended version
The paper "Optimal Synthesis of IDK cascades" by Baruah et al. (2021), published in RTNS 2021, presented analysis and algorithms for determining the optimal sequentially ordering of probabilistically independent IDK classifiers, achieving the

Introduction
Software components that are based on Deep Learning and related AI techniques are increasingly being deployed for classification problems in complex resource-constrained Cyber-Physical Systems. Such systems often require accurate predictions to be delivered in real time using limited computational resources.
Much of the recent research into Deep Neural Networks (DNNs) has however focused on improving the accuracy of classification. From a real-time perspective this ongoing quest for improved accuracy has arguably gone too far, resulting in DNNs that take substantial time to process even simple inputs that should in fact be relatively straightforward to classify. For example, Wang et al. (2018) showed that an order-of-magnitude increase in the execution time of DNNs has resulted in a negligible improvement in the accuracy of predictions for a considerable fraction of the ImageNet 2012 benchmark of validation images (Russakovsky et al 2015).
Balancing the trade-off between accuracy and latency becomes important if such DNNs are to be adopted for use in Cyber Physical Systems (CPS) that are expected to respond in a timely manner; for example, a DNN used for image processing within an autonomous driving CPS (Fujiyoshi et al. 2019). With this goal in mind, Wang et al. (2018) observed that if the advanced but slower DNNs were only used in the more challenging cases, then the time taken to achieve successful classification could be reduced. In effect, combining fast DNNs with accurate ones to reduce mean latency without any trade-off in accuracy.
This observation motivated Trappenberg and Back (2000) and Khani et al. (2016) to explore the use of IDK classifiers, which may be viewed as bringing some degree of self awareness to classifiers. An IDK classifier is obtained from an existing base classifier by attaching a computationally light-weight augmenting classifier that enables the base classifier to additionally predict an auxiliary "I Don't Know" (IDK) class depending on the degree of uncertainty in the predictions of the base classifier. 1 Specifically, an IDK classifier classifies an input as being in the IDK class if the base classifier is not able to predict some actual class for that input with a level of confidence that exceeds a predefined threshold value (Baruah et al. 2021).
The use of multiple classifiers to enhance the overall accuracy and precision of classification has been studied since as least the work of Nilsson (1965). An ensemble of classifiers is used to introduce diversity, either in the type of input, Since it is a requirement that all inputs are successfully classified by an IDK cascade, it is assumed that the last classifier in the cascade always outputs a real class. We refer to a classifier that always outputs a real class as a deterministic classifier. There are various forms that the deterministic classifier can take. Wang et al. (2018) proposed that a human expert could be considered to be the deterministic classifier. For example, the driver of a semi-autonomous vehicle could be called upon to decide, in conditions that caused a camera-based classifier to fail (i.e. output IDK), if a partially obscured road sign ahead signifies a lower speed limit, and hence the vehicle should reduce speed, an action that has a safetyrelated timing constraint. In another application a fully developed DNN could be sufficiently accurate that it can take on the role of deterministic classifier; however, its computation requirements are such that it should only be executed when absolutely necessary, other more efficient classifiers should be used if they can cater for typical inputs. To deal with applications that exhibit high levels of uncertainty, it may be necessary to introduce the class unclassifiable that the final arbiter, the deterministic classifier, can output if a real class cannot be identified.
Given a collection of several different IDK classifiers for a particular classification problem, this paper considers how they should be sequentially ordered for execution so as to minimize the expected (i.e. average) duration taken to successfully classify an input, and if a deadline is specified, to additionally guarantee to always meet that deadline. The analysis and the algorithms required to solve these problems are impacted by the relationships between the IDK classifiers concerned.
Two IDK classifiers may behave in a way that is independent of one another. By independent, we mean that the probability that the second classifier will output a real class is independent of whether it is run on all inputs or on only those inputs where the first classifier outputs IDK. For example, the ensemble examples given earlier for autonomous vehicle control and person recognition employ diverse inputs (radar, LiDAR and camera; iris images, face recognition and voice recognition) are likely to exhibit behavior that is independent. Indeed, Madani et al. (2012Madani et al. ( , 2013 show that very different sources of evidence such as text, audio, and video features are effectively independent. At the other extreme, two image-based classifiers that use the same input image but scale it to different resolutions (e.g. 64 × 64 or 256 × 256 pixels) (Hu et al. 2021) may be very similar and exhibit behavior that is fully dependent. By fully dependent, we mean that the first less powerful classifier is only able to successfully classify a strict subset of the inputs that the second more powerful classifier can recognize. This informal notion of similarity among classifiers is formalized in Sect. 2 via the concept of conditional probability.
The remainder of the paper is organized as follows. Section 2 describes the system model, terminology and notation used, along with the definitions of key concepts. Sections 3 and 4, present solutions for collections of IDK classifiers that are respectively, (i) independent and (ii) fully dependent with respect to one another. Section 5 considers collections of IDK classifiers that have (iii) a mix of dependent and independent relationships. Section 6 extends the analysis and algorithms for all three cases to problems where there is also a deadline constraint. Finally, Sect. 7 concludes with a summary and directions for future research.

System model, terminology, and notation
We consider a collection of n classifiers K 1 , K 2 , … , K n that may be used for a given classification problem. Each classifier K i is characterized by parameters (C i , P i ) , specifying its execution time C i and its success probability P i . These parameters denote that the classifier takes at most a time C i to complete execution when invoked on an input, and returns a real class, rather than IDK, with probability P i , where 0 < P i ≤ 1 . (For a discussion about how these parameters can be obtained, see (Baruah et al. 2021)). We refer to an ordered linear sequence of such classifiers as an IDK cascade.
The actual value of n is application dependent and many different values are to be found in the literature on classifier ensembles. As noted earlier, diversity comes from having independent classifiers with different types of input, different internal models and different training data. Even a single classifier, such as the image-based example described previously, can have a number of different pixel resolutions defined and hence give rise to two, three or more distinct fully dependent classifiers. Combining such fully dependent and independent classifiers can easily lead to IDK cascades with eight or more components. In practice, however, n is unlikely to be greater than ten.
Problem statement: Given a collection of n classifiers K 1 , K 2 , … , K n suitable for use on a given classification problem, the objective is to determine which of these classifiers should be executed, and in what order, such that the expected duration to successfully classify the input is minimized. Stated otherwise, the aim is to obtain the optimal IDK cascade. Further, the problem may additionally be subject to the constraint that the maximum time taken to successfully classify the input must be no more than a specified deadline D.
For the expected duration of an IDK cascade to be finite, it is necessary that some classifier K n with P n = 1 is executed. We therefore assume that such a classifier exists, and refer to it as a deterministic classifier; by convention always denoted by K n . Further, we assume that there is only one such deterministic classifier, since if there were more than one, the one with the shortest execution time should always be preferred and the others discarded. Similarly, we assume that the deterministic classifier has a longer execution time ( C n ) than any of the IDK classifiers, since the deterministic classifier should always be preferred over any IDK classifier with the same or longer execution time; such IDK classifiers can therefore be discarded.
Combining probabilities: When computing the expected duration for an IDK cascade, it is essential to understand how the prior execution of one classifier K A impacts the probability that a subsequent classifier K B will be able to make a successful classification rather than return IDK. Consider two classifiers K A = (C A , P A ) and K B = (C B , P B ) , neither of which is deterministic (i.e. P A < 1.0 and P B < 1.0 ). Suppose that classifier K A is called on some input and returns IDK, and in that case classifier K B is called next. Let p(K B |K A ) denote the conditional probability that classifier K B returns an actual class rather than IDK given that classifier K A failed to do so and therefore returned IDK. Definition 1 (Independent classifiers) Classifiers K A and K B are said to be independent if p(K B |K A ) = P B and p(K A |K B ) = P A .
That is, the conditional probability that K B will make a successful classification given that K A did not, is exactly the same as the probability that K B will make a successful classification when K A is not called beforehand, and vice versa. Informally speaking, this happens when K A and K B are making their classification decisions in very different ways, for example by using completely different (and uncorrelated) attributes of their inputs, and so the fact that one of the classifiers was unable to classify an input has no bearing on the ability of the other classifier to do so.
At the other extreme are fully dependent classifiers. Two classifiers are fully dependent if one successfully classifies only a strict subset of the set of inputs that are successfully classified by the other.
Definition 2 (Dependent classifiers) Classifiers K A and K B , satisfying (P A < P B ) , are said to be fully dependent if So if K B cannot deliver a successful classification then neither can K A ; and the conditional probability that K B will make a successful classification given that K A did not, is given by the proportion of inputs, P B − P A , that K B is able to classify that K A cannot, divided by the proportion of inputs, 1 − P A , that K A fails to classify. For example, if P A = 0.6 and P B = 0.8 then if K A fails to make a classification then the probability of success for K B falls from 0.8 to 0.5.
Observe that if the second classifier K B is deterministic, and hence P B = 1 then we have p( In other words, deterministic classifiers can be regarded as being either independent (Definition 1) or fully dependent (Definition 2). Intuitively, IDK cascades can be constructed by placing less effective but faster classifiers earlier in the order, in the hope that they will successfully classify the input most of the time, with more effective but slower classifiers invoked only on those rare occasions that the earlier classifiers fail. This is illustrated by the following example.
Example 1 Suppose, for solving some classification problem, we have a deterministic classifier K 3 (we will add to this example later), with parameters (C 3 = 10, P 3 = 1) . Since the classifier is deterministic, it will be the last classifier to run in any IDK cascade in which it is included. Further, assume we also have an IDK classifier K 1 with parameters (C 1 = 5, P 1 = 0.6).
Consider the IDK cascade ⟨K 1 ;K 3 ⟩ , which executes K 1 first and subsequently executes K 3 only if K 1 fails to make a successful classification and returns IDK. Since K 1 is always executed on all inputs, but K 3 only executes when K 1 outputs IDK, which happens with probability (1 − P 1 ) , the expected duration of this IDK cascade is given by: which is smaller than C 3 = 10 , the duration of the cascade ⟨K 3 ⟩ containing only the deterministic classifier. ◻ The downside of using an IDK cascade rather than only executing the deterministic classifier is that the worst-case duration increases. In Example 1, while the IDK cascade ⟨K 1 ;K 3 ⟩ completes in 5 time units in 60% of cases, in the remaining 40% of cases it takes 15 time units, whereas executing only the deterministic classifier always takes 10 time units. Whether this matters or not depends on whether classification is required by a specified deadline. Such problems are considered in Sect. 6. Example 1 considered the combination of one IDK classifier with a deterministic classifier. The problem of finding an optimal IDK cascade becomes much more interesting and challenging when multiple IDK classifiers are available. In that case, the analysis and algorithms required depend on the relationships between the IDK classifiers, i.e. whether they are independent (Sect. 3), fully dependent (Sect. 4), or consist of groups of dependent classifiers where classifiers from different groups are nevertheless independent of one another (Sect. 5).

Independent IDK classifiers
In this section we consider the problem of determining the optimal IDK cascade, that minimizes the expected duration for successful classification, given a set of n independent IDK classifiers K i = (C i , P i ) n i=1 . We begin with an illustrative example.
Example 2 Consider again the problem instance from Example 1. Suppose that there is an additional IDK classifier K 2 with parameters (C 2 = 3, P 2 = 0.2) as set out in Table 1. Further, assume that K 2 is known to be independent of K 1 . It can be verified that IDK cascade ⟨K 1 ;K 2 ;K 3 ⟩ has an expected duration of IDK cascade ⟨K 2 ;K 1 ;K 3 ⟩ has an expected duration of and IDK cascade ⟨K 2 ;K 3 ⟩ has an expected duration of Observe that each of these IDK cascades has an expected duration larger than that of the IDK cascade ⟨K 1 ;K 3 ⟩ that, as shown in Example 1, has an expected duration of 9 . Hence if minimizing expected duration is the objective, then the IDK classifier K 2 should not be used at all. ◻ Example 2 illustrates one way of obtaining optimal IDK cascades: simply enumerate all possibilities, compute the expected duration for each, and choose the IDK cascade with the minimum value. However, such an approach is highly inefficient: given n classifiers the number of possible IDK cascades is a very rapidly-growing combinatorial function 2 ( ∑ n−1 k=0 (n−1)! (n−1−k)! ) . Below, we derive a far more efficient algorithm for synthesizing optimal IDK cascades from a given collection of independent IDK classifiers.
Lemma 1 can be used to compute the expected duration of any linear sequence of independent IDK classifiers.
The first classifier will always execute; the second classifier will execute if and only if the first one fails, which happens with probability (1 −P 1 ) ; the third classifier will execute if and only if the first two both fail, which happens with probability (1 −P 1 )(1 −P 2 ); and so on. Hence the expected duration is which is represented compactly by the expression in the Lemma. ◻ Lemma 2 below identifies, for independent IDK classifiers, an important characteristic of any optimal IDK cascade: Lemma 2 Let classifier K j be scheduled for execution after classifier K i in an optimal IDK cascade (i.e. of minimum expected duration). It must be the case that: Proof We will establish that any two adjacently scheduled classifiers K i and K j satisfy (2). The lemma then follows from the transitivity of the ≤ relationship on ℝ (the set of real numbers).
Let S opt denote an optimal IDK cascade, and let K i denote the classifier in the i'th position in this cascade for any i, 1 ≤ i ≤ n . Further, let (Ĉ i ,P i ) denote the execution time and success probability of classifier K i . Finally, let S 1 denote an IDK cascade obtained from S opt by swapping the classifiers in the i'th and (i + 1)'th positions in S opt : Using Lemma 1, the expected duration of S opt can be written as the sum of three terms representing respectively the outer summation of (1) Real-Time Systems (2023) 59:  Next, consider the expected duration of S 1 . Again using Lemma 1, the expected duration of S 1 can also be written as the sum of three terms. Since S 1 only differs from S opt in that the i'th and (i + 1)'th classifiers are swapped, the first and third terms are the same as the first and third terms of (3). However, the middle term is as follows, since the order of the i'th and (i + 1)'th classifiers has been swapped. 3 S opt is by definition an optimal IDK cascade. Its expected duration is therefore no larger than the expected duration of S 1 , and so the middle term of (3) must be no larger than (4): Observing that the term ∏ i−1 j=1 (1 −P j ) appears on both sides of (5), we have: 3 The difference between (4) and the middle term of (3) is that the roles of i and (i + 1) are swapped in the expression Ĉ x + (1 −P x )Ĉ y .

3
Algorithm 1 Synthesizing an optimal IDK cascade from independent IDK classifiers Sort in non-decreasing order of C i /P i 3 Output the classifiers according to their position in the sorted list, stopping at the 4 deterministic classifier Lemma 2 implies that independent IDK classifiers adjacent to each other in any optimal IDK cascade must have the ratio of their execution time to their success probability in non-decreasing order. An algorithm for synthesizing optimal IDK cascades immediately presents itself: simply determine these ratios for all of the classifiers, and sort the list in non-decreasing order-see Algorithm 1. As depicted, the algorithm need not enumerate any classifiers beyond the deterministic one, since such a classifier is guaranteed to complete successfully. Observe that Algorithm 1 is highly efficient, its run-time complexity is dominated by the sorting step, and is therefore (n log n).

Fully dependent IDK classifiers
In this section we consider the problem of determining an optimal IDK cascade, that minimizes the expected duration for successful classification, given a set of n fully dependent IDK classifiers Since the classifiers are fully dependent, there is no benefit in executing a classifier with smaller probability P j after one with a larger probability 4 P i > P j , since from Definition 2, we have p(K j |K i ) = 0 . In order to minimize the expected duration we must therefore order whichever classifiers are used by their probability of successful classification, smallest first. Further, the deterministic classifier K n must appear last, since otherwise there would be a non-zero probability of failing to classify some input, and hence the expected duration would no longer be finite.
Without loss of generality, we now assume that the dependent IDK classifiers are indexed according to strictly increasing values of their success probability, i.e. P i+1 > P i for all i, 1 ≤ i < n , and P n = 1 , where K n is the deterministic classifier.
We begin the analysis by building an illustrative example. Consider two fully dependent IDK classifiers as as set out in Table 2: K 1 with parameters (C 1 = 5, P 1 = 0.5) and K 2 with parameters (C 2 = 9, P 2 = 0.8) . Comparing classifiers K 1 and K 2 , we say that K 2 is a more powerful classifier than K 1 , since P 2 > P 1 . If we were to schedule classifier K 1 to execute only after executing K 2 and having K 2 output IDK, then the probability that K 1 would make a successful classification would be 0, since all of the cases that it can correctly classify would have already been identified by K 1 . Alternatively, if we execute K 1 first and only execute K 2 if K 1 fails (i.e. returns IDK), then the probability that K 2 will make a successful classification needs to account for the fact that we now know its input is not one that K 1 is able to classify. In the notation of conditional probability (see Definition 2 in Sect. 2), we can represent this as: where the numerator is the probability that an input to the IDK cascade ⟨K 1 ;K 2 ⟩ will be classified by K 2 , and the denominator is the probability that an input to the IDK cascade ⟨K 1 ;K 2 ⟩ will not be classified by K 1 . Next consider a further classifier K 3 added to the IDK cascade, i.e. ⟨K 1 ;K 2 ;K 3 ⟩ . The probability that K 3 will be executed is given by the probability that both K 1 and then K 2 output IDK. That is: As expected, this simply equates to the probability that K 2 alone is unable to make a classification. This is the case since K 1 effectively adds nothing to the set of inputs that can be classified by K 2 . Example 3 Consider the problem instance described above, where K 1 , K 2 , and K 3 are fully dependent classifiers, as set out in Table 2. Considering all of the possible IDK cascades where the classifiers run in index order, it can be verified that IDK cascade ⟨K 1 ;K 2 ;K 3 ⟩ has an expected duration of IDK cascade ⟨K 1 ;K 3 ⟩ has an expected duration of and IDK cascade ⟨K 2 ;K 3 ⟩ has an expected duration of Finally, running only the deterministic classifier K 3 , has an expected duration of 15. Observe that the IDK cascade ⟨K 2 ;K 3 ⟩ is optimal, and has an expected duration lower than that of any of the possible IDK cascades that include K 1 . Hence if minimizing expected duration is the objective then K 1 should not be used. ◻ Example 3 again illustrates a simple but inefficient way of synthesizing optimal IDK cascades: enumerate all possible IDK cascades that could potentially be optimal, compute the expected duration for each, and choose the one with the minimum value. However, even though we know that dependent classifiers can only appear in an optimal IDK cascade ordered by increasing probability, the fact that we do not know which classifiers to include means that there are still an exponential number 5 of different IDK cascades to consider: (2 n−1 ) , for n classifiers. Below, we derive a more efficient algorithm for synthesizing optimal IDK cascades from a given collection of fully dependent IDK classifiers. We begin with a definition.
The sub-sequence may omit zero or more of the classifiers, with the exception of K i , but otherwise retains the same (index) ordering. Let f(i) denote the expected duration of this optimal sub-sequence S(i).
Using the terminology introduced in Definition 3, the aim of determining an optimal IDK cascade equates to determining the optimal sub-sequence S(n). We now describe how this may be achieved by inductively determining the optimal subsequences S(1), S(2), … , S(n) in order.
In order to determine S(i) for i ≥ 1 , we observe that optimal sub-sequences satisfy the optimal sub-structure property (Cormen et al. 2009, p. 379): optimal solutions to any problem instance incorporate optimal solutions to sub-instances. Initially, we have: Further, let K h denote the classifier immediately preceding K i in the optimal subsequence S(i). It must be the case that S(i) equates to the concatenation of K i to the end of the optimal sub-sequence S(h). Recalling that f(h) denotes the expected duration of S(h), we therefore have: if is not unique, then this implies that there are multiple optimal sub-sequences with the same minimum duration, and hence any such may be chosen). Example 4 illustrates how this approach works.
Example 4 Consider again the problem instance from Example 3 where K 1 and K 2 are two fully dependent IDK classifiers, with parameters (C 1 = 5, P 1 = 0.5) and (C 2 = 9, P 2 = 0.8) , and K 3 is a deterministic classifier with parameters (C 3 = 15, P 3 = 1.0) , as set out in Table 2 . Let us consider the minimum expected duration f(i) of a sub-sequence ending with each of the three classifiers, i.e. f(i) for i = 1 … 3.
-Starting with classifier K 1 , trivially we have f (1) = C 1 = 5 since no other classifiers can precede K 1 , and therefore S(1) = ⟨K 1 ⟩ is an optimal sub-sequence end- . This is the case because K 2 runs with a probability of 1 − P 1 , the probability that K 1 returned IDK. Since we seek the minimum value, we have f (2) = 9 . As this minimum value is obtained by executing only K 2 , S(2) = ⟨K 2 ⟩ is an optimal sub-sequence ending in K 2 . -Finally considering K 3 , there are three possibilities: (i) K 3 is the only classifier in the sub-sequence, in which case f (3) Since we seek the minimum value, f (3) = 12 , and S(3) = ⟨K 2 , K 3 ⟩ is an optimal sub-sequence ending in K 3 .
As any valid IDK cascade must end with the deterministic classifier, K 3 in this case, we observe that the optimal IDK cascade for this example is S(3) = ⟨K 2 , K 3 ⟩ , with the minimum expected duration of f (3) = 12 . ◻ Algorithm 2 Synthesizing an IDK cascade with minimum expected duration, from fully dependent classifiers Q.prepend(Kr) 13 r = q(r) 14 while (r = 0) 15 / / Q now comprises the optimal IDK cascade 16 Output Q As shown in Algorithm 2, the recurrence in (6) can be evaluated via a loop iterating over the values of k from 1 to n, with the min 0≤h<i {} term leading to an overall run-time complexity that is quadratic (n 2 ) , rather than exponential, in the number of classifiers.

Independent groups of fully dependent IDK classifiers
In this section, we broaden the scope to include both independent and fully dependent IDK classifiers. We consider the problem of determining an optimal IDK cascade, given multiple independent groups of fully dependent IDK classifiers, 6 and a single deterministic classifier, i.e. we have a collection: of n IDK classifiers, such that K n is a deterministic classifier, with P n = 1 , and the remaining classifiers are partitioned into m < n groups 1 , 2 , … , m such that for any two distinct classifiers K A and K B : -If they are in the same group then they are fully dependent with respect to one another as per Definition 2. Without loss of generality, we assume that such fully dependent classifiers are indexed in increasing order of their probabilities, P i ; and Real-Time Systems (2023) 59:1-34 -If they are in different groups then they are independent with respect to one another as per Definition 1.
From Definition 2, considering two fully dependent IDK classifiers K A and K B from the same group, with P B > P A , any sub-sequence of IDK classifiers where K A occurs after K B cannot be part of an optimal IDK cascade, since the probability of K A making a classification when K B did not is zero ( p(K A |K B ) = 0 ). For the opposite ordering, with K A first, the conditional probability that K B will make a classification given that K A did not is given by: Further, the probability p(S) that a sub-sequence S = ⟨K A , K B ⟩ of fully dependent IDK classifiers will fail to make a classification can be expressed in terms of conditional probabilities: which equates to the probability that the most powerful of the dependent IDK classifiers will fail to make a classification. Considering a sub-sequence S = ⟨K A , K I , K B ⟩ , where K A and K B are fully dependent IDK classifiers in the same group, with K B the more powerful classifier (i.e. P B > P A ), and K I is an independent IDK classifier (i.e. from a different group), then the probability that the sub-sequence will fail to make a classification is given by: In general, considering an IDK cascade S composed of classifiers from independent groups of fully dependent IDK classifiers, the probability that the IDK cascade will fail to make a classification is given by: where Z is a set containing for each group only the single most powerful IDK classifier from the group that is present in S.
We illustrate the analysis for a mix of IDK classifiers via Example 5, obtained from Example 4 by renaming the deterministic classifier to K 4 , and adding an independent classifier K 3 .
Example 5 Suppose that we have four IDK classifiers with parameters as set out in Table 3, where K 4 is the deterministic classifier and the other three IDK classifiers are partitioned into two groups as follows: Since sub-sequences where K 2 appears before K 1 cannot be part of an optimal IDK cascade, we ignore such cases and enumerate all other IDK cascades. Similar to Example 4, it can be verified that the IDK cascade ⟨K 1 ;K 2 ;K 4 ⟩ has an expected duration of IDK cascade ⟨K 1 ;K 4 ⟩ has an expected duration of and IDK cascade ⟨K 2 ;K 4 ⟩ has an expected duration of finally ⟨K 4 ⟩ has an expected duration of 15. Building upon the IDK cascades above and inserting K 3 , it can be verified that IDK cascade ⟨K 3 ;K 1 ;K 2 ;K 4 ⟩ has an expected duration of IDK cascade ⟨K 1 ;K 3 ;K 2 ;K 4 ⟩ has an expected duration of IDK cascade ⟨K 1 ;K 2 ;K 3 ;K 4 ⟩ has an expected duration of IDK cascade ⟨K 3 ;K 1 ;K 4 ⟩ has an expected duration of  Example 5 illustrates a crucial point regarding the difficulty of determining an optimal IDK cascade for independent groups of dependent IDK classifiers. For the sub-problem with only one fully dependent group and the deterministic classifier (i.e. K 1 , K 2 , and K 4 ) then the optimal IDK cascade is ⟨K 2 ;K 4 ⟩ , yet once an extra group is added (effectively just the independent IDK classifier K 3 ) then the use of K 1 becomes essential to obtain an optimal IDK cascade, i.e. either ⟨K 1 ;K 3 ;K 4 ⟩ or ⟨K 1 ;K 3 ;K 2 ;K 4 ⟩. This shows that a divide-and-conquer approach, first determining the local solution for each group of dependent IDK classifiers and then using only those classifiers in the global solution, is not optimal.
Lemma 2 shows that when we consider only independent IDK classifiers, then an optimal IDK cascade can be obtained by considering the classifiers in order of the increasing ratio of their execution time to their success probability, which is simply C i ∕P i in that case. Further, when we consider only fully dependent IDK classifiers, then an optimal IDK cascade can be obtained by considering the classifiers in order of their increasing probability P i , with the problem effectively reduced to determining which classifiers, if any, to omit. The problem is however subtly different when we attempt to build an optimal IDK cascade from multiple independent groups of fully dependent classifiers. In this case, there is no single complete ordering of classifiers from which we can make a step-by-step selection of the ones to use. The reason for this is as follows. When selecting between two mutually independent classifiers K 2 and K 3 from two different groups (  of making a successful classification of any as yet unclassified input. (This ordering follows directly from the proof of Lemma 2 for independent classifiers). However, for an IDK classifier such as K 2 that is a member of a group of fully dependent IDK classifiers, this probability is a conditional one that depends on the previous classifier (e.g. K 1 ), if any, from the same group that appears in the IDK cascade. The probability of K 2 making a successful classification of any as yet unclassified input and hence its relative ordering with respect to K 3 can therefore depend on whether or not K 1 is present in the IDK cascade. Specifically, when C 2 ∕P 2 < C 3 ∕P 3 < C 2 ∕ P 2 −P 1 1−P 1 then the ordering of K 2 and K 3 switches depending on whether or not K 1 is present in the IDK cascade. This issue is illustrated in Example 6.

Example 6
The parameters of the classifiers used in this example are as set out in Table 4. IDK classifiers K 1 and K 2 , with parameters (C 1 = 5, P 1 = 0.5) and (C 2 = 9, P 2 = 0.8) respectively, are members of the same group 1 = {K 1 , K 2 } . IDK classifier K 3 , with parameters (C 3 = 10, P 3 = 0.5) , is a member of a separate group 2 = {K 3 } and hence is independent of K 1 and K 2 . Finally, K 4 , with parameters (C 4 = 20, P 4 = 1.0) is the deterministic classifier.
In attempting to build an optimal IDK cascade, we can apply the previously derived rules when selecting from classifiers that are dependent. Hence, if more than one dependent IDK classifier from the same group is present in the IDK cascade, then those classifiers must appear in order of their probability i.e. K 1 first then K 2 , since P 2 > P 1 .
Similarly, when selecting between classifiers that are independent of one another, those classifiers must appear in order of the ratio of their execution time to their probability of successfully classifying any as yet unclassified input, noting that this probability is conditional on the previous classifier, if any, from the same dependent group that already appears in the IDK cascade. The probability of K 2 correctly classifying any as yet unclassified input is given by P 2 = 0.8 if K 1 does not appear in the IDK cascade, and by p(K 2 |K 1 ) = 0.8−0.5 1−0.5 = 0.6 if K 1 appears before K 2 . Further, since C 2 ∕P 2 = 11.25 , C 3 ∕P 3 = 13.333 , and C 2 ∕ P 2 −P 1 1−P 1 = 15 then the relative order of K 2 and K 3 depends on whether or not K 1 is used. If K 1 is present, then including K 3 before K 2 will result in a lower overall expected duration, otherwise it is more effective to include K 2 before K 3 . We enumerate only the IDK cascades relevant to this point. It can be verified that IDK cascade ⟨K 1 ;K 2 ;K 3 ;K 4 ⟩ has an expected duration of Switching the order of K 2 and K 3 , IDK cascade ⟨K 1 ;K 3 ;K 2 ;K 4 ⟩ has an expected duration of Omitting K 1 , it can be verified that IDK cascade ⟨K 2 ;K 3 ;K 4 ⟩ has an expected duration of Switching the order of K 2 and K 3 , IDK cascade ⟨K 3 ;K 2 ;K 4 ⟩ has an expected duration of Observe that when K 1 is included, it is more effective for K 2 to appear after K 3 , whereas if K 1 is omitted, then it is more effective for K 2 to precede K 3 , as in the optimal IDK cascade ⟨K 2 ;K 3 ;K 4 ⟩ . ◻ The issues of inclusion (illustrated by Example 5) and ordering (illustrated by Example 6) make the problem of determining an optimal IDK cascade for multiple independent groups of fully dependent IDK classifiers much more difficult than the individual cases of all independent or all fully dependent classifiers addressed in Sects. 3 and 4. for each dependent group with unused classifiers 8 -Identify as a candidate classifier the least-powerful (i.e., smallest P i ) as 9 y et unused classifier K i from the group 10 -Let P j be the probability for the previous classifier K j , if any, from the 11 same group, that is also included according to B[j], with P j = 0 if there is 12 no such classifier 13 -Compute the conditional probability P i = P i −P j 1−P j for K i using Definition 2   14 Select the candidate classifier that has the minimum ratio of execution time to 15 conditional probability. Let K k denote this selected classifier, P k its conditional 16 probability, and C k its execution time 17 Mark K k as "used". 18 length = length + (C k × prob) / / since K k is executed with probability prob 19 prob = (prob × (1 − P k )) / / Since K k fails with probability (1 − P k ) 20 until (deterministic classifier Kn has been selected) 21 return length Algorithm 3 is a straightforward exponential time algorithm 7 that finds the optimal solution by explicitly considering all possibilities, in the following manner. Letting the n-bit array B denote a particular subset of the classifiers (with B[i]==1 indicating that the i'th classifier K i is included in this subset and B[i]==0 indicating that it is not), the procedure OPTORDERMIXED( ) iterates through all 2 n−1 subsets of the classifiers in that include the deterministic classifier K n , in order to determine which has the smallest expected duration. The expected duration of the subset denoted by B is computed by procedure COMPUTELENGTH( , B) , by repeatedly: -Determining the set of candidate classifiers that could be used next. This candidate set comprises the first (i.e. lowest power) as yet unused classifier from each group that is present in the trial set according to B. (Note, the deterministic classifier is considered to occupy a group on its own and is therefore always included as a candidate). -Selecting from the candidate classifiers the one with the minimum ratio of its execution time to its conditional probability, and marking it as used.
It is evident that the running time of Algorithm 3 is (2 n−1 n 2 ) , and hence exponential in the number of classifiers n. Note that while Algorithm 3 provides an optimal solution to the simpler problems involving (i) only independent IDK classifiers (Sect. 3), and (ii) only fully dependent IDK classifiers (Sect. 4), it has higher complexity and hence is less efficient than Algorithms 1 and 2 that were specifically designed for those cases.

Timing constraints on classification
Previous sections considered the problem of determining the optimal IDK cascade, that minimizes the expected duration for successful classification, given a set of independent IDK classifiers (Sect. 3), a set of fully dependent IDK classifiers (Sect. 4), or independent groups of fully dependent IDK classifiers (Sect. 5). In each case there was no constraint placed on the maximum duration of the IDK cascade. In this section, we consider a variant of each of these problems in which a timing constraint is specified, thus the objective is to determine the optimal IDK cascade, that minimizes the expected duration for successful classification, subject to the constraint that the maximum duration of the IDK cascade does not exceed a specified hard deadline D. We suspect, but have not yet been able to prove, that the problem of determining an optimal IDK cascade, subject to a deadline constraint, is NP-hard in each of these three cases.
A problem instance is now specified as: where K 1 , K 2 , … , K n are n IDK classifiers with K n the deterministic classifier (i.e., P n = 1 ), and D ∈ ℕ is the specified deadline.
The maximum duration of an IDK cascade is simply the sum of the execution times ( C i ) of all of the classifiers deployed in that cascade. Since we must ensure that classification is always completed successful within the deadline D, it follows that a problem instance is feasible if and only if C n ≤ D . In other words, if and only if the deterministic classifier has a duration that does not exceed the deadline. Other problem instances are infeasible and are not considered further. (11)

Independent IDK classifiers with a deadline
In this subsection, we consider the problem of determining an optimal IDK cascade, subject to a deadline constraint, given a set of n independent IDK classifiers. We observe that Lemma 2 (from Sect. 3) continues to hold irrespective of the presence of a deadline. Hence, adjacent classifiers in an optimal IDK cascade must satisfy the property that the ratio of their execution time to their success probability (i.e. C i ∕P i ) is non-decreasing. We assume, without loss of generality, that the classifiers are indexed according to non-decreasing C i ∕P i ; i.e., for all i we have: This assumption can be realized for any problem instance by sorting, in (n log n) time. Since any optimal IDK cascade must always end with the deterministic classifier, we assume that any IDK classifier K j that has a ratio of its execution time to its success probability that is greater than that of the deterministic classifier is discarded. (From Lemma 2, we know that any such IDK classifier cannot appear before the deterministic classifier in any optimal IDK cascade, and hence discarding it has no effect on optimality.) Hence we may potentially reduce the value of n, with K n again representing the deterministic classifier.
Given a modified problem instance as specified above, we apply the technique of dynamic programming (Bellman 1957) to determine an optimal IDK cascade of minimum expected duration, subject to the constraint that the maximum duration does not exceed D. We begin with a definition.
Definition 4 Let E(k, d) denote the minimum expected duration for the following sub-problem of the problem instance specified by (11) That is, only the classifiers K k , K k+1 , … , K n are available, and there is a sub-deadline of d.
Using the notation from Definition 4, the expected duration of the optimal IDK cascade for the complete problem instance is E (1, D), i.e. where the deadline is D, and all n classifiers K 1 , K 2 , … , K n are available.
In building a solution we first look at the sub-problem where only the deterministic classifier K n is available, and compute the values of E(n, d) for all values of d. We have: In other words, if the deadline is smaller than the execution time C n of the deterministic classifier K n , then the sub-problem instance is infeasible, represented by an expected duration of infinity. Otherwise, d ≥ C n and so the optimal IDK cascade for the sub-problem comprises the deterministic classifier K n , and hence has an expected duration of C n . Now, assuming that we have already determined the values of E(k + 1, d � ) for all d ′ , we can compute the values of E(k, d) for all d as follows: where -The first term within the min() reflects the decision not to use classifier K k , and hence the expected duration is equal to the minimum expected duration E(k + 1, d) using only the classifiers K k+1 , K k+2 , … , K n . -The second term within the min() reflects the decision to use classifier K k . In which case classifier K k always executes, taking a time C k , since according to Lemma 2 it precedes the remaining classifiers in an optimal IDK cascade. When classifier K k executes, it fails to make a successful classification with a probability (1 − P k ) , and when this happens, the remainder of the IDK cascade is executed with a minimum expected duration of E(k + 1, d − C k ) , since the classifiers remaining are K k+1 , K k+2 , … , K n , and the available time remaining is d − C k .
Equation (12) can be used to determine E(n, d) for all d. Having done so, repeated application of (13) can be used to determine E(1, d), for all d, and hence obtain the value of E (1, D), which as mentioned earlier is the expected duration of an optimal IDK cascade for the complete problem instance. Further, the optimal IDK cascade that has this duration can be deduced by observing which of the first or the second term in the min() is smaller each time that (13) is applied.
Example 7 Suppose that we have three independent IDK classifiers K 1 , K 2 , and K 3 with parameters (C 1 = 1, P 1 = 0.4) , (C 2 = 3, P 2 = 0.9) , and (C 3 = 2, P 3 = 0.5) and a deterministic classifier K 4 with parameters (C 4 = 10, P 4 = 1.0) as shown in Table 5, and a deadline D = 13 . Note the classifiers are again ordered according to the ratio C i ∕P i as they must be in any optimal IDK cascade of independent IDK classifiers, with the deterministic classifier last. Table 5 Parameters of independent IDK classifiers for Example 7  Table 6 and are explained below. Note, the values of E(k, d) are assumed to be ∞ for all negative values of d.
As any valid IDK cascade must end with the deterministic classifier, K 4 in this case, the minimum expected duration for the optimal IDK cascade, subject to a deadline of D = 13 , is given by E(1, D) = E(1, 13) = 4 for this example. Tracing back how that result was obtained, we see that it is for the IDK cascade ⟨K 2 , K 4 ⟩ , with classifiers K 1 and K 3 unused. We note that the optimal IDK cascade is, as expected, dependent on the length of deadline permitted. If the deadline were longer, D ≥ 16 , then all four classifiers could be used with an expected duration of 3.22, whereas with a shorter deadline of D = 14 or D = 15 the optimal IDK cascade would be ⟨K 1 , K 2 , K 4 ⟩ , with an expected duration of 3.4. For a shorter deadline of D = 12 the optimal IDK cascade would be either ⟨K 1 , K 4 ⟩ or ⟨K 3 , K 4 ⟩ , with an expected duration of 7. Further, with a deadline of D = 11 , only ⟨K 1 , K 4 ⟩ would be optimal with the same expected duration of 7. Finally, with a deadline of D = 10 , it is only possible to use the deterministic classifier K 4 for an expected duration of 10. ◻ Algorithm 4 provides the pseudo-code implementation of the above method. The overall complexity of the algorithm is dominated by the nested for loops. The outer for loop executes n times and the inner one D times, hence the overall complexity is pseudo-polynomial (nD) in the number of classifiers, assuming that they are presorted, which can be done in (n log n) time.

Fully dependent IDK classifiers with a deadline
In this subsection, we consider the problem of determining an optimal IDK cascade, subject to a deadline constraint, given a set of n fully dependent IDK classifiers.
Since the classifiers are fully dependent, we observe that irrespective of the presence of a deadline, there is no benefit in executing a classifier with smaller probability P j after one with a larger probability P i > P j , since from Definition 2 (in Sect. 2), we have p(K j |K i ) = 0 . We therefore assume, without loss of generality, that the classifiers are ordered according to increasing values of their probability P i , with the deterministic classifier K n with P n = 1 last. This assumption can be realized for any problem instance by sorting, in (n log n) time.
Similar to the independent case considered in Sect. 6.1, we again apply dynamic programming to determine an optimal IDK cascade of minimum expected duration, subject to the constraint that the maximum duration does not exceed D. The following definition is analogous to Definition 4. F(k, d) denote the minimum expected duration of an IDK cascade ending with classifier K k (and hence achieving a success probability equal to P k ) while also guaranteeing to meet a deadline d for the following sub-problem of the problem instance specified in (11) That is, only the classifiers K 1 , K 2 , … , K k are available, and there is a deadline of d.

Definition 5 Let
Using the notation from Definition 5, the expected duration of the optimal IDK cascade for the complete problem instance specified in (11) is F(n, D). In other words where all n classifiers K 1 , K 2 , … , K n are available and the deadline is D.
In building a solution we first look at the sub-problem where only classifier K 1 is available, and compute the values of F (1, d) for all values of d. We have: In other words, if the deadline d is smaller than the execution time C 1 of the classifier K 1 , then the sub-problem instance is infeasible, represented by an expected duration of infinity. Otherwise, d ≥ C 1 and so the optimal IDK cascade for the subproblem comprises the classifier K 1 , and hence has an expected duration of C 1 . Now, assuming that we have already determined the values of F(k − 1, d � ) for all d ′ , we can compute the values of F(k, d) for all d as follows: where Real-Time Systems (2023) 59:1-34 -A deadline d that is smaller than the execution time C k of the classifier K k , required by the sub-problem instance to end the IDK cascade, indicates that the sub-problem is infeasible and so F(k, d) = ∞. -The first term within the min() reflects the decision to use only classifier K k , and hence the expected duration is equal to C k . -The second term within the min() reflects the decision to append classifier K k to some IDK cascade that ends with a classifier K i , where i < k , i.e. K i appears earlier in the order of the classifiers than K k , and hence P i < P k . Since classifier K k , with execution time C k , must be appended to this suffix IDK cascade, the deadline available for the suffix IDK cascade is reduced to d − C k . The minimum expected duration of such a suffix IDK cascade is therefore given by F(i, d − C k ) . Further, as the classifiers are fully dependent and ordered by their increasing probabilities, the probability that the suffix IDK cascade ending in K i will fail to make a successful classification and therefore K k will need to execute is given by (1 − P i ) . Hence the overall minimum expected duration when classifier K k is immediately preceded by K i is given by F(i, d − C k ) + (1 − P i ) × C k . Since we are interested in the minimum expected duration of any IDK cascade ending in K k with duration not exceeding d, we take the minimum over all possible values of i, i.e. 1 ≤ i < k.
We note that the recurrence in (15) can be evaluated within a nested loop, with the outer loop iterating over values of k from 1 to n, and the inner loop iterating over values of d from 0 to D (Table 7).
As any valid IDK cascade must end with the deterministic classifier, K 4 in this case, the minimum expected duration for the optimal IDK cascade is given by F(n, D) = f (4, 16) = 4.08 for this example. Tracing back how that result was obtained, we see that it is for the IDK cascade ⟨K 1 , K 3 , K 4 ⟩ , with K 2 unused. We note that the optimal IDK cascade is, as expected, dependent on the length of deadline permitted. If the deadline were longer, D ≥ 18 , then all four classifiers could be used with an expected duration of 3.78, whereas with a shorter deadline of D = 12 the optimal IDK cascade would be ⟨K 1 , K 2 , K 4 ⟩ , with K 2 unused, and an expected duration of 4.1. Further, with a deadline of D = 11 , only ⟨K 2 , K 4 ⟩ would be used for an expected duration of 4.6, and with a deadline of 9 or 10, only ⟨K 1 , K 4 ⟩ would be used for an expected duration of 5. ◻ next next Algorithm 5 provides the pseudo-code implementation of the above method. The overall complexity of the algorithm is dominated by the three nested for loops, hence the overall complexity is pseudo-polynomial (n 2 D) in the number 1 3 of classifiers, assuming that they are pre-sorted, which can be done in (n log n) time.

Independent groups of fully dependent IDK classifiers with a deadline
In this subsection, we consider the problem of determining an optimal IDK cascade, subject to a deadline constraint, given multiple independent groups of fully dependent IDK classifiers, 8 and a single deterministic classifier.
In Sects. 6.1 and 6.2, we used dynamic programming techniques to extend the polynomial time methods for obtaining optimal IDK cascades for (i) independent IDK classifiers (Sect. 3) and (ii) fully dependent IDK classifiers (Sect. 4) into pseudo-polynomial methods for solving these problems subject to a hard deadline constraint. However, for the problem considered in this subsection, we take a different approach.
In Sect. 5 we presented the method described by Algorithm 3 for obtaining optimal IDK cascades given multiple independent groups of fully dependent IDK classifiers. Recall that Algorithm 3 examines each possible subset of classifiers that could potentially comprise the optimal IDK cascade. This is controlled via an n-bit array B that is used to indicate which of the n classifiers will be included in a trial solution, i.e. B[j]==1 indicates that the j-th classifier is included and B[j]==0 indicates that it is not. Note, since the deterministic classifier is always included, B[n]==1.
Algorithm 3 is easily adapted to also account for a specified hard deadline D: we can simply discard any trial solution where the sum of the execute times of the classifiers used exceeds the deadline, i.e. ∑ ∀k∶B[k]=1 C k > D . Since the deadline feasibility of each of the 2 n−1 trial solutions can be evaluated in (n) time, the overall complexity remains (2 n−1 n 2 ) . Below we provide an illustrative example.

Example 9
The parameters of the classifiers used in this example are given in Table 9, and are the same as those used in Example 6 in Sect. 5. IDK classifiers K 1 and K 2 , with parameters (C 1 = 5, P 1 = 0.5) and (C 2 = 9, P 2 = 0.8) respectively, are members of the same group 1 = {K 1 , K 2 } . IDK classifier K 3 , with parameters (C 3 = 10, P 3 = 0.5) , is a member of a separate group 2 = {K 3 } and hence is independent of K 1 and K 2 . Finally, K 4 , with parameters (C 4 = 20, P 4 = 1.0) is the deterministic classifier.
Assuming a deadline D = 36 , then trial solutions {K 1 , K 2 , K 3 , K 4 } and {K 2 , K 3 , K 4 } are infeasible, since they have a maximum duration of 44 and 39 respectively. Examining all other trial solutions, it can be verified that the optimal IDK cascade compliant with the deadline is ⟨K 1 , K 3 , K 4 ⟩ with an expected duration of and a maximum duration of 35. ◻ The Algorithms 3, 4, and 5 that are used to determine optimal IDK cascades with a bounded maximum duration have complexity of (2 n−1 n 2 ) , (nD) , and (n 2 D) respectively. In cases where the run time of these algorithms becomes prohibitive, due to large numbers of classifiers or long deadlines, then heuristic methods could potentially be employed, as discussed in Baruah et al. (2021). Such heuristics are not explored further here.

Context and conclusions
Learning-enabled components, particularly those based on Deep Neural Networks (DNNs), are increasingly being used in safety-critical real-time systems. It is imperative that the real-time scheduling theory community respond to this development by coming up with appropriate techniques to enable the offline analysis of systems that use such components.
In this work, we have adapted and applied algorithmic techniques from realtime scheduling theory to a proposed DNN use-case (Wang et al. 2018) that seeks to strike a balance between accuracy and timeliness by arranging individual DNNbased classifiers, augmented by the ability to classify inputs as belonging to an additional IDK class, into IDK cascades. The intuition behind the design of such IDK cascades is simple yet elegant: if the earlier classifiers in an IDK cascade can successfully identify simple-to-classify inputs, then later more sophisticated and hence slower classifiers need only be invoked rarely on truly challenging inputs. We were able to formalize this intuition, and from that develop algorithms that synthesize optimal IDK cascades from a given set of classifiers, both when the sole objective is optimizing expected duration and when there is additionally a hard deadline constraint. We were able to provide optimal algorithms for collections of IDK classifiers that are respectively, (i) independent, (ii) fully dependent, and (iii) a mix of the two.
The research presented in this paper can be viewed as a proof of concept of the principle that real-time scheduling can contribute to better design of real-time systems that use learning-enabled components, and it behoves us, as a community, to take a closer look at such systems. Future research into the scheduling of IDK classifiers could investigate the complexity of the problem from a theoretical perspective. For example, is the problem of determining an optimal IDK cascade with a deadline constraint NP-hard? Are there more efficient solutions to the problem of determining an optimal IDK cascade with independent groups of dependent classifiers with no deadline constraint than the one presented in this paper? Finally, this paper considered only the sequential execution of IDK classifiers, effectively the single processor case. There is a potentially rich area of associated research that can be undertaken focusing on similar problems in the multiprocessor case, where IDK classifiers can execute in parallel.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.

3
Yue Wu is a graduate student in the Department of Mathematics at the University of Washington, Seattle, where she is working towards a PhD in Mathematics. She holds a bachelor's degree in computer science from Washington University in St. Louis, with a second major in mathematics. Her current research interests are in optimization and theoretical computer science.