A stochastic approach to handle resource constraints as knapsack problems in ensemble pruning

Ensemble-based methods are highly popular approaches that increase the accuracy of a decision by aggregating the opinions of individual voters. The common point is to maximize accuracy; however, a natural limitation occurs if incremental costs are also assigned to the individual voters. Consequently, we investigate creating ensembles under an additional constraint on the total cost of the members. This task can be formulated as a knapsack problem, where the energy is the ensemble accuracy formed by some aggregation rules. However, the generally applied aggregation rules lead to a nonseparable energy function, which takes the common solution tools—such as dynamic programming—out of action. We introduce a novel stochastic approach that considers the energy as the joint probability function of the member accuracies. This type of knowledge can be efficiently incorporated in a stochastic search process as a stopping rule, since we have the information on the expected accuracy or, alternatively, the probability of finding more accurate ensembles. Experimental analyses of the created ensembles of pattern classifiers and object detectors confirm the efficiency of our approach over other pruning ones. Moreover, we propose a novel stochastic search method that better fits the energy, which can be incorporated in other stochastic strategies as well.


Introduction
Ensemble-based systems are rather popular in several application fields and are employed to increase the decision accuracy of individual approaches. We also encounter such approaches for pattern recognition purposes (Lam and Suen 1997), using models based on, e.g., neural networks (Cho and Kim 1995;Hansen and Salamon 1990), decision trees (Kong and Dietterich 1995) or other principles (Ho et al. 1994;Huang and Suen 1995;Xu et al. 1992). In the most recent results, we can recognize this approach in the design of state-of-the-art convolutional neural networks (such as GoogLeNet, incorporating the Inception module Szegedy et al. 2015) or the direct combination of them (Harangi et al. 2018). In our practice, we also routinely consider ensemble-based approaches to aggregate the outputs of pattern classifiers (Antal and Hajdu 2014) or detector algorithms (Antal and Hajdu 2012), usually by some majority voting-based rule. During these efforts, we have also faced perhaps the most negative property of creating ensembles, that is, the increasing demand on resources. This type of cost may occur as the execution/training time and the working hours needed to create the ensemble components, etc., according to the characteristics of the given problem. Thus, in addition to the primary aim of the composition of the most accurate ensemble, a natural constraint emerges as a cost limitation for that endeavor.
The current literature mostly refers to the selection of an efficient ensemble from a pool of possible members as ensemble pruning (Zhou 2012). Even if no resource constraints are applied, a subset of possible ensemble members may lead to better performance than selecting all the members. Moreover, the best strategy is to compose an ensemble having such good performing members which also have diverse behavior. To realize this approach, ensemble pruning methods can be categorized in the following three main groups: ordering-, clustering-, and optimization-based pruning. Ordering-based pruning ranks the individual members according to some criterion, and the most highly ranked ones are put into the final ensemble. Clustering-based pruning aims to identify representative prototype individual members to compose the ensemble, while the optimization-based approach sets up an objective function and forms a subset of members by minimizing or maximizing it. Beyond the clustering-based approaches, but still as a stochastic one we can mention (Hernández-Lobato et al. 2009), as an instance-based pruning method, where the members are selected for each instance separately. Double-pruning can also be executed based on Hernández-Lobato et al. (2009) as proposed in Soto et al. (2010), which approach has similar motivation to ensemble distilling (a.k.a. compression) to reduce the size of the ensemble/learning model (Bucilu et al. 2006;Hinton et al. 2015). There are efforts to complement the basic ensemble pruning models to consider possible resource constraints like training/test execution times or memory/storage space (Bucilu et al. 2006;Hinton et al. 2015) as well. To reach this aim a popular approach is to apply multi-objective evolutionary algorithms, like NSGA-II (Deb et al. 2002). NSGA-II is an elitist algorithm that provides fast nondominated sorting and considers density estimation and crowded-comparison to maintain diversity. These positive properties can be exploited in ensemble pruning like in Mousavi and Eftekhari (2015). As the best fits to our single-objective setup we have implemented a general purpose genetic algorithm from Goldberg (1989) (chapters 2-3), and a boosting-based pruning one from Martinez-Munoz and Suarez (2007). In our comparative analyses we will refer to them as Genetic and Pruning, respectively.
In this work, we analyze a single-constraint task on the resources to compose the most accurate ensemble regarding the energy formed by majority voting as the aggregation rule like in Hernández-Lobato et al. (2009). The constraint we consider corresponds to the training time; however, any other type of resources could be considered. We introduce a novel, theoretically well-founded stochastic approach that considers the energy as the joint probability function of the member accuracies. As our main contribution, we show that this type of knowledge can be efficiently incorporated in any stochastic search process as a stopping rule, since we have the information on the expected accuracy or, alternatively, the probability of finding more accurate ensembles. Our empirical analyses also show that including the stochastic estimation as a stopping rule saves a large amount of search time to build accurate ensembles.
We formulate the resource constraint as a knapsack problem, which provides the opportunity of a precise constraint prescription instead of a simple good price/value expectation considered e.g. in Bucilu et al. (2006), Hernández-Lobato et al. (2009), Hinton et al. (2015) to have small, but relatively accurate ensembles. Basically, we follow an orderingbased approach combined with stochastic sampling to compose the ensembles; however, additionally as a novel contribution we suggest a new heuristics for that. Namely, besides its individual accuracy and cost, we calculate such a usefulness value for each possible member during the selection process that reflects its direct behavior according to the objective function, which is based on the majority voting rule in our case. As we will see, our novel stochastic search method is proven to be very competitive with simulated annealing (SA) (Du and Swamy 2016), and the pruning methods (Goldberg 1989;Martinez-Munoz and Suarez 2007). Also, the proposed heuristic can be successfully inserted into these general stochastic search strategies. Our approach was first proposed in our former work (Hajdu et al. 2016) with limited empirical evaluations; however, only heuristic results were achieved there without being able to take advantage of the theoretical model presented here.
The rest of the paper is organized as follows. In Sect. 2, we introduce the basic concepts and notation to formulate our constrained enesemble pruning task as a knapsack problem, where majority voting is applied for the aggregation of member opinions. Section 3 analyzes the maximum accuracy of the common deterministic ensemble creation strategies in the case of limited total cost. Though the cost may relate to any resource demand, in our scenarios it describes the training times of the classifiers that are considered as possible ensemble members. Existing stochastic approaches are described in Sect. 4 with some preliminary simulation results. Moreover, we introduce a novel stochastic search algorithm that determines the expected usefulness of possible members in a way that adapts to the characteristics of the energy function better than other stochastic search methods, e.g., SA. The stochastic estimation of the ensemble energy from the individual accuracies of the components is presented in Sect. 5. 1 Our experimental analyses are enclosed in Sect. 6, including the investigation of the possible creation of ensembles from participating methods of Kaggle challenges and binary classification problems in UCI databases. 2 We also present how the proposed model is expected to be generalized to multiclass classification tasks with a demonstrative example on our original motivating object detection problem. Finally, in Sect. 7, we discuss generalization possibilities and several issues regarding our approach that can be tuned towards special application fields.

3 2 Basic concepts and notation
Let us consider a pool D = {D 1 , … , D n } containing possible ensemble members, where each member D i (i = 1, … , n) is characterized by a pair (p i , t i ) describing its individual accuracy p i ∈ [0, 1] and cost t i ∈ ℝ >0 . The individual accuracies are supposed to be known, e.g., by determining them on some test data and by an appropriate performance metric. In this work, we will focus on the majority voting-based aggregation principle, where the possible ensemble members D i (i = 1, … , n) are classifiers (see Kuncheva 2004). In , we have dealt with the classic case in which the individual classifiers make true/false (binary) decisions. In this model, a classifier D i with accuracy p i is considered as a Bernoulli distributed random variable i , that is, , where i = 1 means the correct classification by D i . In this case, we obtain that the accuracy of an ensemble As an important practical issue, notice that (1) is valid only for independent members to calculate the ensemble accuracy. The dependency of the members can be discovered further by, e.g., using different kinds of diversity measures .
Regarding ensemble-based systems, the standard task is to devise the most accurate ensemble from D for the given energy function. In this paper, we add a natural constraint of a bounded total cost to this optimization problem. That is, we have to maximize (1) under the cost constraint where the total allowed cost T ∈ ℝ >0 is a predefined constant. Consequently, we must focus on those subsets L ⊆ N with cardinalities |L| = ∈ {1, … , n} for which (2) is fulfilled. Let L 0 denote that index set of cardinality | | L 0 | | = 0 , where the global maximum ensemble accuracy is reached. The following lemma states that one can reach L 0 calculating q (L) for odd values of only, which results in more efficient computation, since not all the possible subsets should be checked.

Lemma 1 If
Proof See Appendix 1 for the proof. ◻ The optimization task defined by the energy function (1) and the constraint (2) can be interpreted as a typical knapsack problem (Martello and Toth 1990). Such problems are known to be NP-hard; however, if the energy function is linear and/or separable for the p i 1 3 -s, then a very efficient algorithmic solution can be given based on dynamic programming. However, if the energy lacks these properties, the currently available theoretical foundation is rather poor. As some specific examples, we were able to locate investigations of an exponential-type energy function (Klastorin 1990), and a remarkably restricted family of nonlinear and nonseparable ones (Sharkey et al. 2011). In Klastorin (1990), an approach based on calculus was made by representing the energy function by its Taylor series. Unfortunately, it has been revealed that dynamic programming can be applied efficiently only to at most the quadratic member of the series; thus, the remaining higher-order members had to be omitted. This compulsion suggests a large error term if this technique is attempted to be considered generally. Thus, to the best of our knowledge, there is a lack of theoretical instructions/general recommendations to solve knapsack problems in the case of complex energy functions. As our energy (1) is also nonlinear and nonseparable, we were highly motivated to develop a well-founded framework for efficient ensemble creation. As our main contribution, in this paper, we propose a novel stochastic technique to solve knapsack problems with complex energy functions. Though the model is worked out in detail for (1) settled on the majority voting rule, it can be applied also to other energy functions. Our approach is based on the stochastic properties of the energy q in (1), providing that we have some preliminary knowledge on which distribution its parameters p i (i = 1, … , n) are coming from. We put a special focus on beta distributions that fit practical problems very well. In other words, we estimate the distribution of q in terms of its mean and variance. This information can be efficiently incorporated as a stopping rule in stochastic search algorithms, as we demonstrate it e.g. for SA. The main idea here is to be able to stop building ensembles when we can expect that better ones can be found by low probability only.
As a common empirical approach to find the optimal ensemble, the usefulness p i ∕t i ( i = 1, … , n ) of the possible members are calculated. Then, as deterministic greedy methods, the ensemble is composed of forward/backward selection strategies (see, e.g., Kurz et al. 2013). Since the deterministic methods are less efficient-e.g., the greedy one is proven to have 50% accuracy for the simplest knapsack energy ∑ n i=1 p i -popular stochastic search algorithms are considered instead, such as SA. As a further contribution, we introduce a novel stochastic search strategy, where the usefulness of the components is defined in a slightly more complex way to better fit the investigated energy; the stopping rule can be successfully applied in this approach as well. For the sake of completeness, we will start our theoretical investigation regarding the accuracy of the existing deterministic methods when a cost limitation is also applied.

Deterministic selection strategies
In this section, we address deterministic selection strategies to build an ensemble that has maximal system accuracy q 0 (L 0 ) , applying the cost limitation. However, since we have 2 n different subsets of elements of a pool of cardinality n, this selection task is known to be NP-hard. To overcome this issue, several selection approaches have been proposed. The common point of these strategies is that in general, they do not assume any knowledge on the proper determination of the classification performance q (L) ; rather, they require only the ability to evaluate it. Moreover, to the best of our knowledge, strategies that consider the capability of individual feature accuracies to be modeled by drawing them from a probability distribution, as in our approach, have not yet been proposed. Based on the above discussion, it seems to be natural to ascertain how the widely applied selection strategies work in our setup. The main difference in our case, in contrast to the general recommendations, is that now we can properly formulate the performance evaluation using the exact functional knowledge of q . That is, we can characterize the behavior of the strategy with a strict analysis instead of the empirical tests generally applied.
We start our investigation with greedy selection approaches by discussing them via the forward selection strategy. Here, the most accurate item is selected and put in a subset S first. Then, from the remaining n − 1 items, the component that maximizes the classification accuracy of the extended ensemble is moved to S. This procedure is then iteratively repeated; however, if the performance cannot be increased by adding a new component, then S is not extended and the selection stops. The first issue we address is to determine the largest possible error this strategy can lead to in our scenario.

Proposition 1
The simple greedy forward selection strategy to build an ensemble that applies the majority voting-based rule has a maximum error rate 1/2.
Proof For the proof, see Appendix 2. ◻ As seen from the proof, the error rate of 1/2 holds for the forward strategy independent of the time constraint. As a quantitative example, let p 1 = 0.510 and p 2 = p 3 = p 4 = p 5 = 0.505 . With this setup, where I k = {1, … , k} , we have q 1 (I 1 ) = p 1 = 0.5100 , q 3 (I 3 ) = 0.5100 , and q 5 (I 5 ) = 0.5112 , which shows that the greedy forward selection strategy is stuck at the single element ensemble, though a more accurate larger one could be found.
In addition to forward selection, its inverse variant, the backward selection strategy, is also popular. It puts all the components into an ensemble first, and in every selection step, leaves the worst one out to gain maximum ensemble accuracy. As a major difference from the forward strategy, backward selection is efficient in our case if the time constraint is irrelevant. Namely, either the removal of the worst items will lead to an increase in q defined in (1), or the selection can be stopped without the risk of missing a more accurate ensemble.
However, if the time constraint applies, the same maximum error rate can be proved.

Proposition 2
The simple greedy backward selection strategy considering the individual accuracy values to build an ensemble that applies the majority voting-based rule has a maximum error rate of 1/2.
Proof Proposition 2 is proved in Appendix 3. ◻ Propositions 1 and 2 have shown the worst-case scenarios for the forward and backward selection strategies. However, the greedy approach was applied only regarding the accuracy values of the members, and their execution times were omitted. To consider both the accuracies and execution times of the algorithms in the ensemble pool D = {D 1 = (p 1 , t 1 ), D 2 = (p 2 , t 2 ), … , D n = (p n , t n )} , we consider their usefulness in the selection strategies, defined as which is a generally used definition to show the price-to-value ratio of an object. The composition of ensembles based on similar usefulness measures has also been efficient, e.g., in sensor networks (Kurz et al. 2013). (4) After the introduction of the usefulness (4), the first natural question to clarify is to investigate whether the validity of the error rate of the deterministic greedy forward and backward selection strategies operating with the usefulness measure holds. Through the following two statements, we will see that the 1/2 error rates remain valid for both greedy selection approaches.
Corollary 1 Proposition 1 remains valid when the forward feature selection strategy operates on the usefulness. Namely, as a worst-case scenario, let t 1 = t 2 = ⋯ = t n = T∕n be the execution times in the example of the proof of Proposition 1 while keeping the same p 1 , p 2 , … , p n values. Then, the selection strategy operates completely in the same way on the u i = p i ∕t i values (i = 1, … , n) as on the p i ones, since the t i values are equal. That is, the error rate is 1/2 in the same way.

Proposition 3
The simple greedy backward selection strategy considering the individual usefulness (4) to build an ensemble that applies the majority voting-based rule has a maximum error rate of 1/2.

Proof
The proof is provided in Appendix 4. ◻ We note the analogy between forward and backward selection strategies and approximation algorithms, which find approximate solutions to optimization problems, in particular NP-hard ones. With this respect the most common expectation is to have provable guarantees on the distance of the returned solution to the optimal one (Williamson and Shmoys 2011). With Propositions 1, 2, 3 and Corollary 1 we exactly address this expectation by giving the worst-case scenarios for these selection strategies. A wider theoretical characterization (regarding e.g. the expected error) would need exhaustive knowledge about the specific member accuracies and cost values. However, in the later chapters we will present many empirical results for these greedy approaches in several scenarios. These results also suggest that a deeper error analysis is expected to be interesting from mainly a theoretical point of view, because other existing and the proposed approaches show remarkably better performance.
The main problem with the above deterministic procedures is that they leave no opportunity to find better performing ensembles. Thus, we move on now to the more dynamic stochastic strategies. Keep in mind that since in our model the distribution of q will be estimated, in any of the selection strategies we can exploit this knowledge as a stopping rule. Namely, even for the deterministic approaches, we can check whether the accuracy of the extracted ensemble is already attractive or whether we should continue and search for a better one.

Stochastic search algorithms
As the deterministic selection strategies may have poor performance, we investigate stochastic algorithms to address our optimization problem. Such randomized algorithms, where randomization only affects the order of the internal executions, produce the same result on a given input, which can cause the same problem we have found for the deterministic ones. In case of Monte Carlo (MC) algorithms (Tempo and Ishii 2007), the result of the simulations might change, but they produce the correct result with a certain probability.

3
The accuracy of the MC approach depends on the number of simulations N; the larger N is, the more accurate the algorithm will be. It is important to know how many simulations are required to achieve the desired accuracy. The error of the estimate of the probability failure is found to be u 1− ∕2

√
(1 − P f )∕NP f , where u 1− ∕2 is the 1 − ∕2 quantile of the standard normal distribution, and P f is the true value of the probability of failure. Simulated annealing (SA), as a variant of the Metropolis algorithm, is composed of two main stochastic processes: generation and acceptance of solutions. SA is a generalpurpose, serial search algorithm, whose solutions are close to the global extremum for an energy function within a polynomial upper bound for the computational time and are independent of the initial conditions.
To compare the MC method with SA for solving a knapsack problem, we applied simulations for that scenario in which the deterministic approaches failed to find the most accurate ensemble, that is, when D 1 = (1 − , T) , and D 2 = D 3 = … = D n = (1∕2 + , T∕n) with 0 < < 1∕2, 0 < < 1∕2 . For this setup, we obtained that the precision of the MC method was only 11% , while SA found the most accurate ensemble in 96% of the simulations. Beyond SA, other pruning methods cited in the introduction are naturally based on stochastic methods. Now, we introduce a novel search strategy that takes better advantage of our stochastic approach than, e.g., SA. This strategy builds ensembles using a randomized search technique and introduces a concept of usefulness for member selection, which better adapts to the ensemble energy than the classic one (4). Namely, in our proposed approach, the selection of the items for the ensemble is based on the efficiency of the members determined in the following way: for the i-th item with accuracy p i and execution time t i , the system accuracy q(p i , t i ) of the ensemble containing the maximal number of i-th items characterizes the efficiency (usefulness) of the i-th item, instead of (4).
A greedy algorithm for an optimization problem always chooses the item that seems to be the most useful at that moment. In our selection method, a discrete random variable depending on the efficiency values of the remaining items is applied in each step to determine the probability of choosing an item from the remaining set to add to the ensemble. Namely, in the k-th selection step, if the items i 1 , … , i k−1 are already in the ensemble, then the efficiency values q (k−1) (p i , t i ) of the remaining items are updated to the maximum time The i-th item is selected as the next member of the ensemble with the following probability: where i, j ∈ N�{i 1 , … , i k−1 } . This discrete random variable reflects that the more efficient the item is, the more probable it is to be selected for the ensemble in the next step.
As a new contribution, we have incorporated these probabilities based on the newly introduced member efficiencies first to SA characterizing the acceptance probabilities by them. Same considerations apply to any other stochastic search methods. Besides SA and the genetic , algorithm (Goldberg 1989), these novel stochastic methods (denoted by SA+ and Genetic+) will be compared with our proposed method in the experimental analyses. If t i > T k for all i ∈ N�{i 1 , … , i k−1 } , then our stochastic process ends for the given search step since there is not enough remaining time for any items. Then, we restart the process to extract another ensemble in the next search step. As a formal description of our proposed stochastic search method SHErLoCk, see Algorithm 1; notice that we evaluate the accuracy of ensembles with odd cardinalities only as in Lemma 1. A very important issue regarding both our approach and other search methods (e.g. SA) is the exact definition of the number of search steps, that is, a meaningful STOP parameter-and also an escaping MAXSTEP onefor Algorithm 1. In our preliminary work (Hajdu et al. 2016), we have already tested the efficiency of our approach; however, we tested it empirically with an ad hoc stopping rule. Now, in the forthcoming sections, we present how the proper derivation of the stopping parameters (STOP and MAXSTEP) can be derived.

3 5 Stochastic estimation of ensemble energy
We need to examine and characterize the behavior of q in (1) to exploit these results to find and apply the proper stopping criteria in stochastic search methods.
Let p ∈ [0, 1] be a random variable with mean p and variance 2 p , where p i (i = 1, 2, … , n) are independent and identically distributed according to p, i.e., a sample. Furthermore, let q and 2 q denote the mean and variance of the ensemble accuracy q , respectively. In this case, it is seen that p ≤ 1 and a simple calculation shows that The mean q is monotonic in , except for the case p = 1∕2 . Moreover, we get 0, 1/2 or 1 for the limit of q if → ∞ in case p < 1∕2 , p = 1∕2 , p > 1∕2 , respectively. As a demonstrative example, see Fig. 1 regarding the three possible accuracy limits (0, 1/2 or 1) described in (43) with respective Beta( p , p ) distributions for p. The parameters of the beta distribution ( p , p ) can be chosen arbitrarily to fulfil the condition p < 1∕2 , p = 1∕2 , p > 1∕2 , respectively. Furthermore, the variance 2 q can be expressed by the mean p and variance 2 p , where its limit lim →∞ 2 q = 0 , if p ≠ 1∕2 and s T = 2 p + 2 p ≠ 1∕2 and 1, otherwise. For the exact formula for 2 q and for the proof of these statements, see Lemma 2 in Appendix 5. Notice that the condition → ∞ naturally assumes the same for the pool size with n → ∞. Now, to devise a stochastic model, we start with checking the possible distributions of the member accuracy values p i to estimate the ensemble accuracy. Then, we extend our model regarding this estimation by incorporating time information as well. Notice that the estimation of the ensemble accuracy will be exploited to derive a stopping rule for the ensemble selection process. 1 3

Estimation of the distribution of member accuracies
Among the various possibilities, we have found that the beta distribution is a very good choice to analyze the distribution of member accuracies. The main reason is that beta concentrates on the interval [0, 1], that is, it can exactly capture the domain for the smallest/ largest accuracy. Moreover, the beta distribution is able to provide density functions of various shapes that often appear in practice. Thus, to start the formal description, let the variate p be distributed as Beta p , p with density In this case, and p ∈ (1∕2, 1) if and only if p > p . If p = p , then p = 1∕2 . In the case of p > p , the mode is also greater than 1/2. The mode is infinite if p < 1 ; therefore, we exclude this situation and we assume from now on that The variance of p is Since q , and 2 q depend on p , and 2 p according to (7) and (44) respectively, one can calculate both of them explicitly. The convergence of q to 1 is fast if p is close to 1, i.e., p ≪ p ; for instance, if p = 17 , p = 5 . Simulations show that the speed of the convergence of 2 q is exponential; hence, the usual square-root law does not provide the Central Limit Theorem for q .
In practice, we perform a beta fit on the p i 's (i = 1, … , n) . If a fit is found at least at the confidence level 0.95, we take the parameters p , p provided by the fit and calculate p , 2 p by (9) and (11), respectively. If the beta fit is rejected, then p and 2 p are estimated from the p i 's as the empirical mean and variance: To simplify our further notation we do not indicate whether the mean and variance have been estimated from the fitted distribution or empirically.

Adding time constraints to the model
Now, we turn to the case when together with the item accuracy p i , we consider its running time t i as well. The common distribution of a random time is exponential, so let be an exponential distribution with density exp (− t) . If p is distributed as Beta p , p , then with setting = 1 − p for a given p, the distribution of becomes Beta p , p . This is a reasonable behavior of time because it is quite natural to assume that more accurate components require more resources such as a larger amount of computation times. On the other hand, the selection procedure becomes trivial, if, e.g., the time and accuracy are inversely proportional, since then the most accurate member is also the fastest one; therefore, it should be selected first by following this strategy for the remaining members until reaching the time limit. For some other possible simple accuracy-time relations, see our preliminary work (Hajdu et al. 2016).
For a given time constraint T, consider the random number T such that We provide an estimation for the expected size of the composed ensemble ̂ in Lemma 3 (see Appendix 6) and we incorporate this information into our stochastic characterization of q . Till this point, we have assumed that p is distributed as beta to calculate ̂ T by Lemma 3. If this is not the case, we consider the following simple and obvious calculation for the approximate number of under the time constraint T: another alternative to derive ̂ T in this case is discussed in Sect. 7. In either way it is derived, the value ̂ T will be used in the stopping rule in our ensemble selection procedure; the proper details will be given next.

Stopping rule for ensemble selection
The procedure of finding L 0 , 0 is a selection task that is NP-hard. We propose an algorithm such that we stop the selection when the value of q (L) is sufficiently close to the possible maximum, which is not known. To be able to do so, we must give a proper stochastic characterization of q by also settling on the calculation of q and 2 q via Lemma 2. First, notice that the values of q are in (0, 1) ; indeed, it is positive and For the case when p i 's are beta distributed, the product of independent beta variates can be close to beta again (see Tang and Gupta 1984). We have also performed MC simulation and found that beta distributions fit q particularly well, compared to, e.g., the gamma, normal, Weibull, and extreme-valued distributions. Specifically, though the beta behavior of q was naturally more stable for beta distributed p i 's, the usual behavior of q was also the same for non-beta p i 's.
Thus, to provide a description of the stochastic behavior of q, we consider the following strategy. With a primary assumption on the Beta( q , q ) distribution of q , we calculate q and q as If time information is provided for the pool items, we calculate ̂ T by Lemma 3, and as a simpler notation, we will write ̂ from now on. If time information is not available, we will set ̂ = n.
Next, we decide whether q should be considered as beta with requiring 1 < q < q to be fulfilled to have a mode that is larger than 1/2 and finite. If this condition does not hold, we reject the beta behavior of q , and based on simulations, we characterize it as a normal distribution and stop the search if where 0.9 is the 0.9 quantile of the standard normal distribution. Otherwise, when q is considered beta, we calculate the mode of Beta( q , q ) for q as and the Pearson's first skewness coefficient as Then, we use Table 1 to select the appropriate probability value q̂ ; the entries are determined by simulation in the case of 2 ≤ q < q .
We stop the selection when the ensemble accuracy reaches the value of the inverse cumulative distribution F −1 q , q ( q̂ ) of Beta( q , q ) in the given probability, that is, when In either via (17) or (20), an estimation for the ensemble accuracy is gained; we obtain a STOP value to stop the stochastic search. However, there is some chance that STOP is not exceeded, though in our experiments it has never occurred. Thus, to avoid an infinite loop, we consider a maximum allowed step number MAXSTEP as an escaping stopping rule. Namely, to obtain MAXSTEP, we apply Stirling's approximation assuming that ̂ ∕n → 0 . This is a reasonable approach since ̂ is calculated according to Lemma 3 or (14). The formal description of our proposed ensemble selection method is enclosed in Algorithm 2.
Notice that neither Algorithm 1 nor Algorithm 2 considers freely adjustable parameters beyond the input (p i , t i ) pairs and the total allowed time T. The derivation of all estimated distribution parameters and the stopping related ones are properly referred to in the bodies of the algorithms. If no time condition is provided then any preferred unconstrained algorithm (e.g. the SA, Genetic Goldberg 1989 or Pruning Martinez-Munoz and Suarez 2007 one) can be used as handled by line 17 in Algorithm 2. Similarly, the output (line 19) of Algorithm 2 is the composed ensemble found by either our proposed method or any other preferred one again. Either time condition is provided or not, all the ensemble creator approaches can take advantage of the calculated stopping parameter STOP.
Before providing our detailed empirical results in Sect. 6, in Table 2 we summarize our findings for Algorithm 2 on simulations. Namely, in two respective demonstrative tests with i = 1, … , 30 and i = 1, … , 100 , we have generated the p i 's to come from the same example Beta(17, 5) as before and the execution times t i from conditional exponential distributions with parameters = 1 − p i . The time constraint T was set in seconds to 30% of the total time ∑ 30 i=1 t i for the first, and 20% of ∑ 100 i=1 t i for the second test. Both tests were repeated 100 times, and we have taken the averages of the obtained precisions. The parameters of the beta distribution for the simulations can be chosen arbitrarily to fulfil the condition 1 < p < p derived in Sect. 5.1. As our primary aim, we have checked whether the stopping rule of the stochastic search indeed led to a reasonable computational gain. For the sake of completeness, in Table 2 we have also shown the results for these simulated pairs (p i , t i ) regarding letting the search continue in the long run (stopped by MAXSTEP), though in each of our tests, the STOP value has been exceeded much earlier. Secondarily, we have compared SA with our selection method SHErLoCk given in Algorithm 1.
For Table 2, we can conclude that applying our stopping rule by using STOP saved considerable computational time compared with the exhaustive search that culminated by stopping it with MAXSTEP with a negligible drop in accuracy. Moreover, our approach has found efficient ensembles quicker than SA. These impressions have also been confirmed by the empirical evaluations on real data described in the next section.

Empirical analysis
In this section, we demonstrate the efficiency of our models through an exhaustive experimental test on publicly available data. Our first experiment considers the possibility of organizing competing approaches with different accuracies into an ensemble. In this scenario, accuracy values correspond to final scores of participants of Kaggle 3 challenges without cost/time information provided. Our second setup for ensemble creation considers machine learning-based binary classifiers as possible members; the performance evaluation is performed on several UCI Machine Learning Repository (Dheeru and Karra Taniskidou 2017) datasets with the training times considered as costs.

Kaggle challenges
Kaggle is an open online platform for predictive modeling and analytics competitions with the aim of solving real-world machine learning problems provided by companies or users. The main idea behind this crowd-sourcing approach is that a countless number of different strategies might exist to solve a specific task, and it is not possible to know beforehand which one is the most effective. Though primarily only the scores of the participating algorithms can be gathered from the Kaggle site, as a possible future direction, we are curious regarding whether creating ensembles from the various strategies could lead to an improvement regarding the desired task. Not all the Kaggle competitions are suitable to test our models since in the current content, we focus on majority voting-based ensemble creation. Consequently, we have collected only such competitions and corresponding scores where majority voting-based aggregation could take place. More precisely, we have restricted our focus only to such competition metrics based on which majority voting can be realized. Such metrics include quadratic weighted kappa, area under the ROC curve (AUC), log loss, normalized Gini coefficient. For concrete competitions where these metrics were applied, we analyze the following ones: Diabetic Retinopathy Detection, 4 DonorsChoose.org Application Screening, 5 Statoil/C-CORE Iceberg Classifier Challenge, 6 WSDM -KKBox's Churn Prediction Challenge, 7 and Porto Seguro's Safe Driver Prediction. 8 For our analytics, on the one hand it is interesting to observe the distribution of the final score of the competitors, which is often affected by the volume of the prize money offered to the winner. Moreover, accuracy measurement is usually scaled to the interval [0, 1], with 0 for the worst and 1 for the perfect performance, which allows us to test our results regarding the beta distributions. As a drawback of Kaggle data, access to the resource constraints corresponding to the competing algorithms (e.g., training/execution times) is rather limited; such data are provided for only a few competitions, primarily in terms of execution time interval.
Thus, to summarize our experimental setup, we interpret the competing solutions of a Kaggle challenge as the pool {D 1 , D 2 , … , D n } , where the score of D i is used for the accuracy term p i ∈ [0, 1] in our model. Then, we apply a beta fit for each investigated challenge to determine whether a beta distribution fits the corresponding scores or not. If the test is rejected, we can still use the estimation for the joint behavior q using (12) and (14). If the beta test is accepted, we can also apply our corresponding results using (9), (11), and Lemma 3. Notice that reliably fitting a model for the scores of the competitors might lead to a better insight of the true behavior of the data of the given field, also for the established expectations there. As observed from Table 3, SA was able to stop much earlier with a slight loss in accuracy using the suggested stopping rule (STOP) in finding the optimal ensemble. Our approach SHErLoCk given in Algorithm 1 has been excluded from this analysis since no cost information was available. Though for the lack of cost information our stochasticsbased results can be applied only partially to Kaggle challenges with distribution fitting and suggesting stopping criterion accordingly, we were highly motivated to include this platform as well. Kaggle collects a huge number of different approaches-sometimes also with available implementations-for the same task, so is an excellent platform to create ensembles. Moreover, several accuracy measures considered in these challenges are just completely suitable for stochastic analysis, just like in our approach. If resource information is also provided in the future for the submitted solutions, then our corresponding results also become applicable.

Binary classification problems
The UCI Machine Learning Repository (Dheeru and Karra Taniskidou 2017) is a popular platform to test the performances of machine learning-based approaches, primarily for classification purposes. A large number of datasets are made publicly available here among which our models can be tested on binary classification ones. That is, in this experiment, the members D 1 , D 2 , … , D n of a pool for ensemble creation are interpreted as binary classifiers, whose outputs can be aggregated by the majority voting rule. Using the ground truth supplied with the datasets, the accuracy term p i ∈ [0, 1] stands for the individual classification accuracy of D i .
The number of commonly applied classifiers is relatively low; therefore to increase the cardinality of the pool, we have also considered a synthetic approach in a similar way to Cavalcanti et al. (2016). Namely, we have trained the same base classifier on different training datasets, by which we can synthesize several "different" classifiers. Naturally, this method is able to provide more independent classifiers only if the base classifier is unstable, i.e., minor changes in the training set can lead to major changes in the classifier output; such an unstable classifier is, for example, the perceptron one.
To summarize our experimental setup for UCI binary classification problems, we have considered base classifiers perceptron (Freund and Schapire 1999), decision tree (Quinlan 1986), Levenberg-Marquardt feedforward neural network (Suratgar et al. 2005), random neural network (Timotheou 2010), and discriminative restricted Boltzmann machine classifier (Larochelle and Bengio 2008) for the UCI datasets MAGIC Gamma Telescope, HIGGS, EEG Eye State, Musk (Version 2), Spambase, Breast Cancer Wisconsin, Mushroom, Gisette and Adult; datasets of large cardinalities were selected to be able to train synthetic variants of base classifiers on different subsets. To check our models for different numbers of possible ensemble members, the respective pool sizes were set to n = 30 and n = 100 ; the necessary number of classifiers has been reached via synthesizing the base classifiers with training them on different subsets of the training part of the given datasets. In contrast to the Kaggle challenges, in these experiments we were able to retrieve meaningful cost information to devise a knapsack scenario. Namely, for a classifier D i , its training time was adjusted as its cost t i in our model. Notice that for even the same classifier, it was possible to obtain different t i values with training its synthetic variants on datasets of different sizes. Using this time information, for the estimated size ̂ of the optimal ensemble, we could use Lemma 3 for n = 30 , while (14) for the case n = 100 . This is one of the reasons why we set the different pool sizes to n = 30 and n = 100 . Another point is to get sufficiently large search spaces to show the efficiency of our proposed method. To choose a pool size greater than 100 is rather unrealistic and results in a very time-demanding problem.
We compare the performance of the proposed search strategy (SHErLoCk) with SA, SA+, Genetic (Goldberg 1989), Genetic+, Pruning (Martinez-Munoz and Suarez 2007) on binary classification problems of UCI datasets using an ensemble pool of n = 30 and n = 100 classifiers, respectively. As clearly visible from Tables 4 and 5, our stochastic search strategy SHErLoCk described in Algorithm 1 was reasonably faster than the other ones and slightly dominant in ensemble accuracy found by the different approaches presented in the tables, as well. Moreover, it can be observed again that applying the stopping rule with the threshold STOP led to an enormous computational advantage for either search strategies with only a small drop in accuracy. For the sake of completeness, we have also included the forward and backward selection techniques. As it can be observed from the tables, these deterministic techniques are naturally quicker than the others; however, their accuracies are reasonably low as well. Notice that, as deterministic techniques it is meaningless to limit the search time for them with either MAXSTEP or STOP. Moreover, as for their 50% worst-case accuracy proved in Sect. 2 we can see a better performance in our experiments.
As for evaluating the accuracy of the ensembles of the trained classifiers, we have followed the following protocol. The individual classifiers have been trained according to the common guidelines with splitting all the datasets to training, cross-validation, and testing parts. Then the individual accuracies p i have been determined as their performances on the test set (30% randomly selected parts of the datasets) since these figures are available only for the test sets regarding the Kaggle competitors. The ensemble accuracies gained by all the methods are calculated by aggregating the outputs of the trained classifiers on the test part of the publicly available datasets by applying the classic majority voting rule without any further training processes. This is the reason why we have presented the experimental results (ensemble accuracies and computational times) for the test parts for Kaggle and UCI, as well. The official splitting is available only for the Kaggle datasets, but not for the UCI ones. Thus, for possible future comparability, we give the results for the whole UCI datasets in Tables 7 and 8 in Appendix 7, as well. We have found a small drop or a small rise regarding each ensemble creator method's performance on the whole dataset, so the trends shown in Tables 4 and 5 are not affected. The computational times are naturally the same for both evaluation protocols. For further comparison of the performance of the proposed search strategy (SHErLoCk) with SA, SA+, Genetic, Genetic+, Pruning, Bayesian signed-rank tests (Benavoli et al. 2017) were applied. The results of the tests for training time as cost on UCI test sets using an ensemble pool of n = 30 and n = 100 classifiers are enclosed in Tables 9 and 10 in Appendix 7, respectively. The signed-rank test returns three probability values from which we consider P(practical equivalence) and P(Algorithm1 > Algorithm2) describing the equivalence of the two compared algorithms and the dominance of Algorithm1 over Algorithm2, respectively. The results show that the accuracy of the two compared classifiers are practically different in each case (the largest value for equivalence was 0.0132) and the proposed algorithm SHErLoCk outperformed the other search strategies with high probability in most of the cases.
We have also checked how the ensemble accuracies found by the different approaches increased regarding the elapsed time during the search. Notice that, the accuracies indeed have a monotonously increasing trend, since at each search step we store the best performing ensemble for all the approaches till the stopping condition is met. Our findings are depicted in Fig. 2 (for n = 30 ) according to SHErLoCk, SA+, Genetic+ as best performing algorithms with indicating also the time points with dashed lines when the respective methods were stopped. For this demonstrative analysis we have selected the two datasets EEG/Gisette requiring the largest/smallest computational times. For each of the analyzed approaches we can see sudden jumps coming from their stochastic behaviours.

Optic disc detection
The majority voting rule can be applied in a problem to aggregate the outputs of single object detectors in the spatial domain ; the votes of the members are given in terms of single pixels as candidates for the centroid of the desired object. In this extension, the shape of the desired object defines a geometric constraint, which should be met by the votes that can be aggregated. In , our practical example relates to the detection of a disc-like anatomical component, namely the optic disc (OD) in retinal images. Here, the votes are required to fall inside a disc of diameter d OD to vote together. As more false regions are possible to be formed, the correct decision can be made even if the true votes are not in the majority, as in Fig. 3. The geometric restriction transforms (1) to the following form: In (22), the terms p ,k describe the probability that a correct decision is made by supposing that we have k correct votes out of . For the terms p ,k (k = 0, 1, … , ) , in general, we have that 0 ≤ p ,0 ≤ p ,1 ≤ ⋯ ≤ p , ≤ 1.
In our experiments, the pool consists of eight OD detector algorithms with the following accuracy and running time values: i=1 t i = 301 secs. As we have mentioned earlier we can consider any resource type regarding the parameters t i . For instance, we have considered the running times of the member algorithms here since some of them are not machine learning based ones. We can apply our theoretical foundation with some slight modifications to solve the same kind of knapsack problem for the variant (22), transforming the model to reflect the multiplication with the terms p ,k .
We have empirically derived the values p 8,k = {0, 0.11, 0.70, 0.93, 0.99, 1.00, 1.00, 1.00, 1.00} for (22) in our task. To adopt our approach by following the logic of Algorithm 2, we need to determine a STOP value for the search based on p and p (calculated by (12)), and ̂ (calculated by (14)). However, since now the energy function is transformed by the terms p ,k in (22), we must borrow the corresponding theoretical results from Tiba et al. (2019) to derive the mean q̂ instead of (7) proposed in Algorithm 2. Accordingly, we had to find a continuous function F that fit to the values p ,k , which was evaluated by regression and resulted in F(x) = b∕(b + x a ∕(1 − x) a ) with a = −3.43 and Fig. 4 Determining the constrained majority voting probabilities p ,k for our OD detector ensemble 1 3 b = 101.7 , as also plotted in Fig. 4. Now, by using Theorem 1 from Tiba et al. (2019), we have gained q̂ = F( p ). For our experiment to search for the best ensemble, we have set the time constraint to be 80% of the total running time, with T = 4( ∑ 8 i=1 t i )∕5 . For this setup, we could estimate ̂ = 7 and q̂ = 0.969 for the expected ensemble size and mean accuracy, respectively.
Then, these values have been considered for Algorithm 2 to demonstrate the performance of our stochastic search method SHErLoCk with SA. As shown in Table 6, our search strategy outperformed SA also for the object detection problem both in accuracy and computational time. Because of the smallness of this setup, we have omitted a full experimental evaluation.

Discussion
For the approximate number ̂ T of the ensemble size, we have considered (14) when the member accuracy p is not a beta distribution. As an alternative notice that it is known that for a given , T and an independent exponential distributed j , T is distributed as Poisson with parameter T . We can use Lemma 3 and conclude that for a starting size of the ensemble, one may choose ̂ T such that the remaining possible values are beyond the 5% error. It follows that we apply formula either where m 0.05 is the upper quantile of the Poisson distribution with parameter T∕ ∑ n i=1 p i , or use the normal approximation to the Poisson distribution which provides us the inequality In our experiments we have used (14) instead of (25) to obtain ̂ T , since the latter provided slightly too large estimated size values. However, for other scenarios, it might be worthwhile to try (25) as well.
As some additional arguments, we call attention to the following issues regarding those elements of our approach that might need special care or can be adjusted differently in other scenarios: -We have assumed independent member accuracy behavior, providing solid estimation power in our tests. However, in the case of strong member dependencies, deeper discovery of the joint behavior might be needed. -Stirling's approximation considered in (21) may provide values that are too small for the parameter MAXSTEP in the case of small pools. Since this is an escape parameter, a sufficiently large value should be selected in such cases instead. Note that on the datasets we examined, we found that it is not worth going above 15 000 iterations, because the ensembles' accuracies improved only slightly with higher step counts. -The constraint we consider first in the knapsack problem corresponds to the training time; however, any other type of resources could be studied. For instance, we have examined the testing time as cost on the UCI dataset, as well. Tables 11 and 12 show the results (ensemble accuracy and computational time) of the proposed search strategy (SHErLoCk) compared with other selection methods using an ensemble pool of n = 30 and n = 100 classifiers, respectively. Similar conclusions can be drawn regarding the comparison of the performances of the methods that we have found for the training time. -The time profile = 1 − p in Sect. 5.2 is suited to our data; however, any other relationship between the member accuracy and time can be considered. In case other non-timebased resources are examined in the constraint, the exponential distribution can be changed accordingly. Nevertheless, the similar derivation of the estimation of the ensemble accuracy might be slightly more laborious depending on the selected distribution. -In (17), we have used a one-tailed (left-side) hypothesis since q was close to 1. However, if it is not that close to 1, a two-tailed hypothesis can be meaningful as well. Furthermore, if q is even smaller (say 0.7), then we can search above this mean by considering a rightside hypothesis.
Natural generalizations of the presented results regard multiple objectives and multiclass classification. Within our framework, multiple constraints lead to a multiply-constrained knapsack problem, which can be handled e.g. with merging the constraints into a single one (Pisinger 1995) to make our approach directly applicable. As for multiclass classification, notice that our motivating example given in Sect. 6.3 has already some similarities regarding multiclass problems, since the geometric constraint intuitively describes such a type of task. To realize the classic setup, we have to consider multinomial distribution instead of the binomial one to handle majority voting in the corresponding classification problem. These changes are quite obvious; however, since they raise serious theoretical challenges over the techniques we used in this work, they need a dedicated future research beyond the current scope. Similarly, a complexity analysis of our approach is a natural issue; however, its stochastic nature makes this derivation really hard like in Dzahini (2020).

Appendix 1: Proof of Lemma 1
Proof Consider a subset K when K = {1, 2, … , 2 } (otherwise we can renumerate p i ). We have where (26) that is, we consider a subset K ⊆ N with |K| = 2 , and Q 2 ,k (K, I) is calculated for an index set I ⊆ K with |I| = k . Now, choose an index a from the set N∖K , i.e., a > 2 , and obtain The term Q 2 ,k (K, I)p a = Q 2 +1,k+1 ({K, a}, {I, a}) and Q 2 ,k (K, I) 1 − p a = Q 2 +1,k ({K, a}, I) ; therefore, since q 2 +1 (K) includes some extra additional terms, say Q 2 +1,k ({K, a}, I) , where I contains a. Regarding that n is odd in the series of q (K) , there will be an element with odd following an element of even and the lemma is proved for odd n. For the case when n is even, we consider the q n (N) and q n−1 (L) , where L = {1, 2, … , 2 − 1} . Set n = 2 ; then, with N = {1, 2, … , 2 } and put notice that the number of terms is equal in both sums. If k = 2 , then otherwise, hence for k < 2 , We start summing up q(N, 2 ) from 2 ; then, using (32) and (34), we obtain for the first two terms If we continue summing up one by one, then induction leads to since p 2 < 1 . ◻

Appendix 2: Proof of Proposition 1
Proof We prove the statement with an example describing the worst-case scenario for the greedy selection strategy. Let D = {D 1 = (p 1 , t 1 ), D 2 = (p 2 , t 2 ), … , D n = (p n , t n )} be the pool, where the index set is denoted by I n = {1, 2, … , n} . Let us suppose that that is, the time constraint should not be of concern. Let p 1 = 1∕2 + , where 0 < ≤ 1∕2 , and p 2 = p 3 = ⋯ = p n = 1∕2 + with 0 < < , where the proper selection of will be given below.
The greedy strategy will move D 1 to S as the most accurate item in its first step. Next, we try to extend S by adding more members. Since we require odd members, we try to add 2 items in every selection step. Since all the remaining n − 1 features have the same behavior, we can check whether S should be extended via comparing the performance of S 1 = {D 1 } and S 3 = {D 1 , D 2 , D 3 } . For the performance of the ensemble S 1 , we trivially have q 1 (I 1 ) = p 1 = 1∕2 + , where I 1 = {1} , while for S 3 we can apply (1) for the 3-member ensemble, with I 3 = {1, 2, 3} to calculate q 3 (I 3 ): after the appropriate substitutions and simplifications. Now, if we adjust to have q 1 (I 1 ) = q 3 (I 3 ) , then via solving the equation we obtain That is, with a selection of given in (39), the ensemble S 1 = {D 1 } is not going to be extended since it does not lead to improvement. Thus, the strategy stops after the first step with an ensemble accuracy 1∕2 + .
On the other hand, with a sufficiently large n, a very accurate ensemble could be achieved. More precisely, it can be easily seen that q n (I n ) is strictly monotonically increasing with Now, by letting → 0 , we can see that for the ensemble accuracy found with this strategy while an ensemble of lim n→∞ q n (I n ) = 1 could also be found. Hence, the proposition follows.

Appendix 3: Proof of Proposition 2
Proof We prove the statement with a similar example to that given in the proof of Proposition 1 in Appendix 1 to describe the worst-case scenario. Let D = {D 1 = (p 1 , t 1 ), D 2 = (p 2 , t 2 ), … , D n = (p n , t n )} be the pool and T be the time constraint. Put p 1 = 1∕2 + , where 0 < ≤ 1∕2 , t 1 = T , and p 2 = p 3 = ⋯ = p m = 1∕2 + , t 2 = t 3 = ⋯ = t n = T∕(n − 1) with 0 < < . If is properly selected, then q 1 (I 1 ) = p 1 < q n−1 (I n ⧵I 1 ) . However, because of the time constraint, we must remove elements during the selection procedure, since initially n ∑ i=1 t i = 2T > T . For this requirement, the greedy approach in the first step will remove any two elements from D 2 , … , D n by decreasing the time with 2T∕(n − 1) . This selection will go on until only D 1 remains in the ensemble. With a proper selection of , we have lim n→∞ q n−1 (I n ⧵I 1 ) = 1 and by letting → 0 , the proposition follows. ◻ (37) (40) lim n→∞ q n (I n ) = 1.
(41) lim →0 q 1 (S 1 ) = 1∕2, Proof The first part of the lemma corresponds to Theorem 1 in Lam and Suen (1997). For the rest, let us denote the product of probabilities by  (44) are actually multinomial coefficients. The rest of the proof is based on the approximation of the binomial distribution by the normal distribution. It is not complicated but slightly lengthy; we make it available to the interested readers on request. ◻