Abstract
We develop a hyperparameter optimisation algorithm, Automated Budget Constrained Training, which balances the quality of a model with the computational cost required to tune it. The relationship between hyperparameters, model quality and computational cost must be learnt and this learning is incorporated directly into the optimisation problem. At each training epoch, the algorithm decides whether to terminate or continue training, and, in the latter case, what values of hyperparameters to use. This decision weighs optimally potential improvements in the quality with the additional training time and the uncertainty about the learnt quantities. The performance of our algorithm is verified on a number of machine learning problems encompassing random forests and neural networks. Our approach is rooted in the theory of Markov decision processes with partial information and we develop a numerical method to compute the value function and an optimal strategy.
1 Introduction
The emergence of Machine Learning (ML) methods as an effective and easy to use tool for modeling and prediction has opened up new horizons for users in all aspects of business, finance, health, and research (Ghoddusi et al. 2019; Briganti and Le Moine 2020). In stark contrast to more traditional statistical methods that require relevant expertise, ML methods constitute largely blackboxes and are based on minimal assumptions. This resulted in adoption of ML from an audience of users with possible expertise in the domain of application but little or no expertise in statistics or computer science. Despite these features, effective use of ML does require good knowledge of the method used. The choice of the method (algorithm), the tuning of its hyperparameters and the architecture requires experience and good understanding of those methods and the data. Naturally, trial and error might provide possible solutions; however it can also result in suboptimal use of ML and wasted computational resources.
The goal of this research is to find optimal values of hyperparameters for a given dataset and a modelling objective given that the relationship between the hyperparameters, model quality (model score) and training cost is not known in advance. This problem is akin to the classical AutoML setup (Hutter et al. 2019) with one crucial difference: if desired, the system has to produce results in radically shorter time than classical AutoML solutions. This comes at the expense of accuracy, so taking the tradeoff between the model quality and the running time explicitly into account lies at the heart of this paper. The relationship between hyperparameters, model quality and computational cost for a particular modelling problem and a particular dataset must be learnt and this learning is incorporated directly into the optimisation problem.
1.1 Main contribution
The objective of a classical hyperparameter optimisation problem is to find values of hyperparameters that maximise the accuracy or a related score of the trained model:
where H(u) is the model score given hyperparameters u and \(\mathcal {U}\) is the domain of hyperparameters. As explained above, we are interested in the hyperparameter optimisation problem when the relationship between hyperparameters and the score is not perfectly known and must be learnt by observing performance of different hyperparameter choices, similarly as in reinforcement learning. Differently to reinforcement learning, we add a constraint on the total optimisation cost which may be a function of the running time or resources consumed in the course of hyperparameter optimisation. Schematic presentation of the resulting optimisation problem is as follows:
where the optimisation is over \(\tau \ge 1\) and \((u_n)_{n=0}^\infty \). Here, \(\tau \) is the number of training steps (a quantity dependent on the course of learning/training), \(u_{n} \in \mathcal {U}\) denotes the choice of hyperparameters at step n (also dependent on the history). The function T(u) is the cost of model training and validation for hyperparameters u.
In practice, the observed model score may be random as it depends on the training process of the model (which can itself include randomness as in the construction of random forests or in the stochastic gradient method for neural networks) and on the choice of training and validation datasets (including Kfold crossvalidation). The observed computational cost can also be random due to random variations in running time commonly observed in practical applications, arising both from a hardware and an operating system’s side as well as from random choices in the training process. To account for this randomness and to allow for incorporation of Bayesian learning, we reformulate the hyperparameter optimisation problem (1.1) as a stochastic control problem using notation which will be used in the paper:
The quantity \(h_n\) is the (random) observed model score corresponding to the choice of hyperparameters \(u_{n1}\). The (random) observed computational cost \(t_n\) also depends on the choice of hyperparameters \(u_{n1}\). We recall that \(\tau \) indicates the time when the learning terminates and itself depends on the whole history of observations. Similarly, hyperparameter choice \(u_n\) depends on the history of observations up to time n. We consider the expectation of the score at step \(\tau \) and the expectation of the cumulative computational cost since the randomness in observation of those quantities can be viewed as external to the problem and not under our control.
We emphasise that the motivation of this paper comes from optimisation problems where \(T^*\) is significantly smaller than commonly used in classical AutoML applications. When model training is performed during an exploratory data analysis, short response times, even at the cost of model accuracy, are key for success (Demšar et al. 2004; Holzinger et al. 2014) and may reduce computational resource usage at early stages of modelling. When \(T^*\) is small, it needs to be directly included in the optimisation criterion as the choice of controls \(u_n\) and the stopping time \(\tau \) is strongly dependent on \(T^*\).
As common in the stochastic control and stochastic optimisation literature, we will study a Lagrangian relaxation of problem (1.2) which takes the form
The constant \(\gamma >0\) is the Lagrange multiplier corresponding to the constraint \(E\big [\sum _{n=1}^{\tau } t_n\big ] \le T^*\). It models our aversion to the computational cost. The higher the number the more sensitive we are to increases in cost and willing to accept models with lower scores. We call the optimisation problem (1.3) Automated Budget Constrained Training (AutoBCT).
We study (1.3) in the framework of Markov Decision Processes with partially observable dynamics, combining stochastic control, optimal stopping and filtering. We treat hyperparameters as a control variable and \(\tau \) as a stopping time. The choice of hyperparameters \(u_{n1}\) for the step n is based on the observation of model scores \(h_1, \ldots , h_{n1}\) and costs \(t_1, \ldots , t_{n1}\) in all the previous steps as well as on the prior information on the dependence of the score and cost on the hyperparameters. The number of training steps \(\tau \) also depends on the history of model scores and computational costs. At the end of each step a decision is made whether to continue training or accept current model. Hence \(\tau \) is not deterministically chosen before the learning starts and depends not only on the total computational budget but also on observed scores and costs. This may and does lead AutoBCT to stop quite early in some problems when the model happens to be so good that the expenditure on more training is not worthwhile (according to the criterion (1.3) which looks at the tradeoff between score and cost).
We approximate score and cost as functions of hyperparameters using linear combinations of basis functions. We assume that the prior knowledge about the distribution of the coefficients of these linear combinations is multivariate Gaussian and is updated after each step using Kalman filtering. This update not only changes the mean vector but also adjusts the covariance matrix which represents the uncertainty about the coefficients and, consequently, about the score and cost mappings. Updated quantities feed back into the algorithm which decides whether to continue or stop training, and, in the former case, which hyperparameters to choose for the next step. We show that the updated distributions of coefficients are sufficient statistics of past observations of model scores and costs for the purpose of optimisation of the objective function (1.3).
Our framework requires that a value function, representing the optimum value of the objective function (1.3), is available. It depends on the dimension of the problem and the tradeoff coefficient \(\gamma \) only, and is score and cost agnostic (in the sense that it does not depend on the particular problem and the choice of score and cost). This allows for an efficient offline precomputation and recycling of the value function maps. Users are able to share and reuse the same value function maps for different problems as long as the dimensionality of the hyperparameter space agrees. We demonstrate it in the validation section of this paper.
Similar to other AutoML frameworks (Feurer et al. 2015; Vanschoren 2019), AutoBCT natively enables metalearning. Prior distributions for coefficients of basis functions determining the score and cost maps can be combined with the metadata describing already completed learning tasks in a publicly available repositories. These data can then be used to select a more informative prior for a given problem and therefore, to warmstart the training. In turn, this leads to reductions in the computational cost and improvements in the score.
The focus on optimising directly the tradeoff between the model score and the total computational costs comes from two directions. Firstly, given the recent emphasis on the ecoefficiency of AI (Strubell et al. 2019; Schwartz et al. 2019), AutoBCT framework provides effective ways of weighing model quality and the employed computational resources as well as recycling information and, in turn, reducing computational resources required to train sufficiently good models. Secondly, with the democratisation of data analytics an increasing amount of data exploration will take place where users need to obtain useful (but not necessarily optimal) results within seconds or minutes (Demšar et al. 2004; Holzinger et al. 2014).
In summary, this paper’s contributions are multifaceted. On the practical level, we develop an algorithm that allows optimal selection of hyperparameters for training and evaluation of models with an explicit tradeoff between the model quality and the computational resources. From the ecoefficiency perspective our framework not only discourages unnecessarily resource intensive training but also naturally enables transfer and accumulation of knowledge. On the side of Markov decision processes, we solve a nonstandard, mixed stochastic control and stopping problem for a partially observable dynamical system and design an efficient numerical scheme for the computation of the value function and optimal controls. Lastly, the applicability of AutoBCT goes beyond machine learning. Any parameter optimisation problem with learning which can be stated as maximising a certain onedimensional ‘score’ (representing a quality of an outcome) subject to a constraint on ‘costs’ (representing resources used) directly fits into our framework.
1.2 Related literature
To maintain the appeal of ML to its huge and newly acquired audience and to enable its use in systems that do not require human intervention, a large effort has been put in the development of algorithms and systems that allow automatic selection of an optimal ML method and an optimal tuning of its parameters. This line of research has largely operated under the umbrella name of AutoML (see Hutter et al. 2019), standing for Automated Machine Learning, and boosted by a series of data crunching challenges under the same title (see, e.g., Guyon et al. 2019). There are widely available AutoML packages in most statistical and data analytics software, see, e.g., AutoWeka (Kotthoff et al. 2017), Autosklearn (Feurer et al. 2015), as well as in all major commercial cloud computing environments.
The core of parameter optimisation typically takes a Bayesian approach in the form of alternating between learning/updating a model that describes the dependence of the score on the hyperparameters and using this model and an acquisition function to select the next candidate hyperparameters. The different algorithms used to support the alternating model fitting and updating of parameters also affect its performance and suitability for specific problems. The three predominant algorithms are (for a discussion see Eggensperger et al. 2013): Gaussian Processes (Snoek et al. 2012), random forest based Sequential Modelbased Algorithm Configuration (SMAC) (Hutter et al. 2011), and Tree Parzen estimator (Bergstra et al. 2011).
Our approach borrows from the ideas underlying Sequential Modelbased Optimisation (SMBO) (Hutter et al. 2011) and learning curve approximation (Domhan et al. 2015). The surrogate model in our case is the basisfunctions based representation of the score and cost maps. However, unlike SMBO, the choice of hyperparameters is not only based on the surrogate model but also on the expected future performance encoded in the value function. Unlike Domhan et al. (2015), the prior distribution of basis function coefficients is conjugate to the distribution of observations so that analytical formulas for the posterior distribution are available and the dimensionality of the representation of the posterior distribution of coefficients does not change with the increasing number of observations. The latter feature is key for tractability of our methodology.
In all the above, time is certainly an important factor. It is however mostly treated either implicitly, through exploration–exploitation tradeoff or through hard bounds, which can sometimes result in the algorithm providing no answer within the allocated time. A more sophisticated optimisation of computational resources has recently started to draw interest. In a recent work Falkner et al. (2018), the authors combine Bayesian optimisation to build a model and suggest potential new parameter configuration with Hyperband, which dynamically allocates time budgets to these new configurations applying successive halving (Li et al. 2017) to reduce time allocated to unsuccessful choices. Yang et al. (2019) uses metalearning and imposes time constraints on the predicted running time of models. Swersky et al. (2014) discusses a freezethaw technique for (at least temporarily) pausing training for unpromising hyperparameters. Another approach prominent in the literature uses learning curves to stop training underperforming neural network models, see Domhan et al. (2015), Chandrashekaran and Lane (2017), Gargiani et al. (2019). Bayesian Optimal Stopping is employed by Dai et al. (2019) to achieve a similar goal. Another related line of research is concerned with prediction of running time of algorithms (see Huang et al. 2010; Hutter et al. 2014 and many others). Those algorithms are akin to metalearning, i.e., running time is predicted based on metafeatures of algorithms and their inputs.
Our approach differs from the above ideas in that it explicitly accounts for the tradeoff between the model quality (score) and the cumulative cost of training. To the best of our knowledge this is the first attempt to bring the cost of computational resources to the same level of importance as the model score and include it in the objective function. Our developments are, however, parallel to some of the papers discussed above. For example, our approach would benefit from employment of learning curves to allow for incomplete training for hyperparameters which appear to produce suboptimal models. Our algorithm could complement OBOE (Yang et al. 2019) by finetuning hyperparameters of selected models. Such extensions are beyond the scope of this paper and will be explored in further research.
Closely related to our problem is budgetconstrained Bayesian optimisation, see Osborne et al. (2009), Ginsbourger and Le Riche (2010). The budget is stated in terms of the number of evaluations of the unknown function (in our case, the unknown function is the mapping of hyperparameters into the score). This problem is studied in the framework of dynamic programming. The key challenge is to efficiently approximate the socalled continuation value without nested evaluations which are prohibitively expensive, see, e.g., Lam et al. (2016); Lee et al. (2020) which employ the socalled rollout strategies. As in the classical Bayesian optimisation, the unknown function is approximated using Gaussian processes so the dimensionality of the state increases after each step. In our basisfunction approach the dimensionality of the state space does not change. Notice also that the former approach is similar to ours only if the cost of each evaluation is the same and known.
Our numerical method for computation of the value function complements a large strand of literature on regression Monte Carlo methods. The widest studied problem is that of optimal stopping (aka American option pricing) initiated by Tsitsiklis and VanRoy (2001), Longstaff and Schwartz (2001) and studied continuously to this day (Nadarajah et al. 2017). Stochastic control problems were first addressed in Kharroubi et al. (2014) using control randomisation techniques and later studied in Balata and Palczewski (2017) using regress later approach and in Bachouch et al. (2018) employing deep neural networks. Numerical methods designed in our paper are closest in their spirit to regress later ideas of Balata and Palczewski (2017) (see also Bachouch et al. 2018).
1.3 Paper structure
The paper is structured as follows. Section 2.1 gives details of the optimisation problem. Technical aspects of embedding this problem within the theory of Markov decision processes with partial information comes in the following subsection. Due to the noisy observation of the realised model score and cost (partial information), the resulting model is nonMarkovian. Section 2.3 reformulates the optimisation problem as a classical Markov decision model with a different state space. The rest of Sect. 2 is devoted to showing that the latter optimisation problem has a solution, developing a dynamic programming equation and showing how to compute optimal controls. Section 3 contains details of numerical computation of the value function and provides necessary algorithms. Validation of our approach is performed in Sect. 4 which opens with a synthetic example to present, in a simple setting, properties of our solution. It is followed by 8 examples encompassing a variety of datasets and modelling objectives. Appendices contain further details. “Appendix A” derives the dynamics of the Markov decision process of Sect. 2.3. Appendix B contains auxiliary estimates used in proofs of main results collected in Appendix C. Detailed settings for all studied models are in Appendix D and E while architectures of neural networks used in examples are provided in Appendix G. Comparisons to Hyperopt are collected in Appendix F.
For code and value functions (maps) used in the paper contact the authors.
2 Stochastic control problem
In this section, we embed the optimisation problem in the theory of stochastic control of partially observable dynamical systems (Bertsekas and Shreve 1996, Chapter 10) and lay foundations for its solution.
2.1 Model overview
We start by providing a sketch of our framework. Its construction is motivated by a tradeoff between two opposing goals: the representation of the key features of the trained model’s score and cost functions needed in optimisation of hyperparameters, and the computational tractability.
Recall the optimisation problem (1.3). The random quantities \(h_n\) and \(t_n\) are not known as they depend on a dataset, a particular problem, and software and hardware used. We therefore embed the optimisation problem in a Bayesian learning setting. Following ideas of linear basis expansion (Hastie et al. 2009, Ch. 5) and regression Monte Carlo (Tsitsiklis and VanRoy 2001), we assume that the true expected model score H(u) and the true expected model cost T(u),^{Footnote 1} given the choice of hyperparameters \(u \in \mathcal {U}\), can be represented as linear combinations of basis functions:
where functions \((\phi _j)_{j=1}^J\) and \((\psi _k)_{k=1}^K\) act from \(\mathcal {U}\) to \(\mathbb {R}\); we recall that \(\mathcal {U}\) is the set of admissible controls (hyperparameter values). The coefficients \({\varvec{\alpha }}= (\alpha _1, \ldots , \alpha _J)^T\) and \({\varvec{\beta }}= (\beta _1, \ldots , \beta _K)^T\) evolve while learning takes place. A priori, before learning starts, \({\varvec{\alpha }}\) follows a normal distribution with the mean \(\mu ^\alpha _0\) and the covariance matrix \(\Sigma ^\alpha _0\); the vector of coefficients \({\varvec{\beta }}\) follows a normal distribution with the mean \(\mu ^\beta _0\) and the covariance matrix \(\Sigma ^\beta _0\). These distributions are referred to as the prior distributions. We assume that a priori \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) are independent from each other. This choice of distributions is motivated by computational tractability but comes at the cost of modelling inaccuracy. In particular, we cannot guarantee that \(T(u) \ge 0\) for every \(u \in \mathcal {U}\) and any realisation of \({\varvec{\beta }}\). The consequences will be explored in more depth later in the paper.
The observed score \(h_{n}\) and cost \(t_{n}\) experience variation due to inherent randomness in model training and testing or in execution conditions (which is particularly important for short running times), see Domhan et al. (2015), Gargiani et al. (2019) for a similar assumption in the case of learning curve estimation. To account for this variation, we assume that
where \((\epsilon _n)\) and \((\eta _n)\) are independent sequences of independent standard normal N(0, 1) random variables. The terms \(\sigma _H(u_{n1}) \epsilon _{n}\) and \(\sigma _T(u_{n1}) \eta _{n}\) correspond to the aforementioned random fluctuations of observed quantities around \(H(u_{n1})\) and \(T(u_{n1})\) and are Gaussian with the mean 0 and the known variances \(\sigma ^2_H(u_{n1})\) and \(\sigma _T^2(u_{n1})\).
Algorithm 1 provides a sketch of our approach. Given all information at step 0, i.e., given the prior distribution of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) and a precomputed AutoBCT map \(\tilde{V}\) (the computation of AutoBCT map is detailed in Sect. 3.2), we choose values of hyperparameters \(u_0\) and then train and validate a model. Based on the observed model score \(h_1\) and cost \(t_1\), we update the distribution of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) (the learning step that is based on Kalman filter, Sect. 2.3). Following this we check the stopping condition: on the intuitive level, the algorithm stops if expected potential gains in the score do not justify additional costs (Sect. 2.5 contains a detailed discussion of stopping conditions). If the decision is to stop, we return the latest values of metaparameters and further information, for example, the latest distributions of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) or the latest model. Otherwise, we repeat choosing hyperparameters, training, validation and update procedures. We continue until the decision is made to stop.
Remark 2.1
In our Bayesian learning framework, the expectation in (1.3) is not only with respect to the randomness of observation errors \((\epsilon _n)\) and \((\eta _n)\) but also with respect to the prior distribution of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\). Therefore, all possible realisations of the score and cost mappings are evaluated with more probable outcomes (as directed by the prior distribution) receiving proportionally more contribution to the value of the objective function.
In the remaining of this subsection, we derive the optimisation problem that will be solved in this paper. This will result in a modification to both terms in the objective function in (1.3).
The score. In Problem (1.3) the optimised quantity is the observed score of the most recently trained model. When the emphasis is on the selection of hyperparameters, one would replace \(h_{\tau }\) by \(H(u_{\tau 1})\) which corresponds to maximisation of the true expected score without the observation error:
Either case can be easily accommodated by our framework, however, we restrict attention to the objective function (2.6) in this paper.
The cost. The optimal value for the problem (2.6), as well as (1.3), is infinite and offers controls \((u_n)\) of little practical use. This is because \({\varvec{\beta }}\) follows a normal distribution for the sake of numerical tractability, so for any choice of basis functions \({\varvec{\psi }}\) one cannot have \(\sum _{k=1}^K \beta _k \psi _k(u) \ge 0\) for all \(u\in \mathcal {U}\) with probability one. Indeed, for any \(\delta > 0\) we have
While learning one could identify, with increasing probability as n increases, those realisations of \({\varvec{\beta }}\) for which there is \(u \in \mathcal {U}\) with \(\sum _{k=1}^K \beta _k \psi _k(u) < \delta \) for some fixed \(\delta > 0\). Then, continuing infinitely long with those controls u would lead to a positive infinite value of the expression inside expectation.
In the view of the above remark, we truncate the cost to stay positive. The final form of the optimisation problem is
where \((t_n)^+ := \max (t_n, 0)\). We show in Lemma C.1 that the optimal value of this problem is finite.
An alternative approach could be to amend the definition of \(t_n\) and the prior distribution of basis function coefficients to ensure nonnegativity, but this would invalidate assumptions of the Kalman filter used in learning and, hence, make the solution of the problem numerically infeasible. Indeed, Domhan et al. (2015) defines a prior distribution of coefficients so that particular features of the resulting curves (like monotonicity or nonnegativity) are guaranteed but then needs to resort to numerical methods for sampling from the posterior distribution.
2.2 Technical details
Optimisation problem (2.7) involves quantities which are not directly observable, namely, the coefficients of basis functions making up the true score H and cost T. In the following two subsections, using theory of optimal control with partial observations we will derive a fully observed control problem. For convenience of a reader familiar with the MDP framework, a complete presentation of the dynamics of the latter problem is included in Appendix A.
Let \((\Omega , \mathcal {F}, \mathbb {P})\) be the underlying probability space supporting the sequences \((\epsilon _n)_{n \ge 1}\) and \((\eta _n)_{n \ge 1}\) and the random variables \({\varvec{\alpha }}\) and \({\varvec{\beta }}\). By \(\mathcal {F}_n\) we denote the observable history up to and including step n, i.e., \(\mathcal {F}_n\) is the \(\sigma \)algebra
with \(\mathcal {F}_0 = \{\emptyset , \Omega \}\). The choice of hyperparameters \(u_n\) must be based only on the observable history, i.e., \(u_n\) must be \(\mathcal {F}_n\)measurable for any \(n \ge 0\). The variable \(\tau \) which governs the termination of training must be an \((\mathcal {F}_n)\)stopping time, i.e., \(\{\tau =n\} \in \mathcal {F}_n\) for \(n \ge 0\)—the decision whether to finish training is based on the past and present observations only. The difficulty lies in the fact that observations combine evaluations of the true expected score function and the running time function with errors, c.f., (2.5). This places us in the framework of stochastic control with partial observation (Bertsekas and Shreve 1996, Chapter 10).
Denote \({\varvec{\phi }}(u) = (\phi _1(u), \ldots , \phi _J(u))^T\) and \({\varvec{\psi }}(u) = (\psi _1(u), \ldots , \psi _K(u))^T\). Equation (2.5) can be written in a vector notation as
We will use this notation throughout the paper.
The power of the theory of Markov decision processes is in its tractability: an optimal control can be written in terms of the current state of the system. However, when the system is not fully observed, there is an obvious benefit to take all past observations into account to determine possible values of unobservable elements of the system. Indeed, due to the form of observation (2.5)–(2.4), one can infer much more about \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) when taking into account all available past readings of model scores and training costs. It turns out that optimal control problems of the form studied in this paper can be rewritten into equivalent systems with a bigger state space but with full observation for which classical dynamic programming can be applied.
Before we proceed with this programme, we state standing assumptions:
 (A):

Basis functions \((\phi _j)_{j=1}^J\) and \((\psi _k)_{k=1}^K\) are bounded on \(\mathcal {U}\).
 (S):

Observation error variances are bounded and separated from zero, i.e.
$$\begin{aligned}&\inf _{u \in \mathcal {U}} \sigma _H(u)> 0, \quad \inf _{u \in \mathcal {U}} \sigma _T(u) > 0,\\&\sup _{u \in \mathcal {U}} \sigma _H(u) + \sigma _T(u) < \infty . \end{aligned}$$
It is not required for the validity of mathematical results that basis functions are orthogonal in a specific sense or even linearly independent. However, linear independence is required for identification purposes and it speeds up learning.
2.3 Separation principle
Denote by \({\mathcal {A}}_n\) the conditional distribution of \({\varvec{\alpha }}\) given \(\mathcal {F}_n\). It follows from an application of the discretetime Kalman filter (Bensoussan 2018, Sect. 4.7) that \({\mathcal {A}}_n\) is Gaussian with the mean \(\mu ^\alpha _n\) and the covariance matrix \(\Sigma ^\alpha _n\) given by the following recursive formulas:
Denote by \({\mathcal {B}}_n\) the conditional distribution of \({\varvec{\beta }}\) given \(\mathcal {F}_n\). By the same arguments as above, it is also Gaussian with the mean \(\mu ^\beta _n\) and the covariance matrix \(\Sigma ^\beta _n\) given by the following recursive formulas:
Furthermore, it follows from the definition of \(h_{n+1}\) and \(t_{n+1}\) that the conditional distribution of \(h_{n+1}\) given \(\mathcal {F}_n\) and control \(u_n\) is Gaussian
and the conditional distribution of \(t_{n+1}\) given \(\mathcal {F}_n\) and control \(u_n\) is also Gaussian
Measurevalued stochastic processes \(({\mathcal {A}}_n)\), \(({\mathcal {B}}_n)\) are adapted to filtration \((\mathcal {F}_n)\). This would have been of little practical use for us if it were not for the fact that those measure valued processes take values in a space of Gaussian distributions, so the state is determined by the mean vector and the covariance matrix. It is then clear that the process \((\mu ^\alpha _n, \Sigma ^\alpha _n, \mu ^\beta _n, \Sigma ^\beta _n, t_n)\) is a Markov process (we omit \(h_n\) as it does not appear in the objective function (2.7)). Its dynamics are explicitly given in Appendix A.
The following theorem shows that the optimisation problem can be restated in terms of these variables instead of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\), i.e., in terms of observable quantities only. This reformulates the problem as a fully observed Markov decision problem which can be solved using dynamic programming methods.
Theorem 2.2
Optimisation problem (2.7) is equivalent to
The proof of this theorem as well as other results are collected in Appendix C.
2.4 Dynamic programming formulation
By inspecting the dynamics of \((\mu ^\alpha _n, \Sigma ^\alpha _n, \mu ^\beta _n, \Sigma ^\beta _n, t_n)\) in Appendix A one notices that the dependence of the state in step \(n+1\) on step n is only via the parameters of the filters \(({\mathcal {A}}_n)\), \(({\mathcal {B}}_n)\) which contain sufficient statistic about \((h_i)_{i=1}^n\) and \((t_i)_{i=1}^n\). Furthermore, the dynamics is timehomogeneous, i.e., it does not depend on the step n. Hence, the value function of the optimisation problem (2.11) depends only on 4 parameters:
The function V gives an optimal value of the optimisation problem given the prior distributions of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) are Gaussian with the given parameters. The expression on the righthand side is well defined and the value function does not admit values \(\pm \infty \), see Lemma C.1.
In preparation for stating the dynamic programming principle, define an operator acting on measurable functions \(\varphi (\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta )\) as follows
where the expectation is with respect to \(\mu ^\alpha _1, \Sigma ^\alpha _1, \mu ^\beta _1, \Sigma ^\beta _1, t_1\) conditional on \(\mu ^\alpha _0 = \mu ^\alpha , \Sigma ^\alpha _0 = \Sigma ^\alpha , \mu ^\beta _0 = \mu ^\beta , \Sigma _0^\beta = \Sigma ^\beta \); we will use this shorthand notation whenever it does not lead to ambiguity.
In order to approximate V, consider the value iteration scheme:
The following theorem asserts that the usual properties of the value iteration scheme hold for our problem. The proof is in Appendix C.
Theorem 2.3
The following statements hold true.

1.
The iterative scheme (2.14) is well defined, i.e., \(V_N\) does not take values \(\pm \infty \) and the expression under expectation in \(\mathcal {T} V_N\) is integrable, so that the operator \(\mathcal {T}\) can be applied to \(V_N\).

2.
\(V_N\) is the value function of the problem with at most N training steps, i.e.
$$\begin{aligned}&V_N(\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta ) = \sup _{(u_{n})_{n \ge 0},\, 1 \le \tau \le N}\\&\quad E\left[ (\mu ^\alpha _{\tau })^T {\varvec{\phi }}(u_{\tau 1})  \gamma \sum _{n=1}^\tau (t_n)^+ \left \mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta \right. \right] . \end{aligned}$$ 
3.
\(V = \lim _{N \rightarrow \infty } V_N\) and the sequence of functions \((V_N)\) is nondecreasing in N. Furthermore, \(V < \infty \).

4.
The value function V defined in (2.12) is a fixed point of the operator \(\mathcal {T}\), i.e., satisfies the dynamic programming equation
$$\begin{aligned} {}{} & {} {} V( \mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta ) \nonumber \\{}{} & {} {} = \sup _{u\in \mathcal {U}} E \left[  \gamma ( t_1)^+ + \max \big ( ( \mu ^\alpha _1)^T {\varvec{\phi }}(u), V( \mu ^\alpha _1, \Sigma ^\alpha _1, \mu ^\beta _1, \Sigma ^\beta _1) \big )\right. \nonumber \\{}{} & {} {} \left. \left \mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta \right. \right] .\end{aligned}$$(2.15)
Assertion 2 states that \(V_N\) is the value function for the problem with at most N training steps, which suggests what tradeoff is made when using \(V_N\) instead of V. Furthermore, \(V_N \le V\) by Assertion 3, which, as we will see in the next section, may affect the stopping criterion. Assertion 4 is the statement of the dynamic programming equation for the infinite horizon stochastic control problem. We will refer to those results in the numerical section of the paper.
Remark 2.4
The expectation of the first term in (2.13) containing only \(t_1\) can be computed in a closed form improving efficiency of numerical calculations:
where
and \(\Phi \) is the cumulative distribution function of the standard normal distribution. The justification of this formula is in Appendix C.
2.5 Identifying control and stopping
If V was available, then the choice of control \(u_n\) and the decision about stopping in Algorithm 1 would follow from classical results in optimal control theory. Indeed, one would choose
and then train and validate the model using hyperparameters \(u_n\). Using the observed score \(h_{n+1}\) and cost \(t_{n+1}\), one would compute \(\mu ^\alpha _{n+1}, \Sigma ^\alpha _{n+1}, \mu ^\beta _{n+1}, \Sigma ^\beta _{n+1}\) via (2.9)–(2.10). If
the algorithm would stop, otherwise it will continue with the choice on \(u_{n+1}\) as in (2.17) and the repeat of the steps above.
Remark 2.5
A reader familiar with the MDP terminology will notice that \(u_n\) is a function of the state \(\mu ^\alpha _n, \Sigma ^\alpha _n, \mu ^\beta _n, \Sigma ^\beta _n\) at time n and does not itself depend on n, so it is a timehomogeneous feedback policy. The stopping criterion (2.18) is also of a classical form: stop when the continuation value is smaller than the current payoff.
A natural adaptation of this procedure to the case when one only knows \(V_N\) is to replace V by \(V_N\) in both places in the above formulas. We will refer to this adhoc approach as the relaxed method. Another possibility is to follow an optimal strategy corresponding to the computed value function, i.e., with the given maximum number of training steps. If one knows \(V_1, \ldots , V_N\), then for \(n = 0, \ldots , N1\),
and stop when
where, with an abuse of notation, we put \(V_0 \equiv \infty \), i.e., at time \(n=N\) we always stop; with this definition of \(V_0\), we have \(V_1 = \mathcal {T} V_0\)). We will refer to this approach as the exact method.
The advantage of the exact method is that it has strong underlying mathematical guarantees but it is limited to N training steps with N chosen a priori. The relaxed method does not impose, a priori, such constraints but it may stop too early as the righthand side of the condition (2.18) is replaced by \(V_N \le V\).
3 Numerical implementation of AutoBCT
In this section, we focus on describing the methodology behind our implementation of AutoBCT. We show how to numerically approximate the iterative scheme (2.14) and how to explicitly account for numerical errors when using estimated \((V_n)_{n \ge 1}\) for the choice of control (hyperparameters) \(u_n\) and for the stopping condition in Algorithm 1. We wish to reiterate that, apart from the onthefly method (Sect. 3.5), a numerical approximation of the value functions \(V_n\), \(n \ge 1\), needs to be computed only once and then applied in Algorithm 4 for various datasets and modelling tasks. We demonstrate this in Sect. 4.
Before delving into details, we provide an overview of the numerical implementation. Our main approach relies on the precomputation of the value function \(V_n\), \(n = 1, 2, \ldots \). This is discussed in Sects. 3.2 and 3.4. The latter subsection provides details on the design of the set of points on which the value function is computed taking into account how this value function is subsequently used. The precomputed value functions form an input to AutoBCT described in Sect. 3.3 which guides the hyperparameter selection of an external ML model. An implementation of our approach which computes an approximation of the value function on demand instead of relying on precomputed value functions is shown in Sect. 3.5 and called onthefly. This algorithm is, however, much more resource intensive at the time of application compared to a negligible cost of running AutoBCT for which the majority of the computational burden is relegated to the precomputation stage.
3.1 Notation
Denote by \(\mathcal {X}\) the state space of the process \(x_n = (\mu ^\alpha _n, \Sigma ^\alpha _n, \mu ^\beta _n, \Sigma ^\beta _n)\). For \(x = (\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta ) \in \mathcal {X}\), define
Given \(h, t \in \mathbb {R}\) and \(x \in \mathcal {X}\), we write \(\texttt {Kalman}(x, h, t)\) for the output of the Kalman filter (2.9)–(2.10) (with an obvious adjustment to the notation: x is taken to be the state at time n and h, t are the observation of the score \(h_{n+1}\) and the cost \(t_{n+1}\)).
For convenience of implementation we assume that \(\mathcal {U}= [0,1]^p\) which can be achieved by appropriate normalisation of the control set. We show in Sect. 4 that this assumption is not detrimental even if some of hyperparameters take discrete values.
3.2 Computation of the value function
We compute the value function via the iterative scheme (2.14). We use ideas of regression Monte Carlo for controlled Markov processes, particularly the regresslater method Balata and Palczewski (2017), Huré et al. (2018), and approximate the value function \(V_n\), \(n=1, \ldots , N\), learnt over an appropriate choice of states \(\mathcal {C}\subset \mathcal {X}\). The choice of the cloud \(\mathcal {C}\) is crucial for the accuracy of the approximation of the value function at points of interest; it is discussed in Sect. 3.4. We fix a grid of controls \(\mathcal {G}\) for computing an approximation of the Qvalue as in practical applications we found it to perform better than a (theoretically supported) random choice of points in the control space \(\mathcal {U}\).
Computation of the value function is presented in two algorithms. Algorithm 2 finds an approximation of the expression inside optimisation in \(\mathcal {T} \varphi (x)\) (see (2.16)), while Algorithm 3 shows how to adapt value iteration approach to the framework of our stochastic optimisation problem. Details follow.
Algorithm 2 defines a function \(\texttt {Lambda}(x, \varphi ; \mathcal {G}, N_s)\) that computes an approximation to the mapping
It is done by evaluating the outcome of each control \(u^i\) in \(\mathcal {G}\) (line 1 in Algorithm 2) using Monte Carlo averaging over \(N_s\) samples from the distribution of the state \(x_{1}\) given \(x_0 = x\) and the control \(u^i\); this averaging is to approximate the expectation. However, the regression in line 7 employs further crosssectional information to reduce error in Monte Carlo estimates of this expectation as long as the sampling of \((h_k)\) and \((t_k)\) is independent between different values \(u^i\). This effect has mathematical grounding as a Monte Carlo approximation to orthogonal projections in an appropriate \(L^2\) space if the loss function is quadratic: \(l_p(p, z) = (pz)^2\), c.f., Tsitsiklis and VanRoy (2001), Balata and Palczewski (2017). In particular, it is valid to use \(N_s=1\), but larger values may result in more efficient computation: one seeks a tradeoff between the size of \(\mathcal {G}\) and the size of \(N_s\) to optimise computational time and achieve sufficient accuracy.
The space of functions \(\mathcal {F}_p\) over which the optimisation in regression in line 7 is performed depends on the choice of a functional approximation. In regression Monte Carlo it is a linear space spanned by a finite number of basis functions. Other approximating families \(\mathcal {F}_p\) include neural networks, splines and regression Random Forest. Although we do not include it explicitly in the notation, the optimisation may further involve regularisation in the form of elastic nets Zou and Hastie (2005) or regularised neural networks, which often improves the realworld performance.
Algorithm 3 iteratively applies the operator \(\mathcal {T}\) and computes regression approximation of the value function. The loop in lines 2–5, computes optimal control and the value for each point from the cloud \(\mathcal {C}\). In line 3, we approximate the mapping
using Algorithm 2. Regression in line 6 uses pairs: the state \(x^j\) and the corresponding optimal value \(q^j\) to fit an approximation \(\widetilde{V}_{n+1}\) of the value function \(V_{n+1}\). This approximation is needed in the following iteration of the loop (in line 3) and as the output of the algorithm to be used in application of AutoBCT to a particular data problem. The family of functions \(\mathcal {F}_\varphi \) (as \(\mathcal {F}_p\)) includes linear combinations of basis functions (for regression Monte Carlo), neural networks or regression Random Forest. Importantly, the dimensionality of the state x is much larger than of the control u itself, so efficient approximation of the value function requires more intricate methods than in the approximation \(\texttt {Lambda}(x, \varphi ; \mathcal {G}, N_s)\) of the qvalue. As in Algorithm 2, some regularisation may benefit realworld performance. The loss function is commonly the squared error \(l_\varphi (q, z) = (qz)^2\) but others can also be used (particularly when using neural networks) at the cost of more difficult or of no justification of convergence of the whole numerical procedure.
3.3 AutoBCT
The core part of the AutoBCT is given in Algorithm 4. We use the notation \(\mathcal {M}\) to denote an external Machine Learning (ML) process that outputs score h and cost t for the choice of hyperparameters \(u\in \mathcal {U}\). The ‘raw’ score \(\hat{h}\) and the ‘raw’ cost \(\hat{t}\) obtained from the ML routine are transformed onto the interval [0, 1] with the userspecified map (robustness to a misspecification of this map is discussed below):
For the cost t we usually use an affine transformation based on a best guess of maximum and minimum values. For the score h the choice depends on the problem at hand: in our practice we have employed affine and affinelog transformations. Misspecification of these maps so that the actual transformed values are outside of the interval [0, 1] are accounted for by learning \(\widetilde{V}\) using a cloud \(\mathcal {C}\) containing states that map onto distributions with resulting functions H and T not bounded by [0, 1] with nonnegligible probability. This robustness does not, however, relieve the user from a sensible choice of the above transformations as significantly bad choices have a tangible effect on the behaviour of the algorithm.
Remark 3.1
The transformation of the raw scores and times into values in (roughly) interval [0, 1] is necessitated by the model setup which is not scaling and transformation neutral. The observation errors \(\sigma _H\) and \(\sigma _T\) are fixed functions, so their effect depends on the ranges of values of score and cost input in the model. Scaling makes sure that the same setup of those functions can be used for multiple datasets and ML models. The score and the cost are represented as linear combinations of basis functions, so the change of range and interpretation of raw quantities (particularly the raw score which does not always need to be the accuracy of a model) would require a selection of a new set of basis functions. If the raw quantities are only transformed with affine transformations, one would need to adjust appropriately prior distributions, the training cloud (see Sect. 3.4) and the Lagrange multiplier \(\gamma \). All this would require the computation of the value function, the most resourcehungry step of our approach. Instead, we advocate using the same basis functions, the same range of priors, observation errors \(\sigma _H\) and \(\sigma _T\), coefficient \(\gamma \) and finally the same value function, but appropriately transform the raw score and cost. The reader will observe in Sect. 4 that this can be successfully (and naturally) done in many examples.
We present an algorithm for the relaxed method as introduced in Sect. 2.5; an adaptation of Algorithm 4 to the exact method is easy and has been omitted.
The input of Algorithm 4 consists of an approximation \(\widetilde{V}\) of the value function V. An optimal control \(u_{n}\) at step n given state \(x_n\) should be determined by maximisation of the expression (3.19) over \(u \in \mathcal {U}\) putting \(\varphi = \widetilde{V}\) and \(x = x_n\). For each u, this expression compares two possibilities available after applying control u: either stopping or continuing training. These two options correspond to strategically different choices of u; the first one maximises score in the next step while the latter prioritises learning. The learning option is preferred if future rewards (as given by \(\widetilde{V}\)) exceed present expected score. The approximation \(\widetilde{V}\) of V may be burdened by errors which may result in an incorrect decision and drive the algorithm into an unnecessary ‘learning spree’ (the opposite effect of premature stopping is less dangerous as it does not lead to unnecessarily long computation times). To mitigate this effect, we replace (3.19) with
for an error adjustment \(\epsilon \ge 0\). Notice that \(\epsilon = 0\) brings back the original formula from (3.19). The approximation of the mapping \(u \mapsto \Lambda ^\epsilon (u; x, \epsilon , \widetilde{V})\) is computed by Algorithm 2 with \(\varphi (x) = (1\epsilon ) \widetilde{V} (x)\). The grid \(\mathcal {G}'\) and the number of Monte Carlo iteration \(N_s'\) are usually taken much larger than in the value iteration algorithm since one needs only one evaluation at each step instead of 1000s of evaluations needed in Algorithm 3. Furthermore, more accurate computation of the mapping \(\Lambda ^\epsilon (\cdot ; x, \epsilon , \widetilde{V})\) was shown to improve reallife performance.
The stopping condition in line 9 is an approximation of the condition \(m^{\alpha }(u_{n}, x) < V(x)\). Indeed, using that \(V = \mathcal {T} V\), we could write \(m^{\alpha }(u_{n}, x) < \mathcal {T} V(x)\) and approximate the righthand side by \(\sup _{u \in \mathcal {U}} \Lambda ^\epsilon (u; x, 0, \widetilde{V})\) with \(\epsilon =0\). Using nonzero \(\epsilon \) is to counteract numerical and approximation errors as discussed above.
The stopping condition \(m^{\alpha }(u_{n}, x) < V(x)\) could also be implemented as \(m^{\alpha }(u_{n}, x) < (1\epsilon ) \widetilde{V}(x)\). We do not do it for two reasons. Firstly, \(\tilde{\Lambda }(u_{n+1}')\) is a more accurate approximation of the value function V at point x due to averaging effect of evaluating \(\mathcal {T} \widetilde{V}\) at the required point x and with computations done with much higher accuracy stemming from a finer grid \(\mathcal {G}'\) and a larger number of Monte Carlo iterations \(N_s'\). Secondly, this brings the stopping condition in line with the choice of optimal control in line 8 which is also based on \(\tilde{\Lambda }\) and comes at no extra cost.
3.4 States and clouds
Approximations \(\widetilde{V}_n\) in Algorithm 3 are learnt over states in cloud \(\mathcal {C}\subset \mathcal {X}\), and therefore the quality of these maps depends on the choice of \(\mathcal {C}\). This cloud needs to comprise sufficiently many states encountered over the running of the model in order to provide good approximations for the choice of control \(u_n\) and the stopping decision in Algorithm 4. In the course of learning, the norm of the covariance matrices \(\Sigma ^\alpha _n\) and \(\Sigma ^\beta _n\) becomes smaller (the filter is concentrated ever more on the mean) asymptotically approaching a truth:
Definition 3.2
(Truth) A state \(x = (\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta ) \in \mathcal {X}\) is called a truth if \(\Sigma ^\alpha = 0\) and \(\Sigma ^\beta = 0\). The set of all truths in \(\mathcal {X}\) is denoted by \(\mathcal {X}_{\infty }\).
We argue that the learning cloud \(\mathcal {C}\) should contain the following:

(a)
an adequate number of states that are elements of \(\mathcal {X}_{\infty }\),

(b)
states that are suitably close to priors \(x_0\) one will use in the Algorithm 4,

(c)
an adequate number of states ‘lying between’ the above.
A failure to include states which are truths (point a) inhibits the ability for AutoBCT algorithm to adequately stop. Conversely, a failure to include states which Algorithm 4 encounters in its running (point c) inhibits the ability to adequately continue. A failure to include a sufficient number of states that are close to priors \(x_0\) in (b) impacts the coverage of the state space by states (point c), and, further, due to the errors in extrapolation, the decisions made in the first step of the algorithm.
We suggest the following approach for construction of the cloud \(\mathcal {C}\). For given \(N_c, K \ge 1\), we repeat \(N_c\) times the steps:

(1)
sample \(\mu ^\alpha \) and \(\mu ^\beta \) from appropriate distributions, e.g., multivariate Gaussian,

(2)
sample \(\Sigma ^\alpha \) and \(\Sigma ^\beta \) from appropriate covariance distributions, e.g., Wishart or LKJ (Lewandowski et al. 2009),

(3)
add states
$$\begin{aligned} \Big (\mu ^\alpha , \frac{k}{K} \Sigma ^\alpha , \mu ^\beta , \frac{k}{K} \Sigma ^\beta \Big ), \quad k = 0, \ldots , K, \end{aligned}$$to the cloud \(\mathcal {C}\).
This results in a cloud of size \(N_c (K+1)\).
Note that if in steps 1 and 2 one uses distributions with the means equal to possible priors \(x_0\), point (b) above is fulfilled. At step 3, \(k=0\) gives zero covariance matrices, i.e., truths according to the definition above. It also highlights that distributions used in step 1 should be sufficiently dispersed to cover possible truths for models for which AutoBCT will be used.
The cloud \(\mathcal {C}\) and the basis functions in (2.4) are closely linked. A state in the cloud defines distributions of weights for the basis functions. The ability of the model to represent required shapes of score and cost functions depends, to a great extent, on the choice of basis functions. As the dimensionality of the state depends quadratically on the number of basis functions, it affects heavily both the computation of the approximate value functions \(\widetilde{V}_n\) and the speed of learning in AutoBCT. We have found that polynomial bases work well in problems studied in Sect. 3, see Appendix E. We used centred monomials up to degree 3 in the case of onedimensional hyperparameter and centred monomials up to degree 4 and one crossproduct when the hyperparameter was twodimensional.
3.5 Onthefly
The ‘onthefly’ method acquires evaluations \(V_{N}(x)\) by performing nested MonteCarlo integrations of the recurrence (2.14).\(V_0(x) = \sup _{u \in \mathcal {U}} \varphi (u)\), Specifically, \(\texttt {OTF}(x, N, \varphi ; \mathcal {G}, N_s)\) approximates the qvalue \(\mathcal {V}_N (x, \cdot )\) using the recursion
where \(\varphi :\mathcal {X}\rightarrow \mathbb {R}\) is some given inital value as in Algorithm 3. Setting \(\varphi \equiv \infty \), one can show by induction that \(V_N(x) = \sup _{u \in \mathcal {U}} \mathcal {V}_N (x,u)\), c.f. (2.14). Notice also by comparing to Algorithm 2 that
This motivates two modifications of Algorithm 4. The first one, which we call ‘onthefly’ in this paper, replaces calls to \(\texttt {Lambda}\) in lines 1 and 7 by calls to \(\texttt {OTF}(x, N, \varphi ; \mathcal {G}, N_s)\) with \(\varphi \equiv \infty \). We also increase the grid size \(\mathcal {G}\) and \(N_s\) in the first execution to obtain a finer approximation of \(\mathcal {V}_N\) than \((\mathcal {V}_n)_{n=1}^{N1}\), in line with the discussion in Sect. 3.3.
The second modification of Algorithm 4 whose performance is not reported in this paper is motivated by the accuracy improvement when using \(\texttt {Lambda}(x, \widetilde{V}; \mathcal {G}', N_s')\) instead of \(\widetilde{V}\). In the same manner, calls to \(\texttt {Lambda}\) can be replaced by \(\texttt {OTF}(x, N, \widetilde{V}; \mathcal {G}', N_s')\) with \(N \ge 2\) bringing further improvements but at a cost of significant increase in computational complexity.
The implementation of \(\texttt {OTF}\) is computationally intensive as it uses multiple nested Monte Carlo integrations and optimisations with the computational complexity exponential in N. Its advantage is that it does not require a good specification of the cloud \(\mathcal {C}\) and precomputed value functions.
4 Validation
In this section we demonstrate applications of AutoBCT to a synthetic example (to visualise its actions and performance) and a number of real datasets and models. Detailed information about AutoBCT settings for each example are collected in Appendix E. Details of the composition of training and testing sets are listed in Appendix D. We emphasise that our objective is to find good hyperparameters within a short computational time. As explained in the introduction, this computational time is meant to be significantly shorter than required to explore the hyperparameter space sufficiently well to find a (near) optimum. The reader should, therefore, not be surprised to see the algorithm stop after only a few steps. For comparison, for each real dataset we provide the best known (to us) performance reported either in a published paper or, more often, in results of a modelling competition. We also provide a comparison of AutoBCT to Hyperopt in Appendix F.
All results were obtained using a desktop computer with Intel(R) Core(TM) i76700K CPU (4.4 GHz), 32 GB RAM and GeForce RTX 3090 GPU (24 GB). In the synthetic example, Sect. 4.1, we artificially slowed down computations in order to be able to observe differences in running times between hyperparameter choices.
For computation of 1D and 2D value function maps used in VI/Map algorithm, we employ a cloud \(\mathcal {C}\) following general guidelines described in Sect. 3.4. However, in addition to those we enrich the cloud by first generating a small set of size 100 of realistic unimodal shapes for the score and cost functions which we propagate via Kalman filter using a predefined control grid up to certain depth D. In 1D case we take depth \(D=3\) and the cloud \(\mathcal {C}\) contains 156, 000 states of which 1000 are truths (see Def. 3.2). In 2D case, the value function is built using a cloud \(\mathcal {C}\) extended by the above procedure with depth \(D=10\); it contains 761, 600 states of which 1000 are truths.
In VI/Map Algorithm 3 the regression in line 6 is performed using Random Forest Regression (RFR) with 100 trees. Functions \(\sigma _H\) and \(\sigma _T\) are taken as constant (independent of the hyperparameters u).
Raw scores and costs are transformed as explained in Sect. 3.3 and Remark 3.1. We emphasise that in all datasets and modelling tasks below (apart from Onthefly results presented for comparison), we used the same 2 precomputed value functions for 1D problems (with two different values of \(\sigma _H\)) and the same value function for 2D problems. In particular, we used the same \(\gamma =0.16\). This is motivated by the fact that the computation of the value function is the most resource intensive task which one would like to perform once. The execution time of each step of AutoBCT is dominated by running the external ML procedure, c.f. Alg. 4. Conclusions on the robustness of this approach are made at the end of Sect. 4.
4.1 A synthetic example
We illustrate steps of AutoBCT algorithm on a synthetic prediction problem. To make results easier to interpret visually, we consider a onedimensional hyperparameter selection problem where we control the number of trees in a random forest (RF) model and take scaled accuracy as the score. Observed accuracy \(\hat{h}\) is mapped to h by an affine mapping that maps 50% model accuracy onto score 0 and 100% onto 1, while the real running time \(\hat{t}\) is mapped to t by an affine mapping that maps 0 s onto 0 and 0.6 s onto 1:
The task at hand is to predict outputs of a function
with \(a=b=10\) (Fig. 1), where \(\lfloor \cdot \rfloor \) is the floor function: \(\lfloor x \rfloor = \max \{m \in \mathbb {Z}\;\; m \le x\}\).
We initialise our synthetic experiment with a prior on H(u) being relatively flat but with a large enough spread to cover well the range (0, 1) (see the left panel of Fig. 2). For the cost function T(u) we choose a pessimistic prior (see the right panel of Fig. 2), which reflects our conservative belief that the cost is exponentially increasing as a function of the number of trees \(n_{\max }\) (in practice, it is usually linear). We fix \(\gamma =0.16\) and \(\epsilon = 0\) and set the observation errors \(\sigma _H \equiv 0.05\) and \(\sigma _T \equiv 0.1\) (standard deviation of the observation error for the score and the cost is assumed to be 5% and 10%, respectively, independently of the choice of the hyperparameter). We relate the control \(u\in [0,1]\) with the number of trees \(n_{\text {trees}}\) via the mapping:
Thus, the smallest considered random forest has \(n_{\text {trees}}(0)=1\) tree and the largest has \(n_{\text {trees}}(1)=100\) trees.
We present below performance of two algorithms: VI/Map, Algorithm 3 (\(N_s = 100\) with \(N=2\) and \(N=4\)), and onthefly (OTF), Algorithm 5 (\(N_s = 1000\) with \(N=2\)). We set \(\mathcal {G}= \{0, 0.01, 0.02, \ldots , 1\}\). Training set comprises 30, 000 points and the validation set contains 20, 000 points. In both sets, the points were randomly drawn from a uniform distribution on the square \([0,1]^2\).
4.1.1 VI/Map results
In this example, we use the 1D map described at the beginning of Sect. 4. We visualise results for the map produced with \(N=2\) (recall that N denotes the depth of recursion in the computation of the value function) and \(\epsilon = 0\). We also include a summary of results for the \(N=4\), \(\epsilon = 0\) map (Table 2) and show that they are not significantly different from the \(N=2\) case.
One run of the VI/Map (\(N=2\), \(\epsilon =0\)) algorithm together with the evolution of posterior distributions is displayed in Fig. 3 with an accompanying summary Table 1 which shows changes of important quantities across iterations. Notice that AutoBCT stopped at iteration 3 with the control \(u = 0.74\) (corresponding to \(n_{\text {trees}}= 74\)) and the expected posterior score and the realised score of \(\approx 0.98\) (accuracy 99%). Recall that the observed accuracy \(h_n\) is burdened with a random error due to the randomness involved in training of the random forest and in validating the trained model. The closedness of the expected posterior score and the realised score at the time of stopping is not a rule and should not be expected in general (see examples further in this section).
We note that the VI/Map with \(N=4\) produces a very similar run (Table 2), so additional expense at finding depth \(N=4\) map is not necessary for this example.
Nonzero \(\varvec{\epsilon }\) value. We kept artificial damping parameter \(\epsilon \), which accounts for errors in the value function map, equal to zero as the quality of the map was very good. It can also be observed in Tables 1 and 2 that any reasonable value of \(\epsilon \) would not change the course of training.
4.1.2 Onthefly results
For the OTF (\(N=2\)) version of the algorithm we provide a summary of one run in Table 3. AutoBCT stopped at iteration 3 with the control \(u = 0.76\) (corresponding to \(n_{\text {trees}}= 76\)) and the expected posterior score of 0.99 (accuracy 99.5%) and realised score of 0.98 (accuracy 99%). We observe that VI/Map results are very similar to the OTF results. In particular, both methods completed training in 3 iterations and the difference between final controls is very small.
4.1.3 Sensitivity analysis
We used onthefly algorithm to explore the dependence of results on the choice of the standard deviation \(\sigma _H\) of score observations and the Lagrange multiplier \(\gamma \). Unlike the rest of numerical results in this paper, the sensitivity analysis was performed on University of Leeds HPC cluster using 4 cores of Intel Xeon Gold 6138 CPU, so costs should not be compared to those shown earlier in this section. Tables 4 and 5 display the mean and standard deviation (in brackets) from 40 independent runs. All other parameters were the same as above. Notice that the final score h is close to 1 for all parameter settings due to the relative simplicity of the problem. We will therefore concentrate the analysis on the total cost and final control.
Assuming too low value of \(\sigma _H\) results in unstable results (see Table 4): the total cost has the standard deviation of 1.72 meaning that some runs took very long to terminate. Similarly, final control has a standard deviation 0.22. For all higher values of \(\sigma _H\), the standard deviation of the total cost is around 0.5 and of the final control around 0.1. Large value \(\sigma _H = 0.2\) results is longer running times due to additional observations needed to get a sufficiently precise estimate of the relationship between hyperparameters and model score but the final control is stable. An excessive choice of \(\sigma _H = 0.4\) destabilises the algorithm resulting in unstable final score, control and total cost.
The Lagrange multiplier \(\gamma \) governs the tradeoff between model score and total running time. Total cost, as displayed in Table 5, is decreasing with \(\gamma \). So is the final score of the trained model with the stability also suffering slighlty (the standard deviation increasing from 0.01 to 0.02). The final control u decreases with \(\gamma \). This is because the cost is increasing in u, so larger \(\gamma \) incentivises using of lower values of u. Importantly, the optimisation problem is relatively robust to the specification of parameters and observed behaviours are consistent with intuitive expectations.
In the next section, we proceed to deploy AutoBCT algorithm on real data, using convolutional neural networks.
4.2 Convolutional neural network
In this section, we focus on a computationally intensive problem of classifying breast cancer specimen images for Invasive Ductal Carcinoma (IDC). IDC is the most common subtype of all breast cancers. Data (Janowczyk and Madabhushi 2016; CruzRoa et al. 2014) contains 277,524 image patches of size \(50\times 50\), where 198,738 are IDC negative and 78,786 IDC positive.
We build a classification model using convolutional neural network (CNN). We optimise the following metaparameters: batch (batch size) and r (learning rate). CNN is run twice (2 epochs) over a given set of training data and is evaluated on a holdout test data. Due to the fact that we run CNN for exactly 2 epochs (and not until a certain accuracy threshold is reached), the choice of the learning rate r will have no effect on the running time. This is to demonstrate that AutoBCT is also able to deal with cases where the cost does not depend on some of the optimised hyperparameters. Architecture of CNN is shown in Fig. 4 and described in “Appendix 1”.
A naive tuning of CNN over a twodimensional predefined grid of hyperparameters would be computationally expensive. In Sects. 4.2.3 and 4.2.4 we present results of 2D AutoBCT (VI/Map) and AutoBCT (OTF), respectively, where nearly optimal controls are obtained in a relatively few iterations. However, we start by discussing 1D VI/map results on batch and r separately. Details about priors and parameter settings are collected in Appendix E. Comparison to Hyperopt is performed in Appendix F.
4.2.1 VI/Map CNN batch
Due to limited computational power, we expedite CNN computational time by working on a subset of data. A training set consists of 5000 images with 50/50 split between the two categories, and a testing set is also balanced and contains 1000 images. We fix the learning rate \(r=0.0007\).
We use the \(N=2\) map with \(\sigma _H = 0.15\), \(\sigma _T = 0.1\) and \(\gamma =0.16\). For this example as well as for others involving a neutral network, we choose relatively high \(\sigma _H = 0.15\) to account for larger randomness in training for some choices of hyperparameters. For the sake of illustration, we allow large batch sizes (relative to the size of the data set), which results in unstable model training. To mitigate the resulting significant variability in observed accuracy and enable interesting observations of the algorithm performance, for each u the neural network is trained 5 times and a median accuracy is outputted together with the total accumulated cost of training and testing 5 neural networks. We map control \(u \in [0,1]\) to batch size in the following way:
which maps 0 to \(\textit{batch}_{\min }=10\) and 1 to \(\textit{batch}_{\max }=200\) linearly. The accuracy and the running time (in minutes) are mapped into score and cost as follows
Results for this problem, dubbed 1D CNN batch, can be inspected in Table 6. We also visualise final posterior distributions of \(H(\cdot )\) and \(T(\cdot )\) in Fig. 5. The algorithm returned the control corresponding to batch size of 89 with realised posterior score of 0.81 (corresponding to the accuracy 73%) and expected posterior score of 0.83 (accuracy 74%).
4.2.2 VI/Map CNN r
Here we explore another 1D case using learning rate r as the control. We fix the batch size 66. It is known that the learning rate does not influence the computational cost of the neural net trained over fixed number of epochs. However, by having the batch size 66 we make each run of the neural net relatively expensive in the context of these examples (\(\approx 2.7\) min). Therefore, our goal is still to obtain an optimal control in as few iterations as possible, to save computational resources.
We keep the same mappings for h and t as above, and the same \(\sigma _H = 0.15\), \(\sigma _T = 0.1\), \(\gamma =0.16\). Using common knowledge about the effect of the learning rate on accuracy and running time, we modify the prior for \(H(\cdot )\) so that the mean is unimodal and has a maximum in the interior of (0, 1) and the prior for \(T(\cdot )\) is flat (Fig. 6a). The mapping of control into the learning rate is linear on the \(\log (r)\) scale, i.e., we set
where \(r_{\min }=10^{5}\) and \(r_{\max }=0.1\).
In the run displayed in Table 7, the final control corresponds to \(r = 0.0005\) with the observed score of \(h_5 = 0.794\) (corresponding to the accuracy 73%) and the expected posterior score of \((\mu ^{\alpha }_{3})^T{\varvec{\phi }}(0.43) = 0.445\) (accuracy 61%). The final posterior distributions with observed scores and costs indicated by crosses are shown in Fig. 6b. The final output of the algorithm is good although the posterior distribution on the left panel poorly fits the observed data. The shape of the score curve is so peaked that it cannot be represented accurately with a degree three polynomial (our choice of basis functions). However, AutoBCT stops when the posterior mean score (not the observed score) exceeds the value function, so it is sufficient that the posterior distribution indicates the positioning of hyperparameters for which the true maximum score is attained.
4.2.3 2D AutoBCT (VI/Map) with CNN
We initialise our 2D CNN experiment with a prior on score \(H(u_1,u_2)\) being unimodal (left panel of Fig. 7a). For the cost function \(T(u_1,u_2)\) we choose a pessimistic prior (left panel of Fig. 7b), which reflects a conservative belief that the cost is exponentially decreasing in the batch size and indifferent to the choice of the learning rate. As before, for each pair \((u_1,u_2)\) the neural network is trained 5 times to mitigate instabilities with large batch. We map the median raw accuracy \(\hat{h}\) and the raw time \(\hat{t}\) (in minutes) via (4.24). We map the control \(u_1\) into the learning rate via (4.25) with \(r_{\min }=10^{5}\) and \(r_{\max }=0.1\). The second control, \(u_2\), is mapped into the batch size via (4.23) with \(\textit{batch}_{\min } = 10\) and \(\textit{batch}_{\max } = 200\). Those mappings are identical as in the 1D cases studied above.
Results for an example run of 2D AutoBCT (VI/map) can be inspected in Table 8. The final control of \((u_1,u_2)=(0.5,0.15)\) corresponds to \(r=0.001\) and the batch size of 38, with the expected posterior accuracy of 74% and the realised accuracy of 78% (compared to the accuracy 84% obtained in CruzRoa et al. 2014 which used full dataset in computations). The surfaces showing the mean of the posterior distributions for H and T are shown on right panels of Fig. 7.
Table 9 presents a summary of average performance of 40 runs on this 2D neural network problem. During 40 VI/Map runs, we recorded a small standard deviation 0.06 for the final control \(u_{1,n}\) (r) suggesting that the quality of the model is sensitive to the choice of the learning rate. On the other hand, the standard deviation for the final control \(u_{2,n}\) (batch) is much bigger suggesting weaker sensitivity to this parameter. It should, however, be remarked that the training of Neural Networks tends to be less stable when batch size is large relative to the total sample size, so good models could have been obtained ‘by chance’.
4.2.4 2D AutoBCT (OTF) with CNN
The OTF version of the AutoBCT with a 2D CNN problem is significantly affected by the relatively high noise present in observations of the score and we observed its tendency to stop more unpredictably. We present summaries of three OTF runs (Table 10). Care should be taken when comparing these results to the ones obtained by AutoBCT (VI/Map) because, due to computational demands, only \(N=2\) version of OTF is ran, while for the VI/Map version we use \(N=3\) map with 1 OTF step at the end (as always for VI/Map runs, see Algorithm 2), essentially making it \(N=4\) for the comparison with OTF. It can, therefore, be concluded that depth \(N=2\) is insufficient to obtain reliable results in 2D case and one needs to use depth \(N=4\). Recall that \(N=2\) version of OTF performed well in the synthetic 1D example.
4.3 AutoBCT validation on popular datasets
We provide summary statistics for the evaluation of AutoBCT ’s performance on 3 popular datasets employing 3 different machine learning models. In tables, we provide in brackets: the accuracy for column Final \(h_n\), the total running time for column \(\sum _{i=1}^n t_i\), and the corresponding value of the hyperparameter for column u (\(u_1\), \(u_2\)). Recall that details of parameter mappings and implementation are collected in Appendix E and comparisons to Hyperopt are made in Appendix F.
HIGGS Data Set (Baldi et al. 2014). This is a classification problem to distinguish between a signal process which produces Higgs bosons and a background process which does not. The data set contains 11 million rows. Training over the whole dataset is time consuming, so we aim to find an optimal sample size to be used for training and testing with Random Forest (ranger) given the tradeoff between the score and the computational cost. Results are collected in Table 11. The final accuracy of 72% puts the model within the TOP 10 performing models according to openml.org/t/146606 statistics for Higgs data set. The original paper Baldi et al. (2014) reported maximum AUC of 0.893 for a deep neural network and 0.81 for Boosted decision trees. AUC for the model from Table 11 is 0.797.
Intel image data. This is image data of Natural Scenes around the world obtained from Intel Scene Classification Challenge.^{Footnote 2} The data contains around 25K images of size \(150\times 150\) distributed under 6 categories. The task is to create a classification model. For this example we use categorical cross entropy loss (Cox 1958) (\(\mathcal {X}\)  population, x  sample, \(\mathcal {C}\)  set of all classes, c  a class)
to construct a score (instead of the accuracy). The above entropy loss allows us to monitor overfitting of the CNN while accuracy itself does not provide us with this information. In this example we choose the number of epochs in CNN as the control. We obtain results for two architectures. In one architecture (Resnet50) we use weights of the widely popular resnet50 model, while in other one (Plain) we use architecture that does not rely on pretrained models. Details of architectures of Plain and Resnet50 models are provided in “Appendix 1”. Results are displayed in Table 12. We note that our Resnet50 model performance produces a result that is 3% short of the challenge winner, who obtained a model with 96% accuracy via a different architecture and ensemble averaging.
We note that a more efficient optimisation of a parameter such as epochs is available. Having a fixed holdout (test) dataset we are able to evaluate the performance of the model on a test set after each training step, without introducing any data leakages to the model. Therefore, if a cost of the model evaluation on the holdout set is relatively small, compared to the overall training cost per step, observations of the score (h) and the cost (t) are available at each step. Optimiser’s task would therefore simplify to optimally stopping the training when the optimum tradeoff between the score and the cost is achieved. However, the flow of information would be different than in our model and a new modelling approach would be needed. Furthermore, in a Bayesian setting like ours, it is required that the error of each observation of the score and the cost is independent of the previous ones. This is clearly violated in an optimisation task with incremental improvements of the model. Nevertheless, this is an interesting direction of research which we retain for the future. Note also similarities to the learning curve methods used by, e.g., Chandrashekaran and Lane (2017).
Credit card fraud data (Pozzolo et al. 2015).^{Footnote 3} This dataset contains credit card transactions from September 2013. It is highly unbalanced with only 492 frauds out of 284, 807 transactions. We build an autoencoder (AE) (Kramer 1991) which is trained on nonfraudulent cases. We attempt to capture fraudulent cases by observing the reconstruction error.
We parametrise the network with two parameters: code which specifies the size of the smallest layer and scale is the number of layers in the encoder (the decoder is symmetric). The largest layer of the encoder has the size \(\lceil 20 \times \text {scale} \rceil \) and there are \(\lceil \textit{scale} \rceil \) layers in the encoder spaced equally between the largest layer and the coding layer (\(\lceil x \rceil \) denotes the ceiling function, i.e., the smallest integer dominating x). So an AE described by code equal to 1 and scale equal to 1.9 has hidden layers of the following sizes: 38, 19, 1, 19, 38. We consider code and scale between 1 and 10.
For 1D example, the optimisation parameter is the scale with \(\textit{code}=2\). For 2D example, we optimise scale and code.
Given that fraudulent transactions constitute only \(0.172\%\) of all transactions, we do not focus on a pure accuracy of the model, but on a good balance between precision and recall which is weighted together by the widely used generalized Fscore (Dice 1945):
We pick \(\beta =6\) as we are interested in capturing most of the fraudulent transactions (we observe that the recall is at least 80%). In practice \(\beta \) is chosen somewhat arbitrarily; however it also can be used as an optimisation parameter (Klinger and Friedrich 2009).
Results for 1D example are displayed in Table 13. We observe that in this particular 1D example the optimal control of 0.01 corresponds to a shallow AE architecture, suggesting that good performance can be achieved with a small in size AE architecture.
The results for the 2D version of this example are shown in Table 14. AutoBCT decides to stop at the most shallow architecture as well, surprisingly revealing that even choosing the size of the code equal to 1 leads to similarly good results.
Summary of findings. Our validation demonstrated that AutoBCT performs well in hyperparameter selection problems for diverse datasets, machine learning techniques and objectives. VI/Map version of the algorithm is preferred to OTF. The value function map can be precomputed in advance and used for a range of problems. Indeed, we used only 2 value functions: one for 1D problem and one for 2D problem. This means that the cloud \(\mathcal {C}\) of training points, the Lagrange multiplier \(\gamma \) and basis functions can be chosen so that the value function is robust and applicable to a wide range of problems. The scaling of raw score and cost are used to encode a user’s expectations about the quality of the model and the cost of its training. In the CNN example (Sect. 4.2) which is a classification problem, it is very easy to obtain the accuracy of around 50%, so the observed accuracy is transformed so that this point is mapped close to 0. The scaling of the running time depends on the computational complexity of the problem and may be adjusted when the cost constraints the algorithm too much and the user is willing to endure longer running times. Note here that observations of the raw score and cost obtained for particular scaling maps can be reused to prime the algorithm (update the prior) for a different scaling, so this computation effort is not wasted. This could potentially lead to an automated scheme for adjusting scaling, but the exploration of this direction is left for further research.
5 Conclusions
We have designed a new modelling framework for studying hyperparameter selection problems with a total cost constraint where the dependence of the objective function and the cost on hyperparameters is unknown and must be learnt. The emphasis is on problems where the cost constraint is very restrictive and prevents thorough exploration of the hyperparameter space. Such problems, for example, may arise in exploratory data analysis when a short response time, potentially at the cost of model quality, is of essence.
We model the dependence of the score and the cost on hyperparameters as a linear combination of basis functions. We embed the learning problem in a Bayesian framework with coefficients of basis functions being multivariate Gaussian distributed a priori. We design a numerical algorithm to solve the resulting optimal control problem and demonstrate its applicability for hyperparameter optimisation for a number of machine learning problems. At each learning step, the strategy either prescribes to terminate training or provides new hyperparameter values to be applied. Those values are selected in the optimal way taking into account explicitly the tradeoff between the improvement of the score (learning) and the cost; this information is encoded directly in the value function.
There are notable open problems left for further research. Our numerical results are for one and twodimensional hyperparameters. A larger number of parameters would require a much bigger computational budget to precompute the value function and potentially call for new functional approximators for the value function and the policy and new ideas for (approximate) representation of the state for the optimal control problem; in the twodimensional setting, the dimensionality of the state of the optimal control problem already reaches 75. The choice of basis functions and prior distributions in multidimensional settings requires further research. Finally, the selection of the Lagrange multiplier \(\gamma \), errors of observations \(\sigma _H\) and \(\sigma _T\) and the computation of the score and the cost from observable quantities for a particular machine learning problem needs to be further explored with particular emphasis on metalearning. It would also be interesting to perform a thorough analysis of benefits of metalearning for selection of priors.
Notes
It will be clear soon why we use the word expected here. For readability, we will often omit it.
References
Bachouch, A., Huré, C., Langrené, N., Pham, H.: Deep neural networks algorithms for stochastic control problems on finite horizon: numerical applications.Methodol Comput Appl Probab 24, 143–178 (2022) https://doi.org/10.1007s11009019097679
Balata, A., Palczewski, J.: Regresslater Monte Carlo for optimal control of Markov processes (2017). arXiv:1712.09705
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in highenergy physics with deep learning. Nat. Commun. 5(1), 1–9 (2014)
Bensoussan, A.: Estimation and Control of Dynamical Systems. Springer, Berlin (2018)
Bergstra, J., Yamins, D., Cox, D.: Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In: Dasgupta, S., McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, vol. 28, pp. 115–123. Atlanta, Georgia, USA: PMLR (2013, 17–19 Jun)
Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyperparameter optimization. Adv. Neural Inf. Process. Syst. 2546–2554 (2011)
Bertsekas, D.P., Shreve, S.E.: Stochastic Optimal Control: The Discrete Time Case. Athena Scientific, Nashua (1996)
Briganti, G., Le Moine, O.: Artificial intelligence in medicine: today and tomorrow. Front. Med. 7, 27 (2020)
Chandrashekaran, A., Lane, I.R.: Speeding up hyperparameter optimization by extrapolation of learning curves using previous builds. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 477–492. Springer International Publishing, Berlin (2017)
Cleveland, W.S., Grosse, E., Shyu, W.M.: Local regression models. In: Chambers, J.M., Hastie, T.J. (eds.) Statistical Models in S. Wadsworth & Brooks/Cole, New York (1992)
Cox, D.R.: The regression analysis of binary sequences. J. Roy. Stat. Soc.: Ser. B (Methodol.) 20(2), 215–232 (1958)
CruzRoa, A., Basavanhally, A., González, F., Gilmore, H., Feldman, M., Ganesan, S., et al.: Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks. In: Medical Imaging 2014: Digital Pathology, vol. 9041, p. 904103 (2014)
Dai, Z., Yu, H., Low, B.K.H., Jaillet, P.: Bayesian optimization meets Bayesian optimal stopping. In: Chaudhuri, K., Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 1496–1506 (2019, 09–15 Jun). Retrieved from https://proceedings.mlr.press/v97/dai19a.html
Demšar, J., Zupan, B., Leban, G., Curk, T.: Orange: From experimental machine learning to interactive data mining. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 537–539 (2004)
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Domhan, T., Springenberg, J.T., Hutter, F.: Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In: Proceedings of the 24th International Conference on Artificial Intelligence, pp. 3460–3468 (2015)
Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., Leyton Brown, K.: Towards an empirical foundation for assessing Bayesian optimization of hyperparameters. In: Nips Workshop on Bayesian Optimization in Theory and Practice, vol. 10, p. 3 (2013)
Falkner, S., Klein, A., Hutter, F.: BOHB: Robust and efficient hyperparameter optimization at scale. In: International Conference on Machine Learning, pp. 1437–1446 (2018)
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J.T., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 2, pp. 2755–2763. MIT Press (2015)
Feurer, M., Springenberg, J.T., Hutter, F.: Initializing Bayesian hyperparameter optimization via metalearning. In: Proceedings of the twentyninth AAAI Conference on Artificial Intelligence, pp. 1128–1135 (2015)
Gargiani, M., Klein, A., Falkner, S., Hutter, F.: Probabilistic rollouts for learning curve extrapolation across hyperparameter settings. In: 6th ICML Workshop on Automated Machine Learning (2019)
Ghoddusi, H., Creamer, G.G., Rafizadeh, N.: Machine learning in energy economics and finance: a review. Energy Econ. 81, 709–727 (2019)
Ginsbourger, D., Le Riche, R.: Towards gaussian processbased optimization with finite time horizon. In: Giovagnoli, A., Atkinson, A.C., Torsney, B., May, C. (eds.) mODa 9—Advances in ModelOriented Design and Analysis, pp. 89–96. PhysicaVerlag HD, Heidelberg (2010)
Green, P.J., Silverman, B.W.: Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. CRC Press, Boca Raton (1993)
Guyon, I., SunHosoya, L., Boullé, M., Escalante, H.J., Escalera, S., Liu, Z., et al.: Analysis of the AutoML challenge series 2015–2018. Automl (2019)
Hastie, T., Tibshirani, R., Jerome, F.: The Elements of Statistical Learning. Springer, New York (2009)
Holzinger, A., Dehmer, M., Jurisica, I.: Knowledge discovery and interactive data mining in bioinformatics: stateoftheart, future challenges and research directions. BMC Bioinform. 15, I1 (2014)
Huang, L., Jia, J., Yu, B., Chun, B.G., Maniatis, P., Naik, M.: Predicting execution time of computer programs using sparse polynomial regression. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems, vol. 1, pp. 883–891 (2010)
Huré, C., Pham, H., Bachouch, A., Langrené, N.: Deep neural networks algorithms for stochastic control problems on finite horizon: convergence analysis. SIAM Journal on Numerical Analysis 51(1), 525–557 (2021)
Hutter, F., Hoos, H.H., LeytonBrown, K.: Sequential modelbased optimization for general algorithm configuration. In: Coello, C.A.C. (ed.) Learning and Intelligent Optimization, pp. 507–523. Springer, Berlin (2011)
Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning. Methods, Systems, Challenges. Springer (2019)
Hutter, F., Xu, L., Hoos, H.H., LeytonBrown, K.: Algorithm runtime prediction: methods and evaluation. Artif. Intell. 206, 79–111 (2014)
Janowczyk, A., Madabhushi, A.: Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases. J. Pathol. Inform. 7, 29 (2016)
Kharroubi, I., Langrené, N., Pham, H.: A numerical algorithm for fully nonlinear HJB equations: an approach by control randomization. Monte Carlo Methods Appl. 20(2), 145–165 (2014)
Klinger, R., Friedrich, C.M.: User’s choice of precision and recall in named entity recognition. In: Proceedings of the International Conference RANLP2009, pp. 192–196 (2009)
Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., LeytonBrown, K.: AutoWEKA 2.0: automatic model selection and hyperparameter optimization in WEKA. J. Mach. Learn. Res. 18(1), 826–830 (2017)
Kramer, M.A.: Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 37(2), 233–243 (1991)
Lam, R.R., Willcox, K.E., Wolpert, D.H.: Bayesian optimization with a finite budget: an approximate dynamic programming approach. In: Proceedings of the 30th International Conference On Neural Information Processing Systems, pp. 883–891. Curran Associates Inc, Red Hook, NY, USA (2016)
Lee, E., Eriksson, D., Bindel, D., Cheng, B., Mccourt, M.: Efficient rollout strategies for Bayesian optimization. In: J. Peters & D. Sontag (Eds.), Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), vol. 124, pp. 260–269 (2020, 03–06 Aug). Retrieved from https://proceedings.mlr.press/v124/lee20a.html
Lewandowski, D., Kurowicka, D., Joe, H.: Generating random correlation matrices based on vines and extended onion method. J. Multivar. Anal. 100(9), 1989–2001 (2009)
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel banditbased approach to hyperparameter optimization. J. Mach. Learn. Res. 18(1), 6765–6816 (2017)
Longstaff, F.A., Schwartz, E.S.: Valuing American options by simulation: a simple leastsquares approach. Rev. Financ. Stud. 14(1), 113–147 (2001)
Nadarajah, S., Margot, F., Secomandi, N.: Comparison of least squares Monte Carlo methods with applications to energy real options. Eur. J. Oper. Res. 256(1), 196–204 (2017)
Osborne, M., Garnett, R., Roberts, S.J.: Gaussian processes for global optimization. In: 3rd International Conference on Learning and Intelligent Optimization (LION3), pp. 1–15 (2009)
Pozzolo, A.D., Caelen, O., Johnson, R.A., Bontempi, G.: Calibrating probability with undersampling for unbalanced classification. In: 2015 IEEE Symposium Series on Computational Intelligence, pp. 159–166 (2015)
Schwartz, R., Dodge, J., Smith, N., Etzioni, O.: Green AI. Commun. ACM 63(12), 54–63 (2020). https://doi.org/10.1145/3381831
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, vol. 2, pp. 2951–2959 (2012)
Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650. Florence, Italy: Association for Computational Linguistics (2019)
Swersky, K., Snoek, J., Adams, R.P.: Freezethaw Bayesian optimization. arXiv:1406.3896 (2014)
Tsitsiklis, J.N., VanRoy, B.: Regression methods for pricing complex Americanstyle options. IEEE Trans. Neural Netw. 12(4), 694–703 (2001)
Vanschoren, J.: Metalearning. In: Hutter, F., Kotthoff, L., Vanschoren, J. (eds.) Automated Machine Learning: Methods, Systems, Challenges, pp. 35–61. Springer, Berlin (2019)
Yang, C., Akimoto, Y., Kim, D.W., Udell, M.: OBOE: Collaborative filtering for AutoML model selection. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1173–1183 (2019)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Acknowledgements
Support of EPSRC grants EP/N013980/1 and EP/N510129/1 is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
1.1 A Dynamics of the observable Markov process
We derive joint dynamics of the filters \(({\mathcal {A}}_n)\) and \(({\mathcal {B}}_n)\), and observable processes \((h_n)\) and \((t_n)\). Random variables \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) are independent and observations are independent given the control \((u_n)\). Therefore, the dynamics of their filters \(({\mathcal {A}}_n)\) and \(({\mathcal {B}}_n)\) are independent given the control \((u_n)\).
Since the process \(({\mathcal {A}}_n)\) takes values in the space of Gaussian distributions, its dynamics can be fully described by the dynamics of the corresponding stochastic process representing means \((\mu ^\alpha _n)\) and covariances \(\Sigma ^\alpha _n\). The dynamics of \((\mu ^\alpha _n, \Sigma ^\alpha _n, h_n)\) is Markovian, timehomogeneous and depends on the control \((u_n)\): for any Borel measurable sets \(D_1 \subset \mathbb {R}^J\), \(D_2 \subset \mathbb {R}^{J \times J}\) (recall that J is the number of basis functions) and \(D_3 \subset \mathbb {R}\)
where \(\mu ^\alpha _1(\hat{h}, u)\) and \(\Sigma ^\alpha _1 (u) \) are obtained from formulas (2.9) with \(u_0=u\), \(h_1 = \hat{h}\) and \(\Sigma ^\alpha _0=\Sigma \), \(\mu ^\alpha _0 = \mu \):
Notice that the transition function does not depend directly on \(h_n\).
The above formula for \(\Sigma ^\alpha _{1}(u)\) is convenient for proofs. However, the inversion of matrices is numerically undesirable, so in our implementation we use an equivalent formula (Bensoussan 2018, Eq. (4.7.5))
It follows from (A.1) that the simulation of the Markov process \((\mu ^\alpha _n, \Sigma ^\alpha _n)\) may be performed with an intermediate step of generating \(h_{n+1}\). Given \((\mu ^\alpha _n, \Sigma ^\alpha _n)\) and the control \(u_n\), we first draw \(h_{n+1}\) from the normal distribution with the mean \((\mu ^\alpha _n)^T {\varvec{\phi }}(u_n)\) and the variance \({\varvec{\phi }}(u_n)^T \Sigma ^\alpha _n {\varvec{\phi }}(u_n) + \sigma ^2_H(u_n)\). Then we compute \(\mu ^\alpha _{n+1} = \mu ^\alpha _1(h_{n+1}, u_n)\) and \(\Sigma ^\alpha _{n+1} = \Sigma ^\alpha _1(u_n)\) inserting \(\mu = \mu ^\alpha _n\) and \(\Sigma = \Sigma ^\alpha _n\).
The process \(({\mathcal {B}}_n)\) is described as above by the mean \((\mu ^\beta _n)\) and the covariance matrix \((\Sigma ^\beta _n)\). We consider the dynamics of \((\mu ^\beta _n, \Sigma ^\beta _n, t_n)\). For Borel measurable sets \(D_1 \subset \mathbb {R}^K\), \(D_2 \subset \mathbb {R}^{K \times K}\) and \(D_3~\subset ~\mathbb {R}\) we have
where \(\mu ^\beta _1(\hat{t}, u)\) and \(\Sigma ^\beta _1(u)\) is given by (2.10) with \(u_0=u\), \(t_{1}=\hat{t}\) and \(\Sigma ^\beta _0=\Sigma \), \(\mu ^\beta _0 = \mu \). Notice that the transition function above does not depend directly on the value \(t_n\). Observe also that the simulation of the process \((\mu ^\beta _n, \Sigma ^\beta _n, t_n)\) can be performed in an analogous way as for \((\mu _n^\alpha , \Sigma _n^\alpha )\).
MDP dynamics. For the convenience of the reader familiar with MDP framework (instead of stochastic control terminology used in this paper), we provide MDP formalisation of the dynamics of our Markov model after filtering. The state space of the Markov decision process is \((\mathbb {R}^J \times \mathbb {R}^{J \times J} \times \mathbb {R})^2\) with the state denoted by \(X_n = (\mu ^\alpha _n, \Sigma ^\alpha _n, h_n, \mu ^\beta _n, \Sigma ^\beta _n, t_n)\). The action space is the set \(\mathcal {U}\). The transition operator is best described in two stages. In the first stage \(h_{n+1}, t_{n+1}\) are selected with the density
In the second stage, \((\mu ^\alpha _{n+1}, \Sigma ^\alpha _{n+1}, \mu ^\beta _{n+1}, \Sigma ^\beta _{n+1})\) are computed deterministically via formulas (2.9) and (2.10).
1.2 B Auxiliary estimates
Without further mention, we assume that all matrices are symmetric and positive definite. We write \(\lambda _{\max } (\Sigma )\) for the largest eigenvalue of the matrix \(\Sigma \) (or equivalently, the operator norm of this martix). Analogously, \(\lambda _{\min }(\Sigma )\) denotes the smallest eigenvalue. Recall also the notation for the Kalman filter (2.9) and (2.10).
Lemma B.1
For any \(u_0 \in \mathcal {U}\), \(\lambda _{\max } (\Sigma ^\alpha _1) \le \lambda _{\max } (\Sigma ^\alpha _0)\) and \(\lambda _{\max } (\Sigma ^\beta _1) \le \lambda _{\max } (\Sigma ^\beta _0)\), \(\mathbb {P}\)a.s.
Proof
Follows directly from (2.9) and observation that \(\lambda _{\min } \big ((\Sigma ^\alpha _1)^{1}\big ) \ge \lambda _{\min } \big ((\Sigma ^\alpha _0)^{1}\big )\). \(\square \)
Lemma B.2
Let a function \(\varphi :\mathbb {R}^J \times \mathbb {R}^{J\times J} \mapsto \mathbb {R}\) satisfy
for some \(a, b, c, p > 0\). Under Assumptions (A) and (S), we have
for some \(\tilde{a}, \tilde{b}, \tilde{c} > 0\) and \(q = \max (3/2, p)\).
Proof
Let \(A = \sup _{u \in \mathcal {U}} \max _j \phi _j(u)\). For any \(u \in \mathcal {U}\), skipping in the notation the conditioning on \(\mu ^\alpha _0 = \mu , \Sigma ^\alpha _0=\Sigma , u_0 = u\), we have
where the second inequality follows from Lemma B.1. It remains to estimate the norm of \(\mu ^\alpha _1\). From (A.1) and the triangle inequality for the Euclidean norm, we have
where \(\Sigma ^\alpha _1\) depends on \(h_1=h\) and \(u_0=u\) via formula (2.9). Applying \(\big \Vert \Sigma ^\alpha _1 {\varvec{\phi }}(u) (h  \mu ^T {\varvec{\phi }}(u)) \big \Vert \le \lambda _{\max }(\Sigma ) A h  \mu ^T {\varvec{\phi }}(u)\), we obtain
for \(a_1 = \lambda _{\max }(\Sigma ) A\) and \(b_1 = A^2 / \sigma _H(u)\). Above the second inequality is by Jensen’s inequality, in the third inequality we use again Assumption (S), and the fourth inequality is due to \(\sqrt{x^2+y^2} \le x + y\) for \(x,y \ge 0\). Inserting the above estimate into (A.3) completes the proof. \(\square \)
Lemma B.3
Let a function \(\varphi :\mathbb {R}^J \times \mathbb {R}^{J\times J} \times \mathbb {R}^K \times \mathbb {R}^{K\times K}\mapsto \mathbb {R}\) satisfy
for some \(a, b, c, p > 0\). Under Assumptions (A) and (S), we have
where \(\tilde{a}, \tilde{b}, \tilde{c} > 0\) and \(q = \max (p, 3/2)\).
Proof
We use the representation (2.16) of the functional \(\mathcal {T}\). For any \(u \in \mathcal {U}\) with \(B = \sup _{u \in \mathcal {U}} \max _l \psi _l(u)\) and \(C = \sup _{u \in \mathcal {U}} \sigma ^2_T(u)\), direct calculations yield
Since
we have, suppressing in the notation the conditioning on \(\mu ^\alpha _0 = \mu ^\alpha \), \(\Sigma ^\alpha _0=\Sigma ^\alpha \), \(\mu ^\beta _0 = \mu ^\beta \), \(\Sigma ^\beta _0=\Sigma ^\beta \), \(u_0 = u\),
where the estimate of the first term was derived in the proof of Lemma B.2 with \(A = \sup _{u \in \mathcal {U}} \max _j \phi _j(u)\), and the estimate of the second terms follows analogously as in Lemma B.2. Inserting the above bound and the bound for \(\Upsilon \) into (2.16) completes the proof. \(\square \)
1.3 C Proofs
Proof of Theorem 2.2
We have
Conditioning nth term of the above sum on \(\mathcal {F}_n\) yields
where we used that \(\{\tau = n\} \in \mathcal {F}_n\) and \(u_{n1}\) is \(\mathcal {F}_{n1}\)measurable. Hence, using the tower property of conditional expectation the objective function (2.7) can be rewritten as
which completes the proof. \(\square \)
Lemma C.1
The value function defined in (2.12) is bounded from above by
where \({\varvec{\alpha }}\sim N(\mu ^\alpha , \Sigma ^\alpha )\), and bounded from below by
where \({\varvec{\beta }}\sim N(\mu ^\beta , \Sigma ^\beta )\) and \(\Vert \cdot \Vert \) denotes the Euclidean norm. Furthermore, functions \(v^*, v_*\) are finitevalued.
Proof
Recall from the proof of Theorem 2.2 that the first term of (2.12) can be equivalently written as
where \({\varvec{\alpha }}\sim N(\mu ^\alpha , \Sigma ^\alpha )\). By assumption (A), the vector \({\varvec{\phi }}(u_{\tau 1})\) has uniformly bounded entries, so \(C := \sup _{u \in \mathcal {U}} \Vert {\varvec{\phi }}(u)\Vert < \infty \). By Cauchy–Schwarz inequality,
so
Recalling that \({\varvec{\alpha }}\) is multivariate Gaussian, we have \(E[\Vert {\varvec{\alpha }}\Vert ] < \infty \) and \(v^*(\mu ^\alpha , \Sigma ^\alpha ) < \infty \). Since the second term in (2.12) is nonpositive, V is bounded from above by \(v^*(\mu ^\alpha , \Sigma ^\alpha ) < \infty \).
Taking \(\tau = 1\) and any \(u_0 \in \mathcal {U}\) gives a finite lower bound on V:
\(\square \)
Derivation of formula (2.16)
Recall that the distribution of \(t_1\) is normal with the mean \((\mu ^\beta )^T {\varvec{\psi }}(u)\) and the variance \({\varvec{\psi }}(u)^T \Sigma ^\beta {\varvec{\psi }}(u) + \sigma ^2_T(u)\). The result follows from the following direct calculation: for \(Y \sim N(m, s^2)\)
\(\square \)
Proof of Theorem 2.3
In the proof we use the representation (2.16) of operator \(\mathcal {T}\). From the proof of Lemma B.3, it follows that
for some \(a, b, c \ge 0\). Hence \(\mathcal {T} V_1\) is well defined. By induction and Lemma B.3, an estimate of the form (A.4) holds for \(V_N\), so \(\mathcal {T} V_N\) is well defined. The bound also ensures that infinite values are not admitted, which completes the proof of Assertion 1. Assertion 2 follows from a standard theory of stochastic control (Bertsekas and Shreve 1996).
The fact that \(V_N\) are nondecreasing in N is a direct consequence of the set of controls in (2.12) growing with N. Analogously, \(V_N \le V\). This implies that the limit \(V_\infty = \lim _{N \rightarrow \infty } V_N\) exists and \(V_\infty \le V\). By Assumption (A), the random variable \(H^* = \sup _{u \in \mathcal {U}} H(u)\) is integrable (\(E[H^*]<\infty \)), so by the dominated convergence theorem we have
for any admissible strategy \((\tau , (u_n))\). By the monotone convergence theorem
The above convergence results are applied to obtain the second equality below:
where we use that
for any choice of admissible strategy \((\tau , (u_n))\). Hence, \(V \le V_\infty \). In view of the opposite inequality obtained above, we have \(V = V_\infty \). Lemma C.1 guarantees that \(V < \infty \). This completes the proof of Assertion 3.
Using monotonicity of \(V_N\) in N and \(V_{N+1} = \mathcal {T} V_N\), we have (omitting parameters of functions for clarity of presentation)
where the last equality follows from pointwise convergence of \(V_N\) to V and the monotone convergence theorem whose assumptions are satisfied as
and \(E[V_1( \mu ^\alpha _1, \Sigma ^\alpha _1, \mu ^\beta _1, \Sigma ^\beta _1) ] < \infty \) by (A.4) and Lemma B.3. \(\square \)
1.4 D Training and testing splits for validation examples
Datasets were split into training and testing sets as detailed below:

Synthetic dataset: a randomly generated sample of 50,000 points, split 60/40 into training and testing sets.

CNN Breast Cancer: the training set was composed of 5000 randomly selected images from the pool of 277,524, with 50% IDC negative and 50% IDC positive. The testing set contained 1000 images and was also balanced between the two categories.

Higgs dataset: the sample size is the optimisation parameter. Subset of data was randomly generated and the train/test split was 60/40.

Intel image data: 14,034 images were used for the train set and 3000 for the test set. Labels for both sets available on request.

Credit card fraud data: the first 200k rows were used for training and the remaining rows for testing resulting in 80/20 split.
1.5 E AutoBML settings for validation examples
Throughout all examples we set \(\gamma = 0.16\). We also note that we use two value function maps for 1D examples (one with \(\sigma _H = 0.05\) and \(\sigma _T = 0.1\), and another one with \(\sigma _H = 0.15\) and \(\sigma _T = 0.1\)) and one value function map for 2D examples with \(\sigma _H = 0.15\) and \(\sigma _T = 0.1\). Regressions in value function maps (see line 7 in Algorithm 2 and line 6 in Algorithm 3) were performed via Random Forest Regression. Fitting of function \(\tilde{\Lambda }(\cdot )\) is perfomed with splines: smooth.spline function in R, that fits Bsplines according to Cleveland et al. (1992) in onedimensional settings and function Tps from R package fields, that fits thinplate splines, for twodimensional problems (see Green and Silverman 1993 for more detailed analysis on nonparametric regression).
We note that for training (including validation) and testing purposes we use fixed subsets of original data which are independent and do not contain any data leakage. Reported scores are computed on testing sets.
1.5.1 E.1 Common settings for 1D problems
Basis functions:

H: \(\{1,u0.5,(u0.5)^2,(u0.5)^3\}\)

T: \(\{1,u0.5,(u0.5)^2,(u0.5)^3\}\)
Observation error:

H: \(\sigma _H = 0.05\) or \(\sigma _H = 0.15\)

T: \(\sigma _T = 0.1\)
Method of fitting \(\tilde{\Lambda }(\cdot )\):

smooth.spline function in R, that fits Bsplines according to Cleveland et al. (1992).
1.5.2 E.2 Common settings for 2D problems
Basis functions:

H: \(\{(u_1  0.5)^k\}_{k=0}^4\), \(\{(u_2  0.5)^k\}_{k=1}^4\), \((u_1  0.5) \cdot (u_2  0.5)\)

T: \(\{(u_1  0.5)^k\}_{k=0}^4\), \(\{(u_2  0.5)^k\}_{k=1}^4\), \((u_1  0.5) \cdot (u_2  0.5)\)
Observation error:

H: \(\sigma _H = 0.15\)

T: \(\sigma _T = 0.1\)
Method of fitting \(\tilde{\Lambda }(\cdot )\):

Tps function in R from package fields, that fits thinplate splines, see Green and Silverman (1993) for more detailed analysis on nonparametric regression.
1.5.3 E.3 1D example with Synthetic data (Sect. 4.1)
Observation error:

H: \(\sigma _H = 0.05\)
Prior state \(x_0\):

\(\mu _\alpha \) = \((0.4, 0.1, 0.2, 0.1)\)

\(\mu _\beta \) = (1, 1, 2, 2)

\(\Sigma _\alpha \) = \(I_4\), (\(I_n\) is \(n\times n\) identity matrix)

\(\Sigma _\beta \) is a diagonal matrix with entries (0.64, 4, 4, 4)
Score (test accuracy) and Cost (running time in seconds) mapping:

H: \(h = (\hat{h}  0.5)/0.5\)

T: \(t = \hat{t}/0.6\)
Control \(u \in [0,1]\) to hyperparameter \(n_{\text {trees}}\) mapping:

\(n_{\text {trees}}\): \(\lfloor 99u + 1\rfloor \)
1.5.4 E.4 1D example with Breast Cancer data (batch) (Sect. 4.2.1)
Observation error:

H: \(\sigma _H = 0.15\)
Prior state \(x_0\):

\(\mu _\alpha \) = \((0.4,0.1,0.2,0.1)\)

\(\mu _\beta \) = \((1,1,2,2)\)

\(\Sigma _\alpha \) = \(I_4\)

\(\Sigma _\beta \) is a diagonal matrix with entries (0.64, 4, 4, 4)
Score (test accuracy) and Cost (running time in seconds) mapping:

H: \(h = (\hat{h}  0.45)/0.35\)

T: \(t = \hat{t}/7.5\)
Control \(u \in [0,1]\) to hyperparameter (batch) mapping:

batch: \(\lfloor (20010) u + 10\rfloor \)
1.5.5 E.5 1D example with Breast Cancer data (r)
Observation error:

H: \(\sigma _H = 0.15\)
Prior state \(x_0\):

\(\mu _\alpha \) = \((0.4,0.1,0.8,0.1)\)

\(\mu _\beta \) = (0.48, 0, 0, 0)

\(\Sigma _\alpha \) = \(I_4\)

\(\Sigma _\beta \) is a diagnoal matrix with entries (0.64, 0, 0, 0)
Score (test accuracy) and Cost (running time in seconds) mapping:

H: \(h = (\hat{h}  0.45)/0.35\)

T: \(t = \hat{t}/7.5\)
Control \(u \in [0,1]\) to hyperparameter (r) mapping:

r: \(\exp [(\log (0.1)\log (0.00001))u + \log (0.00001)]\)
1.5.6 E.6 2D example with Breast Cancer data
Prior state \(x_0\):

\(\mu _\alpha \) = \((0.4, 0.3, 0.4, 0,0, 0.2, 0.4, 0, 0,0)\)

\(\mu _\beta \) = \((3, 0, 0, 0,0, 3.5, 0.45, 0, 0.5,0)\)

\(\Sigma _\alpha \) = \(0.6\times I_{10}\)

\(\Sigma _{\beta }\) is a diagonal matrix with entries \((0.6, 0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.001, 0.6)\)
Score (test accuracy) and Cost (minutes) scaling:

H: \(h = (\hat{h}  0.45)/0.35\)

T: \(t = \hat{t}/7.5\)
Control \(u \in [0,1]\times [0,1]\) to hyperparameters (learning rate and batch size) mappings:

r: \(\exp [(\log (0.1)\log (0.00001))u_1 + \log (0.00001)]\)

batch: \(\lfloor (20010) u_2 + 10\rfloor \)
1.5.7 E.7 1D example with Higgs data
Observation error:

H: \(\sigma _H = 0.05\)
Prior state \(x_0\):

\(\mu _\alpha \) = \((0.5, 0.3, 0.2, 0)\)

\(\mu _\beta \) = (1.4, 4, 6, 6)

\(\Sigma _\alpha \) = \(I_4\)

\(\Sigma _\beta \) = \(I_4\)
Score (test accuracy) and Cost (minutes) scaling:

H: \(h = (\hat{h}  0.6)/0.2\)

T: \(t = \hat{t}/5\)
Control \(u \in [0,1]\) to hyperparameter (sample size) mapping:

sample size: \(\exp [\log _{15}(0.00001)  u \log _{15}(0.00001)]\)
1.5.8 E.8 1D example with Intel Images data
Observation error:

H: \(\sigma _H = 0.15\)
Prior state \(x_0\):

\(\mu _\alpha \) = \((0.9, 0.4, 2, 0)\)

\(\mu _\beta \) = (0.8, 1.5, 0, 0)

\(\Sigma _\alpha \) = \(I_4\)

\(\Sigma _\beta \) = \(I_4\)
Score (test accuracy) and Cost (minutes) scaling:

H: \(h = \hat{h}\)

T: \(t = \hat{t}/12\)
Control \(u \in [0,1]\) to hyperparameter (epoch) mapping:

epoch: \( \lfloor 34u + 1\rfloor \)
1.5.9 E.9 1D example with Credit Card Fraud data
Observation error:

H: \(\sigma _H = 0.15\)
Prior state \(x_0\):

\(\mu _\alpha \) = \((0.9, 0, 2, 0)\)

\(\mu _\beta \) = (1.1, 0.6, 0, 0)

\(\Sigma _\alpha \) = \(I_4\)

\(\Sigma _\beta \) = \(I_4\)
Score (generalized Fscore, \(\beta =6\)) and Cost (minutes) scaling:

H: \(h = \hat{h} / 0.75\)

T: \(t = (t5)/20\)
Control \(u \in [0,1]\) to hyperparameter (scale) mapping:

scale: \( 9u + 1\)
1.5.10 E.10 2D example with Credit Card Fraud data
Prior state \(x_0\):

\(\mu _\alpha \) = \((0.8, 0, 0.8, 0, 0, 0, 0.8, 0, 0, 0)\)

\(\mu _\beta \) = (1.325, 0.75, 0.25, 0, 0, 0.75, 0.25, 0, 0, 0)

The diagonal of \(\Sigma _\alpha \) is \((4,0.6,0.62,0.65,0.70,0.6,0.62,0.65,0.70,0.6)\), and the only nonzero entries offdiagonal are \(\Sigma _\alpha [1,2:10] = \Sigma _\alpha [2:10,1] = 0.1\).

\(\Sigma _{\beta } = I_{10}\)
Score (generalized Fscore, \(\beta =6\)) and Cost (minutes) scaling:

H: \(h = (\hat{h}0.45)/0.35\)

T: \(t = \hat{t}/7.5\)
Control \(u \in [0,1]\times [0,1]\) to hyperparameters (scale and code) mappings:

scale: \( 9u_1 + 1\)

code: \( \text {round}(9u_2 + 1)\)
1.6 F Hyperopt benchmarks
a Treestructured Parzen Estimator via Hyperopt package in Python (Bergstra et al. 2013). As the cost is not part of the optimisation procedure in Hyperopt implementation of Treestructured Parzen estimators, optimisation function relies on a user providing the maximum number of function evaluations. We explain how this is set for each optimisation problem. We record evaluation time of each function call (ML training and testing as for AutoBCT) together with the cumulative time. We scale real time identically as for AutoBCT runs in Section 4 to allow for direct comparison.
1.6.1 F.1 CNN Breast Cancer
We initialized Hyperopt 2D search over hyperparameters r and batch by providing identical constraints as for AutoBCT. We provide ‘hp.loguniform()’ as a prior for r and ‘hp.choice()’ for batch as documented by the package; the search space is defined as:
A representative run of Hyperopt for 2D problem is shown in Table 15. Representative run for a 1D problem for optimisation of r is in Table 16 and for optimisation of batch is in Table 17.
In 2D example, the maximum accuracy obtained in 20 iterations by Hyperopt is 75% with hyperparameters \(r = 0.003\) and \(batch=62\). The total elapsed time is 2.9 times larger that the that of the AutoBCT run presented in the paper (Table 8). If we terminated Hyperopt search after it had reached AutoBCT’s total elapsed time of 1.72, the maximum accuracy would be 70% with \(r = 0.0006\) and \(batch=132\). AutoBCT run in Table 8 obtains hyperparameter values \(r = 0.001\) and \(batch = 38\) corresponding to the accuracy of 78% and stops in 9 iterations. Aggregate results in Table 9 indicate average accuracy of 72% with the mean of controls corresponding to \(r=0.007\) and \(batch=66\). The average running time is 2.02. A similar accuracy is attained by Hyperopt within the same running time but then, in iterations \(n=11, \ldots , 14\), nonpromising values are explored.
In the 1D case of optimisation of r, see Table 16, Hyperopt found a good parameter combination in the second iteration but failed to improve it until iteration 9, which suggests that this early success was by chance. Corresponding results for AutoBCT are in Table 7 and achieve accuracy of 72.8% and \(r=0.0005\) within (scaled) time of 1.8.
Progress of Hyperopt for batch is displayed in Table 17. AutoBCT concluded with accuracy of 73% with \(batch=89\) and total (scaled) running time of 0.45.
1.6.2 F.2 Intel Image data
We have run Hyperopt optimiser for simple and Resnet50 architectures. Results are displayed in Tables 18 and 19. The number of iterations was set to 10, given that AutoBCT converged in 9 iterations. Hyperopt obtained optimal control \(epoch=22\) (accuracy 86.5%) for simple architecture, and \(epoch=26\) (accuracy 91.8%) for Resnet50 enhanced architecture. These results are similar to those obtained by AutoBCT.
1.6.3 F.3 Credit card fraud
We collect here example runs of Hyperopt optimiser for 1D (scale) and 2D (scale and code) problems, see Tables 20 and 21. The number of total evaluation runs was set to 8 for 1D problem and 16 for 2D problem, given that AutoBCT converged in 4 and 8 iterations, repectively. Hyperopt obtained optimal control \(scale=1.17\) (\(F_{\beta }=72.6\)) for 1D problem, and \(scale=1.89\), \(code=9\) (\(F_{\beta }=71.8\)) for 2D problem. Notice that in 2D problem, Hyperopt took a long time to approach relatively good sets of parameters. AutoBCT completed its search in 2.87 which is equivalent to roughly 4 iterations of Hyperopt. Recall that the prior used in AutoBCT indicated that low values of parameters correspond to lower score, which is opposite to the dependence experienced in this problem.
1.7 G Model architectures
We provide snippets of a Python code defining architectures of neural networks used in Section 4.
1.8 G.1 CNN breast cancer example
1.8.1 G.2 Plain and Resnet50 models for Intel data
1.8.2 G.3 Plain
1.8.3 G.4 Resnet50
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cironis, L., Palczewski, J. & Aivaliotis, G. Automatic model training under restrictive time constraints. Stat Comput 33, 16 (2023). https://doi.org/10.1007/s11222022101663
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222022101663
Keywords
 Machine learning
 Hyperparameter optimization
 Time constraint
 Markov decision process