1 Introduction

The emergence of Machine Learning (ML) methods as an effective and easy to use tool for modeling and prediction has opened up new horizons for users in all aspects of business, finance, health, and research (Ghoddusi et al. 2019; Briganti and Le Moine 2020). In stark contrast to more traditional statistical methods that require relevant expertise, ML methods constitute largely black-boxes and are based on minimal assumptions. This resulted in adoption of ML from an audience of users with possible expertise in the domain of application but little or no expertise in statistics or computer science. Despite these features, effective use of ML does require good knowledge of the method used. The choice of the method (algorithm), the tuning of its hyperparameters and the architecture requires experience and good understanding of those methods and the data. Naturally, trial and error might provide possible solutions; however it can also result in sub-optimal use of ML and wasted computational resources.

The goal of this research is to find optimal values of hyperparameters for a given dataset and a modelling objective given that the relationship between the hyperparameters, model quality (model score) and training cost is not known in advance. This problem is akin to the classical AutoML setup (Hutter et al. 2019) with one crucial difference: if desired, the system has to produce results in radically shorter time than classical AutoML solutions. This comes at the expense of accuracy, so taking the trade-off between the model quality and the running time explicitly into account lies at the heart of this paper. The relationship between hyperparameters, model quality and computational cost for a particular modelling problem and a particular dataset must be learnt and this learning is incorporated directly into the optimisation problem.

1.1 Main contribution

The objective of a classical hyperparameter optimisation problem is to find values of hyperparameters that maximise the accuracy or a related score of the trained model:

$$\begin{aligned} \text {maximise } H(u) \quad \text {subject to}\quad u \in \mathcal {U}, \end{aligned}$$

where H(u) is the model score given hyperparameters u and \(\mathcal {U}\) is the domain of hyperparameters. As explained above, we are interested in the hyperparameter optimisation problem when the relationship between hyperparameters and the score is not perfectly known and must be learnt by observing performance of different hyperparameter choices, similarly as in reinforcement learning. Differently to reinforcement learning, we add a constraint on the total optimisation cost which may be a function of the running time or resources consumed in the course of hyperparameter optimisation. Schematic presentation of the resulting optimisation problem is as follows:

$$\begin{aligned} \text {maximise } H(u_{\tau -1})\quad \text {subject to } \sum _{n=1}^{\tau } T(u_{n-1}) \le T^*, \end{aligned}$$
(1.1)

where the optimisation is over \(\tau \ge 1\) and \((u_n)_{n=0}^\infty \). Here, \(\tau \) is the number of training steps (a quantity dependent on the course of learning/training), \(u_{n} \in \mathcal {U}\) denotes the choice of hyperparameters at step n (also dependent on the history). The function T(u) is the cost of model training and validation for hyperparameters u.

In practice, the observed model score may be random as it depends on the training process of the model (which can itself include randomness as in the construction of random forests or in the stochastic gradient method for neural networks) and on the choice of training and validation datasets (including K-fold cross-validation). The observed computational cost can also be random due to random variations in running time commonly observed in practical applications, arising both from a hardware and an operating system’s side as well as from random choices in the training process. To account for this randomness and to allow for incorporation of Bayesian learning, we reformulate the hyperparameter optimisation problem (1.1) as a stochastic control problem using notation which will be used in the paper:

$$\begin{aligned} \begin{aligned} \text {maximise } E [h_\tau ] \quad \text {subject to } E \left[ \sum _{n=1}^{\tau } t_n\right] \le T^*. \end{aligned} \end{aligned}$$
(1.2)

The quantity \(h_n\) is the (random) observed model score corresponding to the choice of hyperparameters \(u_{n-1}\). The (random) observed computational cost \(t_n\) also depends on the choice of hyperparameters \(u_{n-1}\). We recall that \(\tau \) indicates the time when the learning terminates and itself depends on the whole history of observations. Similarly, hyperparameter choice \(u_n\) depends on the history of observations up to time n. We consider the expectation of the score at step \(\tau \) and the expectation of the cumulative computational cost since the randomness in observation of those quantities can be viewed as external to the problem and not under our control.

We emphasise that the motivation of this paper comes from optimisation problems where \(T^*\) is significantly smaller than commonly used in classical AutoML applications. When model training is performed during an exploratory data analysis, short response times, even at the cost of model accuracy, are key for success (Demšar et al. 2004; Holzinger et al. 2014) and may reduce computational resource usage at early stages of modelling. When \(T^*\) is small, it needs to be directly included in the optimisation criterion as the choice of controls \(u_n\) and the stopping time \(\tau \) is strongly dependent on \(T^*\).

As common in the stochastic control and stochastic optimisation literature, we will study a Lagrangian relaxation of problem (1.2) which takes the form

$$\begin{aligned} \sup _{(u_n), \tau } E \left[ h_\tau - \gamma \sum _{n=1}^{\tau } t_n\right] . \end{aligned}$$
(1.3)

The constant \(\gamma >0\) is the Lagrange multiplier corresponding to the constraint \(E\big [\sum _{n=1}^{\tau } t_n\big ] \le T^*\). It models our aversion to the computational cost. The higher the number the more sensitive we are to increases in cost and willing to accept models with lower scores. We call the optimisation problem (1.3) Automated Budget Constrained Training (AutoBCT).

We study (1.3) in the framework of Markov Decision Processes with partially observable dynamics, combining stochastic control, optimal stopping and filtering. We treat hyperparameters as a control variable and \(\tau \) as a stopping time. The choice of hyperparameters \(u_{n-1}\) for the step n is based on the observation of model scores \(h_1, \ldots , h_{n-1}\) and costs \(t_1, \ldots , t_{n-1}\) in all the previous steps as well as on the prior information on the dependence of the score and cost on the hyperparameters. The number of training steps \(\tau \) also depends on the history of model scores and computational costs. At the end of each step a decision is made whether to continue training or accept current model. Hence \(\tau \) is not deterministically chosen before the learning starts and depends not only on the total computational budget but also on observed scores and costs. This may and does lead AutoBCT to stop quite early in some problems when the model happens to be so good that the expenditure on more training is not worthwhile (according to the criterion (1.3) which looks at the trade-off between score and cost).

We approximate score and cost as functions of hyperparameters using linear combinations of basis functions. We assume that the prior knowledge about the distribution of the coefficients of these linear combinations is multivariate Gaussian and is updated after each step using Kalman filtering. This update not only changes the mean vector but also adjusts the covariance matrix which represents the uncertainty about the coefficients and, consequently, about the score and cost mappings. Updated quantities feed back into the algorithm which decides whether to continue or stop training, and, in the former case, which hyperparameters to choose for the next step. We show that the updated distributions of coefficients are sufficient statistics of past observations of model scores and costs for the purpose of optimisation of the objective function (1.3).

Our framework requires that a value function, representing the optimum value of the objective function (1.3), is available. It depends on the dimension of the problem and the trade-off coefficient \(\gamma \) only, and is score and cost agnostic (in the sense that it does not depend on the particular problem and the choice of score and cost). This allows for an efficient offline precomputation and recycling of the value function maps. Users are able to share and reuse the same value function maps for different problems as long as the dimensionality of the hyperparameter space agrees. We demonstrate it in the validation section of this paper.

Similar to other AutoML frameworks (Feurer et al. 2015; Vanschoren 2019), AutoBCT natively enables meta-learning. Prior distributions for coefficients of basis functions determining the score and cost maps can be combined with the meta-data describing already completed learning tasks in a publicly available repositories. These data can then be used to select a more informative prior for a given problem and therefore, to warm-start the training. In turn, this leads to reductions in the computational cost and improvements in the score.

The focus on optimising directly the trade-off between the model score and the total computational costs comes from two directions. Firstly, given the recent emphasis on the eco-efficiency of AI (Strubell et al. 2019; Schwartz et al. 2019), AutoBCT framework provides effective ways of weighing model quality and the employed computational resources as well as recycling information and, in turn, reducing computational resources required to train sufficiently good models. Secondly, with the democratisation of data analytics an increasing amount of data exploration will take place where users need to obtain useful (but not necessarily optimal) results within seconds or minutes (Demšar et al. 2004; Holzinger et al. 2014).

In summary, this paper’s contributions are multifaceted. On the practical level, we develop an algorithm that allows optimal selection of hyperparameters for training and evaluation of models with an explicit trade-off between the model quality and the computational resources. From the eco-efficiency perspective our framework not only discourages unnecessarily resource intensive training but also naturally enables transfer and accumulation of knowledge. On the side of Markov decision processes, we solve a non-standard, mixed stochastic control and stopping problem for a partially observable dynamical system and design an efficient numerical scheme for the computation of the value function and optimal controls. Lastly, the applicability of AutoBCT goes beyond machine learning. Any parameter optimisation problem with learning which can be stated as maximising a certain one-dimensional ‘score’ (representing a quality of an outcome) subject to a constraint on ‘costs’ (representing resources used) directly fits into our framework.

1.2 Related literature

To maintain the appeal of ML to its huge and newly acquired audience and to enable its use in systems that do not require human intervention, a large effort has been put in the development of algorithms and systems that allow automatic selection of an optimal ML method and an optimal tuning of its parameters. This line of research has largely operated under the umbrella name of AutoML (see Hutter et al. 2019), standing for Automated Machine Learning, and boosted by a series of data crunching challenges under the same title (see, e.g., Guyon et al. 2019). There are widely available AutoML packages in most statistical and data analytics software, see, e.g., Auto-Weka (Kotthoff et al. 2017), Auto-sklearn (Feurer et al. 2015), as well as in all major commercial cloud computing environments.

The core of parameter optimisation typically takes a Bayesian approach in the form of alternating between learning/updating a model that describes the dependence of the score on the hyperparameters and using this model and an acquisition function to select the next candidate hyperparameters. The different algorithms used to support the alternating model fitting and updating of parameters also affect its performance and suitability for specific problems. The three predominant algorithms are (for a discussion see Eggensperger et al. 2013): Gaussian Processes (Snoek et al. 2012), random forest based Sequential Model-based Algorithm Configuration (SMAC) (Hutter et al. 2011), and Tree Parzen estimator (Bergstra et al. 2011).

Our approach borrows from the ideas underlying Sequential Model-based Optimisation (SMBO) (Hutter et al. 2011) and learning curve approximation (Domhan et al. 2015). The surrogate model in our case is the basis-functions based representation of the score and cost maps. However, unlike SMBO, the choice of hyperparameters is not only based on the surrogate model but also on the expected future performance encoded in the value function. Unlike Domhan et al. (2015), the prior distribution of basis function coefficients is conjugate to the distribution of observations so that analytical formulas for the posterior distribution are available and the dimensionality of the representation of the posterior distribution of coefficients does not change with the increasing number of observations. The latter feature is key for tractability of our methodology.

In all the above, time is certainly an important factor. It is however mostly treated either implicitly, through exploration–exploitation trade-off or through hard bounds, which can sometimes result in the algorithm providing no answer within the allocated time. A more sophisticated optimisation of computational resources has recently started to draw interest. In a recent work Falkner et al. (2018), the authors combine Bayesian optimisation to build a model and suggest potential new parameter configuration with Hyperband, which dynamically allocates time budgets to these new configurations applying successive halving (Li et al. 2017) to reduce time allocated to unsuccessful choices. Yang et al. (2019) uses meta-learning and imposes time constraints on the predicted running time of models. Swersky et al. (2014) discusses a freeze-thaw technique for (at least temporarily) pausing training for unpromising hyperparameters. Another approach prominent in the literature uses learning curves to stop training under-performing neural network models, see Domhan et al. (2015), Chandrashekaran and Lane (2017), Gargiani et al. (2019). Bayesian Optimal Stopping is employed by Dai et al. (2019) to achieve a similar goal. Another related line of research is concerned with prediction of running time of algorithms (see Huang et al. 2010; Hutter et al. 2014 and many others). Those algorithms are akin to meta-learning, i.e., running time is predicted based on metafeatures of algorithms and their inputs.

Our approach differs from the above ideas in that it explicitly accounts for the trade-off between the model quality (score) and the cumulative cost of training. To the best of our knowledge this is the first attempt to bring the cost of computational resources to the same level of importance as the model score and include it in the objective function. Our developments are, however, parallel to some of the papers discussed above. For example, our approach would benefit from employment of learning curves to allow for incomplete training for hyperparameters which appear to produce suboptimal models. Our algorithm could complement OBOE (Yang et al. 2019) by fine-tuning hyper-parameters of selected models. Such extensions are beyond the scope of this paper and will be explored in further research.

Closely related to our problem is budget-constrained Bayesian optimisation, see Osborne et al. (2009), Ginsbourger and Le Riche (2010). The budget is stated in terms of the number of evaluations of the unknown function (in our case, the unknown function is the mapping of hyperparameters into the score). This problem is studied in the framework of dynamic programming. The key challenge is to efficiently approximate the so-called continuation value without nested evaluations which are prohibitively expensive, see, e.g., Lam et al. (2016); Lee et al. (2020) which employ the so-called rollout strategies. As in the classical Bayesian optimisation, the unknown function is approximated using Gaussian processes so the dimensionality of the state increases after each step. In our basis-function approach the dimensionality of the state space does not change. Notice also that the former approach is similar to ours only if the cost of each evaluation is the same and known.

Our numerical method for computation of the value function complements a large strand of literature on regression Monte Carlo methods. The widest studied problem is that of optimal stopping (aka American option pricing) initiated by Tsitsiklis and VanRoy (2001), Longstaff and Schwartz (2001) and studied continuously to this day (Nadarajah et al. 2017). Stochastic control problems were first addressed in Kharroubi et al. (2014) using control randomisation techniques and later studied in Balata and Palczewski (2017) using regress later approach and in Bachouch et al. (2018) employing deep neural networks. Numerical methods designed in our paper are closest in their spirit to regress later ideas of Balata and Palczewski (2017) (see also Bachouch et al. 2018).

1.3 Paper structure

The paper is structured as follows. Section 2.1 gives details of the optimisation problem. Technical aspects of embedding this problem within the theory of Markov decision processes with partial information comes in the following subsection. Due to the noisy observation of the realised model score and cost (partial information), the resulting model is non-Markovian. Section 2.3 reformulates the optimisation problem as a classical Markov decision model with a different state space. The rest of Sect. 2 is devoted to showing that the latter optimisation problem has a solution, developing a dynamic programming equation and showing how to compute optimal controls. Section 3 contains details of numerical computation of the value function and provides necessary algorithms. Validation of our approach is performed in Sect. 4 which opens with a synthetic example to present, in a simple setting, properties of our solution. It is followed by 8 examples encompassing a variety of datasets and modelling objectives. Appendices contain further details. “Appendix A” derives the dynamics of the Markov decision process of Sect. 2.3. Appendix B contains auxiliary estimates used in proofs of main results collected in Appendix C. Detailed settings for all studied models are in Appendix D and E while architectures of neural networks used in examples are provided in Appendix G. Comparisons to Hyperopt are collected in Appendix F.

For code and value functions (maps) used in the paper contact the authors.

2 Stochastic control problem

In this section, we embed the optimisation problem in the theory of stochastic control of partially observable dynamical systems (Bertsekas and Shreve 1996, Chapter 10) and lay foundations for its solution.

figure a

2.1 Model overview

We start by providing a sketch of our framework. Its construction is motivated by a trade-off between two opposing goals: the representation of the key features of the trained model’s score and cost functions needed in optimisation of hyperparameters, and the computational tractability.

Recall the optimisation problem (1.3). The random quantities \(h_n\) and \(t_n\) are not known as they depend on a dataset, a particular problem, and software and hardware used. We therefore embed the optimisation problem in a Bayesian learning setting. Following ideas of linear basis expansion (Hastie et al. 2009, Ch. 5) and regression Monte Carlo (Tsitsiklis and VanRoy 2001), we assume that the true expected model score H(u) and the true expected model cost T(u),Footnote 1 given the choice of hyperparameters \(u \in \mathcal {U}\), can be represented as linear combinations of basis functions:

$$\begin{aligned} H(u) = \sum _{j=1}^J \alpha _j \phi _j(u), \quad T(u) = \sum _{k=1}^K \beta _k \psi _k(u), \end{aligned}$$
(2.4)

where functions \((\phi _j)_{j=1}^J\) and \((\psi _k)_{k=1}^K\) act from \(\mathcal {U}\) to \(\mathbb {R}\); we recall that \(\mathcal {U}\) is the set of admissible controls (hyperparameter values). The coefficients \({\varvec{\alpha }}= (\alpha _1, \ldots , \alpha _J)^T\) and \({\varvec{\beta }}= (\beta _1, \ldots , \beta _K)^T\) evolve while learning takes place. A priori, before learning starts, \({\varvec{\alpha }}\) follows a normal distribution with the mean \(\mu ^\alpha _0\) and the covariance matrix \(\Sigma ^\alpha _0\); the vector of coefficients \({\varvec{\beta }}\) follows a normal distribution with the mean \(\mu ^\beta _0\) and the covariance matrix \(\Sigma ^\beta _0\). These distributions are referred to as the prior distributions. We assume that a priori \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) are independent from each other. This choice of distributions is motivated by computational tractability but comes at the cost of modelling inaccuracy. In particular, we cannot guarantee that \(T(u) \ge 0\) for every \(u \in \mathcal {U}\) and any realisation of \({\varvec{\beta }}\). The consequences will be explored in more depth later in the paper.

The observed score \(h_{n}\) and cost \(t_{n}\) experience variation due to inherent randomness in model training and testing or in execution conditions (which is particularly important for short running times), see Domhan et al. (2015), Gargiani et al. (2019) for a similar assumption in the case of learning curve estimation. To account for this variation, we assume that

$$\begin{aligned} \begin{aligned} h_{n}&= H(u_{n-1}) + \sigma _H(u_{n-1}) \epsilon _{n},\\ t_{n}&= T(u_{n-1}) + \sigma _T(u_{n-1}) \eta _{n}, \end{aligned} \end{aligned}$$
(2.5)

where \((\epsilon _n)\) and \((\eta _n)\) are independent sequences of independent standard normal N(0, 1) random variables. The terms \(\sigma _H(u_{n-1}) \epsilon _{n}\) and \(\sigma _T(u_{n-1}) \eta _{n}\) correspond to the aforementioned random fluctuations of observed quantities around \(H(u_{n-1})\) and \(T(u_{n-1})\) and are Gaussian with the mean 0 and the known variances \(\sigma ^2_H(u_{n-1})\) and \(\sigma _T^2(u_{n-1})\).

Algorithm 1 provides a sketch of our approach. Given all information at step 0, i.e., given the prior distribution of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) and a pre-computed AutoBCT map \(\tilde{V}\) (the computation of AutoBCT map is detailed in Sect. 3.2), we choose values of hyperparameters \(u_0\) and then train and validate a model. Based on the observed model score \(h_1\) and cost \(t_1\), we update the distribution of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) (the learning step that is based on Kalman filter, Sect. 2.3). Following this we check the stopping condition: on the intuitive level, the algorithm stops if expected potential gains in the score do not justify additional costs (Sect. 2.5 contains a detailed discussion of stopping conditions). If the decision is to stop, we return the latest values of metaparameters and further information, for example, the latest distributions of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) or the latest model. Otherwise, we repeat choosing hyperparameters, training, validation and update procedures. We continue until the decision is made to stop.

Remark 2.1

In our Bayesian learning framework, the expectation in (1.3) is not only with respect to the randomness of observation errors \((\epsilon _n)\) and \((\eta _n)\) but also with respect to the prior distribution of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\). Therefore, all possible realisations of the score and cost mappings are evaluated with more probable outcomes (as directed by the prior distribution) receiving proportionally more contribution to the value of the objective function.

In the remaining of this subsection, we derive the optimisation problem that will be solved in this paper. This will result in a modification to both terms in the objective function in (1.3).

The score. In Problem (1.3) the optimised quantity is the observed score of the most recently trained model. When the emphasis is on the selection of hyperparameters, one would replace \(h_{\tau }\) by \(H(u_{\tau -1})\) which corresponds to maximisation of the true expected score without the observation error:

$$\begin{aligned} \sup _{(u_n), \tau } E \left[ H(u_{\tau -1}) - \gamma \sum _{n=1}^{\tau } t_n\right] . \end{aligned}$$
(2.6)

Either case can be easily accommodated by our framework, however, we restrict attention to the objective function (2.6) in this paper.

The cost. The optimal value for the problem (2.6), as well as (1.3), is infinite and offers controls \((u_n)\) of little practical use. This is because \({\varvec{\beta }}\) follows a normal distribution for the sake of numerical tractability, so for any choice of basis functions \({\varvec{\psi }}\) one cannot have \(\sum _{k=1}^K \beta _k \psi _k(u) \ge 0\) for all \(u\in \mathcal {U}\) with probability one. Indeed, for any \(\delta > 0\) we have

$$\begin{aligned} \mathbb {P}\left( \sum _{k=1}^K \beta _k \psi _k(u) < -\delta \text { for some}\, u \in \mathcal {U}\right) > 0. \end{aligned}$$

While learning one could identify, with increasing probability as n increases, those realisations of \({\varvec{\beta }}\) for which there is \(u \in \mathcal {U}\) with \(\sum _{k=1}^K \beta _k \psi _k(u) < -\delta \) for some fixed \(\delta > 0\). Then, continuing infinitely long with those controls u would lead to a positive infinite value of the expression inside expectation.

In the view of the above remark, we truncate the cost to stay positive. The final form of the optimisation problem is

$$\begin{aligned} \sup _{(u_n), \tau } E \left[ H(u_{\tau -1}) - \gamma \sum _{n=1}^\tau (t_n)^+ \right] , \end{aligned}$$
(2.7)

where \((t_n)^+ := \max (t_n, 0)\). We show in Lemma C.1 that the optimal value of this problem is finite.

An alternative approach could be to amend the definition of \(t_n\) and the prior distribution of basis function coefficients to ensure non-negativity, but this would invalidate assumptions of the Kalman filter used in learning and, hence, make the solution of the problem numerically infeasible. Indeed, Domhan et al. (2015) defines a prior distribution of coefficients so that particular features of the resulting curves (like monotonicity or non-negativity) are guaranteed but then needs to resort to numerical methods for sampling from the posterior distribution.

2.2 Technical details

Optimisation problem (2.7) involves quantities which are not directly observable, namely, the coefficients of basis functions making up the true score H and cost T. In the following two subsections, using theory of optimal control with partial observations we will derive a fully observed control problem. For convenience of a reader familiar with the MDP framework, a complete presentation of the dynamics of the latter problem is included in Appendix A.

Let \((\Omega , \mathcal {F}, \mathbb {P})\) be the underlying probability space supporting the sequences \((\epsilon _n)_{n \ge 1}\) and \((\eta _n)_{n \ge 1}\) and the random variables \({\varvec{\alpha }}\) and \({\varvec{\beta }}\). By \(\mathcal {F}_n\) we denote the observable history up to and including step n, i.e., \(\mathcal {F}_n\) is the \(\sigma \)-algebra

$$\begin{aligned} \mathcal {F}_n = \sigma ( h_1, \ldots , h_{n}, t_1, \ldots , t_{n} ), \quad n \ge 1, \end{aligned}$$

with \(\mathcal {F}_0 = \{\emptyset , \Omega \}\). The choice of hyperparameters \(u_n\) must be based only on the observable history, i.e., \(u_n\) must be \(\mathcal {F}_n\)-measurable for any \(n \ge 0\). The variable \(\tau \) which governs the termination of training must be an \((\mathcal {F}_n)\)-stopping time, i.e., \(\{\tau =n\} \in \mathcal {F}_n\) for \(n \ge 0\)—the decision whether to finish training is based on the past and present observations only. The difficulty lies in the fact that observations combine evaluations of the true expected score function and the running time function with errors, c.f., (2.5). This places us in the framework of stochastic control with partial observation (Bertsekas and Shreve 1996, Chapter 10).

Denote \({\varvec{\phi }}(u) = (\phi _1(u), \ldots , \phi _J(u))^T\) and \({\varvec{\psi }}(u) = (\psi _1(u), \ldots , \psi _K(u))^T\). Equation (2.5) can be written in a vector notation as

$$\begin{aligned} \begin{aligned} h_{n+1}&= {\varvec{\alpha }}^T {\varvec{\phi }}(u_n) + \sigma _H(u_n) \epsilon _{n+1},\\ t_{n+1}&= {\varvec{\beta }}^T {\varvec{\psi }}(u_n) + \sigma _T(u_n) \eta _{n+1}. \end{aligned} \end{aligned}$$
(2.8)

We will use this notation throughout the paper.

The power of the theory of Markov decision processes is in its tractability: an optimal control can be written in terms of the current state of the system. However, when the system is not fully observed, there is an obvious benefit to take all past observations into account to determine possible values of unobservable elements of the system. Indeed, due to the form of observation (2.5)–(2.4), one can infer much more about \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) when taking into account all available past readings of model scores and training costs. It turns out that optimal control problems of the form studied in this paper can be rewritten into equivalent systems with a bigger state space but with full observation for which classical dynamic programming can be applied.

Before we proceed with this programme, we state standing assumptions:

(A):

Basis functions \((\phi _j)_{j=1}^J\) and \((\psi _k)_{k=1}^K\) are bounded on \(\mathcal {U}\).

(S):

Observation error variances are bounded and separated from zero, i.e.

$$\begin{aligned}&\inf _{u \in \mathcal {U}} \sigma _H(u)> 0, \quad \inf _{u \in \mathcal {U}} \sigma _T(u) > 0,\\&\sup _{u \in \mathcal {U}} \sigma _H(u) + \sigma _T(u) < \infty . \end{aligned}$$

It is not required for the validity of mathematical results that basis functions are orthogonal in a specific sense or even linearly independent. However, linear independence is required for identification purposes and it speeds up learning.

2.3 Separation principle

Denote by \({\mathcal {A}}_n\) the conditional distribution of \({\varvec{\alpha }}\) given \(\mathcal {F}_n\). It follows from an application of the discrete-time Kalman filter (Bensoussan 2018, Sect. 4.7) that \({\mathcal {A}}_n\) is Gaussian with the mean \(\mu ^\alpha _n\) and the covariance matrix \(\Sigma ^\alpha _n\) given by the following recursive formulas:

$$\begin{aligned} \begin{aligned} \left( \Sigma ^\alpha _{n+1}\right) ^{-1}&= \left( \Sigma ^\alpha _n\right) ^{-1} + \frac{1}{\sigma _H(u_n)^2} {\varvec{\phi }}(u_n) {\varvec{\phi }}(u_n)^T\\ \mu ^\alpha _{n+1}&= \mu ^\alpha _n + \frac{1}{\sigma _H(u_n)^2} \Sigma ^\alpha _{n+1} {\varvec{\phi }}(u_n)\\ {}& \times \big ( h_{n+1} - {\varvec{\phi }}(u_n)^T \mu ^\alpha _n \big ). \end{aligned} \end{aligned}$$
(2.9)

Denote by \({\mathcal {B}}_n\) the conditional distribution of \({\varvec{\beta }}\) given \(\mathcal {F}_n\). By the same arguments as above, it is also Gaussian with the mean \(\mu ^\beta _n\) and the covariance matrix \(\Sigma ^\beta _n\) given by the following recursive formulas:

$$\begin{aligned} \begin{aligned} \left( \Sigma ^\beta _{n+1}\right) ^{-1}&= \left( \Sigma ^\beta _n\right) ^{-1} + \frac{1}{\sigma _T(u_n)^2} {\varvec{\psi }}(u_n) {\varvec{\psi }}(u_n)^T\\ \mu ^\beta _{n+1}&= \mu ^\beta _n + \frac{1}{\sigma _T(u_n)^2} \Sigma ^\beta _{n+1} {\varvec{\psi }}(u_n)\\ {}& \times \big ( t_{n+1} - {\varvec{\psi }}(u_n)^T \mu ^\beta _n \big ). \end{aligned} \end{aligned}$$
(2.10)

Furthermore, it follows from the definition of \(h_{n+1}\) and \(t_{n+1}\) that the conditional distribution of \(h_{n+1}\) given \(\mathcal {F}_n\) and control \(u_n\) is Gaussian

$$\begin{aligned} N\Big ((\mu ^\alpha _{n})^T {\varvec{\phi }}(u_n),\ {\varvec{\phi }}(u_n)^T \Sigma ^\alpha _{n} {\varvec{\phi }}(u_n) + \sigma ^2_H(u_n) \Big ), \end{aligned}$$

and the conditional distribution of \(t_{n+1}\) given \(\mathcal {F}_n\) and control \(u_n\) is also Gaussian

$$\begin{aligned} N\Big ((\mu ^\beta _{n})^T {\varvec{\psi }}(u_n),\ {\varvec{\psi }}(u_n)^T \Sigma ^\beta _{n} {\varvec{\psi }}(u_n) + \sigma ^2_T(u_n) \Big ). \end{aligned}$$

Measure-valued stochastic processes \(({\mathcal {A}}_n)\), \(({\mathcal {B}}_n)\) are adapted to filtration \((\mathcal {F}_n)\). This would have been of little practical use for us if it were not for the fact that those measure valued processes take values in a space of Gaussian distributions, so the state is determined by the mean vector and the covariance matrix. It is then clear that the process \((\mu ^\alpha _n, \Sigma ^\alpha _n, \mu ^\beta _n, \Sigma ^\beta _n, t_n)\) is a Markov process (we omit \(h_n\) as it does not appear in the objective function (2.7)). Its dynamics are explicitly given in Appendix A.

The following theorem shows that the optimisation problem can be restated in terms of these variables instead of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\), i.e., in terms of observable quantities only. This reformulates the problem as a fully observed Markov decision problem which can be solved using dynamic programming methods.

Theorem 2.2

Optimisation problem (2.7) is equivalent to

$$\begin{aligned} \sup _{(u_n), \tau } E \Big [(\mu ^\alpha _\tau )^T {\varvec{\phi }}(u_{\tau -1}) - \gamma \sum _{n=1}^\tau (t_n)^+ \Big ]. \end{aligned}$$
(2.11)

The proof of this theorem as well as other results are collected in Appendix C.

2.4 Dynamic programming formulation

By inspecting the dynamics of \((\mu ^\alpha _n, \Sigma ^\alpha _n, \mu ^\beta _n, \Sigma ^\beta _n, t_n)\) in Appendix A one notices that the dependence of the state in step \(n+1\) on step n is only via the parameters of the filters \(({\mathcal {A}}_n)\), \(({\mathcal {B}}_n)\) which contain sufficient statistic about \((h_i)_{i=1}^n\) and \((t_i)_{i=1}^n\). Furthermore, the dynamics is time-homogeneous, i.e., it does not depend on the step n. Hence, the value function of the optimisation problem (2.11) depends only on 4 parameters:

$$\begin{aligned} \begin{aligned}&V(\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta )\\ {}&\quad = \sup _{(u_{n}), \tau \ge 1} E\Big [(\mu ^\alpha _\tau )^T {\varvec{\phi }}(u_{\tau -1}) - \gamma \sum _{n=1}^{\tau } (t_n)^+ \\ {}& \Big | \mu ^{\alpha }_0 = \mu ^\alpha , \Sigma ^\alpha _0 = \Sigma ^\alpha , \mu ^\beta _0 = \mu ^\beta , \Sigma ^\beta _0 = \Sigma ^\beta \Big ]. \end{aligned} \end{aligned}$$
(2.12)

The function V gives an optimal value of the optimisation problem given the prior distributions of \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) are Gaussian with the given parameters. The expression on the right-hand side is well defined and the value function does not admit values \(\pm \infty \), see Lemma C.1.

In preparation for stating the dynamic programming principle, define an operator acting on measurable functions \(\varphi (\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta )\) as follows

$$\begin{aligned}{} & {} \mathcal {T} \varphi (\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta )\nonumber \\{}{} & {} {} \quad = \sup _{u \in \mathcal {U}} E \Big [ - \gamma (t_1)^{+}+ \max \big ( (\mu ^\alpha _1)^T {\varvec{\phi }}(u), \varphi (\mu ^\alpha _1, \Sigma ^\alpha _1, \mu ^\beta _1, \Sigma ^\beta _1) \big )\nonumber \\{}{} & {} {} \Big | \mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta \Big ], \end{aligned}$$
(2.13)

where the expectation is with respect to \(\mu ^\alpha _1, \Sigma ^\alpha _1, \mu ^\beta _1, \Sigma ^\beta _1, t_1\) conditional on \(\mu ^\alpha _0 = \mu ^\alpha , \Sigma ^\alpha _0 = \Sigma ^\alpha , \mu ^\beta _0 = \mu ^\beta , \Sigma _0^\beta = \Sigma ^\beta \); we will use this shorthand notation whenever it does not lead to ambiguity.

In order to approximate V, consider the value iteration scheme:

$$\begin{aligned} {}{} & {} {} V_1 (\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta )\nonumber \\{}{} & {} {} \quad = \sup _{u \in \mathcal {U}} E\left[ - \gamma (t_1)^+ + (\mu ^\alpha _1)^T {\varvec{\phi }}(u) \left| \mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta \right. \right] , \nonumber \\{}{} & {} V_{N+1}(\cdot ) = \mathcal {T} V_N (\cdot ), \quad N \ge 1. \end{aligned}$$
(2.14)

The following theorem asserts that the usual properties of the value iteration scheme hold for our problem. The proof is in Appendix C.

Theorem 2.3

The following statements hold true.

  1. 1.

    The iterative scheme (2.14) is well defined, i.e., \(V_N\) does not take values \(\pm \infty \) and the expression under expectation in \(\mathcal {T} V_N\) is integrable, so that the operator \(\mathcal {T}\) can be applied to \(V_N\).

  2. 2.

    \(V_N\) is the value function of the problem with at most N training steps, i.e.

    $$\begin{aligned}&V_N(\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta ) = \sup _{(u_{n})_{n \ge 0},\, 1 \le \tau \le N}\\&\quad E\left[ (\mu ^\alpha _{\tau })^T {\varvec{\phi }}(u_{\tau -1}) - \gamma \sum _{n=1}^\tau (t_n)^+ \left| \mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta \right. \right] . \end{aligned}$$
  3. 3.

    \(V = \lim _{N \rightarrow \infty } V_N\) and the sequence of functions \((V_N)\) is non-decreasing in N. Furthermore, \(|V| < \infty \).

  4. 4.

    The value function V defined in (2.12) is a fixed point of the operator \(\mathcal {T}\), i.e., satisfies the dynamic programming equation

    $$\begin{aligned} {}{} & {} {} V( \mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta ) \nonumber \\{}{} & {} {} = \sup _{u\in \mathcal {U}} E \left[ - \gamma ( t_1)^+ + \max \big ( ( \mu ^\alpha _1)^T {\varvec{\phi }}(u), V( \mu ^\alpha _1, \Sigma ^\alpha _1, \mu ^\beta _1, \Sigma ^\beta _1) \big )\right. \nonumber \\{}{} & {} {} \left. \left| \mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta \right. \right] .\end{aligned}$$
    (2.15)

Assertion 2 states that \(V_N\) is the value function for the problem with at most N training steps, which suggests what trade-off is made when using \(V_N\) instead of V. Furthermore, \(V_N \le V\) by Assertion 3, which, as we will see in the next section, may affect the stopping criterion. Assertion 4 is the statement of the dynamic programming equation for the infinite horizon stochastic control problem. We will refer to those results in the numerical section of the paper.

Remark 2.4

The expectation of the first term in (2.13) containing only \(t_1\) can be computed in a closed form improving efficiency of numerical calculations:

$$\begin{aligned} {}{} & {} {} \mathcal {T} \varphi (\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta )\nonumber \\{}{} & {} {} \quad =\sup _{u \in \mathcal {U}} \Big [ - \gamma \Upsilon \Big ((\mu ^\beta )^T {\varvec{\psi }}(u),\, {\varvec{\psi }}(u)^T \Sigma ^\beta {\varvec{\psi }}(u) + \sigma ^2_T(u)\Big )\nonumber \\{}{} & {} {} \qquad + E \Big [\max \Big ( (\mu ^\alpha _1)^T {\varvec{\phi }}(u),\, \varphi ( \mu ^\alpha _1, \Sigma ^\alpha _1, \mu ^\beta _1, \Sigma ^\beta _1) \Big )\nonumber \\{}{} & {} {} \Big | \mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta \Big ]\Big ],\end{aligned}$$
(2.16)

where

$$\begin{aligned} \Upsilon (m, s^2) =\frac{s}{\sqrt{2\pi }} e^{-\frac{m^2}{2s^2}} + m \Phi \Big (\frac{m}{s}\Big ), \end{aligned}$$

and \(\Phi \) is the cumulative distribution function of the standard normal distribution. The justification of this formula is in Appendix C.

2.5 Identifying control and stopping

If V was available, then the choice of control \(u_n\) and the decision about stopping in Algorithm 1 would follow from classical results in optimal control theory. Indeed, one would choose

$$\begin{aligned} \begin{aligned} u_n&= \mathop {\mathrm {arg\, max}}\limits _{u \in \mathcal {U}} E \Big [ - \gamma ( t_{n+1})^+ \\ {}&\qquad + \max \Big ( ( \mu ^\alpha _{n+1})^T {\varvec{\phi }}(u), V( \mu ^\alpha _{n+1}, \Sigma ^\alpha _{n+1}, \mu ^\beta _{n+1}, \Sigma ^\beta _{n+1}) \Big )\\ {}& \Big |\mu ^\alpha _n, \Sigma ^\alpha _n, \mu ^\beta _n, \Sigma ^\beta _n, u_n = u \Big ],\end{aligned} \end{aligned}$$
(2.17)

and then train and validate the model using hyperparameters \(u_n\). Using the observed score \(h_{n+1}\) and cost \(t_{n+1}\), one would compute \(\mu ^\alpha _{n+1}, \Sigma ^\alpha _{n+1}, \mu ^\beta _{n+1}, \Sigma ^\beta _{n+1}\) via (2.9)–(2.10). If

$$\begin{aligned} \left( \mu ^\alpha _{n+1}\right) ^T {\varvec{\phi }}(u_n) \ge V\left( \mu ^\alpha _{n+1}, \Sigma ^\alpha _{n+1}, \mu ^\beta _{n+1}, \Sigma ^\beta _{n+1}\right) \end{aligned}$$
(2.18)

the algorithm would stop, otherwise it will continue with the choice on \(u_{n+1}\) as in (2.17) and the repeat of the steps above.

Remark 2.5

A reader familiar with the MDP terminology will notice that \(u_n\) is a function of the state \(\mu ^\alpha _n, \Sigma ^\alpha _n, \mu ^\beta _n, \Sigma ^\beta _n\) at time n and does not itself depend on n, so it is a time-homogeneous feedback policy. The stopping criterion (2.18) is also of a classical form: stop when the continuation value is smaller than the current payoff.

A natural adaptation of this procedure to the case when one only knows \(V_N\) is to replace V by \(V_N\) in both places in the above formulas. We will refer to this ad-hoc approach as the relaxed method. Another possibility is to follow an optimal strategy corresponding to the computed value function, i.e., with the given maximum number of training steps. If one knows \(V_1, \ldots , V_N\), then for \(n = 0, \ldots , N-1\),

$$\begin{aligned} u_n&= \mathop {\mathrm {arg\, max}}\limits _{u \in \mathcal {U}} E \Big [ - \gamma ( t_{n+1})^+ \\ {}&\qquad + \max \Big ( ( \mu ^\alpha _{n+1})^T {\varvec{\phi }}(u),V_{N-n}( \mu ^\alpha _{n+1}, \Sigma ^\alpha _{n+1}, \mu ^\beta _{n+1}, \Sigma ^\beta _{n+1}) \Big )\\ {}& \Big |\mu ^\alpha _n, \Sigma ^\alpha _n, \mu ^\beta _n, \Sigma ^\beta _n, u_n = u \Big ], \end{aligned}$$

and stop when

$$\begin{aligned} \left( \mu ^\alpha _{n+1}\right) ^T {\varvec{\phi }}(u_n) \ge V_{N-n}\left( \mu ^\alpha _{n+1}, \Sigma ^\alpha _{n+1}, \mu ^\beta _{n+1}, \Sigma ^\beta _{n+1}\right) , \end{aligned}$$

where, with an abuse of notation, we put \(V_0 \equiv -\infty \), i.e., at time \(n=N\) we always stop; with this definition of \(V_0\), we have \(V_1 = \mathcal {T} V_0\)). We will refer to this approach as the exact method.

The advantage of the exact method is that it has strong underlying mathematical guarantees but it is limited to N training steps with N chosen a priori. The relaxed method does not impose, a priori, such constraints but it may stop too early as the right-hand side of the condition (2.18) is replaced by \(V_N \le V\).

3 Numerical implementation of AutoBCT

In this section, we focus on describing the methodology behind our implementation of AutoBCT. We show how to numerically approximate the iterative scheme (2.14) and how to explicitly account for numerical errors when using estimated \((V_n)_{n \ge 1}\) for the choice of control (hyperparameters) \(u_n\) and for the stopping condition in Algorithm 1. We wish to reiterate that, apart from the on-the-fly method (Sect. 3.5), a numerical approximation of the value functions \(V_n\), \(n \ge 1\), needs to be computed only once and then applied in Algorithm 4 for various datasets and modelling tasks. We demonstrate this in Sect. 4.

Before delving into details, we provide an overview of the numerical implementation. Our main approach relies on the pre-computation of the value function \(V_n\), \(n = 1, 2, \ldots \). This is discussed in Sects. 3.2 and 3.4. The latter subsection provides details on the design of the set of points on which the value function is computed taking into account how this value function is subsequently used. The pre-computed value functions form an input to AutoBCT described in Sect. 3.3 which guides the hyperparameter selection of an external ML model. An implementation of our approach which computes an approximation of the value function on demand instead of relying on pre-computed value functions is shown in Sect. 3.5 and called on-the-fly. This algorithm is, however, much more resource intensive at the time of application compared to a negligible cost of running AutoBCT for which the majority of the computational burden is relegated to the pre-computation stage.

3.1 Notation

Denote by \(\mathcal {X}\) the state space of the process \(x_n = (\mu ^\alpha _n, \Sigma ^\alpha _n, \mu ^\beta _n, \Sigma ^\beta _n)\). For \(x = (\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta ) \in \mathcal {X}\), define

$$\begin{aligned} m^{\alpha }(u, x)&= (\mu ^{\alpha })^T {\varvec{\phi }}(u),\\ s^{\alpha }(u, x)^2&= {\varvec{\phi }}(u)^T \Sigma ^{\alpha }{\varvec{\phi }}(u) + \sigma _H^2(u),\\ m^{\beta }(u, x)&= (\mu ^{\beta })^T {\varvec{\psi }}(u),\\ s^{\beta }(u, x)^2&= {\varvec{\psi }}(u)^T \Sigma ^{\beta }{\varvec{\psi }}(u) + \sigma _T^2(u). \end{aligned}$$

Given \(h, t \in \mathbb {R}\) and \(x \in \mathcal {X}\), we write \(\texttt {Kalman}(x, h, t)\) for the output of the Kalman filter (2.9)–(2.10) (with an obvious adjustment to the notation: x is taken to be the state at time n and ht are the observation of the score \(h_{n+1}\) and the cost \(t_{n+1}\)).

For convenience of implementation we assume that \(\mathcal {U}= [0,1]^p\) which can be achieved by appropriate normalisation of the control set. We show in Sect. 4 that this assumption is not detrimental even if some of hyperparameters take discrete values.

3.2 Computation of the value function

We compute the value function via the iterative scheme (2.14). We use ideas of regression Monte Carlo for controlled Markov processes, particularly the regress-later method Balata and Palczewski (2017), Huré et al. (2018), and approximate the value function \(V_n\), \(n=1, \ldots , N\), learnt over an appropriate choice of states \(\mathcal {C}\subset \mathcal {X}\). The choice of the cloud \(\mathcal {C}\) is crucial for the accuracy of the approximation of the value function at points of interest; it is discussed in Sect. 3.4. We fix a grid of controls \(\mathcal {G}\) for computing an approximation of the Q-value as in practical applications we found it to perform better than a (theoretically supported) random choice of points in the control space \(\mathcal {U}\).

Computation of the value function is presented in two algorithms. Algorithm 2 finds an approximation of the expression inside optimisation in \(\mathcal {T} \varphi (x)\) (see (2.16)), while Algorithm 3 shows how to adapt value iteration approach to the framework of our stochastic optimisation problem. Details follow.

figure b

Algorithm 2 defines a function \(\texttt {Lambda}(x, \varphi ; \mathcal {G}, N_s)\) that computes an approximation to the mapping

$$\begin{aligned} \begin{aligned} u&\mapsto - \gamma \Upsilon \big (m^\beta (u, x),\, s^\beta (u, x)^2\big )\\ {}& + E \big [\max \big (m^\alpha (u, x_{1}),\, \varphi (x_1) \big ) \big | x_{0} = x\big ]. \end{aligned} \end{aligned}$$
(3.19)

It is done by evaluating the outcome of each control \(u^i\) in \(\mathcal {G}\) (line 1 in Algorithm 2) using Monte Carlo averaging over \(N_s\) samples from the distribution of the state \(x_{1}\) given \(x_0 = x\) and the control \(u^i\); this averaging is to approximate the expectation. However, the regression in line 7 employs further cross-sectional information to reduce error in Monte Carlo estimates of this expectation as long as the sampling of \((h_k)\) and \((t_k)\) is independent between different values \(u^i\). This effect has mathematical grounding as a Monte Carlo approximation to orthogonal projections in an appropriate \(L^2\) space if the loss function is quadratic: \(l_p(p, z) = (p-z)^2\), c.f., Tsitsiklis and VanRoy (2001), Balata and Palczewski (2017). In particular, it is valid to use \(N_s=1\), but larger values may result in more efficient computation: one seeks a trade-off between the size of \(\mathcal {G}\) and the size of \(N_s\) to optimise computational time and achieve sufficient accuracy.

The space of functions \(\mathcal {F}_p\) over which the optimisation in regression in line 7 is performed depends on the choice of a functional approximation. In regression Monte Carlo it is a linear space spanned by a finite number of basis functions. Other approximating families \(\mathcal {F}_p\) include neural networks, splines and regression Random Forest. Although we do not include it explicitly in the notation, the optimisation may further involve regularisation in the form of elastic nets Zou and Hastie (2005) or regularised neural networks, which often improves the real-world performance.

figure c

Algorithm 3 iteratively applies the operator \(\mathcal {T}\) and computes regression approximation of the value function. The loop in lines 2–5, computes optimal control and the value for each point from the cloud \(\mathcal {C}\). In line 3, we approximate the mapping

$$\begin{aligned} u&\mapsto - \gamma \Upsilon \big (m^\beta (u, x^j),\, s^\beta (u, x^j)^2\big )\\ {}& + E \big [\max \big (m^\alpha (u, x_{n+1}),\, \widetilde{V}_n( x_{n+1}) \big ) \big | x_{n} = x^j\big ], \end{aligned}$$

using Algorithm 2. Regression in line 6 uses pairs: the state \(x^j\) and the corresponding optimal value \(q^j\) to fit an approximation \(\widetilde{V}_{n+1}\) of the value function \(V_{n+1}\). This approximation is needed in the following iteration of the loop (in line 3) and as the output of the algorithm to be used in application of AutoBCT to a particular data problem. The family of functions \(\mathcal {F}_\varphi \) (as \(\mathcal {F}_p\)) includes linear combinations of basis functions (for regression Monte Carlo), neural networks or regression Random Forest. Importantly, the dimensionality of the state x is much larger than of the control u itself, so efficient approximation of the value function requires more intricate methods than in the approximation \(\texttt {Lambda}(x, \varphi ; \mathcal {G}, N_s)\) of the q-value. As in Algorithm 2, some regularisation may benefit real-world performance. The loss function is commonly the squared error \(l_\varphi (q, z) = (q-z)^2\) but others can also be used (particularly when using neural networks) at the cost of more difficult or of no justification of convergence of the whole numerical procedure.

3.3 AutoBCT

The core part of the AutoBCT is given in Algorithm 4. We use the notation \(\mathcal {M}\) to denote an external Machine Learning (ML) process that outputs score h and cost t for the choice of hyperparameters \(u\in \mathcal {U}\). The ‘raw’ score \(\hat{h}\) and the ‘raw’ cost \(\hat{t}\) obtained from the ML routine are transformed onto the interval [0, 1] with the user-specified map (robustness to a mis-specification of this map is discussed below):

$$\begin{aligned} (\hat{h}, \hat{t}) \mapsto (h, t). \end{aligned}$$

For the cost t we usually use an affine transformation based on a best guess of maximum and minimum values. For the score h the choice depends on the problem at hand: in our practice we have employed affine and affine-log transformations. Mis-specification of these maps so that the actual transformed values are outside of the interval [0, 1] are accounted for by learning \(\widetilde{V}\) using a cloud \(\mathcal {C}\) containing states that map onto distributions with resulting functions H and T not bounded by [0, 1] with non-negligible probability. This robustness does not, however, relieve the user from a sensible choice of the above transformations as significantly bad choices have a tangible effect on the behaviour of the algorithm.

Remark 3.1

The transformation of the raw scores and times into values in (roughly) interval [0, 1] is necessitated by the model setup which is not scaling and transformation neutral. The observation errors \(\sigma _H\) and \(\sigma _T\) are fixed functions, so their effect depends on the ranges of values of score and cost input in the model. Scaling makes sure that the same set-up of those functions can be used for multiple datasets and ML models. The score and the cost are represented as linear combinations of basis functions, so the change of range and interpretation of raw quantities (particularly the raw score which does not always need to be the accuracy of a model) would require a selection of a new set of basis functions. If the raw quantities are only transformed with affine transformations, one would need to adjust appropriately prior distributions, the training cloud (see Sect. 3.4) and the Lagrange multiplier \(\gamma \). All this would require the computation of the value function, the most resource-hungry step of our approach. Instead, we advocate using the same basis functions, the same range of priors, observation errors \(\sigma _H\) and \(\sigma _T\), coefficient \(\gamma \) and finally the same value function, but appropriately transform the raw score and cost. The reader will observe in Sect. 4 that this can be successfully (and naturally) done in many examples.

We present an algorithm for the relaxed method as introduced in Sect. 2.5; an adaptation of Algorithm 4 to the exact method is easy and has been omitted.

The input of Algorithm 4 consists of an approximation \(\widetilde{V}\) of the value function V. An optimal control \(u_{n}\) at step n given state \(x_n\) should be determined by maximisation of the expression (3.19) over \(u \in \mathcal {U}\) putting \(\varphi = \widetilde{V}\) and \(x = x_n\). For each u, this expression compares two possibilities available after applying control u: either stopping or continuing training. These two options correspond to strategically different choices of u; the first one maximises score in the next step while the latter prioritises learning. The learning option is preferred if future rewards (as given by \(\widetilde{V}\)) exceed present expected score. The approximation \(\widetilde{V}\) of V may be burdened by errors which may result in an incorrect decision and drive the algorithm into an unnecessary ‘learning spree’ (the opposite effect of premature stopping is less dangerous as it does not lead to unnecessarily long computation times). To mitigate this effect, we replace (3.19) with

$$\begin{aligned} \begin{aligned}&\Lambda ^\epsilon (u; x, \epsilon , \widetilde{V}) = -\gamma \Upsilon (m^{\beta }(u, x),s^{\beta }(u, x)^2)\\&\quad + E\Big [\max \big ( (m^\alpha (u, x_1), \widetilde{V}(x_1)\left( 1 - \epsilon \right) \big )\Big | x_0 = x\Big ], \end{aligned} \end{aligned}$$
(3.20)

for an error adjustment \(\epsilon \ge 0\). Notice that \(\epsilon = 0\) brings back the original formula from (3.19). The approximation of the mapping \(u \mapsto \Lambda ^\epsilon (u; x, \epsilon , \widetilde{V})\) is computed by Algorithm 2 with \(\varphi (x) = (1-\epsilon ) \widetilde{V} (x)\). The grid \(\mathcal {G}'\) and the number of Monte Carlo iteration \(N_s'\) are usually taken much larger than in the value iteration algorithm since one needs only one evaluation at each step instead of 1000s of evaluations needed in Algorithm 3. Furthermore, more accurate computation of the mapping \(\Lambda ^\epsilon (\cdot ; x, \epsilon , \widetilde{V})\) was shown to improve real-life performance.

The stopping condition in line 9 is an approximation of the condition \(m^{\alpha }(u_{n}, x) < V(x)\). Indeed, using that \(V = \mathcal {T} V\), we could write \(m^{\alpha }(u_{n}, x) < \mathcal {T} V(x)\) and approximate the right-hand side by \(\sup _{u \in \mathcal {U}} \Lambda ^\epsilon (u; x, 0, \widetilde{V})\) with \(\epsilon =0\). Using non-zero \(\epsilon \) is to counteract numerical and approximation errors as discussed above.

The stopping condition \(m^{\alpha }(u_{n}, x) < V(x)\) could also be implemented as \(m^{\alpha }(u_{n}, x) < (1-\epsilon ) \widetilde{V}(x)\). We do not do it for two reasons. Firstly, \(\tilde{\Lambda }(u_{n+1}')\) is a more accurate approximation of the value function V at point x due to averaging effect of evaluating \(\mathcal {T} \widetilde{V}\) at the required point x and with computations done with much higher accuracy stemming from a finer grid \(\mathcal {G}'\) and a larger number of Monte Carlo iterations \(N_s'\). Secondly, this brings the stopping condition in line with the choice of optimal control in line 8 which is also based on \(\tilde{\Lambda }\) and comes at no extra cost.

3.4 States and clouds

Approximations \(\widetilde{V}_n\) in Algorithm 3 are learnt over states in cloud \(\mathcal {C}\subset \mathcal {X}\), and therefore the quality of these maps depends on the choice of \(\mathcal {C}\). This cloud needs to comprise sufficiently many states encountered over the running of the model in order to provide good approximations for the choice of control \(u_n\) and the stopping decision in Algorithm 4. In the course of learning, the norm of the covariance matrices \(\Sigma ^\alpha _n\) and \(\Sigma ^\beta _n\) becomes smaller (the filter is concentrated ever more on the mean) asymptotically approaching a truth:

Definition 3.2

(Truth) A state \(x = (\mu ^\alpha , \Sigma ^\alpha , \mu ^\beta , \Sigma ^\beta ) \in \mathcal {X}\) is called a truth if \(\Sigma ^\alpha = 0\) and \(\Sigma ^\beta = 0\). The set of all truths in \(\mathcal {X}\) is denoted by \(\mathcal {X}_{\infty }\).

We argue that the learning cloud \(\mathcal {C}\) should contain the following:

  1. (a)

    an adequate number of states that are elements of \(\mathcal {X}_{\infty }\),

  2. (b)

    states that are suitably close to priors \(x_0\) one will use in the Algorithm 4,

  3. (c)

    an adequate number of states ‘lying between’ the above.

A failure to include states which are truths (point a) inhibits the ability for AutoBCT algorithm to adequately stop. Conversely, a failure to include states which Algorithm 4 encounters in its running (point c) inhibits the ability to adequately continue. A failure to include a sufficient number of states that are close to priors \(x_0\) in (b) impacts the coverage of the state space by states (point c), and, further, due to the errors in extrapolation, the decisions made in the first step of the algorithm.

We suggest the following approach for construction of the cloud \(\mathcal {C}\). For given \(N_c, K \ge 1\), we repeat \(N_c\) times the steps:

  1. (1)

    sample \(\mu ^\alpha \) and \(\mu ^\beta \) from appropriate distributions, e.g., multivariate Gaussian,

  2. (2)

    sample \(\Sigma ^\alpha \) and \(\Sigma ^\beta \) from appropriate covariance distributions, e.g., Wishart or LKJ (Lewandowski et al. 2009),

  3. (3)

    add states

    $$\begin{aligned} \Big (\mu ^\alpha , \frac{k}{K} \Sigma ^\alpha , \mu ^\beta , \frac{k}{K} \Sigma ^\beta \Big ), \quad k = 0, \ldots , K, \end{aligned}$$

    to the cloud \(\mathcal {C}\).

This results in a cloud of size \(N_c (K+1)\).

Note that if in steps 1 and 2 one uses distributions with the means equal to possible priors \(x_0\), point (b) above is fulfilled. At step 3, \(k=0\) gives zero covariance matrices, i.e., truths according to the definition above. It also highlights that distributions used in step 1 should be sufficiently dispersed to cover possible truths for models for which AutoBCT will be used.

The cloud \(\mathcal {C}\) and the basis functions in (2.4) are closely linked. A state in the cloud defines distributions of weights for the basis functions. The ability of the model to represent required shapes of score and cost functions depends, to a great extent, on the choice of basis functions. As the dimensionality of the state depends quadratically on the number of basis functions, it affects heavily both the computation of the approximate value functions \(\widetilde{V}_n\) and the speed of learning in AutoBCT. We have found that polynomial bases work well in problems studied in Sect. 3, see Appendix E. We used centred monomials up to degree 3 in the case of one-dimensional hyperparameter and centred monomials up to degree 4 and one cross-product when the hyperparameter was two-dimensional.

3.5 On-the-fly

figure d

The ‘on-the-fly’ method acquires evaluations \(V_{N}(x)\) by performing nested Monte-Carlo integrations of the recurrence (2.14).\(V_0(x) = \sup _{u \in \mathcal {U}} \varphi (u)\), Specifically, \(\texttt {OTF}(x, N, \varphi ; \mathcal {G}, N_s)\) approximates the q-value \(\mathcal {V}_N (x, \cdot )\) using the recursion

$$\begin{aligned} \begin{aligned}&\mathcal {V}_{1}(x,u) = -\gamma \Upsilon (m^{\beta }(u, x), s^{\beta }(u, x)^2)\\ {}&\quad + E \left[ \max \left( m^\alpha (u, x_1), \varphi (x_1) \right) \Big | x_0 = x \right] ,\\ {}&\mathcal {V}_{n+1}(x,u) = -\gamma \Upsilon (m^{\beta }(u, x), s^{\beta }(u, x)^2)\\ {}&\quad + E \left[ \max \left( m^\alpha (u, x_1), \sup _{u \in \mathcal {U}} \mathcal {V}_{n} (x_1, u) \right) \Big | x_0 = x \right] ,\end{aligned} \end{aligned}$$
(3.21)

where \(\varphi :\mathcal {X}\rightarrow \mathbb {R}\) is some given inital value as in Algorithm 3. Setting \(\varphi \equiv -\infty \), one can show by induction that \(V_N(x) = \sup _{u \in \mathcal {U}} \mathcal {V}_N (x,u)\), c.f. (2.14). Notice also by comparing to Algorithm 2 that

$$\begin{aligned} \texttt {Lambda}(x, \varphi ; \mathcal {G}, N_s) = \texttt {OTF}(x, 1, \varphi ; \mathcal {G}, N_s). \end{aligned}$$

This motivates two modifications of Algorithm 4. The first one, which we call ‘on-the-fly’ in this paper, replaces calls to \(\texttt {Lambda}\) in lines 1 and 7 by calls to \(\texttt {OTF}(x, N, \varphi ; \mathcal {G}, N_s)\) with \(\varphi \equiv -\infty \). We also increase the grid size \(\mathcal {G}\) and \(N_s\) in the first execution to obtain a finer approximation of \(\mathcal {V}_N\) than \((\mathcal {V}_n)_{n=1}^{N-1}\), in line with the discussion in Sect. 3.3.

The second modification of Algorithm 4 whose performance is not reported in this paper is motivated by the accuracy improvement when using \(\texttt {Lambda}(x, \widetilde{V}; \mathcal {G}', N_s')\) instead of \(\widetilde{V}\). In the same manner, calls to \(\texttt {Lambda}\) can be replaced by \(\texttt {OTF}(x, N, \widetilde{V}; \mathcal {G}', N_s')\) with \(N \ge 2\) bringing further improvements but at a cost of significant increase in computational complexity.

The implementation of \(\texttt {OTF}\) is computationally intensive as it uses multiple nested Monte Carlo integrations and optimisations with the computational complexity exponential in N. Its advantage is that it does not require a good specification of the cloud \(\mathcal {C}\) and precomputed value functions.

4 Validation

In this section we demonstrate applications of AutoBCT to a synthetic example (to visualise its actions and performance) and a number of real datasets and models. Detailed information about AutoBCT settings for each example are collected in Appendix E. Details of the composition of training and testing sets are listed in Appendix D. We emphasise that our objective is to find good hyperparameters within a short computational time. As explained in the introduction, this computational time is meant to be significantly shorter than required to explore the hyperparameter space sufficiently well to find a (near) optimum. The reader should, therefore, not be surprised to see the algorithm stop after only a few steps. For comparison, for each real dataset we provide the best known (to us) performance reported either in a published paper or, more often, in results of a modelling competition. We also provide a comparison of AutoBCT to Hyperopt in Appendix F.

All results were obtained using a desktop computer with Intel(R) Core(TM) i7-6700K CPU (4.4 GHz), 32 GB RAM and GeForce RTX 3090 GPU (24 GB). In the synthetic example, Sect. 4.1, we artificially slowed down computations in order to be able to observe differences in running times between hyperparameter choices.

For computation of 1D and 2D value function maps used in VI/Map algorithm, we employ a cloud \(\mathcal {C}\) following general guidelines described in Sect. 3.4. However, in addition to those we enrich the cloud by first generating a small set of size 100 of realistic unimodal shapes for the score and cost functions which we propagate via Kalman filter using a predefined control grid up to certain depth D. In 1D case we take depth \(D=3\) and the cloud \(\mathcal {C}\) contains 156, 000 states of which 1000 are truths (see Def. 3.2). In 2D case, the value function is built using a cloud \(\mathcal {C}\) extended by the above procedure with depth \(D=10\); it contains 761, 600 states of which 1000 are truths.

In VI/Map Algorithm 3 the regression in line 6 is performed using Random Forest Regression (RFR) with 100 trees. Functions \(\sigma _H\) and \(\sigma _T\) are taken as constant (independent of the hyperparameters u).

Raw scores and costs are transformed as explained in Sect. 3.3 and Remark 3.1. We emphasise that in all datasets and modelling tasks below (apart from On-the-fly results presented for comparison), we used the same 2 pre-computed value functions for 1D problems (with two different values of \(\sigma _H\)) and the same value function for 2D problems. In particular, we used the same \(\gamma =0.16\). This is motivated by the fact that the computation of the value function is the most resource intensive task which one would like to perform once. The execution time of each step of AutoBCT is dominated by running the external ML procedure, c.f. Alg. 4. Conclusions on the robustness of this approach are made at the end of Sect. 4.

4.1 A synthetic example

We illustrate steps of AutoBCT algorithm on a synthetic prediction problem. To make results easier to interpret visually, we consider a one-dimensional hyperparameter selection problem where we control the number of trees in a random forest (RF) model and take scaled accuracy as the score. Observed accuracy \(\hat{h}\) is mapped to h by an affine mapping that maps 50% model accuracy onto score 0 and 100% onto 1, while the real running time \(\hat{t}\) is mapped to t by an affine mapping that maps 0 s onto 0 and 0.6 s onto 1:

$$\begin{aligned} h= \frac{\hat{h} - 0.5}{0.5}, \quad t= \frac{\hat{t}}{0.6}. \end{aligned}$$

The task at hand is to predict outputs of a function

$$\begin{aligned} f(x,y):= & {} \big ( \lfloor ax + 1\rfloor + \lfloor by + 1\rfloor \big ) \mod 2, \\{} & {} (x,y) \in [0,1]\times [0,1], \end{aligned}$$

with \(a=b=10\) (Fig. 1), where \(\lfloor \cdot \rfloor \) is the floor function: \(\lfloor x \rfloor = \max \{m \in \mathbb {Z}\;|\; m \le x\}\).

Fig. 1
figure 1

Function to be predicted in the synthetic example of Sect. 4.1

We initialise our synthetic experiment with a prior on H(u) being relatively flat but with a large enough spread to cover well the range (0, 1) (see the left panel of Fig. 2). For the cost function T(u) we choose a pessimistic prior (see the right panel of Fig. 2), which reflects our conservative belief that the cost is exponentially increasing as a function of the number of trees \(n_{\max }\) (in practice, it is usually linear). We fix \(\gamma =0.16\) and \(\epsilon = 0\) and set the observation errors \(\sigma _H \equiv 0.05\) and \(\sigma _T \equiv 0.1\) (standard deviation of the observation error for the score and the cost is assumed to be 5% and 10%, respectively, independently of the choice of the hyperparameter). We relate the control \(u\in [0,1]\) with the number of trees \(n_{\text {trees}}\) via the mapping:

$$\begin{aligned} n_{\text {trees}}(u) = \lfloor 1 + 99u \rfloor . \end{aligned}$$
(4.22)

Thus, the smallest considered random forest has \(n_{\text {trees}}(0)=1\) tree and the largest has \(n_{\text {trees}}(1)=100\) trees.

Fig. 2
figure 2

Score and cost priors for the synthetic example

Fig. 3
figure 3

One run for the synthetic classification example using AutoBCT (VI/Map \(N=2\), \(\epsilon =0\))

We present below performance of two algorithms: VI/Map, Algorithm 3 (\(N_s = 100\) with \(N=2\) and \(N=4\)), and on-the-fly (OTF), Algorithm 5 (\(N_s = 1000\) with \(N=2\)). We set \(\mathcal {G}= \{0, 0.01, 0.02, \ldots , 1\}\). Training set comprises 30, 000 points and the validation set contains 20, 000 points. In both sets, the points were randomly drawn from a uniform distribution on the square \([0,1]^2\).

4.1.1 VI/Map results

In this example, we use the 1D map described at the beginning of Sect. 4. We visualise results for the map produced with \(N=2\) (recall that N denotes the depth of recursion in the computation of the value function) and \(\epsilon = 0\). We also include a summary of results for the \(N=4\), \(\epsilon = 0\) map (Table 2) and show that they are not significantly different from the \(N=2\) case.

One run of the VI/Map (\(N=2\), \(\epsilon =0\)) algorithm together with the evolution of posterior distributions is displayed in Fig. 3 with an accompanying summary Table 1 which shows changes of important quantities across iterations. Notice that AutoBCT stopped at iteration 3 with the control \(u = 0.74\) (corresponding to \(n_{\text {trees}}= 74\)) and the expected posterior score and the realised score of \(\approx 0.98\) (accuracy 99%). Recall that the observed accuracy \(h_n\) is burdened with a random error due to the randomness involved in training of the random forest and in validating the trained model. The closedness of the expected posterior score and the realised score at the time of stopping is not a rule and should not be expected in general (see examples further in this section).

Table 1 AutoBCT (VI/Map, \(N=2\), \(\epsilon = 0\)) Results of one run for the synthetic example

We note that the VI/Map with \(N=4\) produces a very similar run (Table 2), so additional expense at finding depth \(N=4\) map is not necessary for this example.

Non-zero \(\varvec{\epsilon }\) value. We kept artificial damping parameter \(\epsilon \), which accounts for errors in the value function map, equal to zero as the quality of the map was very good. It can also be observed in Tables 1 and 2 that any reasonable value of \(\epsilon \) would not change the course of training.

4.1.2 On-the-fly results

For the OTF (\(N=2\)) version of the algorithm we provide a summary of one run in Table 3. AutoBCT stopped at iteration 3 with the control \(u = 0.76\) (corresponding to \(n_{\text {trees}}= 76\)) and the expected posterior score of 0.99 (accuracy 99.5%) and realised score of 0.98 (accuracy 99%). We observe that VI/Map results are very similar to the OTF results. In particular, both methods completed training in 3 iterations and the difference between final controls is very small.

Table 2 AutoBCT (VI/Map, \(N=4\), \(\epsilon = 0\)) results of one run for the synthetic example
Table 3 AutoBCT (OTF, \(\hbox {N}=2\)) results of one run for the synthetic example

4.1.3 Sensitivity analysis

We used on-the-fly algorithm to explore the dependence of results on the choice of the standard deviation \(\sigma _H\) of score observations and the Lagrange multiplier \(\gamma \). Unlike the rest of numerical results in this paper, the sensitivity analysis was performed on University of Leeds HPC cluster using 4 cores of Intel Xeon Gold 6138 CPU, so costs should not be compared to those shown earlier in this section. Tables 4 and 5 display the mean and standard deviation (in brackets) from 40 independent runs. All other parameters were the same as above. Notice that the final score h is close to 1 for all parameter settings due to the relative simplicity of the problem. We will therefore concentrate the analysis on the total cost and final control.

Table 4 AutoBCT (OTF, \(\hbox {N}=2\)) Sensitivity to the standard deviation \(\sigma _H\) of score observations
Table 5 AutoBCT (OTF, \(\hbox {N}=2\)) Sensitivity to Lagrange multipier \(\gamma \)

Assuming too low value of \(\sigma _H\) results in unstable results (see Table 4): the total cost has the standard deviation of 1.72 meaning that some runs took very long to terminate. Similarly, final control has a standard deviation 0.22. For all higher values of \(\sigma _H\), the standard deviation of the total cost is around 0.5 and of the final control around 0.1. Large value \(\sigma _H = 0.2\) results is longer running times due to additional observations needed to get a sufficiently precise estimate of the relationship between hyperparameters and model score but the final control is stable. An excessive choice of \(\sigma _H = 0.4\) destabilises the algorithm resulting in unstable final score, control and total cost.

The Lagrange multiplier \(\gamma \) governs the trade-off between model score and total running time. Total cost, as displayed in Table 5, is decreasing with \(\gamma \). So is the final score of the trained model with the stability also suffering slighlty (the standard deviation increasing from 0.01 to 0.02). The final control u decreases with \(\gamma \). This is because the cost is increasing in u, so larger \(\gamma \) incentivises using of lower values of u. Importantly, the optimisation problem is relatively robust to the specification of parameters and observed behaviours are consistent with intuitive expectations.

In the next section, we proceed to deploy AutoBCT algorithm on real data, using convolutional neural networks.

4.2 Convolutional neural network

In this section, we focus on a computationally intensive problem of classifying breast cancer specimen images for Invasive Ductal Carcinoma (IDC). IDC is the most common subtype of all breast cancers. Data (Janowczyk and Madabhushi 2016; Cruz-Roa et al. 2014) contains 277,524 image patches of size \(50\times 50\), where 198,738 are IDC negative and 78,786 IDC positive.

We build a classification model using convolutional neural network (CNN). We optimise the following metaparameters: batch (batch size) and r (learning rate). CNN is run twice (2 epochs) over a given set of training data and is evaluated on a holdout test data. Due to the fact that we run CNN for exactly 2 epochs (and not until a certain accuracy threshold is reached), the choice of the learning rate r will have no effect on the running time. This is to demonstrate that AutoBCT is also able to deal with cases where the cost does not depend on some of the optimised hyperparameters. Architecture of CNN is shown in Fig. 4 and described in “Appendix 1”.

Fig. 4
figure 4

CNN architecture, where input is \(50\times 50\) tissue image with 3 colour channels and scaling parameter \(s=1\) that affects the number of filters in convolution layers and sizes of flattened, densely connected arrays. A detailed information about the architecture of CNN can be found in “Appendix 1

A naive tuning of CNN over a two-dimensional predefined grid of hyperparameters would be computationally expensive. In Sects. 4.2.3 and 4.2.4 we present results of 2D AutoBCT (VI/Map) and AutoBCT (OTF), respectively, where nearly optimal controls are obtained in a relatively few iterations. However, we start by discussing 1D VI/map results on batch and r separately. Details about priors and parameter settings are collected in Appendix E. Comparison to Hyperopt is performed in Appendix F.

4.2.1 VI/Map CNN batch

Due to limited computational power, we expedite CNN computational time by working on a subset of data. A training set consists of 5000 images with 50/50 split between the two categories, and a testing set is also balanced and contains 1000 images. We fix the learning rate \(r=0.0007\).

We use the \(N=2\) map with \(\sigma _H = 0.15\), \(\sigma _T = 0.1\) and \(\gamma =0.16\). For this example as well as for others involving a neutral network, we choose relatively high \(\sigma _H = 0.15\) to account for larger randomness in training for some choices of hyperparameters. For the sake of illustration, we allow large batch sizes (relative to the size of the data set), which results in unstable model training. To mitigate the resulting significant variability in observed accuracy and enable interesting observations of the algorithm performance, for each u the neural network is trained 5 times and a median accuracy is outputted together with the total accumulated cost of training and testing 5 neural networks. We map control \(u \in [0,1]\) to batch size in the following way:

$$\begin{aligned} \textit{batch}(u) = \lfloor \textit{batch}_{\min } + (\textit{batch}_{\max }-\textit{batch}_{\min })\cdot u \rfloor , \end{aligned}$$
(4.23)

which maps 0 to \(\textit{batch}_{\min }=10\) and 1 to \(\textit{batch}_{\max }=200\) linearly. The accuracy and the running time (in minutes) are mapped into score and cost as follows

$$\begin{aligned} h:= \frac{\hat{h} - 0.45}{0.35}, \quad t:= \frac{\hat{t}}{7.5}. \end{aligned}$$
(4.24)

Results for this problem, dubbed 1D CNN batch, can be inspected in Table 6. We also visualise final posterior distributions of \(H(\cdot )\) and \(T(\cdot )\) in Fig. 5. The algorithm returned the control corresponding to batch size of 89 with realised posterior score of 0.81 (corresponding to the accuracy 73%) and expected posterior score of 0.83 (accuracy 74%).

Table 6 AutoBCT (VI/Map, \(\hbox {N}=2\), \(\epsilon =0\)) results on 1D CNN example with batch as control

4.2.2 VI/Map CNN r

Here we explore another 1D case using learning rate r as the control. We fix the batch size 66. It is known that the learning rate does not influence the computational cost of the neural net trained over fixed number of epochs. However, by having the batch size 66 we make each run of the neural net relatively expensive in the context of these examples (\(\approx 2.7\) min). Therefore, our goal is still to obtain an optimal control in as few iterations as possible, to save computational resources.

We keep the same mappings for h and t as above, and the same \(\sigma _H = 0.15\), \(\sigma _T = 0.1\), \(\gamma =0.16\). Using common knowledge about the effect of the learning rate on accuracy and running time, we modify the prior for \(H(\cdot )\) so that the mean is unimodal and has a maximum in the interior of (0, 1) and the prior for \(T(\cdot )\) is flat (Fig. 6a). The mapping of control into the learning rate is linear on the \(\log (r)\) scale, i.e., we set

$$\begin{aligned} r(u) = \exp \left\{ \left. \log (r_{\min }) + \left( \log (r_{\max }) - \log (r_{\min })\right) \cdot u\right) \right\} ,\nonumber \\ \end{aligned}$$
(4.25)

where \(r_{\min }=10^{-5}\) and \(r_{\max }=0.1\).

Fig. 5
figure 5

Visualisation of a final posterior distribution of a 1D CNN batch example

Fig. 6
figure 6

Prior and posterior distributions for 1D CNN r example

In the run displayed in Table 7, the final control corresponds to \(r = 0.0005\) with the observed score of \(h_5 = 0.794\) (corresponding to the accuracy 73%) and the expected posterior score of \((\mu ^{\alpha }_{3})^T{\varvec{\phi }}(0.43) = 0.445\) (accuracy 61%). The final posterior distributions with observed scores and costs indicated by crosses are shown in Fig. 6b. The final output of the algorithm is good although the posterior distribution on the left panel poorly fits the observed data. The shape of the score curve is so peaked that it cannot be represented accurately with a degree three polynomial (our choice of basis functions). However, AutoBCT stops when the posterior mean score (not the observed score) exceeds the value function, so it is sufficient that the posterior distribution indicates the positioning of hyperparameters for which the true maximum score is attained.

Table 7 AutoBCT (VI/Map, \(\hbox {N}=2\), \(\epsilon =0\)) results on 1D CNN example with r as control

4.2.3 2D AutoBCT (VI/Map) with CNN

We initialise our 2D CNN experiment with a prior on score \(H(u_1,u_2)\) being unimodal (left panel of Fig. 7a). For the cost function \(T(u_1,u_2)\) we choose a pessimistic prior (left panel of Fig. 7b), which reflects a conservative belief that the cost is exponentially decreasing in the batch size and indifferent to the choice of the learning rate. As before, for each pair \((u_1,u_2)\) the neural network is trained 5 times to mitigate instabilities with large batch. We map the median raw accuracy \(\hat{h}\) and the raw time \(\hat{t}\) (in minutes) via (4.24). We map the control \(u_1\) into the learning rate via (4.25) with \(r_{\min }=10^{-5}\) and \(r_{\max }=0.1\). The second control, \(u_2\), is mapped into the batch size via (4.23) with \(\textit{batch}_{\min } = 10\) and \(\textit{batch}_{\max } = 200\). Those mappings are identical as in the 1D cases studied above.

Fig. 7
figure 7

Prior and posterior distributions for a 2D CNN run from Table 8

Results for an example run of 2D AutoBCT (VI/map) can be inspected in Table 8. The final control of \((u_1,u_2)=(0.5,0.15)\) corresponds to \(r=0.001\) and the batch size of 38, with the expected posterior accuracy of 74% and the realised accuracy of 78% (compared to the accuracy 84% obtained in Cruz-Roa et al. 2014 which used full dataset in computations). The surfaces showing the mean of the posterior distributions for H and T are shown on right panels of Fig. 7.

Table 8 An example run of AutoBCT on CNN example with the 2D control space

Table 9 presents a summary of average performance of 40 runs on this 2D neural network problem. During 40 VI/Map runs, we recorded a small standard deviation 0.06 for the final control \(u_{1,n}\) (r) suggesting that the quality of the model is sensitive to the choice of the learning rate. On the other hand, the standard deviation for the final control \(u_{2,n}\) (batch) is much bigger suggesting weaker sensitivity to this parameter. It should, however, be remarked that the training of Neural Networks tends to be less stable when batch size is large relative to the total sample size, so good models could have been obtained ‘by chance’.

Table 9 Summary results of the average of 40 runs (standard deviation is given in brackets) for the 2D CNN problem

4.2.4 2D AutoBCT (OTF) with CNN

The OTF version of the AutoBCT with a 2D CNN problem is significantly affected by the relatively high noise present in observations of the score and we observed its tendency to stop more unpredictably. We present summaries of three OTF runs (Table 10). Care should be taken when comparing these results to the ones obtained by AutoBCT (VI/Map) because, due to computational demands, only \(N=2\) version of OTF is ran, while for the VI/Map version we use \(N=3\) map with 1 OTF step at the end (as always for VI/Map runs, see Algorithm 2), essentially making it \(N=4\) for the comparison with OTF. It can, therefore, be concluded that depth \(N=2\) is insufficient to obtain reliable results in 2D case and one needs to use depth \(N=4\). Recall that \(N=2\) version of OTF performed well in the synthetic 1D example.

Table 10 Final results of 3 OTF \(\hbox {N}=2\) runs with 2D CNN problem

4.3 AutoBCT validation on popular datasets

We provide summary statistics for the evaluation of AutoBCT ’s performance on 3 popular datasets employing 3 different machine learning models. In tables, we provide in brackets: the accuracy for column Final \(h_n\), the total running time for column \(\sum _{i=1}^n t_i\), and the corresponding value of the hyperparameter for column u (\(u_1\), \(u_2\)). Recall that details of parameter mappings and implementation are collected in Appendix E and comparisons to Hyperopt are made in Appendix F.

HIGGS Data Set (Baldi et al. 2014). This is a classification problem to distinguish between a signal process which produces Higgs bosons and a background process which does not. The data set contains 11 million rows. Training over the whole dataset is time consuming, so we aim to find an optimal sample size to be used for training and testing with Random Forest (|ranger|) given the trade-off between the score and the computational cost. Results are collected in Table 11. The final accuracy of 72% puts the model within the TOP 10 performing models according to openml.org/t/146606 statistics for Higgs data set. The original paper Baldi et al. (2014) reported maximum AUC of 0.893 for a deep neural network and 0.81 for Boosted decision trees. AUC for the model from Table 11 is 0.797.

Table 11 Random forest on ‘Higgs’ data set using sample size as a control

Intel image data. This is image data of Natural Scenes around the world obtained from Intel Scene Classification Challenge.Footnote 2 The data contains around 25K images of size \(150\times 150\) distributed under 6 categories. The task is to create a classification model. For this example we use categorical cross entropy loss (Cox 1958) (\(\mathcal {X}\) - population, x - sample, \(\mathcal {C}\) - set of all classes, c - a class)

$$\begin{aligned} -\frac{1}{N} \sum _{x \in \mathcal {X}} \sum _{c \in \mathcal {C}} {{\mathbb {1}}_{\{x \in c\}}} \log {\mathbb {P}(x \in c)} \end{aligned}$$
(4.26)

to construct a score (instead of the accuracy). The above entropy loss allows us to monitor overfitting of the CNN while accuracy itself does not provide us with this information. In this example we choose the number of epochs in CNN as the control. We obtain results for two architectures. In one architecture (Resnet50) we use weights of the widely popular resnet50 model, while in other one (Plain) we use architecture that does not rely on pretrained models. Details of architectures of Plain and Resnet50 models are provided in “Appendix 1”. Results are displayed in Table 12. We note that our Resnet50 model performance produces a result that is 3% short of the challenge winner, who obtained a model with 96% accuracy via a different architecture and ensemble averaging.

Table 12 CNN on Intel data set using epochs as a control using two architectures

We note that a more efficient optimisation of a parameter such as epochs is available. Having a fixed holdout (test) dataset we are able to evaluate the performance of the model on a test set after each training step, without introducing any data leakages to the model. Therefore, if a cost of the model evaluation on the holdout set is relatively small, compared to the overall training cost per step, observations of the score (h) and the cost (t) are available at each step. Optimiser’s task would therefore simplify to optimally stopping the training when the optimum trade-off between the score and the cost is achieved. However, the flow of information would be different than in our model and a new modelling approach would be needed. Furthermore, in a Bayesian setting like ours, it is required that the error of each observation of the score and the cost is independent of the previous ones. This is clearly violated in an optimisation task with incremental improvements of the model. Nevertheless, this is an interesting direction of research which we retain for the future. Note also similarities to the learning curve methods used by, e.g., Chandrashekaran and Lane (2017).

Credit card fraud data (Pozzolo et al. 2015).Footnote 3 This dataset contains credit card transactions from September 2013. It is highly unbalanced with only 492 frauds out of 284, 807 transactions. We build an autoencoder (AE) (Kramer 1991) which is trained on non-fraudulent cases. We attempt to capture fraudulent cases by observing the reconstruction error.

We parametrise the network with two parameters: code which specifies the size of the smallest layer and scale is the number of layers in the encoder (the decoder is symmetric). The largest layer of the encoder has the size \(\lceil 20 \times \text {scale} \rceil \) and there are \(\lceil \textit{scale} \rceil \) layers in the encoder spaced equally between the largest layer and the coding layer (\(\lceil x \rceil \) denotes the ceiling function, i.e., the smallest integer dominating x). So an AE described by code equal to 1 and scale equal to 1.9 has hidden layers of the following sizes: 38, 19, 1, 19, 38. We consider code and scale between 1 and 10.

For 1D example, the optimisation parameter is the scale with \(\textit{code}=2\). For 2D example, we optimise scale and code.

Given that fraudulent transactions constitute only \(0.172\%\) of all transactions, we do not focus on a pure accuracy of the model, but on a good balance between precision and recall which is weighted together by the widely used generalized F-score (Dice 1945):

$$\begin{aligned} F_{\beta } = (1+\beta ^2) \frac{{\textit{precision}} \times {\textit{recall}}}{(\beta ^2 \times {\textit{precision}}) + {\textit{recall}}}. \end{aligned}$$
(4.27)

We pick \(\beta =6\) as we are interested in capturing most of the fraudulent transactions (we observe that the recall is at least 80%). In practice \(\beta \) is chosen somewhat arbitrarily; however it also can be used as an optimisation parameter (Klinger and Friedrich 2009).

Results for 1D example are displayed in Table 13. We observe that in this particular 1D example the optimal control of 0.01 corresponds to a shallow AE architecture, suggesting that good performance can be achieved with a small in size AE architecture.

Table 13 AE on ‘Credit Card Fraud’ dataset using scale as a control

The results for the 2D version of this example are shown in Table 14. AutoBCT decides to stop at the most shallow architecture as well, surprisingly revealing that even choosing the size of the code equal to 1 leads to similarly good results.

Table 14 AE on ‘Credit Card Fraud’ dataset using scale and code as a controls

Summary of findings. Our validation demonstrated that AutoBCT performs well in hyperparameter selection problems for diverse datasets, machine learning techniques and objectives. VI/Map version of the algorithm is preferred to OTF. The value function map can be precomputed in advance and used for a range of problems. Indeed, we used only 2 value functions: one for 1D problem and one for 2D problem. This means that the cloud \(\mathcal {C}\) of training points, the Lagrange multiplier \(\gamma \) and basis functions can be chosen so that the value function is robust and applicable to a wide range of problems. The scaling of raw score and cost are used to encode a user’s expectations about the quality of the model and the cost of its training. In the CNN example (Sect. 4.2) which is a classification problem, it is very easy to obtain the accuracy of around 50%, so the observed accuracy is transformed so that this point is mapped close to 0. The scaling of the running time depends on the computational complexity of the problem and may be adjusted when the cost constraints the algorithm too much and the user is willing to endure longer running times. Note here that observations of the raw score and cost obtained for particular scaling maps can be reused to prime the algorithm (update the prior) for a different scaling, so this computation effort is not wasted. This could potentially lead to an automated scheme for adjusting scaling, but the exploration of this direction is left for further research.

5 Conclusions

We have designed a new modelling framework for studying hyperparameter selection problems with a total cost constraint where the dependence of the objective function and the cost on hyperparameters is unknown and must be learnt. The emphasis is on problems where the cost constraint is very restrictive and prevents thorough exploration of the hyperparameter space. Such problems, for example, may arise in exploratory data analysis when a short response time, potentially at the cost of model quality, is of essence.

We model the dependence of the score and the cost on hyperparameters as a linear combination of basis functions. We embed the learning problem in a Bayesian framework with coefficients of basis functions being multivariate Gaussian distributed a priori. We design a numerical algorithm to solve the resulting optimal control problem and demonstrate its applicability for hyperparameter optimisation for a number of machine learning problems. At each learning step, the strategy either prescribes to terminate training or provides new hyperparameter values to be applied. Those values are selected in the optimal way taking into account explicitly the trade-off between the improvement of the score (learning) and the cost; this information is encoded directly in the value function.

There are notable open problems left for further research. Our numerical results are for one and two-dimensional hyperparameters. A larger number of parameters would require a much bigger computational budget to pre-compute the value function and potentially call for new functional approximators for the value function and the policy and new ideas for (approximate) representation of the state for the optimal control problem; in the two-dimensional setting, the dimensionality of the state of the optimal control problem already reaches 75. The choice of basis functions and prior distributions in multi-dimensional settings requires further research. Finally, the selection of the Lagrange multiplier \(\gamma \), errors of observations \(\sigma _H\) and \(\sigma _T\) and the computation of the score and the cost from observable quantities for a particular machine learning problem needs to be further explored with particular emphasis on meta-learning. It would also be interesting to perform a thorough analysis of benefits of meta-learning for selection of priors.