Automatic model training under restrictive time constraints

We develop a hyperparameter optimisation algorithm, Automated Budget Constrained Training, which balances the quality of a model with the computational cost required to tune it. The relationship between hyperparameters, model quality and computational cost must be learnt and this learning is incorporated directly into the optimisation problem. At each training epoch, the algorithm decides whether to terminate or continue training, and, in the latter case, what values of hyperparameters to use. This decision weighs optimally potential improvements in the quality with the additional training time and the uncertainty about the learnt quantities. The performance of our algorithm is verified on a number of machine learning problems encompassing random forests and neural networks. Our approach is rooted in the theory of Markov decision processes with partial information and we develop a numerical method to compute the value function and an optimal strategy.


Introduction
The emergence of Machine Learning (ML) methods as an effective and easy to use tool for modeling and prediction has opened up new horizons for users in all aspects of business, finance, health, and research [18,7].In stark contrast to more traditional statistical methods that require relevant expertise, ML methods constitute largely black-boxes and are based on minimal assumptions.This resulted in adoption of ML from an audience of users with possible expertise in the domain of application but little or no expertise in statistics or computer science.Despite these features, effective use of ML does require good knowledge of the method used.The choice of the method (algorithm), the tuning of its hyperparameters and the architecture requires experience and good understanding of those methods and the data.Naturally, trial and error might provide possible solutions; however it can also result in sub-optimal use of ML and wasted computational resources.
The goal of this research is to find optimal values of hyperparameters for a given dataset and a modelling objective given that the relationship between the hyperparameters, model quality (model score) and training cost is not known in advance.This problem is akin to the classical AutoML setup [25] with one crucial difference: if desired, the system has to produce results in radically shorter time than classical AutoML solutions.This comes at the expense of accuracy, so taking the trade-off between the model quality and the running time explicitly into account lies at the heart of this paper.The relationship between hyperparameters, model quality and computational cost for a particular modelling problem and a particular dataset must be learnt and this learning is incorporated directly into the optimisation problem.

Main Contribution
We propose Automated Budget Constrained Training (AutoBCT), an algorithm based on theory of Markov Decision Processes with partially observable dynamics, combining stochastic control, optimal stopping and filtering.In AutoBCT we treat hyperparameters as a control variable and introduce an objective function that balances model score with the cost, where both are often functions of the accuracy and the running time, respectively.The optimisation problem takes the form sup where the optimisation is over ≥ 1 and ( ) ∞ =0 .Here, is the number of training epochs (a quantity dependent on the course of learning/training), ∈ denotes the choice of hyperparameters at epoch (also dependent on the history), and is the set of admissible hyperparameters.The quantity ℎ is the model score corresponding to the choice of hyperparameters −1 .The score ℎ may be random as it depends on the training process of the model (which can itself include randomness as in the construction of random forests or in the stochastic gradient method for neural networks) and on the choice of training and validation datasets (including K-fold cross-validation).The computational cost also depends on the choice of hyperparameters −1 and can be random due to random variations in running time commonly observed in practical applications, arising both from a hardware and an operating system's side as well as from random choices in the training process.
In the objective function we consider the expectation of the score at epoch and the cumulative computational cost up to epoch since the randomness in observation of those quantities can be viewed as external to the problem and not under our control.The constant > 0 models our aversion to computational cost.The higher the number the more sensitive we are to increase in cost and willing to accept models with lower scores.
The choice of hyperparameters −1 for the epoch is based on the observation of model scores ℎ 1 , . . ., ℎ −1 and costs 1 , . . ., −1 in all the previous epochs as well as on the prior information on the dependence of the score and cost on the hyperparameters.The number of training epochs also depends on the history of model scores and computational costs.At the end of each epoch a decision is made whether to continue training or accept current model.Hence is not deterministically chosen before the learning starts and depends not only on the total computational budget but also on observed scores and costs.This may and does lead AutoBCT to stop in some problems very early when the model happens to be so good that the expenditure on more training is not worthwhile (according to the criterion (1.1) which looks at the trade-off between score and cost).
We approximate score and cost as functions of hyperparameters using linear combinations of basis functions.We assume that the prior knowledge about the distribution of the coefficients of these linear combinations is multivariate Gaussian and is updated after each epoch using Kalman filtering.This update not only changes the mean vector but also adjusts the covariance matrix which represents the uncertainty about the coefficients and, consequently, about the score and cost mappings.Updated quantities feed back into the algorithm which decides whether to continue or stop training, and, in the former case, which hyperparameters to choose for the next epoch.We show that the updated distributions of coefficients are sufficient statistics of past observations of model scores and costs for the purpose of optimisation of the objective function (1.1).
Our framework requires that a value function, representing the optimum value of the objective function (1.1), is available.It depends on the dimension of the problem and the trade-off coefficient only, and is score and cost agnostic (in the sense that it does not depend on the particular problem and the choice of score and cost).This allows for an efficient offline precomputation and recycling of the value function maps.Users are able to share and reuse the same value function maps for different problems as long as the dimensionality of the hyperparameter space agrees.We demonstrate it in the validation section of this paper.
Similar to other AutoML frameworks [17,41], AutoBCT natively enables meta-learning.Prior distributions for coefficients of basis functions determining the score and cost maps can be combined with the meta-data describing already completed learning tasks in a publicly available repositories.These data can then be used to select a more informative prior for a given problem and therefore, to warm-start the training.In turn, this leads to reductions in the computational cost and improvements in the score.
The focus on optimising directly the trade-off between the model score and the total computational costs comes from two directions.Firstly, given the recent emphasis on the eco-efficiency of AI [38,36], AutoBCT framework provides effective ways of weighing model quality and the employed computational resources as well as recycling information and, in turn, reducing computational resources required to train sufficiently good models.Secondly, with the democratisation of data analytics an increasing amount of data exploration will take place where users need to obtain useful (but not necessarily optimal) results within seconds or minutes [12,21].
In summary, this paper's contributions are multifaceted.On the practical level, we develop an algorithm that allows optimal selection of hyperparameters for training and evaluation of models with an explicit trade-off between the model quality and the computational resources.From the eco-efficiency perspective our framework not only discourages unnecessarily resource intensive training but also naturally enables transfer and accumulation of knowledge.Lastly, on the side of Markov decision processes, we solve a non-standard, mixed stochastic control and stopping problem for a partially observable dynamical system and design an efficient numerical scheme for the computation of the value function and optimal controls.

Related literature
To maintain the appeal of ML to its huge and newly acquired audience and to enable its use in systems that do not require human intervention, a large effort has been put in the development of algorithms and systems that allow automatic selection of an optimal ML method and an optimal tuning of its parameters.This line of research has largely operated under the umbrella name of AutoML (see [25]), standing for Automated Machine Learning, and boosted by a series of data crunching challenges under the same title (see, e.g., [20]).There are widely available AutoML packages in most statistical and data analytics software, see, e.g., Auto-Weka [30], Auto-sklearn [16], as well as in all major commercial cloud computing environments.
The core of parameter optimisation typically takes a Bayesian approach in the form of alternating between learning/updating a model that describes the dependence of the score on the hyperparameters and using this model and an acquisition function to select the next candidate hyperparameters.The different algorithms used to support the alternating model fitting and updating of parameters also affect its performance and suitability for specific problems.The three predominant algorithms are (for a discussion see [14]): Gaussian Processes [37], random forest based Sequential Model-based Algorithm Configuration (SMAC) [24], and Tree Parzen estimator [5].
Our approach borrows from the ideas underlying Sequential Model-based Optimisation (SMBO) [24].The surrogate model in our case is the basis-functions based representation of the score and cost maps.However, unlike SMBO, the choice of hyperparameters is not only based on the surrogate model but also on the expected future performance encoded in the value function.In all the above, time is certainly an important factor.It is however mostly treated either implicitly, through exploration-exploitation trade-off or through hard bounds, which can sometimes result in the algorithm providing no answer within the allocated time.A more sophisticated optimisation of computational resources has recently started to draw interest.In a recent work [15], the authors combine Bayesian optimisation to build a model and suggest potential new parameter configuration with Hyperband, which dynamically allocates time budgets to these new configurations applying successive halving [33] to reduce time allocated to unsuccessful choices.[42] uses meta-learning and imposes time constraints on the predicted running time of models.[39] discusses a freeze-thaw technique for (at least temporarily) pausing training for unpromising hyperparameters.[8] uses learning curves to stop training under-performing neural network models.Another related line of research is concerned with prediction of running time of algorithms (see [22], [26] and many others).Those algorithms are akin to meta-learning, i.e., running time is predicted based on metafeatures of algorithms and their inputs.
Our approach differs from the above ideas in that it explicitely accounts for the tradeoff between the model quality (score) and the cumulative cost of training.To the best of our knowledge this is the first attempt to bring the cost of computational resources to the same level of importance as the model score and include it in the objective function.Our developments are, however, parallel to some of the papers discussed above.For example, our approach would benefit from employment of learning curves to allow for incomplete training for hyperparameters which appear to produce suboptimal models.Our algorithm could complement OBOE [42] by fine-tuning hyper-parameters of selected models.Such extensions are beyond the scope of this paper and will be explored in further research.
Our numerical method for computation of the value function complements a large strand of literature on regression Monte Carlo methods.The widest studied problem is that of optimal stopping (aka American option pricing) initiated by [40,34] and studied continuously to this day [35].Stochastic control problems were first addressed in [28] using control randomisation techniques and later studied in [2] using regress later approach and in [1] employing deep neural networks.Numerical methods designed in our paper are closest in their spirit to regress later ideas of [2] (see also [1]).

Paper structure
The paper is structured as follows.Section 2.1 gives details of the optimisation problem.Technical aspects of embedding this problem within the theory of Markov decision processes with partial information comes in the following subsection.Due to the noisy observation of the realised model score and cost (partial information), the resulting model is non-Markovian.Section 2.3 reformulates the optimisation problem as a classical Markov decision model with a different state space.The rest of Section 2 is devoted to showing that the latter optimisation problem has a solution, developing a dynamic programming equation and showing how to compute optimal controls.Section 3 contains details of numerical computation of the value function and provides necessary algorithms.Validation of our approach is performed in Section 4 which opens with a synthetic example to present, in a simple setting, properties of our solution.

Stochastic control problem
In this section, we embed the optimisation problem in the theory of stochastic control of partially observable dynamical systems [6, Chapter 10] and lay foundations for its solution.

Model overview
We start by providing a sketch of our framework.Recall the optimisation problem (1.1).The random quantities ℎ and are not known as they depend on a dataset, a particular problem, and software and hardware used.We therefore embed the optimisation problem in a Bayesian learning setting.To this end, we assume that where ( ) and ( ) are independent sequences of independent standard normal (0, 1) random variables.The terms ( ) +1 and ( ) +1 correspond to the aforementioned random fluctuations of observed quantities around ( ) and ( ) and are Gaussian with the mean 0 and the known variances 2 ( ) and 2 ( ).The quantities ( ) and ( ) are the true expected model score and true expected model cost given the choice of hyperparameters; for readability, we often omit the word expected.
In order to embed the learning process in a computationally efficient Bayesian setting, we assume that ( ) and ( ) can be represented as linear combinations of basis functions: Return : * , . . .while learning takes place.A priori, before learning starts, follows a normal distribution with the mean 0 and the covariance matrix Σ 0 ; the vector of coefficients follows a normal distribution with the mean 0 and the covariance matrix Σ 0 .These distributions are referred to as the prior distributions.We assume that a priori and are independent from each other.This choice of distributions is motivated by computational tractability but comes at the cost of modelling inaccuracy.In particular, we cannot guarantee that ( ) ≥ 0 for any ∈ and any realisation of .The consequences will be explored in more depth later in the paper.
Algorithm 1 provides a sketch of our approach.Given all information at epoch 0, i.e., given the prior distribution of and , we choose values of hyperparameters 0 and then train and validate a model.Based on the observed model score ℎ 1 and cost 1 , we update the distribution of and (the learning step).Following this we check whether to continue learning and training.If the decision is to stop, we return the latest values of metaparemeters and further information, for example, the latest distributions of and or the latest model.Otherwise, we repeat choosing hyperparameters, training, validation and update procedures.We continue until the decision is made to stop.Remark 2.1.In our Bayesian learning framework, the expectation in (1.1) is not only with respect to the randomness of observation errors ( ) and ( ) but also with respect to the prior distribution of and .Therefore, all possible realisations of the score and cost mappings are evaluated with more probable outcomes (as directed by the prior distribution) receiving proportionally more contribution to the value of the objective function.
In the remaining of this subsection, we derive the optimisation problem that will be solved in this paper.This will result in modification to both terms in the objective function in (1.1).
The score.In Problem (1.1) the optimised quantity is the observed score of the most recently trained model.When the emphasis is on the selection of hyperparameters, one would replace ℎ by ( −1 ) which corresponds to maximisation of the true expected score without the observation error: (2.4) The cost.The optimal value for the problem (2.4), as well as (1.1), is infinite and offers controls ( ) of little practical use.This is because follows a normal distribution for the sake of numerical tractability.Indeed, one cannot find assumptions on the vector of basis functions to ensure that =1 ( ) ≥ 0 for all ∈ with probability one.Conversely, for any > 0 we have P =1 ( ) < − for some ∈ > 0.
While learning one could identify, with increasing probability as increases, those realisations of for which there is ∈ with =1 ( ) < − for some fixed > 0.Then, continuing infinitely long with those controls would lead to a positive infinite value of the expression inside expectation.
In the view of the above remark, we truncate the cost to stay positive.The final form of the optimisation problem is sup where ( ) + := max( , 0).We show in Lemma C.1 that the optimal value of this problem is finite.
An alternative approach could be to amend the definition of to ensure non-negativity, but this would invalidate assumptions of the Kalman filter used in learning and, hence, make the solution of the problem numerically infeasible.

Technical details
Let (Ω, ℱ , P) be the underlying probability space supporting the sequences ( ) ≥1 and ( ) ≥1 and the random variables and .By ℱ we denote the observable history up to and including epoch , i.e., ℱ is the -algebra with ℱ 0 = {∅, Ω}.The choice of hyperparameters must be based only on the observable history, i.e., must be ℱ -measurable for any ≥ 0. The variable which governs the termination of training must be an (ℱ )-stopping time, i.e., { = } ∈ ℱ for ≥ 0 -the decision whether to finish training is based on the past and present observations only.The difficulty lies in the fact that observations combine evaluations of the true expected score function and the running time function with errors, c.f., (2.2).This places us in the framework of stochastic control with partial observation [6,Chapter 10].
Denote ( ) = ( 1 ( ), . . ., ( )) and ( ) = ( 1 ( ), . . ., ( )) .Equations (2.2) can be written in a vector notation as We will use this notation throughout the paper.The power of the theory of Markov decision processes is in its tractability: an optimal control can be written in terms of the current state of the system.However, when the system is not fully observed, there is an obvious benefit to take all past observations into account to deterimine possible values of unobservable elements of the system.Indeed, due to the form of observation (2.2)-( 2.3), one can infer much more about and when taking into account all available past readings of model scores and training costs.It turns out that optimal control problems of the form studied in this paper can be rewritten into equivalent systems with a bigger state space but with full observation for which classical dynamic programming can be applied.
Before we proceeed with this programme, we state standing assumptions: (A) Basis functions ( ) =1 and ( ) =1 are bounded on .
(S) Observation error variances are bounded and separated from zero, i.e.
It is not required for the validity of mathematical results that basis functions are orthogonal in a specific sense or even linearly independent.However, linear independence is required for identification purposes and it speeds up learning.

Separation principle
Denote by the conditional distribution of given ℱ .It follows from an application of the discrete-time Kalman filter [4, Section 4.7] that is Gaussian with the mean and the covariance matrix Σ given by the following recursive formulas: (2.7) Denote by ℬ the conditional distribution of given ℱ .By the same arguments as above, it is also Gaussian with the mean and the covariance matrix Σ given by the following recursive formulas: (2.8) Furthermore, it follows from the definition of ℎ +1 and +1 that the conditional distribution of ℎ +1 given ℱ and control is Gaussian and the conditional distribution of +1 given ℱ and control is also Gaussian Measure-valued stochastic processes ( ), (ℬ ) are adapted to filtration (ℱ ).This would have been of little practical use for us if it were not for the fact that those measure valued processes take values in a space of Gaussian distributions, so the state is determined by the mean vector and the covariance matrix.It is then clear that the process ( , Σ , , Σ , ) is a Markov process (we omit ℎ as it does not appear in the objective function (2.5)).Its dynamics are explicitly given in Appendix A.
The following theorem shows that the optimisation problem can be restated in terms of these variables instead of and , i.e., in terms of observable quantities only.This reformulates the problem as a fully observed Markov decision problem which can be solved using dynamic programming methods. (2.9) The proof of this theorem as well as other results are collected in Appendix C.

Dynamic programming formulation
By inspecting the dynamics of ( , Σ , , Σ , ) in Appendix A one notices that the dependence of the state at time + 1 on time is only via the parameters of the filters ( ), (ℬ ) which contain sufficient statistic about (ℎ ) =1 and ( ) =1 .Furthermore, the dynamics is time-homogeneous, i.e., it does not depend on the epoch .Hence, the value function of the optimisation problem (2.9) depends only on 4 parameters: (2.10) The function gives an optimal value of the optimisation problem given the distributions of and are Gaussian with the given parameters.The expression on the right-hand side is well defined and the value function does not admit values ±∞, see Lemma C.1.
In preparation for stating the dynamic programming principle, define an operator acting on measurable functions ( , Σ , , Σ ) as follows ) where the expectation is with respect to , Σ 0 = Σ ; we will use this shorthand notation whenever it does not lead to ambiguity.The expectation of the first term containing only 1 can be computed in a closed form improving efficiency of numerical calculations: where and Φ is the cumulative distribution function of the standard normal distribution.The justification of this formula is in Appendix C. In order to approximate , consider an iterative scheme: (2.13) The following theorem guarantees that the above scheme is well defined, i.e., does not take values ±∞ and the expression under expectation in is integrable, so that the operator can be applied to .The proof is in Appendix C.

2.
is the value function of the problem with at most training epochs, i.e.

The value function
defined in (2.10) is a fixed point of the operator , i.e., satisfies the dynamic programming equation Assertion 4 is of theoretical interest: it is the statement of the dynamic programming equation for the infinite horizon stochastic control problem.Assertion 2 states that is the value function for the problem with at most training epochs, which suggests what trade-off is made when using instead of .Furthermore, ≤ by Assertion 3, which, as we will see in the next section, may affect the stopping.We will refer to those results in the numerical section of the paper.

Identifying control and stopping
If was available, then the choice of control and the decision about stopping in Algorithm 1 would follow from classical results in optimal control theory.Indeed, one would choose and then train and validate the model using hyperparameters , compute +1 , Σ +1 , +1 , Σ +1 via (2.7)-(2.8),and stop when A natural adaptation of this procedure to the case when one only knows is to replace by in both places in the above formulas.Another possibility is to follow an optimal strategy corresponding to the computed value function, i.e, with the given maximum number of training epochs.If one knows 1 , . . ., , then for = 0, . . ., − 1, and stop when , where, with an abuse of notation, we put 0 ≡ −∞, i.e., at time = we always stop and we take that into account in choosing (c.f. the definition of 1 ).Let us refer to the first approach as the relaxed method and the second approach as the exact method.The advantage of the exact method is that it has strong underlying mathematical guarantees but it is limited to training epochs with chosen a priori.The relaxed method does not impose, a priori, such constraints but it may stop too early as the right-hand side of the condition (2.16) is replaced by ≤ .

Numerical implementation of AutoBCT
In this section, we focus on describing the methodology behind our implementation of AutoBCT.We show how to numerically approximate the iterative scheme (2.13) and how to explicitly account for numerical errors when using estimated ( ) ≥1 for the choice of control (hyperparameters) and for the stopping condition in Algorithm 1.We wish to reiterate that, apart from the on-the-fly method (Section 3.5), a numerical approximation of the value functions , ≥ 1, needs to be computed only once and then applied in Algorithm 4 for various datasets and modelling tasks.We demonstrate this in Section 4.

Notation
Denote by the state space of the process = ( , Σ , , Σ ).For = ( , Σ , , Σ ) ∈ , define Given ℎ, ∈ R and ∈ , we write Kalman( , ℎ, ) for the output of the Kalman filter (2.7)-(2.8)(with an obvious adjustment to the notation: is taken to be the state at time and ℎ, are the observation of the score ℎ +1 and the cost +1 ).
For convenience of implementation we assume that = [0, 1] which can be achieved by appropriate normalisation of the control set.We show in Section 4 that this assumption is not detrimental even if some of hyperparameters take discrete values.

Computation of the value function
We compute the value function via the iterative scheme (2.13).We use ideas of regression Monte Carlo for controlled Markov processes, particularly the regress-later method [2,23], and approximate the value function , = 1, . . ., , learnt over an appropriate choice of states ⊂ .The choice of the cloud is crucial for the accuracy of the approximation of the value function at points of interest; it is discussed in Section 3.4.We fix a grid of controls for computing an approximation of the Q-value as in practical applications we found it to perform better than a (theoretically supported) random choice of points in the control space .Computation of the value function is presented in two algorithms.Algorithm 2 finds an approximation of the expression inside optimisation in ( ) (see (2.12)), while Algorithm 3 shows how to adapt value iteration approach to the framework of our stochastic optimisation problem.Details follow.
Algorithm 2 defines a function Lambda( , ; , ) that computes an approximation to the mapping It is done by evaluating the outcome of each control in (line 1 in Algorithm 2) using Monte Carlo averaging over samples from the distribution of the state 1 given 0 = and the control ; this averaging is to approximate the expectation.However, the regression in line 7 employs further cross-sectional information to reduce error in Monte Carlo estimates of this expectation as long as the sampling of (ℎ ) and ( ) is independent between different values .This effect has mathematical grounding as a Monte Carlo approximation to ortogonal projections in an appropriate 2 space if the loss function is quadratic: ( , ) = ( − ) 2 , c.f., [40,2].In particular, it is valid to use = 1, but larger values may result in more efficient computation: one seeks a trade-off between the size of and the size of to optimise computational time and achieve sufficient accuracy.
The space of functions ℱ over which the optimisation in regression in line 7 is performed depends on the choice of functional approximation.In regression Monte Carlo it is a linear space spanned by a finite number of basis functions.Other approximating families ℱ include neural networks, splines and regression Random Forest.Although we do not include it explicitely in the notation, the optimisation may further involve regularisation in the form of elastic nets [43] or regularised neural networks, which often improves the real-world performance.
Algorithm 3 iteratively applies the operator and computes regression approximation of the value function.The loop in lines 2-5, computes optimal control and the value for each point from the cloud .In line 3, we approximate the mapping using Algorithm 2. Regression in line 6 uses pairs: the state and the corresponding optimal value to fit an approximation +1 of the value function +1 .This approximation is needed in the following iteration of the loop (in line 3) and as the output of the algorithm to be used in application of AutoBCT to a particular data problem.The family of functions ℱ (as ℱ ) includes linear combinations of basis functions (for regression Monte Carlo), neural networks or regression Random Forest.Importantly, the dimensionality of the state is much larger than of the control itself, so efficient approximation of the value function requires more intricate methods than in the approximation Lambda( , ; , ) of the q-value.As in Algorithm 2, some regularisation may benefit real-world performance.The loss function is commonly the squared error ( , ) = ( − ) 2 but others can also be used (particularly when using neural networks) at the cost of more difficult or of no justification of convergence of the whole numerical procedure.

AutoBCT
The core part of the AutoBCT is given in Algorithm 4. We use the notation ℳ to denote an external Machine Learning (ML) process that outputs score ℎ and cost for the choice of hyperparameters ∈ .The 'raw' score ĥ and the 'raw' cost ˆ obtained from the ML  ; routine are transformed onto the interval [0, 1] with the user-specified map (robustness to a mis-specification of this map is discussed below): ( ĥ, ˆ ) ↦ → (ℎ, ).
For the cost we usually use an affine transformation based on a best guess of maximum and minimum values.For the score ℎ the choice depends on the problem at hand: in our practice we have employed affine and affine-log transformations.Mis-specification of these maps so that the actual transformed values are outside of the interval [0, 1] are accounted for by learning using a cloud containing states that map onto distributions with resulting functions and not bounded by [0, 1] with non-negligible probability.This robustness does not, however, relieve the user from a sensible choice of the above transformations as significantly bad choices have a tangible effect on the behaviour of the algorithm.
We present an algorithm for the relaxed method as introduced in Section 2.5; an adaptation of Algorithm 4 to the exact method is easy and has been omitted.
The input of Algorithm 4 consists of an approximation of the value function .An optimal control at epoch given state should be determined by maximisation of the expression (3.17) over ∈ putting = and = .For each , this expression compares two possibilities available after applying control : either stopping or continuing training.These two options correspond to strategically different choices of ; the first one maximises score in the next epoch while the latter prioritises learning.The learning option is preferred if future rewards (as given by ) exceed present expected score.The approximation of may be burdened by errors which may result in an incorrect decision and drive the algorithm into an unnecessary 'learning spree' (the opposite effect of premature stopping is less dangerous as it does not lead to unnecessarily long computation times).To mitigate this effect, we replace (3.17) with Λ ( ; , , ) = − Υ( ( , ), ( , ) 2 ) + max ( ( , 1 ), ( 1 ) (1 − ) 0 = , (3.18) for an error adjustment ≥ 0. Notice that = 0 brings back the original formula from (3.17).The approximation of the mapping ↦ → Λ ( ; , , ) is computed by Algorithm 2 with ( ) = (1 − ) ( ).The grid and the number of Monte Carlo iteration are usually taken much larger than in the value iteration algorithm since one needs only one evaluation at each epoch instead of 1000s of evaluations needed in Algorihm 3. Furthemore, more accurate computation of the mapping Λ (•; , , ) was shown to improve real-life performance.
The stopping condition in line 9 is an approximation of the condition ( , ) < ( ).Indeed, using that = , we could write ( , ) < ( ) and approximate the righthand side by sup ∈ Λ ( ; , 0, ) with = 0. Using non-zero is to counteract numerical and approximation errors as discussed above.
The stopping condition ( , ) < ( ) could also be implemented as ( , ) < (1− ) ( ).We do not do it for two reasons.Firstly, Λ( +1 ) is a more accurate approximation of the value function at point due to averaging effect of evaluating at the required point and with computations done with much higher accuracy stemming from a finer grid and a larger number of Monte Carlo iterations .Secondly, this brings the stopping condition in line with the choice of optimal control in line 8 which is also based on Λ and comes at no extra cost.

States and clouds
Approximations in Algorithm 3 are learnt over states in cloud ⊂ , and therefore the quality of these maps depends on the choice of .This cloud needs to comprise sufficiently many states encountered over the running of the model in order to provide good approximations for the choice of control and the stopping decision in Algorithm 4. In the course of learning, the norm of the covariance matrices Σ and Σ becomes smaller (the filter is concentrated ever more on the mean) asymptotically approaching a truth: Definition 3.1 (Truth).A state = ( , Σ , , Σ ) ∈ is called a truth if Σ = 0 and Σ = 0.The set of all truths in is denoted by ∞ .
We argue that the learning cloud should contain the following: a) an adequate number of states that are elements of ∞ , b) states that are suitably close to priors 0 one will use in the Algorithm 4, c) an adequate number of states 'lying between' the above.
A failure to include states which are truths (point a) inhibits the ability for AutoBCT algorithm to adequately stop.Conversely, a failure to include states which Algorithm 4 encounters in its running (point c) inhibits the ability to adequately continue.A failure to include a sufficient number of states that are close to priors 0 in (b) impacts the coverage of the state space by states (point c), and, further, due to the errors in extrapolation, the decisions made in the first step of the algorithm.
This results in a cloud of size ( + 1).Note that if in steps 1 and 2 one uses distributions with the means equal to possible priors 0 , point (b) above is fulfilled.At step 3, = 0 gives zero covariance matrices, i.e., truths according to the definition above.It also highlights that distributions used in step 1 should be sufficiently dispersed to cover possible truths for models for which AutoBCT will be used.
This motivates two modifications of Algorithm 4. The first one, which we call 'on-the-fly' in this paper, replaces calls to Lambda in lines 1 and 6 by calls to OTF( , , ; , ) with ≡ −∞.We also increase the grid size and in the first execution to obtain a finer approximation of than ( ) −1 =1 , in line with the discussion in Section 3.3.The second modification of Algorithm 4 whose performance is not reported in this paper is motivated by the accuracy improvement when using Lambda( , ; , ) instead of .In the same manner, calls to Lambda can be replaced by OTF( , , ; , ) with ≥ 2 bringing further improvements but at a cost of significant increase in computational complexity.
The implementation of OTF is computationally intensive as it uses multiple nested Monte Carlo integrations and optimisations with the computational complexity exponential in .Its advantage is that it does not require a good specification of the cloud and precomputed value functions.

Validation
In this section we demonstrate applications of AutoBCT to a synthetic example (to visualise its actions and performance) and a number of real datasets and models.Detailed information about settings for each example are collected in Appendix D. All results were obtained using a desktop computer with Intel(R) Core(TM) i7-6700K CPU (4.4GHz), 32GB RAM and GeForce RTX 3090 GPU (24GB).In the synthetic example, Section 4.1, we artificially slowed down computations in order to be able to observe differences in running times between hyperparameter choices.
For computation of 1D and 2D value function maps used in VI/Map algorithm, we employ a cloud following general guidelines described in Subsection 3.4.However, in addition to those we enrich the cloud by first generating a small set of size 100 of realistic unimodal shapes for the score and cost functions which we propagate via Kalman filter using a predefined control grid up to certain depth .In 1D case we take depth = 3 and the cloud contains 156, 000 states of which 1000 are truths.In 2D case, the value function is built using a cloud extended by the above procedure with depth = 10; it contains 761, 600 states of which 1000 are truths.
In VI/Map Algorithm 3 the regression in line 6 is performed using Random Forest Regression (RFR) with 100 trees.Functions and are taken as constant (independent of the hyperparameters ).

A synthetic example
We illustrate steps of AutoBCT algorithm on a synthetic prediction problem.To make results easier to interpret visually, we consider a one-dimensional hyperparameter selection problem where we control the number of trees in a random forest (RF) model and take scaled accuracy as the score.Observed accuracy ĥ is mapped to ℎ by an affine mapping that maps 50% model accuracy onto score 0 and 100% onto 1, while the real running time ˆ is mapped to by an affine mapping that maps 0 seconds onto 0 and 0.6 seconds onto 1: ℎ = ĥ − 0.5 0.5 , = ˆ 0.6 .The task at hand is to predict outputs of a function with = = 10 (Figure 1), where • is the floor function: We initialise our synthetic experiment with a prior on ( ) being relatively flat but with a large enough spread to cover well the range (0, 1) (see the left panel of Figure 2).For the cost function ( ) we choose a pessimistic prior (see the right panel of Figure 2), which reflects our conservative belief that the cost is exponentially increasing as a function of the number of trees max (in practice, it is usually linear).We fix = 0.16 and = 0 and set the observation errors ≡ 0.05 and ≡ 0.1 (standard deviation of the observation error for the score and the cost is assumed to be 5% and 10%, respectively, independently of the choice of the hyperparameter).We relate the control ∈ [0, 1] with the number of trees trees via the mapping: trees ( ) = 1 + 99 .1: AutoBCT (VI/Map, = 2, = 0) Results of one run for the synthetic example.The first column displays the iteration of the algorithm, the second columns the realised score ℎ , the third the realised cost , following by the total cost and the control −1 applied in a given iteration and the expected posterior score ( ) ( −1 ) corresponding to the control −1 applied in iteration .Finally shows the value function map evaluated at the posterior distribution after iteration .Thus, the smallest considered random forest has trees (0) = 1 tree and the largest has trees (1) = 100 trees.

VI/Map results
In this example, we use the 1D map described at the beginning of Section 4. We visualise results for the map produced with = 2 (recall that denotes the depth of recursion in the computation of the value function) and = 0. We also include a summary of results for the = 4, = 0 map (Table 2) and show that they are not significantly different from the = 2 case.
One run of the VI/Map ( = 2, = 0) algorithm together with the evolution of posterior distributions is displayed in Figure 3 with an accompanying summary Table 1 which shows changes of important quantities across iterations.Notice that AutoBCT stopped at iteration 3 with the control = 0.74 (corresponding to trees = 74) and the expected posterior score and the realised score of ≈ 0.98 (accuracy 99%).Recall that the observed accuracy ℎ is burdened with a random error due to the randomness involved in training of the random forest and in validating the trained model.The closedness of the expected posterior score and the realised score at the time of stopping is not a rule and should not be expected in general (see examples further in this section).
We note that the VI/Map with = 4 produces a very similar run (Table 2), so additional expense at finding depth = 4 map is not necessary for this example.
Non-zero value.We kept artificial damping parameter , which accounts for errors in the value function map, equal to zero as the quality of the map was very good.It can also be observed in Table 1 and 2 that any reasonable value of would not change the course of training.

On-the-fly results
For the OTF ( = 2) version of the algorithm we provide a summary of one run in Table 3. AutoBCT stopped at iteration 3 with the control = 0.76 (corresponding to trees = 76) and the expected posterior score of 0.99 (accuracy 99.5%) and realised score of 0.98 (accuracy 99%).We observe that VI/Map results are very similar to the OTF results.In particular, both methods completed training in 3 iterations and the difference between final controls is very small.In the next section, we proceed to deploy AutoBCT algorithm on real data, using convolutional neural networks.

Convolutional neural network
In this section, we focus on a computationally intensive problem of classifying breast cancer specimen images for Invasive Ductal Carcinoma (IDC).IDC is the most common subtype of all breast cancers.Data [27,11] contains 277, 524 image patches of size 50 × 50, where 198, 738 are IDC negative and 78, 786 IDC positive.
We build a classification model using convolutional neural network (CNN).We optimise the following metaparameters: batch (batch size) and (learning rate).CNN is run twice (2 epochs) over a given set of training data and is evaluated on a holdout test data.Due to the fact that we run CNN for exactly 2 epochs (and not until a certain accuracy threshold is reached), the choice of the learning rate will have no effect on the running time.This is to demonstrate that AutoBCT is also able to deal with cases where the cost does not depend on some of the optimised hyperparameters.Architecture of CNN is shown in Figure 4 and described in Appendix E.1.
A naive tuning of CNN over a two-dimensional predefined grid of hyperparameters would be computationally expensive.In Sections 4.2.3 and 4.2.4 we present results of 2D AutoBCT (VI/Map) and AutoBCT (OTF), respectively, where nearly optimal controls are obtained in a relatively few iterations.However, we start by discussing 1D VI/map results on

VI/Map CNN batch
Due to limited computational power, we expedite CNN computational time by working on a subset of data.A training set consists of 5000 images with 50/50 split between the two categories, and a testing set is also balanced and contains 1000 images.We fix the learning rate = 0.0007.
We use the = 2 map with = 0.15, = 0.1 and = 0.16.For this example as well as for others involving a neutral network, we choose relatively high = 0.15 to account for larger randomness in training for some choices of hyperparameters.For the sake of ilustration, we allow large batch sizes (relative to the size of the data set), which results in unstable model training.To mitigate the resulting significant variability in observed accuracy and enable interesting observations of the algorithm performance, for each the neural network is trained 5 times and a median accuracy is outputted together with the total accumulated cost of training and testing 5 neural networks.We map control ∈ [0, 1] to batch size in the following way: Results for this problem, dubbed 1D CNN batch, can be inspected in Table 4.We also visualise final posterior distributions of (•) and (•) in Figure 5.The algorithm returned the control corresponding to batch size of 89 with realised posterior score of 0.81 (corresponding to the accuracy 73%) and expected posterior score of 0.83 (accuracy 74%).

VI/Map CNN r
Here we explore another 1D case using learning rate r as the control.We fix the batch size 66.It is known that the learning rate does not influence the computational cost of the neural net trained over fixed number of epochs.However, having the batch size 66 we make each run of the neural net relatively expensive in the context of these examples (≈ 2.7 minutes).Therefore, our goal is still to obtain an optimal control in as few iterations as possible, to save computational resources.We keep the same mappings for ℎ and as above, and the same = 0.15, = 0.1, = 0.16.Using common knowledge about the effect of the learning rate on accuracy and running time, we modify the prior for (•) so that the mean is unimodal and has a maximum in the interior of (0, 1) and the prior for (•) is flat (Figure 6a).The mapping of control into the learning rate is linear on the log( ) scale, i.e., we set where min = 10 −5 and max = 0.1.In the run displayed in Table 5, the final control corresponds to = 0.0005 with the observed score of ℎ 5 = 0.794 (corresponding to the accuracy 72%) and the expected posterior score of ( 3 ) (0.43) = 0.445 (accuracy 61%).The final posterior distributions with observed scores and costs indicated by crosses are shown in Figure 6b.The final output of the algorithm is good although the posterior distribution on the left panel poorly fits the observed data.The shape of the score curve is so peaked that it cannot be represented accurately with a degree three polynomial (our choice of basis functions).However, AutoBCT stops when the posterior mean score (not the observed score) exceeds the value function, so it is sufficient that the posterior distribution indicates the positioning of hyperparameters for which the true maximum score is attained.

2D AutoBCT (VI/Map) with CNN
We initialise our 2D CNN experiment with a prior on score ( 1 , 2 ) being unimodal (left panel of Figure 7a).For the cost function ( 1 , 2 ) we choose a pessimistic prior (left panel of CNN example; r (VI/map, N=2, = 0) ℎ Figure 7b), which reflects a conservative belief that the cost is exponentially decreasing in the batch size and indifferent to the choice of the learning rate.As before, for each pair ( 1 , 2 ) the neural network is trained 5 times to mitigate instabilities with large batch.We map the median raw accuracy ĥ and the raw time ˆ (in minutes) via (4.22).We map the control 1 into the learning rate via (4.23) with min = 10 −5 and max = 0.1.The second control, 2 , is mapped into the batch size via (4.21) with batch min = 10 and batch max = 200.Those mappings are identical as in the 1D cases studied above.
Results for an example run of 2D AutoBCT (VI/map) can be inspected in Table 6.The final control of ( 1 , 2 ) = (0.5, 0.15) corresponds to = 0.001 and the batch size of 38, with the expected posterior accuracy of 74% and the realised accuracy of 78% (compared to the accuracy 84% obtained in [11] which used full dataset in computations).The surfaces showing the mean of the posterior distributions for and are shown on right panels of Figure 7.
Table 7 presents a summary of average performance of 40 runs on this 2D neural network problem.During 40 VI/Map runs, we recorded a small standard deviation 0.06 for the final control 1, (r) suggesting that the quality of the model is sensitive to the choice of the learning rate.On the other hand, the standard deviation for the final control 2, (batch) is much bigger suggesting weaker sensitivity to this parameter.It should, however, be remarked that the training of Neural Networks tends to be less stable when batch size is large relative to the total sample size, so good models could have been obtained 'by chance'.

2D AutoBCT (OTF) with CNN
The OTF version of the AutoBCT with a 2D CNN problem is significantly affected by the relatively high noise present in observations of the score and we observed its tendency to stop more unpredictably.We present summaries of three OTF runs (Table 8).Care should be taken when comparing these results to the ones obtained by AutoBCT (VI/Map) because, due to computational demands, only = 2 version of OTF is ran, while for the VI/Map version we use = 3 map with 1 OTF step at the end (as always for VI/Map runs, see Algorithm 2), essentially making it = 4 for the comparison with OTF.It can, therefore, be concluded that depth = 2 is insufficient to obtain reliable results in 2D case and one needs to use depth = 4. Recall that = 2 version of OTF performed well in the synthetic 1D example.

AutoBCT validation on popular datasets
We provide summary statistics for the evaluation of AutoBCT's performance on 3 popular datasets employing 3 different machine learning models.In tables, we provide in brackets: the accuracy for column Final ℎ , the total running time for column =1 , and the corresponding value of the hyperparameter for column ( 1 , 2 ).Recall that details of parameter mappings and implementation are collected in Appendix D. Summary of (VI/Map, N=3, = 0.02) AutoBCT runs for the 2D CNN problem ℎ  [3].This is a classification problem to distinguish between a signal process which produces Higgs bosons and a background process which does not.The data set contains 11 million rows.Training over the whole dataset is time consuming, so we aim to find an optimal sample size to be used for training and testing with Random Forest (ranger) given the trade-off between the score and the computational cost.Results are collected in  [10] ( -population, -sample, -set of all classes, -a class)

HIGGS Data Set
to construct a score (instead of the accuracy).The above entropy loss allows us to monitor overfitting of the CNN while accuracy itself does not provide us with this information.In this example we choose the number of epochs in CNN as the control.We obtain results for two architectures.In one architecture (Resnet50) we use weights of the widely popular resnet50  10.We note that our Resnet50 model performance produces a result that is 3% short of the challenge winner, who obtained a model with 96% accuracy via a different architecture and ensamble averaging.We note that a more efficient optimisation of a parameter such as epochs is available.Having a fixed holdout (test) dataset we are able to evaluate the performance of the model on a test set after each training epoch, without introducing any data leakages to the model.Therefore, if a cost of the model evaluation on the holdout set is relatively small, compared to the overall training cost per epoch, observations of the score (ℎ) and the cost ( ) are available at each epoch.Optimiser's task would therefore simplify to optimally stopping the training when the optimum trade-off between the score and the cost is achieved.However, the flow of information would be different than in our model and a new modelling approach would be needed.Furthermore, in a Bayesian setting like ours, it is required that the error of each observation of the score and the cost is independent of the previous ones.This is clearly violated in an optimisation task with incremental improvements of the model.Nevertheless, this is an interesting direction of research which we retain for the future.Note also similarities to the learning curve methods used by, e.g., [8].
Credit card fraud data (https://www.kaggle.com/mlg-ulb/creditcardfraud).This dataset contains credit card transactions from September 2013.It is highly unbalanced with only 492 frauds out of 284, 807 transactions.We build an autoencoder (AE) [31] which is trained on non-fraudulent cases.We attempt to capture fraudulent cases by observing the reconstruction error.
We parametrise the network with two parameters: code which specifies the size of the smallest layer and scale is the number of layers in the encoder (the decoder is symmetric).The largest layer of the encoder has the size 20 × scale and there are scale layers in the encoder spaced equally between the largest layer and the coding layer ( denotes the ceiling function, i.e., the smallest integer dominating ).So an AE described by code equal to 1 and scale equal to 1.9 has hidden layers of the following sizes: 38, 19, 1, 19, 38.We consider code and scale between 1 and 10.
For 1D example, the optimisation parameter is the scale with code = 2.For 2D example, we optimise scale and code.
Given that fraudulent transactions constitute only 0.172% of all transactions, we do not focus on a pure accuracy of the model, but on a good balance between precision and recall which is weighted together by the widely used generalized F-score [13]: We pick = 6 as we are interested in capturing most of the fraudulent transactions (we observe that the recall is at least 80%).In practice is chosen somewhat arbitrarily; however it also can be used as an optimisation parameter [29].
Results for 1D example are displayed in Table 11.We observe that in this particular 1D example the optimal control of 0.01 corresponds to a shallow AE architecture, suggesting that good performance can be achieved with a small in size AE architecture.
The results for the 2D version of this example are shown in Table 12.AutoBCT decides to stop at the most shallow architecture as well, surprisingly revealing that even choosing the size of the code equal to 1 leads to similarly good results.Conclusions.Our validation demonstrated that AutoBCT performs well in hyperparameter selection problems for diverse datasets, machine learning techniques and objectives.VI/Map version of the algorithm is preferred to OTF.The value function map can be precomputed in advance and used for a range of problems.

Appendices A Dynamics of the observable Markov process
We derive joint dynamics of the filters ( ) and (ℬ ), and observable processes (ℎ ) and ( ).Random variables and are independent and observations are independent given the control ( ).Therefore, the dynamics of their filters ( ) and (ℬ ) are independent given the control ( ).
Since the process ( ) takes values in the space of Gaussian distributions, its dynamics can be fully described by the dynamics of the corresponding stochastic process representing means ( ) and covariances Σ .The dynamics of ( , Σ ) is Markovian, time-homogeneous and depends on the control ( ): for any Borel measurable sets 1 ⊂ R and 2 ⊂ R × (recall for some , , , > 0. Under Assumptions (A) and (S), we have where ˜ , ˜ , ˜ > 0 and = max( , 3/2).
Proof.We use the representation (2.12) of the functional .For any we have, supressing in the notation the conditioning on 0 where the estimate of the first term was derived in the proof of Lemma B.2 with = sup ∈ max ( ), and the estimate of the second terms follows analogously as in Lemma B.2. Inserting the above bound and the bound for Υ into (2.12)completes the proof.

C Proofs
Proof of Theorem 2.2.We have Conditioning -th term of the above sum on ℱ yields Using monotonicity of in and +1 = , we have (omitting parameters of functions for clarity of presentation) where the last equality follows from pointwise convergence of to and the monotone convergence theorem whose assumptions are satisfied as and

D AutoBML settings for validation examples
Throughout all examples we set = 0.16.We also note that we • smooth.splinefunction in R, that fits B-splines according to [9].

D.4 2D example with Breast Cancer data
Basis functions: • : {( 1 − 0. • Tps function in R from package fields, that fits thin-plate splines, see [19] for more detailed analysis on nonparametric regression.

Figure 1 :
Figure 1: Function to be predicted in the synthetic example of Section 4.1.

Figure 2 :
Figure 2: Score and cost priors for the synthetic example.

Figure 4 :
Figure 4: CNN architecture, where input is 50 × 50 tissue image with 3 colour channels and scaling parameter = 1 that affects the number of filters in convolution layers and sizes of flattened, densely connected arrays.A detailed information about the architecture of CNN can be found in Appendix E.1.
(a) Prior distributions for and .(b) Final posterior distribution of and .

Figure 6 :
Figure 6: Prior and posterior distributions for 1D CNN r example.
(a) Surfaces corresponding to prior (left) and posterior (right) means of the score.(b) Surfaces corresponding to prior (left) and posterior (right) means of the cost.

Figure 7 :
Figure 7: Prior and posterior distributions for a 2D CNN run from Table 6 by (C.4) and Lemma B.3.
It is followed by 8 examples ecompassing a variety of datasets and modelling objectives.Appendices contain further details.Appendix A derives the dynamics of the Markov decision process of Section 2.3.Appendix B contains auxiliary estimates used in proofs of main results collected in Appendix C. Detailed settings for all studied models are in Appendix D while architectures of neural networks used in examples are provided in Appendix E.

Table 3 :
AutoBCT (OTF, N=2) Results of one run for the synthetic example.

Table 6 :
An example run of AutoBCT on CNN example with the 2D control space.1, −1 corresponds to the learning rate and 2, −1 corresponds to the batch size batch.

Table 8 :
Final results of 3 OTF N=2 runs with 2D CNN problem.

Table 9 .
The final accuracy of 72% puts the model within the TOP 10 performing models according to openml.org/t/146606statistics for Higgs data set.

Table 9 :
Random forest on 'Higgs' data set using sample size as a control.This is image data of Natural Scenes around the world obtained from Kaggle (https://www.kaggle.com/puneet6060/intel-image-classification).This data was part of the challenge created by (https://datahack.analyticsvidhya.com)and was sourced by Intel company.The data contains around 25K images of size 150 × 150 distributed under 6 categories.The task is to create a classification model.For this example we use categorical cross entropy loss

Table 10 :
CNN on Intel data set using epochs as a control using two architectures.In brackets in the first column we show the test set accuracy while the score corresponds to the crossentropy loss.model, while in other one (Plain) we use architecture that does not rely on pretrained models.Details of architectures of Plain and Resnet50 models are provided in Appendix E.2.Results are displayed in Table

Table 11 :
1D (VI/Map, N=3, = 0.02) AutoBCT with AE on Credit Card Fraud data ℎ AE on 'Credit Card Fraud' dataset using scale as a control.

Table 12 :
AE on 'Credit Card Fraud' dataset using scale and code as a controls.