OPTCON3: An Active Learning Control Algorithm for Nonlinear Quadratic Stochastic Problems

In this paper, we describe the new OPTCON3 algorithm, which serves to determine approximately optimal policies for stochastic control problems with a quadratic objective function and nonlinear dynamic models. It includes active learning and the dual effect of optimizing policies, whereby optimal policies are used to learn about the stochastics of the dynamic system in addition to their immediate effect on the performance of the system. The OPTCON3 algorithm approximates the nonlinear model with a time-varying linear model and applies a procedure similar to that of Kendrick to the series of linearized models to calculate approximately optimal policies. The results for two simple economic models serve to test the OPTCON3 algorithm and compare it to previous solutions of the stochastic control problem. Initial evaluations show that the OPTCON3 approach may be promising to enhance our understanding of the adaptive economic policy problem under uncertainty.

relations between different variables which are reflected in the probability distributions of the parameters of the econometric (or calibrated) model of the economy. Stochastic optimal control theory is therefore an appropriate framework to deal with such policy problems when the policy maker's aim is to obtain the best policy according to his/her preferences.
Unfortunately, stochastic optimal control theory has not succeeded in deriving precise solutions for even very simple analytical problems and even less so for the complex problems involving large models which are characterized by nonlinearities and various sources of uncertainty. One of the reasons for this is the so-called dual effect of controls in a stochastic dynamic system: controls do not only serve to optimize the instantaneous objective in each period but may also be used to learn about the reactions of the economy to policy measures, which in turn can contribute to improved policies in later periods. This interdependence between considerations of direct optimization and experimentation to learn about policy effects makes the stochastic optimal control problem intractable, as has been recognized by several authors in the past (Fel'dbaum 1965;Aoki 1989). One is therefore restricted to numerical investigations determining approximations to the unobtainable truly optimal policies. So far, the most ambitious work on optimal stochastic control for economic policy problems has been done by Kendrick (1981), who developed several algorithms, including one for active learning, based on Bar-Shalom and Tse (1976), in which the dual effect of controls is explicitly taken into account. Further work with these algorithms revealed interesting problems, such as the occurrence of nonconvexities in linear-quadratic stochastic control problems under active learning (Mizrach 1991;Amman and Kendrick 1995;Tucci 1998;Amman et al. 2018). So far, these algorithms have been confined to linear dynamic models, which is a severe restriction as even the simplest econometric models contain some nonlinearities. In this paper, we extend the Kendrick algorithm with active learning to a class of nonlinear models which can be approximated by time-varying linear models. We first review previous research with the OPTCON algorithms (versions OPTCON1 and OPTCON2) and then present the new OPTCON3 algorithm which includes active learning. Initial evaluations show that this approach may be promising to enhance our understanding of the adaptive economic policy problem under uncertainty.

The Problem
The OPTCON algorithms are designed to achieve approximate solutions to optimal control problems with a quadratic objective function (a loss function to be minimized) and a nonlinear multivariate discrete-time dynamic system under additive and parameter uncertainties. The intertemporal objective function is formulated in quadratic tracking form, which is quite often used in applications of optimal control theory to econometric models.
Thus, it is required to find values for the control variables (u t ) and the corresponding state variables (x t ) which minimize the function with and satisfy conditions in the form of a dynamic system of nonlinear difference equations: x t is an n-dimensional vector of state variables that describes the state of the economic system at any point in time t. u t is an m-dimensional vector of control variables, x t ∈ R n andũ t ∈ R m are given 'ideal' (desired, target) levels of the state and control variables respectively. T denotes the terminal time period of the finite planning horizon. W t is an ((n + m) × (n + m)) matrix, specifying the relative weights of the state and control variables in the objective function. Quite often, W t is a matrix including a discount factor α with W t = α t−1 W . W t (or W ) is symmetric.
Moreover, θ is a p-dimensional vector of parameters whose values are assumed to be constant but unknown to the decision maker (parameter uncertainty), z t denotes an ldimensional vector of non-controlled exogenous variables, and ε t is an n-dimensional vector of additive disturbances (system error). θ and ε t are assumed to be independent random vectors with expectationsθ and O n respectively and covariance matrices Σ θθ and Σ εε respectively. f is a vector-valued function and f i (. . .) is the i-th component of f (. . .), i = 1, . . . , n.

Versions 1 and 2 of the OPTCON Algorithm
This section gives a brief description of the two previous versions of the OPTCON algorithm, with the open-loop and then with the passive learning strategy. The first version of OPTCON, OPTCON1, delivers an open-loop (OL) solution and is described in detail in Matulka and Neck (1992). The open-loop strategy either ignores the stochastics of the system altogether or assumes the stochastics (expectation and covariance matrices of additive and multiplicative disturbances) to be given for all time periods at the beginning of the planning horizon. The problem with the nonlinear system is tackled iteratively, starting with a tentative path of the control and state variables. The tentative path of the control variables is given for the first iteration. In order to find the corresponding tentative path for the state variables, the nonlinear system is solved numerically using the Levenberg-Marquardt method or trust region methods. 1 Then, the iterative approximation of the optimal solution starts. The solution is iterated from one time path to the next until the algorithm converges or the maximum for t=T, ..., 1 -linearize the system around ( stop criterion for non-linearity loop (convergence?) no yes nonlinearity-loop number of iterations is reached. During the optimization process the system is linearized around the previous iteration's result as a tentative path and the problem is solved for the resulting time-varying linearized system. 2 The optimal solution of the problem for the linearized system is found under the above-mentioned simplifying assumptions about the information pattern; this solution is then used as the tentative path for the next iteration, starting off the procedure all over again. In every iteration, i.e. for every solution of the problem for the linearized system, the objective function is minimized using Bellman's principle of optimality to obtain the parameters of the feedback control rule. Finally, the value of the objective function is calculated for the obtained solution. Figure 1 summarizes the OPTCON1 algorithm. The second version of the algorithm, called OPTCON2 and described in Blueschke-Nikolaeva et al. (2012), includes the passive learning strategy (also named open-loop feedback (OLF)), which uses the idea of re-estimation of the model at the end of each time period. For this re-estimation the model builder (and hence the control agent) observes what has happened and uses the current values of the state variables, that is, the new information, to improve his/her knowledge of the system.
The stochastics in the problem is again represented by two kinds of errors, namely additive (random system errors) and multiplicative ('structural' errors in parameters). It is assumed that 'true' parametersθ generate the model. However, the policy maker does not know these true parametersθ and works with the 'wrong' parameters θ m resulting from the estimates using the realization of the random variable μ m : θ m =θ + μ m .
The passive learning strategy has the following structure: a forward loop is started from time 1 to T . In each time period S an (approximately) optimal open-loop solution for the subproblem is determined, i.e the problem for the time periods from S to T . Then the predicted x * S and u * S are fixed for the time period S. At the end of each time period the policy maker observes the realized values of the state variables x a * S , which are, however, disturbed by the additive errors. The difference between comes from the realization of the random numbers ε m S and μ m . Next, the new information is used by the policy maker to update and adjust the parameter estimate θ m . After that, the same procedure is applied to the remaining subproblem from S + 1 to T , and so on. The update of the parameter estimates is conducted via the Kalman Filter.
The same update procedure is used in the next version of the OPTCON algorithm (which is called OPTCON3) as well.

Description
The new version of the OPTCON algorithm includes an active learning strategy (also called closed-loop, adaptive dual or dual control) and is named OPTCON3. The active learning strategy lets the policy maker face the dual problem of choosing the best strategy and reducing the uncertainty about the system. It is expected that such a strategy can help improve the performance of the control process and give more reliable policy recommendations. The active learning method differs from the passive learning method in the OPTCON2 algorithm in the following way. When using the passive learning method, new observations are obtained each period and are used to update the parameter estimates; however, no effort is made to choose control variables with the aim of improving the learning process about the dynamic system to be controlled. In contrast, in the active learning methods, control variables are chosen with the dual purpose of moving the system in the desired direction and perturbing the system to improve the parameter estimates. Thus, the active learning strategy delivers an optimal solution where the control is chosen with a view to reaching the desired states in the present and reducing uncertainty through learning, permitting an easier attainment of desired states in the future. This lets the policy maker cope with the dual problem of choosing the best strategy and reducing the uncertainty about the system simultaneously. The key idea is to make some use of information about future measurements as well.
The procedure of finding the closed-loop solution in this paper corresponds to Kendrick (1981). The approximate cost-to-go is broken down into three terms: where J d is the total cost-to-go with T periods remaining; the deterministic component J D includes only non-stochastic terms; the cautionary component J C includes the stochastic component of the system known in the current period; and the probing term J P contains the effect of dual learning on the future time periods. Each of these components faces special difficulties in computing due to the nonlinearity of the system. Especially the probing term includes the motivation to perturb the controls in the present time period in order to reduce future uncertainty about the parameter values and can therefore be considered the most challenging task. Thus, the terms J C and J P constitute a separate optimization problem with a quadratic criterion which is maximized subject to the nonlinear system. The system equations are derived from the expansion of the original system and can be calculated by rewriting the Taylor expansion of the nonlinear system in the perturbation form δx t . Instead of the system (3) the objective function in perturbation form has to be solved: min δu t ΔJ * t .

Fig. 2 Flow chart of OPTCON3
After some calculations the solution ΔJ * t is presented in quadratic form as a function of δx t−1 . The original J * t can be derived from ΔJ * t and can be decomposed in three terms. Moreover, all the terms and formulas need to be adjusted to the augmented Next, a schematic structure of the OPTCON3 algorithm is presented. This goes in line with the simplified flow chart presented in Fig. 2 and is used as a basic structure for the implementation.
The optimization is carried out in a forward loop from 1 to T . In each time period S (S = 1, . . . , T ) the following search procedure is conducted. The subproblem from S to T is solved via the open-loop (OL) strategy (see Fig. 1 in Sect. 3). The OL solution of (x * S , u * S ) for the time period S is fixed. After that the core part of the dual control starts. The idea is to actively search for some solution paths which best deal with the dual problem of minimizing the current objective function and the future uncertainty in the model. In this paper a grid search method is used. 4 For this purpose we create a grid of possible solutions around the existing path (x * S , u * S ). We denote the grid search procedure as "π -loop". In each iteration (π = 1, . . . , Π) the approximate objective function is evaluated which corresponds to the search value of the control. The evaluation is repeated until the approximately optimal control is found. Inside the search loop (for each π ) the following steps are to be performed.
An (approximately) optimal open-loop solution for the subproblem (i.e. the problem for the time periods from S +1 to T ) is determined. Then the OL solution (x * π S+1 , u * π S+1 ) for the time period S +1 is fixed. Next, after some auxiliary calculations (Riccati matrices) the deterministic, cautionary and probing terms of the cost-to-go are determined. In the process, a new loop is introduced, where the terms J D , J C and J P for the time periods j = S + 1, . . . , T are calculated step by step from time period S + 1 to T using the updated covariances. Once the π -loop has been completed, the total approximate objective function J d = J D + J C + J P can be obtained. The evaluation of the function is done at each iteration in the π -loop. When the search is completed, i.e. the approximately optimal path with min J d is found, the new information is used by the policy maker to update and to adjust the parameter estimate θ m , whereby the Kalman filter is used. After that, the same procedure is applied for the remaining subproblems from S + 2 to T , and so on. 5 The OPTCON3 algorithm essentially uses the approach introduced by Bar-Shalom and Tse (1976) and Kendrick (1981) but augments it by approximating, in each step, the nonlinear system by a series of linear systems (replacing the nonlinear autonomous system by a linear time-varying one).
The OPTCON3 algorithm (Steps I -IV in the appendix) describes the steps how to obtain an approximately optimal dual control solution of a stochastic problem. In the optimization process one has to observe the current state of the system, which is crucial for the learning procedure. Because it is not possible to observe current and true values for a performance test, one has to resort to Monte-Carlo simulations. In this way, some "quasi-real" values can be created and used to compare the performance of an optimization without learning (both open-loop (OL) and certainty equivalence (CE) alternatives), passive learning (OLF) and active learning (AL).
Thus, a large number M (a number, usually between 100 and 1000) of realizations of random noises (ε m t ) T t=1 and μ m , m = 1, . . . , M, are generated. It is assumed that there is an unknown 'real' model with the 'true' constant parameter vectorθ. But the policy maker does not know these 'true' parametersθ and works with the 'wrong' parameters θ m resulting from the estimates using the realization of the random variable μ m : θ m =θ +μ m . For better understanding, a brief scheme is sketched in Algorithm 1.
Algorithm 1 Pseudocode of the overall experiment 1: Generate M sets of (ε m t ) T t=1 and μ m (i.e. M sets of θ m =θ + μ m ) 2: for m = 1 : M do 3: calculate stochastic CE, OL, OLF and AL solutions using the OPTCON3 algorithm (and the generated values ε m t and θ m ) 4: end for 5: compare CE, OL, OLF, AL Algorithm 1 is used in the next section to test the performance of the new algorithm.

Applications
We apply Algorithm 1 to two different models and test the performance of active learning in terms of the objective function value and influence on the control variable. In a simple linear model, the MacRae model, we observe a relatively small effect of active learning. In contrast, in a more sophisticated model ATOPT, using the active learning strategy leads to a more active use of the control variable.
The MacRae model The MacRae model, as used by MacRae (1972) and Kendrick (1981), is a theoretical model for two periods only. The MacRae model includes one control variable and one state variable and consists of one equation only: The model does not have exogenous (non-controlled) variables. One of the parameters is treated as unknown. 6 The objective function penalizes deviations of objective variables from their target values. The target values of the state and control variables (x t andũ t respectively) are assumed to be zero. The weight matrix W is assumed to be constant over time (no discounting). The weights for the state and the control variables (the values in W ) are chosen to be 1, which reflects the same importance for all variables. The optimization horizon consists of 2 periods.
The aim of the application is to determine approximately optimal policies under the assumed objective function and the dynamic system given here by Eq. (4) using the three versions of the OPTCON algorithm, i.e. the three strategies: certainty equivalence (CE), open-loop feedback (OLF) and active learning (AL). Figure 3 summarizes, in the form of a boxplot, the optimization results, i.e. the optimal values of the control variable in different Monte Carlo runs for all three strategies.
In the boxplot in Fig. 3, 50% of all scores are placed in the boxes and the median is shown by the line that divides the box into two parts. It shows that the results of the OLF strategy are more concentrated (the OLF box is smaller) than those of CE. The results of the AL strategy are even more concentrated than those of CE and OLF. This is to some extent due to the simplicity of the model and the fact that there is just one optimization period for active learning. We can also observe the following: CE has the minimum cost in approximately 66% of the cases, AL in 22% and OLF in remaining cases. This may be compared to the results in Kendrick's DUAL code, which are 60%, 25%, and 15%, respectively.
The mean and the standard deviation of the optimal values of the objective function are given by mean(J O L F ) = 20.11, mean(J AL ) = 20.18, mean(J C E ) = 22.18, and std(J O L F ) = 6.35, std(J AL ) = 5.93, std(J C E ) = 11.36, respectively. As with Kendrick's DUAL software, the difference in the average cost of the three procedures is rather small and the AL algorithm gives the minimum standard deviation, whereas the avg. cost of CE has the highest standard deviation, with larger differences than for the mean. In particular, Fig. 3 suggests that the control associated with AL for t = 1 is less active than the others, almost half so of the CE control, and its standard deviation is also much lower than all the others. From Fig. 4, we see that the differences in the controls and standard deviations are less pronounced for the final period. Especially the results for OLF and AL are very similar.
As far as the state is concerned (Figs. 5 and 6), all three procedures show similar results at the end of the time horizon. AL performs slightly better than the others at t = 1. Again, standard deviations are lower when using AL. However, the differences are much smaller as compared to the control variable. Summarizing, by and large we can confirm the results from Kendrick's DUAL code. Figure 7 shows the values of the three components of the objective function, deterministic (J D ), cautionary (J C ) and probing terms (J P ), and the total objective function (J d ) [see Eqs.  Kendrick (1981) and Kendrick (1982). Thus, an application to a more sophisticated, nonlinear model will be carried out in the next step. The ATOPT model Next, we apply the algorithm to a nonlinear dynamic model of the Austrian economy (ATOPT model) created by Blueschke et al. (2018), which analyzes the output -public debt trade-off. The model consists of three equations, i.e. three endogenous variables: output growth (y t ), public debt (d t ), and the interest rate (r t ), which are: Austria is a small open economy; thus its economic performance depends to a large extent on the economic situation in the world. The correlation coefficient (a 1 ) between Austria's and the world's GDP growth (between 1996 and 2017) is 0.7266. The fiscal policy instrument (g t ) is the primary fiscal surplus (or deficit if negative). For the Austrian economy, an acceptable debt level is assumed to be given by the Maastricht criterion of 60% of GDP. As a threshold for normal output growth, a value slightly above the historical average (1996-2017) of 1.84 is assumed here, namely 2 percent annual growth (ȳ = 0.02). The fiscal multiplier parameter (θ 1 ) is one of the two stochastic parameters in the model and is assumed to be 1.2 with variance Σ θ 1 = 0.5. The second stochastic parameter (θ 2 ) is the link between output growth and the interest rate and is equal to 0.1 with variance Σ θ 2 = 0.1.
Equations (5)- (7) give a very simplified description of the Austrian economy with an output growth-public debt trade-off. Using its instrument, namely fiscal policy g, the government aims at maintaining a high GDP growth of 3% (ỹ = 0.03) and a steady decrease in public debt from 78.4% of GDP in 2017 to 60% of GDP at the end of the planing horizon, namely in 2022 (T = 5). At the same time, the government prefers to have a balanced budget (g = 0). The former two targets are represented by state variables in the ATOPT model, while the latter objective variable is the control variable.
Thus, the task is to find an optimal path for the control variable, in order to minimize the sum of the squared differences between the outcome of the system and the given targets. The optimal control problem is solved again using CE, OLF and AL strategies  Fig. 8 illustrates the values of the first control variable in the first time period and the box plot in Fig. 9 shows the same control variable in period 5. We see that in period 1 the AL strategy allows for a lot of variation or probing. This is in line with the idea of active learning. As a result of this probing the AL strategy delivers much better results in time period 5 than the other two strategies. The AL results of different MC runs are not spread out as much as CE and OLF in the last period. This qualitative behaviour of the active-learning control (relatively strong variations at the beginning to elicit reactions from the system from which to learn and to get closer to the "true" system parameters at the end) has also been observed by Kendrick (1982) in a linear model.
In Fig. 10 we can see that in contrast to the MacRae problem (Fig. 7), the cautionary component is much bigger and the deterministic term is smaller. Thus the largest part of the total objective function is due to the cautionary term. The explanation is that here we have a more complex optimization problem (compared to the MacRae problem) and, in particular, optimization over a longer planning horizon. These insights are in line with the results in Kendrick (1981).

Conclusion
In this paper, we reviewed the algorithms OPTCON1 for open-loop and OPTCON2 for open-loop feedback (passive learning) control for stochastic economic policy models and describe in detail the new OPTCON3 algorithm, which includes active learning and the dual effect of optimizing policies. The OPTCON algorithms are applicable to nonlinear models which can be approximated by time-varying linear models. A computer program was created to implement approximately optimal policies according to the OPTCON3 algorithm. The results from calculating these policies for two simple economic models served to test the OPTCON3 algorithm and compare it to the previous solutions of the stochastic control problem. Initial evaluations show that the OPTCON3 approach may be promising to enhance our understanding of the adaptive economic policy problem under uncertainty.
Σ θθ 0 = Σ θθ covariance matrix of system parameters Σ εε covariance matrix of system noise θ m (random) parameter noises (ε m t ) T t=1 (random) system noises (z t ) T t=1 path of exogenous variables (ũ t ) T t=1 target path for control variables (x t ) T t=1 target path for state variables W x x , W ux , W uu weighting matrices of objective function α discount rate of objective function At the end of the algorithm the following optimal values have to be obtained: Step I: For each S from 1 to T do the following search steps [1]-[3]: Step I-1: Find an open-loop solution for the subproblem (S, . . . , T ): apply the procedure already implemented in OPTCON1; cf. Sect. 3 above. Fix (x * S , u * S ).
• The nonlinearity loop is run until the stop criterion is fulfilled, i.e. until the difference between the values of the current and the previous iteration is smaller than a pre-specified number or the maximum number of iterations is achieved. The approximately optimal solution (x * t , u * t ) T S+1 has been found when the stop criterion has been achieved. Then go to the next step I-2B. It should be noted that after several runs of the nonlinearity loop only the solution (x * S+1 , u * S+1 ) for the time period S + 1 will be taken as the optimal (nominal) solution. The calculations of the pairs (x * t , u * t ) for other periods (t > S +1) have to be done again, taking into account the re-estimated parameters for all periods.
Notice the parameter matrices for the linearized system of equations: and F x u t are the derivatives of the system function f (. . .) with respect to x t , x t−1 and u t respectively. 7 Step I-2B: Calculate the Riccati matrices K x x , K xθ and K θθ and the auxiliary matrices Λ x x , Λ xu , Λ uu , λ x and λ u t for time periods t = S + 1, . . . , T . 8 7 For detailed definition see Matulka and Neck (1992), Eq. (16) -(18). 8 The terms w u t and w x t are defined as follows. In order to simplify notation and computation the following general quadratic form will be used: where w x t w u t = −W t x t u t and w c t = 1 2 x t u t W t x t u t The equivalence between the quadratic tracking form and the general quadratic form is shown, for instance, in (Blueschke-Nikolaeva 2013).

Σ θθ
with [c]: Calculate the probing component J P j .
Step I-2F: Calculate the sum of the deterministic, cautionary and probing terms over the periods S, . . . , T : Step I-2G: Take a new control • u S = • u π +1 S (new point of the grid search) and go to step I-2A.
Step I-3: Choose an optimal u * S with min J = J * d (u * S ). End of grid search.
Σ θθ S/S = Σ θθ S/S−1 − Σ θ x S/S−1 (Σ x x S/S−1 ) −1 Σ xθ S/S−1 (20) Step IV: Set θ m = θ m S/S and Σ θθ = Σ θθ S/S , go to Step I and run the procedure for the time period S + 1. The loop of Step I and the OPTCON3 algorithm is finished when S = T and the approximately optimal dual control and state variables for all periods have been found.