Model-based deep reinforcement learning for accelerated learning from flow simulations

In recent years, deep reinforcement learning has emerged as a technique to solve closed-loop flow control problems. Employing simulation-based environments in reinforcement learning enables a priori end-to-end optimization of the control system, provides a virtual testbed for safety-critical control applications, and allows to gain a deep understanding of the control mechanisms. While reinforcement learning has been applied successfully in a number of rather simple flow control benchmarks, a major bottleneck toward real-world applications is the high computational cost and turnaround time of flow simulations. In this contribution, we demonstrate the benefits of model-based reinforcement learning for flow control applications. Specifically, we optimize the policy by alternating between trajectories sampled from flow simulations and trajectories sampled from an ensemble of environment models. The model-based learning reduces the overall training time by up to $85\%$ for the fluidic pinball test case. Even larger savings are expected for more demanding flow simulations.


Introduction
According to the 2022 IPCC report, smart control technologies are a key enabler in mitigating the impact of global warming and climate change [1].Control tasks in which the manipulation of fluid flows can significantly reduce the carbon dioxide footprint are ubiquitous in society and industries, e.g., reducing aerodynamic forces acting on vehicles [2], maximizing the coefficient of performance of heat pumps [3], or adjusting the operating conditions of process engineering systems in response to the availability of renewable energy resources [4], to name a few.To achieve optimal control in the aforementioned example applications, it is necessary that the controller can respond to changes in the ambient conditions.However, the design and implementation of closedloop active flow control (AFC) systems is highly complex.As an example, consider the flow past a truck.At highway speeds, such a flow is fully 3D and comprises turbulent boundary layers, flow separation, reattachment, and large turbulent wakes [5,6].To implement closed-loop AFC, this complexity must be captured with a limited amount of sensors placed on the vehicle's surface.Similarly, the actuators must be designed and placed sensibly to control the dominant flow structures.Moreover, a suitable control law must be derived, i.e., a mapping from the sensor reading to the optimal actuation.Finally, the nonlinear flow dynamics create a tight coupling between sensors, actuators, and control law.Hence, the AFC system should be designed and optimized as a whole.
Recently, reinforcement learning (RL), and in particular deep RL (DRL), started to emerge as a technology in fluid mechanics with the potential to tackle the complexity of closed-loop AFC.Viquerat et al. [7] provide an extensive account of DRL-based applications in fluid mechanics.DRL solves control problems through trial-and-error interactions with an environment, where environment is the DRL term for the system to be controlled.The unique advantage of learning from simulation-based environments is the possibility of end-to-end optimization of the full control system, i.e., sensor locations, control law, and actuator positions [8][9][10].However, a critical challenge of simulation-based RL is the high computational cost and turnaround time of flow simulations.Even for relatively simple flow control benchmark problems, stateof-the-art RL algorithms require O(100) episodes to converge [7], which is far from practical for real-world AFC applications.
To emphasize the need for sample-efficient learning, consider again the application of DRL-based closed-loop AFC to vehicles.The current gold standard for accurate and fast simulations of vehicle aerodynamics are hybrid approaches like improved delayed detached eddy simulations (IDDES) [11].Computing converged mean force coefficients for the DrivAer test case at Re = 768000, required approximately two days on 700 CPU cores in 2016 [11].It is conceivable that roughly five sequential IDDES can be performed per day because it is not necessary to obtain converged statistics in every run.Moreover, hardware and scalability have improved significantly over the past years.Assuming that 10 simulations are performed in parallel per episode to generate more data, each simulation being run on 1000 CPU cores, the computational cost for 100 episodes is approximately 5M CPU hours, and the training lasts 20 days.Cloud providers charge 0.01-0.05euros per CPU hour, which puts a price tag of 0.5M-2.5Meuros on a single training at a single flow condition.Additional hyperparameter optimization, sensor/actuator variants, and an extension to multiple flow conditions can easily increase the cost by a factor of 10-100.This cost, combined with a turnaround time of 20 days, is far from practical for the majority of potential users.
Several remedies exist to reduce the computational cost and turnaround time, e.g., by exploiting invariances [12,13] or by pre-training on simulations with very coarse meshes [14].While effective, the applicability of the aforementioned techniques strongly depends on the control problem, i.e., the problem must exhibit invariances, and the mesh coarsening must not change the flow characteristics.A more general approach to improve data efficiency is model-based DRL (MBDRL) [15].The main idea is to substitute the expensive simulation-based environment with one or more surrogate models.The surrogate models are optimized based on data coming from the high-fidelity environment.Additional data to optimize the control law can be created with little effort by querying the optimized surrogate model.A plethora of MBDRL algorithms exist that differ in the type of surrogate model and how the control law is derived from the model.Moerland et al. [15] provide an extensive overview.
Key challenges in MBDRL are i) the efficient creation of accurate surrogate models and ii) dealing with the presence of model error.Due to their flexibility, neural networks are a common choice for building data-driven surrogate models.However, the models must be created on the fly based on data that is unavailable for prior tests, so the optimization pipeline must be robust and efficient.Moreover, the models are auto-regressive.Hence, long-term stability is a persistent challenge.If the prediction accuracy is not sufficiently high, querying the models repeatedly leads to vanishing or exploding solutions.Finally, even if the model creation workflow is robust and efficient, there is always epistemic uncertainty due to the alternating iterative updates of environment models and control law.With each update of the control law, previously unknown actuations and states will emerge, and consequently, the environment models will eventually become too inaccurate.Estimating whether the surrogate models are still reliable or should be updated with high-fidelity data from the true environment is essential to making MBDRL robust.
In this contribution, we present a modified version of the model ensemble trust region policy optimization (METRPO) [16] algorithm and demonstrate the benefits of MBDRL for fluid-mechanical applications.Specifically, we compare model-free (MF) and model-based (MB) learning on the flow past a circular cylinder [17,18] and the fluidic pinball configuration [9,19].The appendix describes various implementation details and serves as a reference for the abbreviations used throughout the text.Moreover, we provide a complimentary code repository with instructions to reproduce the numerical experiments and data analyses presented hereafter [20].A persistent record of all scientific results, including a snapshot of the code repository, is publicly accessible [21].

Theory
This section starts with a very short introduction to essential RL concepts and then describes the MF algorithm, proximal policy optimization (PPO), employed to solve the flow control problems.Thereafter, we explain the creation of model ensembles, which emulate the simulation outputs, and how these ensembles are employed in an MBDRL variant of PPO. Figure 1 shows how the different algorithmic components are connected within one episode of model-based PPO.

Reinforcement learning
There are two main entities in RL: the agent and the environment.The agent contains the control logic, while the environment represents the simulation or experiment that shall be controlled.The environment at a discrete time step n ∈ N is characterized by a state S n ∈ S, where S is the space of all possible states.Formally, there is a difference between the system's state and a partial observation of the state, since the full state is often inaccessible.Nonetheless, we follow standard practice and refer to partial observations and full states alike simply as states.In fluid mechanics, S n could represent instantaneous pressure or velocity values at one or more sensor locations in the region of interest.The agent can manipulate the state via an action A n ∈ A, where A is the space of all possible actions, upon which the environment transitions to a new state S n+1 .Common actions (means of actuation) for flows are suction, blowing, rotation, or heating.The control objective in RL is expressed by the reward R n+1 ∈ R, where R is the space of all possible rewards, and is assessed based on the transition from S n to S n+1 .For example, if the control objective is drag reduction, the reward would be defined such that large drag forces yield low rewards while small drag forces yield high rewards.The combination of S n , A n , R n+1 , and S n+1 is called an experience tuple [22].The aggregation of experience tuples over N interactions between agent and environment forms a so-called trajectory [22], Trajectories form the basis for optimizing the policy π(A n = a|S n = s) (the control law), which predicts the probability of taking some action a given that the current state is s.Optimality in RL is defined in terms of cumulative rewards, the so-called return G n = N i=n+1 γ i R i , because the system's response to an action may be delayed.The discount factor γ ∈ (0, 1] introduces a notion of urgency, i.e., the same reward counts more if achieved early in a trajectory.Given the same policy π and the same state S n , the return G n might vary between trajectories.Incomplete knowledge of the state is one common source of uncertainty.To consider uncertainty in the optimization, RL ultimately aims to find the policy that maximizes the expected return v π (s) under policy π over all possible states s ∈ S [22]: where E π [•] denotes the expected value under policy π, e.g., the mean value for a normally distributed random variable, and v * π (s) is the optimal state-value function.Equation (1) forms the basis for all RL algorithms.

Policy learning
Expression (1) implicitly defines the optimal policy π * (a|s), which is the policy associated with the optimal value function v * π (s).However, additional algorithmic building blocks are required to derive a workable policy.PPO is the most common algorithm to solve problem (1) for fluid-mechanical systems [7].PPO employs deep neural networks as ansatz for the value function v π (s) and the policy π(a|s).Hence, PPO belongs to the group of actor-critic DRL algorithms.PPO is relatively straightforward to implement and enables accelerated learning from multiple trajectories, i.e., multiple trajectories may be sampled and processed in parallel rather than sequentially to produce new training data T = [τ 1 , τ 2 , . . .τ K ].It is important to note that there are many PPO details that vary across different implementations.Many of these details are not thoroughly described in the standard PPO reference article [23], but they are nonetheless of utmost importance.Andrychowicz et al. [24] provide guidelines based on comprehensive test results for a multitude of algorithmic variants.Our implementation is based on the textbook by Morales [25].We outline only those elements needed for the MBDRL version.
Based on the trajectories, the free parameters θ v of the parametrized value network v θv (s) are optimized to predict the expected return over all trajectories.The value function's subscript π has been dropped to simplify the notation.The loss function includes some additional clipping to limit the change between two updates [25]: where v old θv is the value network's state before optimization, and the clip function is defined as: Similar to the value function network, the policy network π θπ (a|s) parametrizes a multivariate Beta distribution over possible actions.During exploration, actions are sampled at random from this distribution, i.e., A n ∼ π θπ (S n ).We do not consider a potential cross-correlation between actions for exploration.To evaluate if a sampled action performs better than expected, the reward is compared to the expected reward, i.e. [23,25]: where δ n is called the 1-step advantage estimate at time step n.Actions with a positive advantage estimate should be favored.Rather than looking one step ahead, one could also look l steps ahead to consider delayed effects.The generalized advantage estimate Â balances between short-term and long-term effects using the hyperparameter λ ∈ (0, 1] [23,25]: Finally, the free parameters of the policy network, θ π , are optimized to make favorable actions ( Ân > 0) more likely and poor actions ( Ân ≤ 0) less likely.This concept is combined with a clipping mechanism that restricts policy changes [25]: where π θπ /π old θπ is the probability ratio between the current and old (before optimization) policy, ε is the clipping parameter, E(π) is the policy's entropy [25], and β is the entropy's weighting factor.The additional entropy term avoids a premature stop of the exploration [15].
One sequence of generating trajectories, updating the value network according to equation ( 2), and updating the policy according to equation ( 6) is called an episode.Since the optimization starts with randomly initialized value and policy networks, multiple episodes are required to find the optimal policy.

Model learning
The type of environment model employed here is a simple feed-forward neural network with weights θ m that maps from the last d + 1 states and a given action to the next state and the received reward: To simplify the notation, we introduce the current extended state Ŝn , which comprises the last d + 1 visited states: Arranging the current extended state and the current action into a feature vector, x n = [ Ŝn , A n ], and the next state and the received reward into a label vector, y n = [S n+1 , R n+1 ], the model weights θ m can be optimized employing a mean squared error (MSE) loss of the form: where D is a set of feature-label-pairs constructed from high-fidelity trajectories, and |D| is the overall number of feature-label-pairs.A single trajectory of length N yields |D| = N − (d + 1) such pairs.To obtain the model weights based on equation ( 9), we employ state-of-the-art deep learning techniques, e.g., normalization of features and labels, layer normalization, batch training, learning rate schedule, and early stopping.
More details about the model training are provided in appendix A.

Model-based proximal policy optimization
Given a trained model of the form (7), it becomes straightforward to sample new (fictitious) trajectories from the model by employing the model recursively.This process is described in detail in algorithm 1.The initial extended state Ŝ0 is selected from the existing high-fidelity trajectories.Note that the action sampling makes the trajectory sampling stochastic, so even when starting from the same Ŝ0 , executing algorithm 1 K times generates K different trajectories.For efficiency, the sampling of multiple trajectories should be performed in parallel.
Algorithm 1 Sampling of a trajectory from an environment model.
▷ predict the next state and the reward ▷ append experience tuple to trajectory Ŝn ← [S n−d+1 , . . ., S n , S n+1 ] ▷ overwrite current extended state end while In principle, MBDRL can work by sampling trajectories from a single model.However, the policy changes with each episode, which leads to newly explored states and actions.Naturally, the environment model's prediction error increases with each episode, and it becomes increasingly challenging to sample meaningful trajectories from the model.Therefore, it is vital to monitor the model's confidence in a prediction to decide at which point new high-fidelity episodes should be generated.There are two pathways to include model uncertainty: one could train a fully probabilistic environment model, i.e., a model that predicts a distribution over the next possible states, or one could train an ensemble of regular models [15].Here, we follow the METRPO algorithm [16] and train an ensemble of N m simple environment models.Each member in the ensemble M = [m 1 , m 2 , . . ., m Nm ] has a different set of weights θ i , which we drop from the index to simplify the notation.The ensemble approach introduces a new hyperparameter, namely the number of models, but training the ensemble is typically easier than training a fully probabilistic network.Moreover, it is straightforward to optimize the models in parallel.Each model is trained as described before, i.e., employing loss function (9).Even though the training procedure stays the same, the optimized models differ from one another for several reasons: There are several options to generate new trajectories from the model ensemble.The most obvious one is to repeat algorithm 1 once per model.In contrast, the original METRPO algorithm mixes the models within the trajectory sampling to improve robustness [16].The authors state that different models are likely to have different model biases due to the training on a different subset of the data.When mixing the models, the different biases can partially cancel out.The alternation between models is achieved by sampling a different model for each step from a categorical distribution P M over N m classes, where each class i has an equal selection probability of p i = 1/N m .The modified sampling strategy, which we also employ here, is depicted in algorithm 2.
Algorithm 2 Sampling of a trajectory from an ensemble of environment models.Differences compared to algorithm 1 are marked in bold.▷ randomly select a model from the ensemble

Input
▷ overwrite current extended state end while Besides the robust generation of trajectories, the ensemble allows to quantify prediction uncertainty.More precisely, we would like to determine, at which point the ensemble becomes unsuitable to generate new trajectories.Therefore, in each modelbased episode, K trajectories are generated from the ensemble employing algorithm 2, and N m additional trajectories, one for each model, are generated employing algorithm 1.The ensemble trajectories are used to update policy and value networks.The model-specific trajectories are used to evaluate the policy loss (6) individually for each model to obtain a scalar value expressing the model's quality (not for gradient descent).Comparing the ith model's loss values of the current episode, i.g., L new pol,i , with the loss of the same model obtained in the previous episode, i.e., L old pol,i , allows assessing the quality of the policy update in the current episode.The loss trends for all models in the ensemble are combined by evaluating: where H is the Heaviside step function.In simple terms, N pos is the number of models in the ensemble for which the policy loss improves after an update of the policy.

Results
We demonstrate the MEPPO algorithm on two flow configurations, namely a rotating cylinder in channel flow and the fluidic pinball.The numerical setups are described in section 3.1.In section 3.2, we compare MF and MB trainings and investigate the influence of ensemble size N m and switching criterion N thr .Section 3.3 presents a short discussion of the optimal policies found for each test case.The numerical simulations are performed with the OpenFOAM-v2206 toolbox [26].The orchestration of simulations and the DRL logic are implemented in the drlFoam package [27].Further implementation details are available in the complementary code repository [20].

Cylinder flow
The flow past a circular cylinder has become an established AFC benchmark.Originally, the setup was introduced by Schäfer et al. [28] as a benchmark for numerical methods.Rabault et al. [17] were the first to adopt the setup for DRL-based AFC.Since then, many variants with different means of actuation, sensor positions, and Reynolds numbers have been investigated [7].The setup used in the present study is depicted in figure 2. The inlet velocity profile is parabolic [28], while the velocity vector at the upper and lower domain boundary is zero.At the outlet boundary, the velocity gradient is set to zero.For the pressure, a fixed reference value is applied at the outlet, while the gradient is set to zero on all other boundaries.The Reynolds number based on the mean inlet velocity U in , the cylinder diameter d and the kinematic viscosity ν is Re = U in d/ν = 100.To solve the incompressible Navier Stokes equations, we employ the pimpleFoam solver with residual control.The pressure-velocity coupling is stopped once the initial residuals for pressure and momentum equations drop below 10 −4 .Since the flow is laminar, no additional turbulence modeling is necessary.The mesh consists of approximately 5.3 × 10 3 hexahedral cells and is created using the blockMesh utility.For completeness, we note that the mesh has one cell layer in the third spatial direction of depth ∆z = 0.1d, even though the simulation is 2D.The extension in the third direction is a technical requirement of 2D simulations in OpenFOAM.To discretize convective and diffusive fluxes, we employ pure linear interpolation.An implicit first-order Euler method is used for the temporal discretization.
The state (observation) used as input for control is formed by 12 pressure probes placed in the cylinder's wake.The flow is actuated by rotating the cylinder.The control aims to minimize the forces acting on the cylinder.Given the density-normalized integral force vector F = [F x , F y ] T , the corresponding force coefficients along each coordinate are: where the reference area is A ref = d∆z.The instantaneous reward at time step n is computed as [18]: Note that we do not use time or window-averaged coefficients but the instantaneous values c i,n .The magical constants in equation ( 12) simply scale the reward to a value range near zero, which avoids additional tuning of PPO hyperparameters.Given the convective time scale t conv = d/U in , the control starts once the quasisteady state is reached, which is at 40t conv .The policy is queried once every 20 numerical time steps, which corresponds to a control time step of ∆t c = 0.1t conv .Within the control interval, the angular velocity linearly transitions from ω n to ω n+1 [18].The value range of the normalized angular velocity, ω * = ωd/U in , is limited to ω * ∈ [−0.5, 0.5].Overall, N = 400 experience tuples are generated within one trajectory, which corresponds to a duration of 40t conv .The generation of a single trajectory takes approximately 4min on two MPI ranks.

Fluidic pinball
Noack et al. [29] introduced the fluidic pinball, which is a triangular configuration of three rotating cylinders.The unforced flow undergoes several different vortex shedding regimes, namely symmetric, asymmetric, and chaotic vortex shedding.The cylinderbased Reynolds number of the present setup is Re = 100, which places the flow in the asymmetric vortex shedding regime.
The setup is depicted in figure 3. To reduce the initial transient simulation phase, we apply a step function velocity profile at the inlet with ε = 0.01U in .A constant velocity vector is set on the lower and upper domain boundaries, i.e., 1.01U in and 0.99U in , respectively.The velocity gradient at the outlet boundary is set to zero.Similar to the cylinder setup, the pressure is fixed at the outlet, and the pressure gradient on all other boundaries is set to zero.The remaining details of the setup are largely the same as those presented in section 3.1.1.Due to the more complex geometry, the mesh consists of approximately 5.3 × 10 4 control volumes.For the convective term of the momentum equation, we employ a limitedLinear scheme with a coefficient of 1.0.A total of 14 pressure sensors form the state.The positions depicted in figure 3 are loosely inspired by a preliminary optimal sensor placement study [9].The precise numerical coordinates of each sensor can be inferred from the setup files in the complementary code repository.The policy parametrizes a trivariate Beta distribution over the angular velocities ω i , i ∈ {1, 2, 3}.As the control objective, we aim to minimize the cumulative forces acting on all three cylinders.Given the density-normalized integral force T of the ith cylinder, the corresponding force coefficients are defined as: where the reference area is the projected area of all three cylinders in x-direction, namely A ref = 2.5d∆z.Definition (13) ensures that the sum over all cylinders recovers the correct total force coefficients: Based on the cumulative force coefficients, the instantaneous reward is computed as: The reward definition is very similar to the one used for the single cylinder.Note that there is no particular reason for this definition other than that it was employed in a previous study [19].Alternatively, we could have also summed up the magnitudes of the individual force coefficients.However, the focus of this work is on the demonstration of MBDRL.The magical constants in equation 15 have the same purpose as before, namely normalizing the reward values.Keeping the same definition of the convective time scale as before, the control starts at the end of the initial transient regime, which is at 200t conv .The policy is queried every 50 numerical time steps, corresponding to a control interval of ∆t c = 0.5t conv .The trajectory extends over 100t conv such that N = 200 experience tuples are generated.The dimensionless angular velocities are limited to the range ω * i ∈ [−5, 5].Generating a single trajectory with eight MPI ranks requires approximately 40min.The main parameters relevant to AFC are summarized in table 1 for both test cases.

General remarks about the training
We compare the control performance and the computational cost between MF and MB trainings.Each training is executed in isolation on a dedicated compute node with 96 CPU cores.To discuss the schematics presented later on, it is necessary to outline several implementation details.In each episode, exactly ten trajectories are generated in parallel before updating the policy and value networks.The buffer size and the parallel trajectory generation are the same for CFD and MB trajectories.We perform a very rough check of the force coefficients in the MB trajectories and discard trajectories that are obviously extremely inaccurate.Specifically, we only keep an MB trajectory for the cylinder flow if all coefficients fulfill the criteria 2.

Cylinder flow
Figure 4 shows the episode-wise mean rewards received with different training configurations.Within 200 episodes almost all configurations achieve a similar maximum reward.The maximum reward in the MF training is slightly lower, but the reward still keeps increasing moderately toward the end of the training.Surprisingly, all MB trainings reach the maximum reward earlier than the MF training.We noticed that the MB trajectories display less small-scale variance than the CFD trajectories.Presumably, this indirect filtering performed by the environment models affects the policy and value network updates positively.In general, The MB trainings display a higher variance in the mean reward, especially within the first 50 episodes, where the changes in the reward are the strongest.This behavior is related to individual MB trajectories with poor prediction quality.Luckily, these trajectories seem to have a limited impact on the learning.The markers in the MB training in figure 4 indicate CFD trajectories.For the MB training with one model, we switch back to simulation-based sampling every fourth episode, so 75% of all episodes use the environment model.The same ratio results in the MEPPO training with N m = 5 and N thr = 3.Interestingly, the automated switching between CFD and model sampling leads to a relatively regular alternation.The same number of models with a lowered threshold value of N thr = 2 reduces the amount of CFD episodes to 15%, notably without performance degradation.The training with N m = 10 and N thr = 5 also employs the model ensemble in 75% of the episodes.Reducing the threshold value in the latter case to N thr = 3 reduces the number of CFD episodes to 6%.However, the optimization becomes significantly less stable and does not reach optimal control performance.Note that the amount of training data per model decreases as the number of models increases.Therefore, increasing the number of sampled trajectories per episode or allowing overlapping training data could potentially stabilize trainings with a lowered threshold value.The increased amount of training data when employing only a single model could explain the quick convergence of the corresponding training after approximately 100 episodes.Assuming that the creation of environment models is reliable and much less costly than a CFD simulation, the more CFD episodes can be replaced with MB episodes, the lower the overall training time.This trend is reflected by the normalized MB training time depicted in figure 5. Remarkably, all configurations reduce the training time by at least 65%, even though the computational cost to simulate the cylinder flow is fairly small.To better understand differences in the overall training time, figure 5 also shows the time spent on the most relevant parts of the training.For all configurations, the simulation runs remain the most time-consuming element.The time spent updating the networks for environment, policy, and value function remains fairly constant.
A noticeable increase can be observed in the time spent to sample trajectories from a single environment model.This increase, as well as other variations in the sampling time, are related to the removal of obviously corrupted trajectories, as explained in section 3.2.1.Figure 6 shows the number of discarded trajectories per training.Clearly, the ensemble models lead to a significantly more robust trajectory sampling.Moreover, ensembles with more models produce fewer invalid samples.It is noteworthy to mention that the majority of invalid trajectories result in the initial training phase, e < 50, presumably because of the strong exploration and policy changes.In our implementation, trajectories are sampled in parallel.However, the constraint of executing the PPO update with exactly ten valid trajectories leads to repeated sampling in the case of discarded trajectories.Of course, this extra time could be avoided by sampling more trajectories than required in parallel or by loosening the constraint.We note that there are also small variations in the measured execution times that are not straightforward to explain.One means to reduce these variations would be executing additional training runs with varying seeds.However, the main trends in terms of control performance and training time are clearly visible and explainable.Moreover, there are nontrivial coupling effects between simulations, environment models, and PPO updates.As discussed before, policy updates differ between MF and MB trainings and also depend to some degree on ensemble size and switching criterion.The policy employed in a simulation impacts the execution time, e.g., by subsequent changes in the pressure-velocity-coupling iterations, or the linear solver iterations.

Fluidic pinball
Due to the computational cost of the pinball setup, we do not perform multiple training runs and investigate fewer hyperparameter configurations.Figure 7 shows the episode-wise mean rewards of all trainings.The final reward is similar for all tested configurations.The MF training reaches the maximum reward after approximately 60 episodes.Thereafter, the control performance drops slightly.The highest overall reward is achieved by the MB training with N m = 10.For the MB training with a single model, we switch every fourth episode to CFD trajectories.Also the automated switching criterion in the training with N m = 10 and N thr = 5 leads to 25% CFD episodes.The smaller ensemble with five models switches approximately every fifth episode back to the high-fidelity samples.
Figure 8  runtime reduction is even more significant due to the higher computational cost of performing the pinball simulation.Remarkably, all configurations reduce the training time by at least 80%.The remaining CFD episodes dominate the overall time consumption.
As for the cylinder test case, the time spent on sampling model trajectories increases when using only a single model.Ten trajectories were discarded during the training and had to be re-sampled.In the MEPPO runs, no invalid trajectories occurred.The increased time spent on simulations in the training with ten models is a result of the non-trivial interaction between policy learning and the numerical properties of the simulation, as discussed before.

Cylinder flow
In this section, we compare the final policies of the MF and MB trainings that achieved the highest mean reward.Note that the achieved force reduction and the corresponding control are by no means new [7].However, we would like to point out a few subtle differences in the final MF and MB policies.It is noteworthy that all MB policies achieved a similar force reduction, even though we present only results for the MEPPO training with ten models.Figure 9 shows the optimal control and the corresponding force coefficients.Both MF and MB policies reduce the drag force by approximately 7%.The second force coefficient remains near zero after 20 convective time units.Both the MF and MB policies perform a very similar actuation within the first five convective time units.Thereafter, the MB policy rotates the cylinder slightly less and archives an almost perfectly steady force balance.Small fluctuations remain in the angular velocity and the side force when employing the MF policy.The cylinder rotation steadies and extends the wake.The pressure drop over the cylinder decreases, as can be inferred visually in figure 10.The figure also shows the significant suppression of vortex shedding.While small pressure fluctuations remain when employing the MF policy, the MB policies leads to an almost perfectly steady flow field.It is conceivable that the MF training could achieve a similar control performance with more training episodes and additional PPO hyperparameter tuning.However, the MB runs achieve the presented control performance with even less than 200 episodes.

Fluidic pinball
Several established control mechanisms exist for the fluidic pinball, e.g., boat tailing and base bleeding [30].The strategy learned by the policies in the present work is referred to as boat tailing.Figure 11 compares the control and the resulting force coefficients of the best-performing MF and MB policies.Both policies follow a similar strategy, i.e., rotating cylinders 2 and 3 with maximum speed in opposite directions while keeping the cylinder in the front nearly still.This actuation leads to boat tailing and reduces the drag coefficient by approximately 87%.Of course, the reduction strongly depends on the allowed maximum rate of rotation and would be smaller for smaller rates of rotation.
While the drag reduction is fairly similar, the MF policy introduces a nonzero net force in y-direction.This asymmetry is caused by a subtle difference in the absolute values of ω * 2 and ω * 3 , i.e., the absolute value of ω * 2 is marginally larger.The same small difference between ω * 2 and ω * 3 is present in the MB policy.However, the MB  policy applies a small positive spin to the cylinder in the front, which balances the side forces to net zero.The reader might be wondering if the same drag reduction and net zero force in y-direction could be achieved by setting ω * 1 = 0 and ω * 2 = −ω * 3 = 5.The short answer is yes.However, the latter control leads to relatively strong c y fluctuations whose amplitude decays only slowly.Approximately 150 convective time units are necessary to reach the steady state with open-loop control, which is longer than the prescribed trajectory in the training.The asymmetric control performed by the MB policy achieves the same force reduction in about 20 convective time units.Finally, we want to shed some light on the local flow dynamics created by the spinning cylinders.Figure 12 shows the velocity field in the vicinity of the cylinders.As mentioned before, the uncontrolled state is asymmetric, as can be observed by looking at the wakes of the two cylinders in the back.In the controlled flow, the rotation of cylinders two and three pushes fluid between them in upstream direction.This suction significantly reduces the extent of the wake and the cumulative pressure drop over the cylinders.The fluid then passes through the gaps between the rear cylinders and the front cylinder.For the MF policy, the fluxes in positive and negative y-direction are fairly balanced.Instead, the MB policy hinders the flux between cylinders one and two and diverts more fluid to the gap between cylinders one and three.This small imbalance is enough to compensate for the different rotation speeds of the rear cylinders.

Conclusion
DRL-based AFC is a promising approach to enable smart control technology.A current limitation of simulation-based DRL is the high computational cost and turnaround time associated with the optimization.The idea of MBDRL is the creation of lowcost environment models that partially replace the costly sampling of high-fidelity trajectories from simulations.On two common benchmark AFC problems, we demonstrate that the MEPPO algorithm adopted here can tremendously reduce the training cost and time while achieving optimal control performance.The relative reduction of training time increases with the cost of the CFD simulation.Therefore, we expect the MEPPO algorithm, or variants thereof, to be a key enabler in performing DRL-based AFC on realistic, 3D simulations at an industrial scale.Of course, further tests on more complex flow configurations are required to consolidate this hypothesis.
There are a number of promising options to reduce the training cost even further.Creating accurate auto-regressive environment models on the fly posed a significant challenge.Automating the model creation, e.g., by means of Bayesian hyperparameter optimization, could be beneficial for the models' accuracy and adaptability to new control problems.The more accurate the models can extrapolate, the more savings are possible.More advanced recurrent network architectures or transformers [31] might be capable of achieving higher accuracy than the simple, fully connected networks employed here.However, such advanced networks are typically also harder to train.Alternatively, more straightforward approaches to creating reduced-order models might be worthwhile to explore, too.For example, dynamic mode decomposition with control [32] could be a drop-in replacement for the models employed here.
We also noticed a strong dependency of the learning progress on the PPO hyperparameters.Due to the high training cost, automated tuning of these parameters is infeasible for complex simulations.However, recent improvements in the PPO algorithm, e.g., the PPO-CMA variant [33], reduce the number of hyperparameters, improve robustness, and accelerate learning.Such improvements would be extremely beneficial both for MF and MB learning.

Appendix B PPO hyperparameters
Most PPO hyperparameters employed here are standard values suggested in the literature [23,25].In contrast to the original implementation, we use two separate networks for policy and value function.The two networks are optimized sequentially by two different optimizers.Since the optimization is inexpensive, we allocate only 5 CPU cores.As suggested in [25], the policy optimization is stopped if the difference between old and new policy, measured in terms of KL divergence, exceeds a prescribed threshold.However, we noticed that this criterion is rarely triggered.

1 .
The dataset D is split randomly into N m subsets, and each model is trained on a different subset.2. Each model has a different set of randomly initialized weights.Since the optimization is nonlinear, the optimized weights likely differ, even if the same dataset is used for training.3. The gradient descent algorithm updates the model weights multiple times per epoch (iteration) based on batch gradients.The mini-batches are chosen at random from the training data (batch training).
85 ≤ c x ≤ 3.5 and |c y | ≤ 1.3.The analogous bounds for pinball trajectories are −3.5 ≤ c x,i ≤ 3.5 and |c y,i | ≤ 3.5.Note that these bounds are really generous and filter out only extreme prediction errors.If one or more trajectories are discarded, the sampling is repeated until ten valid trajectories are available.The training is stopped after 200 and 150 episodes for the cylinder and pinball cases, respectively.Finally, we note that all parameter configurations for the cylinder flow are repeated with five different seed values.All results presented in section 3.2.2 are averaged over these five runs.Due to the computational cost of the fluidic pinball, we execute only a single training run per parameter configuration.

N m = 10 N thr = 5 N thr = 3 eRFig. 4
Fig. 4 Cylinder flow: episode-wise mean reward R for different training configurations; the shaded area encloses one standard deviation below and above the mean; mean and standard deviation are computed over all trajectories of all seeds; for the MB training, the markers indicate CFD-based trajectory sampling.

Fig. 5
Fig. 5 Cylinder flow: composition of the total MB training time T MB normalized with the MF training time T MF ≈ 14h.

10 N thr = 5 NFig. 6
Fig. 6 Cylinder flow: number of discarded trajectories N d per training; the box plot shows the outcome of five independent runs (seed values); the orange line indicates the median value, and the markers indicate extreme outcomes.

N thr = 5 N m = 10 eRFig. 7
Fig. 7 Fluidic pinball: episode-wise mean reward R for different training configurations; the shaded area encloses one standard deviation below and above the mean; mean and standard deviation are computed over all trajectories; for the MB training, the markers indicate CFD-based trajectory sampling.

Fig. 9
Fig. 9 Cylinder velocity and force resulting from the best MF and MB policies; the dimensionless time is t * = t/tconv.

Fig. 10
Fig. 10 Temporal mean and standard deviation of the pressure fields with and without control; both mean and standard deviation are normalized with the minimum (blue) and maximum (yellow) values of the uncontrolled case; the coordinates are normalized with the diameter.

Fig. 11
Fig. 11 Fluidic pinball: angular velocities and summed force coefficients resulting from the best MF and MB policies; the dimensionless time is t * = t/tconv.

Fig. 12
Fig.12Comparison of instantaneous velocity fields with and without control; the velocity field is normalized with the inlet velocity.
Values of N pos /N m close to zero indicate that the models' generalization capabilities are exhausted.Consequently, new high-fidelity data should be generated to update the ensemble.We define the condition N pos ≥ N thr to switch between model-based and simulation-based trajectory sampling.Values of N thr /N m close to zero encourage learning from the models at the risk of decreased robustness, whereas values close to one ensure high model quality at the risk of decreased computational efficiency.It is conceivable that N thr /N m ≈ 0.5 might be a suitable compromise between efficiency and control performance.The final model ensemble PPO (MEPPO) workflow is depicted in algorithm 3.
pos < N thr or e < 2 then ▷ sample high-fidelity data

Table 1
Characteristic parameters of the two AFC setups; ranks refers to the number of MPI ranks; CV is the number of control volumes, and Ttr is the time required to execute one simulation (to sample one trajectory).
Fluidic pinball: composition of the total MB training time T MB normalized with the MF training time T MF ≈ 130h.

Table A1
Hyperparameters for the training of the environment models.

Table C3
Abbreviations used throughout the article.