Deep reinforcement learning for turbulent drag reduction in channel flows

We introduce a reinforcement learning (RL) environment to design and benchmark control strategies aimed at reducing drag in turbulent fluid flows enclosed in a channel. The environment provides a framework for computationally efficient, parallelized, high-fidelity fluid simulations, ready to interface with established RL agent programming interfaces. This allows for both testing existing deep reinforcement learning (DRL) algorithms against a challenging task, and advancing our knowledge of a complex, turbulent physical system that has been a major topic of research for over two centuries, and remains, even today, the subject of many unanswered questions. The control is applied in the form of blowing and suction at the wall, while the observable state is configurable, allowing to choose different variables such as velocity and pressure, in different locations of the domain. Given the complex nonlinear nature of turbulent flows, the control strategies proposed so far in the literature are physically grounded, but too simple. DRL, by contrast, enables leveraging the high-dimensional data that can be sampled from flow simulations to design advanced control strategies. In an effort to establish a benchmark for testing data-driven control strategies, we compare opposition control, a state-of-the-art turbulence-control strategy from the literature, and a commonly used DRL algorithm, deep deterministic policy gradient. Our results show that DRL leads to 43% and 30% drag reduction in a minimal and a larger channel (at a friction Reynolds number of 180), respectively, outperforming the classical opposition control by around 20 and 10 percentage points, respectively.


Introduction
Turbulent flows are ubiquitous both in nature and in engineering applications, from climate and weather dynamics [1] to wind-turbine engineering [2], and from turbulent blood streams in the human body [3] to hypersonic flows around re-entry vehicles [4].Hence, turbulence is highly relevant both for economic and environmental reasons.Depending on the intended outcome, it can be desirable to promote turbulence, for example to enhance mixing in combustion engines [5], or to hinder it, as a mean to reduce the drag on airplane wings and thus reducing the overall fuel consumption [6].In both cases, some form of flow control needs to be designed and deployed.While flow control has attracted widespread attention over the years from different fields of the scientific community, turbulent flows exhibit a chaotic nature, they are multiscale, highly non-linear and high-dimensional phenomena; furthermore, they are very expensive to simulate numerically in an accurate way.Still today, the challenges brought by turbulence prevent us from finding effective flow-control strategies in most realistic applications.
Deep reinforcement learning (DRL) is a mathematical framework that has been used to design and learn control policies, also in physics research.This framework was successfully applied in optics [7], plasma physics [8], and thermodynamics [9].In fluid dynamics, the potential of DRL algorithms has been assessed for turbulence modelling [10] and drag reduction [11].Extending the latter application, in this work we introduce a RL environment to simulate a turbulent flow in a channel.The setup chosen for the simulation is more computationally-efficient than other, more realistic geometries (e.g. a wing or a turbine blade), while still being able to capture all the flow features in the vicinity of the wall.The environment is based on the numerical solver SIMSON [12], which implements an efficient pseudo-spectral method to solve the Navier-Stokes equations for incompressible wall-bounded flows.The solver performs direct numerical simulations (DNSs), in which all the time and length scales are resolved without any approximation.This is essential to design a control strategy which does not exploit the limitations introduced by modelling some of the flow scales.Note that such a control policy would underperform if applied on a more realistic flow.
This article is organized as follows: in section 2, we review the most recent applications of deep learning and deep reinforcement learning in fluid mechanics, as well as the different approaches to reduce drag in channel flows from the literature.In section 3, we describe the simulation setup and how the numerical solver interacts with the deep reinforcement learning agent.The learning setup details are also provided.In section 4 we validate our code against the opposition control results available in literature.Furthermore, the learning results in the minimal channel are reported, along with the drag reduction achieved by the DRL policy in two channel flow simulations of different sizes.Finally, in section 5 we summarize our findings and outline some possible future developments.
2 Related work

Deep learning in fluid mechanics
Fluid-dynamics problems offer many opportunities and challenges for datadriven techniques.Recent research works published in machine-learning venues focused on improving the accuracy of coarse simulations [13,14] or approximating the dynamics of partial differential equations (PDEs) [15,16].At the same time, the application of deep-learning methods to turbulence requires specific domain knowledge, as testified by the large number of research works that have been published by domain specialists.A comprehensive review by Vinuesa and Brunton [17] identifies three main areas of application of deep learning for fluid mechanics: the first possible application is to accelerate direct numerical simulations.A second area includes all the studies in which deep learning is used to enhance turbulence models in simulations, in order to make them less computationally expensive.Finally, neural-network architectures, such as autoencoders (AEs) can be used to develop reduced-order models of fluid flows [18].On top of these applications, further applications have been envisioned [19], including but not limited to temporal predictions of reduced-order models or spatial reconstructions of turbulent flows.
Reconstruction of turbulent flow fields at a given instant has been attempted using convolutional networks [20].The possibility to perform non-intrusive sensing of the flow, i.e. sampling quantities without disrupting it, is an essential element to implement flow-control systems, which typically rely on velocity fields sampled at a given distance from the wall, as detailed in subsection 2.3.Convolutional neural networks can also be trained to increase the resolution of coarse flow fields [21].Note, however, that generative adversarial networks (GANs) have been proven to be more effective for this task [22].

Reinforcement learning in fluid mechanics
The application of (deep) reinforcement learning to fluid mechanics is still in its early phase compared with traditional supervised learning [23,24].Two main categories of contributions can be identified: the first focuses on modelling turbulence with the aid of DRL [10], while the second focuses on active-flowcontrol.In the following, we will focus on active flow control applications.
A number of contributions in this domain aim to control the movement of an agent in a fluid flow, for example, representing a fish swimming in a turbulent flow or in a fish school, where the reward aims to maximize the efficiency of the agent swimming [25,26].Another category of contributions use DRL to control the dynamics of the flow.The present work falls into this last subcategory.A notable example of such an application is provided by Ref. [11], where the flow around a cylinder is controlled using jets orthogonal to the main flow direction.This case has been used as benchmark in a number of extension works [27][28][29][30][31][32], a fact that illustrates both the growing interest of DRL for active flow control, and the importance of providing benchmark cases that can be used as a starting point by the community.Recent works [33] highlighted how different control strategies are selected by the DRL agent depending on the physical features of the flow to be controlled, and showed clearly that DRL agents can discover complex strategies not limited to simple opposition control.
In this study, we extend the application of DRL to active flow control in another category of flows, namely wall-bounded turbulent flows.This is a significant jump in complexity for at least two reasons.First, the twodimensional (2D) cylinder case previously considered exhibits a well-defined shedding pattern at a specific frequency, which is dominant with respect to all the other dynamics, and suppressing it provides a straightforward dragreduction strategy.If a higher Reynolds number is considered, a different policy is chosen: instead of reducing the shedding, the agent energizes the boundary layer on the cylinder surface to trigger the drag crisis [33].The channel flow, by contrast, is an inherently multi-scale phenomenon in which there are no obvious frequencies nor mechanisms that can be targeted in order to achieve drag reduction.Second, previous works focused on a 2D environment, while the channel flow is inherently 3D.This is a requisite for the development of true fully-featured turbulence structures, and it provides a more nonlinear, chaotic, and challenging benchmark to test novel DRL algorithms and control strategies, while also being closer to realistic full-scale configurations.Very recent works have started exploring similar flow cases, for example in Ref. [41] the use of DRL was tested in a standard channel flow (note that in this study we consider an open channel flow, as detailed in subsection 3.1).Another notable example is Ref. [42], where a Couette flow is controlled by means of two streamwise parallel slots.Note that in the latter case the training of the DRL agent is performed in a reduced-order model of the problem and then applied to the actual case.

Drag reduction in channel flows
Given the high importance of flow control in several fields, different techniques and frameworks have been presented in the literature.These control strategies are simple in nature, and exhibit a somewhat limited performance, but they are well established.We discuss these traditional control strategies in the following subsection, and we implement them in the solver to serve as a baseline.In wallbounded flows (such as our channel flow) the actuation is usually performed by means of a wall-normal velocity distribution applied at the wall, which corresponds to blowing when the wall-normal velocity is positive, and suction when it is negative.The control law is traditionally obtained by rescaling the wall-normal or spanwise velocity fluctuations at a given wall-normal sampling location y s , and using it, with opposite sign, as the actuation value.The aim is to suppress the near-wall turbulent structures [43][44][45].This strategy provides a drag reduction that depends on the height of the sensing plane, as detailed in subsection 4.1.Similar drag reduction, at the corresponding Reynolds number, is reported for different flow cases, such as a turbulent boundary layer flow [46].

Simulation setup
The geometry under consideration is an open channel flow.The wall is represented as a no-slip condition at the lower boundary, while a symmetry condition is imposed at the upper wall.Previous works [41,43] used a standard channel flow for their investigations, with a no-slip condition applied to both the upper and lower boundary.We opted for an open channel flow because it allows for a better isolation of the flow features close to a solid boundary.In fact, in the full channel the large turbulent structures can extend beyond the channel centerline, with an effect also on the other wall.The dynamics of the near-wall turbulence close to the lower boundary is not affected by the boundary condition on the other side, providing a meaningful comparison between the two simulations and control techniques.
Throughout the paper, we use (x, y, z) to indicate the streamwise, wallnormal and spanwise directions, respectively, and (u, v, w) are used to indicate the corresponding velocity components.We consider two different simulation domains, the first one is a minimal channel [47] with size Ω = L x × L y × L z = 2.67h × h × 0.8h (where h is the open-channel height).This domain size is large enough to simulate all the relevant near-wall statistical features of turbulence, while being computationally cheap.Note, however, that the limited spanwise dimension of the channel limits the simulation to a single low-speed streak, whereas in the larger domain, we are able to simulate the interaction of several of them.The second domain size is a larger channel of size The friction Reynolds number is defined as Re τ = u τ h/ν, where the friction velocity u τ = τ w /ρ (based on the the wall-shear stress τ w and the fluid density ρ) and ν is the kinematic viscosity of the fluid.The Reynolds number represents the ratio between the inertial and viscous forces in the flow, determining the flow characteristics and how turbulent the flow is.We consider Re τ = 180 in both domains, to compare the drag reduction with the corresponding results in Ref. [46].The solver we used, SIMSON, is a pseudo-spectral code that uses the Chebyshev polynomials in the wall-normal direction.The resolution of the simulation is given by N x × N y × N z , where N x and N z are the number of Fourier modes in the streamwise and spanwise directions, while N y is the number of Chebyshev modes in the wall-normal direction.The minimal channel is simulated with a resolution of 16 × 65 × 16, while the resolution in the larger domain is 64 × 65 × 64.The time-advancement numerical scheme is a second-order Crank-Nicholson algorithm for the linear terms and a third-order Runge-Kutta method for the nonlinear terms.

Reinforcement-learning environment and algorithm
The simulation discussed above needs to be cast as a reinforcement-learning problem, defining the actions that can be performed, the state space and the reward function.
The control is performed by imposing a wall-normal velocity distribution.We cast the simulation as a multi-agent reinforcement-learning (MARL) problem, which implies that several independent agents cooperate in order to maximize the chosen reward function.In our setup all the agents operate locally and thus share the same actuation policy, which determines the action to perform based on the local observation of the environment provided to the agent.This approach has two important consequences: it allows us to avoid the curse of dimensionality on the DRL-control space dimension and to re-use the knowledge of the properties of the flow across the domain, as discussed in Ref. [34].We consider a grid of N CTRLx × N CTRLz agents that cover the entire lower wall of the channel.Here N CTRLx and N CTRLz represent the number of agents in the streamwise and spanwise directions, respectively.The policy learnt by the individual agent is translationally invariant to both the streamwise and spanwise directions.Each agent computes the wall-normal velocity intensity to be applied at each evaluation.The control policy is evaluated over a fixed time interval and the actual control varies linearly from the old value to the new one in order to avoid numerical instabilities related to sudden variation of the wall-normal velocity at the boundary.In our numerical experiments we use ∆t + = 0.6 (where the time units are scaled with t * = ν/u 2 τ ).The actuation value is limited to a prescribed range, in our case the agent can apply wall-normal velocities between −u τ and u τ .
The agent does not have access to the entire velocity field, and the observed state consists of a portion of one or more wall-parallel planes of sampled flow quantities.The portion of the flow field that is observable by each agent is the one above the actuation.It is possible to sample any of the three velocity components at a given wall-normal location, the wall-normal location of the sampled plane is defined in the input file of the simulation and it is rounded up to the closest Chebyshev colocation node.For the remainder of the paper, we will refer to the different sampling heights using inner-scaled wall-normal coordinates: y + = yu τ /ν = y/ * ∈ [0, Re τ ], where * denotes the viscous length.
The same global reward value is provided to all MARL agents, defined as the percentual reduction of the wall-shear stress with respect to the uncontrolled flow, averaged over the entire wall.We do not consider the instantaneous value of this quantity, rather an averaged value between two actions is returned.The solver performs multiple time-step advancements between one action and the next one because the time step required for the stability of the numerical method is much lower than the actuation time.The actuation interval is also a parameter that can be defined in the input file and it should be set to be small enough to allow it to react to the change of the small flow scales, and sufficiently large to observe the effect of the action at the intermediate scales [34].Since turbulence is a multi-frequency phenomenon, the reward signal reflects this characteristic and the reward averaging helps to provide a more reliable estimate of the actual wall-shear-stress reduction than an instantaneous value would do.Note that the presence of turbulence in the flow induces a higher drag with respect to a laminar flow and the implemented control effectively reduces the level of turbulence.In fact, the control can theoretically induce a re-laminarization of the flow.This condition represents an upper bound for the drag reduction achievable with the control [48].This value depends on the chosen Reynolds number and in our case is 73.9%.Note, however, that there is no guarantee that this value is ever attainable with any control law.
The deep reinforcement learning algorithm that is used to train the agent is the deep deterministic policy gradient (DDPG) [49].As the name suggests, this algorithm optimizes a deep neural network that approximates the relation between the state and the action to be performed (i.e. a policy function).After learning, the policy is deterministic.During learning, Gaussian noise with zero mean and variance 0.1u τ is added to foster exploration.The policy is updated every ∆t + = 180, with 64 mini-batch gradient updates.The mini-batches are sampled from a buffer replay that includes 5,000,000 (state, action, reward) tuples.

Solver-DRL interface
The reinforcement-learning framework chosen for this work is Stablebaselines [50], while the environment is coded as a custom PettingZoo/Gym environment [51].Both Stable-baselines and the Gym environment are written in Python, however the fluid solver is coded in FORTRAN 77/90.An interface between the two programming languages is then needed in order to communicate the quantities (state, control values, reward) that are necessary for the learning to take place.While previous studies have coupled the solver and the RL algorithm using an input/output (I/O) stream [52], in this case the interface is based on message-passing interface (MPI).The fluid solver is spawned as a child process of the main learning process and an intercomm is created between the solver and the Python main program.After the initialization, the solver waits for information requests from the main program.Once the request is received by the solver, further MPI-based messages are exchanged between the agent and the solver, depending on the requested information.Communication requests are handled using five-character strings, which determine the sequence of instructions that the solver has to perform for each interaction of the agent with the environment.The admissible values for the request string are: • STATE: this string is used to request the simulation to communicate the state to the agent.• CNTRL: after this request, the agent communicates the new action to the environment, in particular the new values of the controllable parameters are passed to the fluid solver.• EVOLVE: once the control parameters are updated with CNTRL, several time iterations of the solver are computed in order to observe the effect of the chosen action on the environment.The instantaneous wall-shear stress is passed to the Python program and it is used to compute the reward.• TERMN: this request interrupts the solver and closes the FORTRAN program.
This request is used at the end of each episode before restarting the environment.The initial condition is the same for every episode unless stated otherwise.
The combination of these messages is used to define the required functions for the Gym environment application programming interface (API).

Validation and baseline
As mentioned in section 2, opposition control is a simple established control benchmark that is well understood theoretically and can be used as a reference to assess DRL control laws.Given the velocity distribution at the sensing plane y s , the actuation at the wall v w applied by opposition control can be computed with the expression: Here v(x, y s , z, t) is the spatial mean of the wall-normal velocity field, and the second term of the equation enforces zero-net-mass-flux within the domain, so that the actuation plane as a whole is not adding or removing fluid mass to the flow.α is a positive scaling parameter which is fixed in the spatial directions and time and its typical value is α = 1, independently from the height of the sampled velocity plane.In order to validate our code implementation we reproduced this control strategy in our solver, letting the environment evolve with constant α for a sufficiently long time, such that the drag-reduction rate has reached a stationary value.In Ref. [43], the highest drag reduction was found by sampling the wall-normal velocity at y + = 10.The computational domain had size Ω = L x × L y × L z = 4πh × h × 4/3πh, with resolution N x ×N y ×N z = 128×129×128.The friction Reynolds number Re τ was reported to be 180 and the Reynolds number based on the centerline velocity U cl (the one at the top boundary, in the case of the open channel) is in close agreement with our setup: the reported value was Re cl = 3300, while our simulation is performed with Re cl = 3273.The drag reduction reported in Ref. [43] was ≈ 14% when sampling at y + = 10.Our result in the larger channel is 17.73%.This is in acceptable agreement with that of the literature and further validates our simulation, considering the lower resolution and taking into account the small difference in spanwise size of the domain.
A different sampling plane and slightly lower Re cl were used in Ref. [44].The domain here is When rescaling the velocity at y + = 15 with Re cl = 3240, the resulting drag reduction was ≈ 25% and our own test resulted in a reduction of 25.67%, which is in very good agreement.Finally, in Ref. [46], the field at y + = 15 at Re τ = 180 is sampled.In this case, the resolution is slightly lower than those of the previous studies the wall-parallel directions (N x × N y × N z = 160 × 257 × 128).On the other hand, the channel size is also lower, being exactly the same as our larger channel.They reported ≈ 24% drag reduction, and our result is also in good agreement with theirs.Overall, the results from our implementation match the ones reported in the literature, giving us confidence in our simulation setup and control implementation.The drag reduction obtained with these settings is used as baseline for comparison with the DRL results.

Minimal channel learning and testing
We first consider the smaller domain, i.e. the minimal-channel configuration.In this case, the number of agents is set to N CTRLx = N x and N CTRLz = N z .The learning is performed using several initial conditions for 400 episodes.The streamwise and wall-normal components of the velocity at y + = 15 are provided to the agents as the observable state.In each learning, all the episodes are initialized with the same flow field.Figure 1 highlights the sensitivity of the learning to the initial condition of the episode.One important remark is that (D)RL algorithms are based on a trial-and-error approach to the reward optimization.This means that not all learning runs provide a drag-reducing policy, as shown in the aforementioned figure.Note that the unsuccessful runs (shown in the left panel) are removed from the average for clarity.
Even though the learning has a strong sensitivity on the initialization of the episode, the learnt policies perform consistently during testing, regardless of the selected initial condition.Figure 2 shows the drag reduction achieved with one of the best-performing policy.The policy test is repeated for six different initial conditions, in order to assess the generalization of the learnt control policy.The performance is compared with the baseline strategy (opposition control, with sensing plane at y + = 15) for the same set of initial conditions.Initially, the DRL policy produces a strong increase in the drag for a brief time, corresponding to a negative value for the drag reduction.After this drag increase, it is possible to observe how the average drag reduction after the initial transient is consistently higher than the one obtained with opposition control: the DDPG policy provides 43% drag reduction, while opposition control is limited to 26%, as mentioned in subsection 4.1.Figure 2 (right) shows the effect of the control on the velocity-fluctuations distribution.Using the friction velocity for the uncontrolled case, it is possible to observe how the two control approaches produce different changes in the fluctuations distribution.With opposition control, the range of the fluctuations is reduced, but the overall shape of the distribution is unchanged.In fact, from the perspective of quadrant analysis [53,54], sweeps and ejections remain the dominant features in the near-wall region.On the other hand, DRL has a more significant effect on the distribution: at the sensing plane, we can observe a distribution that is almost symmetric with respect to the wall-normal fluctuations.With the DRL control, the range of the streamwise fluctuations is greatly reduced, while the wall-normal fluctuations are increased.Furthermore, the predominance of sweeps and ejections vanishes, with a more even distribution of events among the four quadrants.Consequently, DRL learns a control strategy with a profound impact on the flow physics.Analyzing the sensing plane at y + = 15 when the control is applied reveals that the wall streak in the minimal channel is significantly attenuated by a streamwise travelling wave.

Larger channel testing
One of the relevant features of the learnt policy is that it is local and translational invariant, meaning that it can be applied with no modification to different domain sizes and flow cases, provided that the size of the state observations and of the reward are the same.The most important assumption behind the application of the same policy is that the underlying physical features exploited by the agent in the environment where the learning is performed are also present in the new environment.When applying the policy to a larger domain, we still considered N CTRLx = N x and N CTRLz = N z .Note, however, that this implies that a larger number of agents cooperate within a single instance of the simulation.The left panel of figure 3 shows the drag reduction achieved in the larger domain when the policy learnt in the minimal channel is applied.Remarkably, the policy provides a drag reduction that is higher than the opposition-control baseline, regardsless of the flow initial condition.The effectiveness of the control policy is tested without any further tuning in the larger domain.The success of this transfer application shows that the DRL agent has been able to learn a control strategy that robustly exploits the features in the flow, so that the learnt policy can be effective in different cases which have a similar physical characterization.The magnitude of the reduction in this case is smaller than the one obtained in the minimal channel, with the DDPG policy providing 30% drag reduction, still performing better than opposition control, which yields 20%.One possible explanation for this result is the fact that the policy is learnt in the minimal channel, where some physical features, such as the interaction of the streaks, cannot be experienced by the agent.Further evidence of this difference can be found by assessing the effect of the two control strategies on the velocity-fluctuations distribution shown in the right panel of figure 3.For the DRL control, the symmetry with respect to the wall-normal fluctuations is more pronounced than before, showing a high probability of strong wall-normal fluctuations, coupled with weak streamwise fluctuations.In this case the predominance of sweeps and ejections is also reduced by the DRL in the larger domain, but it is not eliminated to the extent that is observed in the minimal channel.

Conclusions
In this paper we have described the implementation of a pseudo-spectral solver for DNS of fluid flows as a reinforcement-learning environment, with which the agents interact to maximize the drag reduction within the domain.The environment supports three-dimensional, fully-turbulent flow simulations, allowing for the discovery of physically-accurate control policies.The environment can be customized with different flow cases and adapted to tasks of varying difficulty, selecting a different number of input quantities or a different number of agents, for instance.Using the environment, we applied DRL for drag reduction in an open channel flow, with two different domain sizes and resolutions.The DDPG algorithm allows the identification of control policies that on average show a higher drag reduction than the one provided by opposition control, used here as baseline.In the minimal channel, the control strategy yields 43% drag reduction, while in the larger domain the achieved reduction is smaller (30%), but still consistently better than the baseline.The performance improvement with respect to the baseline is around 20 percentage points in the minimal channel and 10 percentage points in the larger domain.None of the policies learnt or used as baseline considers the energetic cost of the actuation, meaning that the net-energy saving is lower than the figures provided here.In this regard, reinforcement learning could in further work help designing more efficient control strategies by incorporating explicitly the energy cost of the actuation as part of the reward function.Currently, the learnt policies are not able to re-laminarize the flow.Since this condition represents the upper bound for drag reduction, an improvement of the control policy is still theoretically possible, although there is no guarantee that this is attainable through DRL control or other techniques.The current study focuses on a open-channel flow with uncontrolled friction Reynolds number Re τ = 180.Our setup can also be adapted to different flow cases, paving the way to applying similar techniques to increasingly complex systems such as boundary layers or higher Reynolds numbers.Simulating increasingly high Reynolds numbers becomes progressively more computationally expensive, however the simulation wall-clock time can be reduced by using the MPI parallelization of the solver and HPC clusters.The possibility to change the Reynolds number of the flow simulation represents a way to tune the difficulty of the control problem, making the drag reduction in a open channel flow an ideal benchmark to test new policies.Increasing the Reynolds number is necessary step in order to design a control strategy that can be applied in practical applications; on the other hand, it also represents an appealing research direction as the maximum achievable drag reduction increases with the Reynolds number.It must be noted that a control policy that yields drag reduction at a given Reynolds number, may not be as effective at a higher Re because different physical mechanisms can be responsible for the drag [55].In this regard, deep reinforcement learning constitutes a promising framework, thanks to its end-to-end approach: all the physical features of the turbulent flow are represented in the environment, allowing for new and possibly more effective control strategies.Azizpour (HA) provided feedback.The paper was written by LG and edited by RV and JR.Comments were provided by RV, JR, HA and PS.

Appendix A Effect of the episode length
In subsection 4.2, we consider episodes of 3000 interactions between the environment and the agent.This corresponds to a simulation time of t + ≈ 1500.
Here we verify the effect of the episode length on the training by reducing the number of interactions to 1000 and 2000, as shown in figure A1.The initial choice of the episode length is designed to include the initial transient (typically t + ≈ 500) and a sufficiently long time after that.However, in our experiments, longer episodes do not provide a significant advantage in terms of drag reduction.It is possible to observe how the learning appears to improve faster at the very beginning of the learning but this is simply related to the higher number of interactions per episode.Shorter episode might provide a way to reduce the overall time and computational cost of the learning, which increase quickly with the number of agents and the resolution/Reynolds number, respectively.

Fig. 1 (
Fig. 1 (Left) Running mean of the drag reduction with respect to the reference uncontrolled case during agent learning in the minimal channel.Each panel shows 3 different learning runs for 6 initial conditions.(Right) Running mean of the drag reduction with respect to the reference uncontrolled case during learning in the minimal channel, averaged over different learning runs.The shaded area represents the variance of the drag reduction with the different runs.

Fig. 2 (
Fig.2(Left) Drag reduction with respect to the uncontrolled case obtained in the minimal channel, when using the policy learnt using DRL or opposition control.The result is averaged over 6 different initial conditions.The shaded area represents the variance of the drag reduction with the different initial conditions.(Right) Distribution of the inner-scaled velocity-fluctuation components after the initial transient (t + > 500) in the streamwise (u) and wall-normal (v) directions, for the uncontrolled case (top), with opposition control (middle) and when using DRL (bottom).

Fig. 3 (
Fig.3(Left) Drag reduction with respect to the uncontrolled case obtained in the larger channel, when using the DRL policy learnt in the minimal channel versus using opposition control.The result is averaged over 6 different initial conditions.The shaded area represents the variance of the drag reduction with the different initial conditions.(Right) Distribution of the inner-scaled velocity-fluctuation components after the initial transient (t + > 500) in the streamwise (u) and wall-normal (v) directions, for the uncontrolled case (top), with opposition control (middle) and when using DRL (bottom).
Fig. A1 Running mean of the drag reduction with respect to the reference uncontrolled case during learning in the minimal channel, averaged over different learning runs.Each panel shows 3 different learning runs for 6 initial conditions.