Introduction

Reduced order models (ROMs) often refer to simplifications of high-fidelity models that capture dominant system dynamics using minimal computational resources. Over the past decades, we have witnessed an ever increasing number of reduced order modeling approaches and their enormous impact on fluid dynamics research1,2,3,4. A chief motivation behind these approaches being introduced is to be able to use ROMs in multi-query applications such as control and optimization5,6.

Broadly speaking, closure modeling in fluid flow simulations refers to parameterizing the interactions between high-fidelity and coarse-grained descriptions. Although projection based ROMs have been utilized extensively in many fluid dynamics applications1,2,3,4, they might yield inaccurate results when they are used in the under-resolved regime, i.e., when the number of modes is not large enough to capture the parameterized or transient dynamics of the underlying system7. Prior studies have suggested that closure models are efficacious in decreasing such modal truncation errors4. In fact, ROM closures can be viewed as correction or residual terms that are added to classical ROMs in order to model the effect of the discarded ROM modes in under-resolved simulations.

Consequently, an emerging thrust in modern ROM closure development efforts is to incorporate machine learning (ML) models4,8,9,10,11,12. The last decade has seen the growth of data-driven modeling technologies (e.g., deep neural networks). So far, a substantial body of closure modeling works has focused on supervised learning9,13,14. A detailed discussion on these models can be found in a recent survey4. In principle, the problem of building a data-driven closure model from the multimodal datasets can be posed as an optimization and ML task. Although supervised learning techniques become more of commodity tools nowadays, reinforcement learning (RL) is relatively an uncharted approach in computational science communities. In contrast to many other data-driven supervised learning based approaches introduced for closures4, this powerful approach can be formulated to tackle with the closure problem in an automated fashion.

RL often provides a comprehensive iterative computational framework that implies modular goal-directed interactions of an agent with its environment. More recently, Novati et al.15 demonstrated the power of RL in discovery of turbulence closure models in large-eddy simulation of turbulent flows. The RL framework being introduced in the turbulence modeling context is quite comprehensive, and might apply equally well to other coarse-grained reduced order modeling approaches. In relevant works16,17, the feasibility of using RL to learn optimal ROM closures has been discussed. We also refer to a recent review18 for the theory and application of RL approaches in fluid mechanics.

A fundamental question in RL is how to construct robust state, action and reward definitions relevant to the underlying problem. In this study, we put forth and examine the scale-aware RL mechanisms that automate closure modeling of projection based ROMs. Specifically, we first introduce a multi-modal RL (MMRL) approach, which discovers the mode dependant policies to stabilize the evolution of the truncated Galerkin system. One of the main objectives of our study is therefore to demonstrate how physical insights play a key role in designing an effective RL environment that converges to a robust control policy for learning and parameterizing the closure terms.

In fact, a key element in forging a robust RL agent is to introduce an appropriate reward function, which can be, in principle, constituted of any difference metrics between the RL model and high fidelity simulation data. In this study, instead, we explore how RL can be formulated using a variational multiscale approach to discover closure models without requiring access to the high fidelity data in designing the reward function. The general concept of the variational multiscale framework has been first introduced in finite element community19,20,21,22,23, and its underpinning idea has been later adopted by ROM modellers24,25,26,27,28,29. In our study, we put forth a variational multiscale RL (VMRL) approach by leveraging this multiscale concept that introduces a natural hierarchy to quantify the difference between modal interactions in the Galerkin projection based ROM systems. Specifically, our work addresses the following questions:

  • How can RL be used to discover reliable closure models in reduced order models of transport equations?

  • Which parameterization processes lead to improved closure approaches that may reduce uncertainty in the evolution of projection coefficients of the Galerkin ROM systems?

  • How does a mode-dependent closure formulation affect overall predictive performance?

  • What are the design considerations for formulating a feasible reward function that does not require access to the supervised training data?

Therefore, the main goal of this paper is to address these questions in the context of closure model discovery for complex nonlinear spatiotemporal systems.

Methods

Reduced order modeling

To illustrate the proposed approaches, we focus on the viscous Burgers equation, a generic partial differential equation that represents broad nonlinear transport phenomena in fluid dynamics, which is given as4

$$\begin{aligned} \frac{\partial u}{\partial t} + u\frac{\partial u}{\partial x} = \nu \frac{\partial ^2 u}{\partial x^2}, \end{aligned}$$
(1)

where u refers to velocity, and \(\nu \) is the kinematic viscosity (i.e., \(\nu = 1/\text{Re }\) in dimensionless form, where Re refers to the Reynolds number). From a model reduction perspective, we highlight that Eq. (1) possesses the hallmarks of general nonlinear multidimensional advection-diffusion problems30. Defining our spatiotemporal domain, \(x \in [0, 1]\) and \(t \in [0, 1]\), the viscous Burgers equation admits an analytical solution in the form of30,31

$$\begin{aligned} u(x,t) = \frac{\frac{x}{t+1}}{1+\sqrt{\frac{t+1}{t_0}}\exp ( \frac{x^2}{4\nu (t+1)})}, \end{aligned}$$
(2)

where \(t_0 = \exp (\frac{1}{8\nu })\). This closed form expression is used to generate snapshot data for our forthcoming model order reduction analysis. The database is constituted with Eq. (2) using \(N = 1024\) spatial collocation points at each snapshot. We set \(\nu =0.001\) (i.e., \(Re =1000\)) to generate our training data and our training database consists of 500 snapshots from \(t=0\) to \(t=1\). In considering such spatiotemporal system, we often use a modal decomposition to the field u(xt)

$$\begin{aligned} u(x,t) =\sum _{k=1}^{R} \alpha _k(t) \psi _k(x), \end{aligned}$$
(3)

where \(\alpha _k(t)\) and \(\psi _k(x)\) refer to the kth modal coefficient and kth proper orthogonal decomposition (POD) basis function, respectively. Figure 1 shows the the first eight most energetic POD basis functions utilized in this study. We note that, without losing generality, one can also use Fourier harmonics9 or randomized orthogonal functions32 to forge a set of basis functions. Here, we note that access to a set of previously recorded data snapshots is necessary in order to generate the POD basis functions. The proposed framework, however, is not necessarily dependent on the POD procedure and is easily adaptable to different base functions. In other words, let’s take into account the following layers when formulating our problem: (i) create basis functions, (ii) develop a projection-based reduced order model, (iii) establish an ansatz/model for closures, and (iv) use reinforcement learning to learn the parameters of the ansatz/model. We stress that the data has only been used in step (i) when we construct bases here since we concentrate on the POD-based model reduction strategy. However, our approach can be used effectively with the use of different bases, such as Fourier basis functions.

Figure 1
figure 1

Illustration of the first eight POD basis functions generated using a total of 500 data snapshots at \(Re = 1000\).

Once a set of spatial orthonormal modes (i.e., for \(k=1,2, \ldots , R\)) is obtained from snapshot data, we then apply the Galerkin projection (GP) to obtain a dynamical system that evolves in the latent space of \(\alpha _k(t)\). The resulting ROM, denoted as GP in this study, becomes

$$\begin{aligned} \frac{d \alpha _k}{d t} = \sum _{i=1}^{R} {\mathfrak {L}}^{i}_{k}\alpha _{i} + \sum _{i=1}^{R}\sum _{j=1}^{R} {\mathfrak {N}}^{ij}_{k}\alpha _{i}\alpha _{j}, \quad \forall \ k = 1, \ldots , R \end{aligned}$$
(4)

where

$$\begin{aligned}&{\mathfrak {L}}^{i}_{k} = \bigg ( \nu \frac{\partial ^2 \psi _i(x) }{\partial x^2}, \psi _{k}(x) \bigg ), \end{aligned}$$
(5)
$$\begin{aligned}&{\mathfrak {N}}^{ij}_{k} = \bigg ( - \psi _i(x)\frac{\partial \psi _j(x)}{\partial x}, \psi _{k}(x) \bigg ), \end{aligned}$$
(6)

where the notion of \((\cdot , \cdot )\) represents the standard inner product. We note that these tensorial coefficients in GP model only depend on spatial modes, which are often precomputed from the available snapshot data when designing projection based ROMs.

Closure modeling

We first illustrate the underlying closure modeling concept for a prototype demonstration as shown in Fig. 2. To formulate our ROM closure problem, we modify Eq. (4) by adding a functional form of distributed control. This control is often referred to as eddy viscosity approach that has strong roots in large eddy simulations to model or compensate the residual effects of the truncated scales33,34,35,36. Therefore, the modified ROM becomes

$$\begin{aligned} \frac{d \alpha _k}{d t} = \sum _{i=1}^{R} {\mathfrak {L}}^{i}_{k}\alpha _{i} + \sum _{i=1}^{R} \tilde{{\mathfrak {L}}}^{i}_{k}\alpha _{i} + \sum _{i=1}^{R} \sum _{j=1}^{R} {\mathfrak {N}}^{ij}_{k}\alpha _{i}\alpha _{j}, \quad \forall \ k = 1, \ldots , R \end{aligned}$$
(7)

where the proposed closure term can be parameterized by defining an eddy viscosity coefficient \(\eta \) as follows

$$\begin{aligned} \tilde{{\mathfrak {L}}}^{i}_{k} = \bigg (\eta \frac{\partial ^2 \psi _i(x) }{\partial x^2}, \psi _{k}(x) \bigg ). \end{aligned}$$
(8)

Several techniques have been introduced to improve the accuracy of closure parameterizations, including definition of a nonlinear eddy viscosity model37 or dynamic closure models38,39 that allow varying eddy viscosity in time (i.e., \(\eta (t) \leftarrow \eta \)). In this paper, we first formulate an RL environment and design an agent to discover this eddy viscosity parameter \(\eta (t)\). We call this approach linear-mode RL (LMRL).

Figure 2
figure 2

A schematic overview of the closure modeling in a hypothetical three-dimensional latent space. When we truncate a model spanned in a higher dimensional state space (i.e., \({\tilde{R}}=3\) in the figure, a three-dimensional model spanned in \(\alpha _1,\alpha _2,\alpha _3\)) to a lower dimensional latent space (i.e., \(R=2\) in the figure, a reduced order two-dimensional model spanned only in \(\alpha _1,\alpha _2\)), a closure error will be introduced due to the underlying nonlinear interactions. The main goal in closure modeling is to find a parameterized model, which is only function of resolved modal coefficients (i.e., \(\alpha _1\), \(\ldots \), \(\alpha _R\)) to minimize this closure error. Therefore, in this paper we formulate an RL problem to discover this closure model using the state variables (i.e., resolved modal coefficients).

In their seminal work, Östh et al.40 further enhanced the closure theory emphasizing the modal eddy viscosity concept. The roots of such mode-dependent correction go back to the work of Rempfer and Fasel41 in order to provide improved closure models. These multi-modal closures can be specified as

$$\begin{aligned} \tilde{{\mathfrak {L}}}^{i}_{k} = \bigg (\eta _k \frac{\partial ^2 \psi _i(x) }{\partial x^2}, \psi _{k}(x) \bigg ). \end{aligned}$$
(9)

where \(\eta _k\) refers to the kth modal eddy viscosity coefficient. In our current work, we formulate an RL framework to learn \(\eta _k(t)\), and call this approach as multi-modal RL (MMRL). Although the proposed closure problem can be formulated using more traditional adjoint based37 or sensitivity based approaches42, our chief motivation in this study is to explore the feasiblity of RL workflows for the ROM closure problems. More specifically, in this paper we aim to introduce a variational multiscale RL (VMRL) approach by formulating a new procedure to forge a reward function that does not require access to the training data. Our approach therefore facilitates new RL workflows since RL enhanced computational frameworks might play an integral role in designing many end-to-end data-driven approaches for broader optimization and control problems.

Deep reinforcement learning

In our context, deep RL presents a modular computational framework to learn \(\eta _k(t)\) in Eq. (9). Here, we briefly describe the formulation of the RL problem and the proximal policy optimization (PPO) algorithm43. In RL, at each time step t, the agent observes some representation of the state of the system, \(s_t \in {\mathscr {S}}\), and based on this observation selects an action, \(a_t \in {\mathscr {A}}\). The agent receives the reward, \(r_t \in {\mathscr {R}}\) as a consequence of the action and the environment enters in a new state \(s_{t+1}\). Therefore, the interaction of an agent with the environment gives rise to a trajectory as follows

$$\begin{aligned} \tau = \{ s_0,a_0,r_0,s_1,a_1,r_1,\dots \}. \end{aligned}$$
(10)

The goal of the RL is to find an optimal strategy for the agent that will maximize the expected discounted reward over the trajectory \(\tau \) and can be written mathematically as follows

$$\begin{aligned} {\mathfrak {R}}(\tau ) = \sum _{t=0}^T \gamma ^t r_t, \end{aligned}$$
(11)

where \(\gamma \) is a parameter called discount rate that lies between [0, 1], and T is the horizon of the trajectory. The discount rate determines how much weightage to be assigned to the long-term reward compared to an immediate reward.

In RL, the agent’s decision making strategy is characterized by a policy \(\pi (s,a) \in {\Pi }\). The RL agent is trained to find a policy to optimize the expected return when starting in the state s at time step t and is called as state-value function. We can write the state-value function as follows

$$\begin{aligned} V^{\pi }(s) = {\mathbb {E}}_\pi \left[ \sum _{k=0}^\infty \gamma ^k r_{t+k} | s_t=s, \pi \right] . \end{aligned}$$
(12)

Similarly, the expected return starting in a state s, taking an action a, and thereafter following a policy \(\pi \) is called as the action-value function and can be written as

$$\begin{aligned} Q^{\pi }(s,a) = {\mathbb {E}}_\pi \left[ \sum _{k=0}^\infty \gamma ^k r_{t+k} | s_t=s, a_t=a, \pi \right] . \end{aligned}$$
(13)

We also define an advantage function that can be considered as an another version of action-value function with lower variance by taking the state-value function as the baseline. The advantage function can be written as

$$\begin{aligned} A^{\pi }(s,a) = Q^\pi (s,a) - V^\pi (s). \end{aligned}$$
(14)

We use \(\pi _w(\cdot )\) to denote that the policy is parameterized by \(w \in {\mathbb {R}}^d\) (i.e., the weights and biases of the neural network in deep RL). The agent is trained with an objective function defined as44

$$\begin{aligned} J(w)~ \dot{=} ~ V^{\pi _w}(s_0), \end{aligned}$$
(15)

where an episode starts in some particular state \(s_0\), and \(V^{\pi _w}\) is the value function for the policy \(\pi _w\). The policy parameters w are updated by estimating the gradient of an objective function and plugging it into a gradient ascent algorithm as follows

$$\begin{aligned} w \leftarrow w + \beta \nabla _w J(w), \end{aligned}$$
(16)

where \(\beta \) is the learning rate of the optimization algorithm. The gradient of an objective function can be computed using the policy gradient theorem45 as follows

$$\begin{aligned} \nabla _w V^{\pi _w}(s_0) = {\mathbb {E}}_{\pi _w}\big [ \nabla _w \big (\log ~\pi _w(s,a) \big ) Q^{\pi _w}(s,a)]. \end{aligned}$$
(17)

The accurate calculation of empirical expectation in Eq. (17) requires large number of samples. Furthermore, the performance of policy gradient methods is highly sensitive to the learning rate leading to difficulty in obtaining stable and steady improvement. The PPO algorithm introduces a clipped surrogate objective function43 to avoid excessive update in policy parameters in a simplified way as follows

$$\begin{aligned} J^{\mathrm{clip}}(w) = {\mathbb {E}}\big [ \mathrm{min}(r_t(w) {A}^{\pi _w}(s,a), {\mathrm{clip}} (r_t(w),1-\varepsilon ,1+\varepsilon ) {A}^{{\pi _w}}(s,a) ) \big ], \end{aligned}$$
(18)

where \(r_t(w)\) denotes the probability ratio between new and old policies as given below

$$\begin{aligned} r_t(w) = \frac{\pi _{w+\Delta w}(s,a)}{\pi _{w}(s,a)}. \end{aligned}$$
(19)

The \(\varepsilon \) in Eq. (18) is a hyperparameter that controls how much new policy can deviate from the old. This is done through a function \(\mathrm{clip}(r_t(w),1-\varepsilon ,1+\varepsilon )\) that enforces the ratio between new and old policy (\(r_t(w)\)) to stay between the limit \([1-\varepsilon ,1+\varepsilon ]\).

Variational multiscale approach

Here we present a two-scale variational multiscale formulation as depicted in Fig. 3, which utilize two orthogonal spaces, \({\varvec{X}}_A\) and \({\varvec{X}}_B\). Since the POD basis is orthonormal by construction, we can build these two orthogonal spaces in a natural, straightforward way: \({\varvec{X}}_A := \text {span} \{ \psi _1, \psi _2, \ldots , \psi _{R} \}\), which represents the resolved ROM scales, and \({\varvec{X}}_B := \text {span} \{ \psi _{R+1}, \psi _{R+2},\ldots , \psi _{{\tilde{R}}} \}\), which represents the test scales (i.e., unresolved ROM scales). Following Eq. (4), next we use the ROM approximation of u in the space \({\varvec{X}}_A \oplus {\varvec{X}}_B\),

$$\begin{aligned} u(x,t) =\sum _{k=1}^{R} \alpha _k(t) \psi _k(x) + \sum _{k=R+1}^{{\tilde{R}}} \alpha _k(t) \psi _k(x), \end{aligned}$$
(20)

where the first term in the right hand side of Eq. (20) represents the resolved ROM components of u, and the second term represents the unresolved test scales. Plugging the ROM approximation of u in Eq. (1), projecting it onto \({\varvec{X}}_A\), and using ROM basis orthogonality, we obtain

$$\begin{aligned} \frac{d \alpha _k}{d t} = \sum _{i=1}^{{\tilde{R}}} {\mathfrak {L}}^{i}_{k} \alpha _{i} + \sum _{i=1}^{{\tilde{R}}}\sum _{j=1}^{{\tilde{R}}} {\mathfrak {N}}^{ij}_{k} \alpha _{i}\alpha _{j}, \quad \forall \ k = 1, \ldots , R. \end{aligned}$$
(21)

To describe the hierarchical structure of the ROM basis, we make use of the variational framework. Therefore, the scales are naturally divided into two groups in the first stage using ROM projection: (i) resolved scales and (ii) unresolved scales. The phrases describing the relationships between the two categories of scales are expressly identified in the second stage. The novelty of the proposed framework is demonstrate in the third step, where we build our reinforcement learning based closure models for the interaction between the two types of scales. We also highlight that the choice of cut-off scale R and the test scale \({\tilde{R}}\) might have impact on the results of the approach. Moreover, in POD-based reduced order models, raising R typically results in an increase for the accuracy4, whereas \({\tilde{R}}\) can be taken as a very modest value for an efficient learning performance. Furthermore, there is an analogy between our approach and large eddy simulations where a test filter scale has been traditional used to estimate the coefficients of the dynamic models39,46.

Figure 3
figure 3

Relative information content (RIC) values as a function of the POD index. Test scales,which are used to build our reward function, represent the contribution of the under-resolved ROM scales.

In summary, our RL environment consists of three model definition for the evolution of the state variables \(\alpha _1, \alpha _2, \ldots , \alpha _R \): (i) base ROM given by Eq. (4), (ii) improved ROM by the closure model given by Eq. (7), and (iii) test model given by Eq. (21). Our key hypothesis relies on the fact that the proposed closure model is accurate and representative if the difference between states obtained by Eqs. (7) and 21 is minimized. Therefore, the reward function in our RL framework can now be easily constructed by exploiting the difference between these resolved and test scale modal coefficients. More precisely, let’s define the following states at time t: \(s_t^{base} := \{\alpha _1, \alpha _1, \ldots , \alpha _R\}\) as the solution of Eq. (4), \(s_t^{ROM} := \{\alpha _1, \alpha _1, \ldots , \alpha _R\}\) as the solution of our ROM given by Eq. (7), and \(s_t^{test} := \{\alpha _1, \alpha _1, \ldots , \alpha _R\}\) as the solution of Eq. (21). Then we can reward our closure policy according the following definition of the binary reward function:

$$\begin{aligned} r_t= {\left\{ \begin{array}{ll} +10 , &{} \text {if } \sigma ||s_t^{base} - s_t^{ROM}|| < ||s_t^{base} - s_t^{test}|| \\ -10, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(22)

where \(\sigma > 1\) is a scaling factor that can be chosen between 1 and 2 in practice. In our calculations, we set \(\sigma =1.6\). We note that this binary definition of reward function eliminates the need for access to the snapshot data as we will detail further in our results section. The selection of \(\pm 10\) in our binary definition is arbitrary since the RL workflows are designed to maximize the sum of the reward over each episodic experiment.

Results

Figure 4 displays the complete deep RL framework for the MMRL approach where the agent observes the POD modal coefficients as the state of the system and takes the action of selecting modal eddy viscosity coefficients. Due to the modal truncation, the effect of unresolved scales on resolved scales is not captured and there is a discrepancy between the true modal coefficients and the ROM modal coefficients. The goal of the agent is to minimize this difference, and therefore, the \(l_2\) norm of the deviation between true and ROM modal coefficients is used as the reward function. Since we are maximizing the reward, the negative of the \(l_2\) norm is first assigned as a reward at each time step that is given by

$$\begin{aligned} r_t = - ||s_t^{ROM} - ({\hat{u}},\psi _k)|| \end{aligned}$$
(23)

where \({\hat{u}} :={u(x,t)}\) refers to snapshot data at time step t. The choice of the reward function can have a significant effect on the performance of the agent15 and needs to be carefully designed depending upon the problem. The performance of the MMRL approach is compared against the linear-modal RL (LMRL) approach where the agent selects only a scalar value of eddy viscosity amplitude as an action and linear viscosity kernel35 is utilized to assign the modal eddy viscosity coefficients. Specifically, in LMRL approach we have \(\eta _k(t)=\eta _e(t)(k/R)\), where \(\eta _e\) is the eddy viscosity amplitude selected by the agent as an action.

Figure 4
figure 4

Schematic of the deep RL framework for MMRL ROM closure approach. The RL agent observes modal coefficients and selects model eddy viscosity coefficient as an action.

In Fig. 5, the trajectory of the mean reward is shown for training an RL agent with the LMRL and MMRL approaches. The agent is trained for Reynolds number \(Re = 1000\). It can be seen that the maximum reward attained with the MMRL approach is almost twice the magnitude of the reward achieved with the LMRL approach. Figure 6 depicts the evolution of selected POD modal coefficients for \(Re = 1500\). The prediction from both LMRL and MMRL approaches is in better agreement with the true projection modal coefficients compared to GP with the prediction from MMRL being more accurate compared to LMRL. However, we highlight that both LMRL and MMRL approaches utilize the reward function given by Eq. (23), which requires access to the true snapshot data. We highlight that in our evaluation, both mean and two-standard deviation (i.e., 95% confidence interval) of 10 different RL models are shown (e.g., see red symbols in Fig. 5 for those models).

Figure 5
figure 5

Evolution of the moving average mean reward for LMRL (left) and MMRL (right) approaches. The models used for testing are indicated by red symbols.

Figure 6
figure 6

Evolution of the second, fourth, sixth and the last POD modal coefficients at \(Re = 1500\) for LMRL (left) and MMRL (right) approaches.

On the other hand, Fig. 7 illustrates the trajectory of the mean reward for training an RL agent with the VMRL approach that utilizes a binary reward function given by Eq. (22). Figure 8 shows a comparison between MMRL and VMRL approaches at \(Re = 1500\). We highlight that both approaches utilizes multi-modal action space (i.e., discovering \(\eta _k(t)\) for \(k=1,2, \ldots , R\)). Figure 8 clearly demonstrates that the VMRL approach obtains an accurate policy without requiring access to the true labeled data in defining the reward function. This key aspect of the proposed VMRL approach paves the way of designing novel RL workflows exploiting the modal interaction between resolved and test scales.

Figure 7
figure 7

Evolution of the moving average mean reward for VMRL approaches. The models used for testing are indicated by red symbols.

Figure 8
figure 8

Evolution of the second, fourth, sixth and the last POD modal coefficients at \(Re = 1500\) for MMRL (left) and VMRL (right) approaches.

The spatiotemporal velocity field with different ROMs is shown in Fig. 9. It should be noted that we can at the most recover the true projection of the full order model (FOM) solution. This true reduced order representation (ROR) is also shown in Fig. 9. In our analogy given in Fig. 2, the blue curve represents the ROR, the red curve represents the GP model. For quantitative analysis, the root mean squared error (RMSE) for different ROM approaches at different Reynolds numbers (different from the training setting at \(Re=1000\)) is reported in Table 1. The RMSE for LMRL, MMRL, and VMRL approaches is significantly smaller compared the the GP model. We also observe that the VMRL approach provides marginally more accurate solution than the LMRL and MMRL approaches at higher Reynolds number. The trends for the LMRL and MMRL techniques indicate that error increases as Reynolds number rises, as expected. However, the error for \(Re=1200\) is slightly bigger for the VMRL technique than it is for \(Re=1500\). This might be related to the underlying modeling hyperparameters, and that somewhat different results might be obtained by using a different RL architecture.

Figure 9
figure 9

Spatiotemporal visualization of the velocity field at \(Re = 1500\) for different modeling approaches.

Table 1 \(\ell _2\) norm for the deviation of the velocity with respect to its true projection value for \(t\in [0,1]\).

Discussion

This study introduces scale-aware reinforcement learning (RL) framework to automate the discovery of closure models in projection based ROMs. We treat the closure as a control input in the latent space of the ROM and build the parameterized model with a dissipation term. The feasibility of the RL framework is first demonstrated with linear-modal RL (LMRL) where a linear eddy viscosity constraint is utilized for parameterization and with multi-modal RL (MMRL) which finds mode dependant eddy viscosity model coefficient. The agent is incorporated in a reduced-order solver, observes the POD modal coefficients, and accordingly computes the closure term. Here, both RL approaches minimize the discrepancy between the true POD modal coefficients and prediction from closure ROM, and the obtained closure model generalizes to different Reynolds number. We then demonstrate how to formulate an RL framework without requiring access to the true data using the variational multiscale formalism. We find that this variational-multiscale RL (VMRL) is a robust closure discovery framework that utilizes a reward function based on the modal energy transfer effect. Building on the promising results presented in this study to develop ROM closure models using deep RL, our future work will concentrate on incorporating uncertainties associated with observations into account while selecting the action of an agent. Although our report demonstrates the feasibility of the proposed RL framework while considering the Burgers equation, more complex examples need to be examined in order to conclude whether the proposed methods are indeed applicable to simulations of complex phenomena, a subject we intend to investigate in the future. The key benefit of utilizing reinforcement learning in this reduced order modeling context is its modular nature, which allows for the identification and learning of new closure modeling approaches, despite the fact that it is likely one of the most computational time- and resource-intensive disciplines of machine learning.