1 Introduction

Optimal control problem involves finding controls for a dynamical system such that a certain objective function is optimized over a predefined simulation time. Recently, reinforcement learning (RL) has been demonstrated as an effective method to solve stochastic optimal control problems in fields like manufacturing (Dornheim et al. 2020), energy (Anderlini et al. 2016) and fluid dynamics (Rabault et al. 2019). RL, being virtually a stochastic optimization method, involves a large number of exploration and exploitation attempts to learn the optimal control policy. As a result, the learning process for the optimal policy comprises a large number of simulations of the controlled dynamical system, which is often computationally expensive.

Various research studies have shown the effectiveness of using multigrid methods to improve the convergence rate of reinforcement learning. Anderson and Crawford-Hines (1994) extend Q-Learning by casting it as a multigrid method and has shown a reduction in the number of updates required to reach a given error level in the Q-function. Ziv and Shimkin (2005) and Pareigis (1996) formulate the value function learning process with a Hamilton-Jacobi-Bellman equation (HJB), which is solved using algebraic multigrid methods. However, despite the effectiveness of this strategy, the HJB formulation is only feasible when the model dynamics is well defined. As a result, these methods cannot be applied to problems where the model dynamics is an approximate representation of reality. Li and Xia (2015) used multigrid approach to compute tabular Q values for energy conservation and comfort of HVAC in buildings, which is applicable to certain simple RL problems with finite and discrete state-action space. In this paper, the aim is to present a generalized multigrid RL approach that can be applied to both discrete and continuous state and action space where HJB formulation may not be possible. For instance, when the transition in model dynamics is not necessarily differentiable and/or when the model is stochastic.

In the context of the reinforcement learning literature, the proposed multigrid learning process can be categorized as a framework for transfer learning. In transfer learning, the agent is first trained on one or more source task(s), and the acquired knowledge is then transferred to aid in solving the desired target task (Taylor and Stone 2009). In the presented study, the highest fidelity simulation corresponds to the target task, which is assumed to have the fine-grid discretization. The fine-grid discretization is presumed to guarantee a good approximation of the output quantities of interest with the accuracy required by the problem at hand. On the other hand, low-grid-fidelity simulations that compromise the accuracy of these quantities correspond to source tasks. These low-grid-fidelity simulations are generated using a degree-of-freedom parameter called the grid fidelity factor (much like the study by Narvekar et al. (2016)). Transfer learning is a much broader subdomain of RL that covers knowledge transfer in the form of data samples (Lazaric et al. 2008), policies (Fernández et al. 2010), models (Fachantidis et al. 2013), or value functions (Taylor and Stone 2005). In this study, knowledge transfer is done in the form of a policy for a model-free, on-policy algorithm called proximal policy optimization (PPO). Since the policy is designed for the state and actions corresponding to the highest-fidelity simulation, a predefined mapping function is used, which maps states and actions from low-fidelity simulations to high-fidelity simulations, and vice versa. This is done by defining restriction (mapping from high- to low-fidelity simulation) and prolongation (mapping from low- to high-fidelity simulation) operators, which are normally found in classical geometric multigrid methods.

The effectiveness of this multigrid RL framework is demonstrated for the robust optimal well control problem, which is a subject of intensive research activities in subsurface reservoir management (van Essen et al. 2009; Roseta-Palma and Xepapadeas 2004; Brouwer et al. 2001). Recently, several researchers have proposed the use of reinforcement learning to solve the optimal well control problem (Miftakhov et al. 2020; Nasir et al. 2021; Dixit and ElSheikh 2022). For this study, the dynamical system under consideration is non-linear and, in practice, is partially observable since the data is only available at a sparse set of points (i.e., well locations). Furthermore, the subsurface model parameters are highly uncertain due to the sparsity of available field data. Optimal well control problem consists of optimizing the control variables like valve openings of wells in order to maximize sweep efficiency of injector fluid throughout the reservoir life. The reservoir permeability field is considered as an uncertain model parameter for which the uncertainty distribution is known. Two test cases – both representing a distinct model parameter uncertainty and control dynamics – are used to demonstrate the computational gains of using the multigrid idea.

In summary, a multigrid reinforcement learning framework is proposed to solve the optimal well control problem for subsurface flow with uncertain parameters. This framework is essentially inspired by the principles of geometric multigrid methods used in iterative numerical algorithms. The optimal policy learning process is initiated using a low-fidelity simulation that corresponds to a coarse grid discretization of the underlying partial differential equations (PDEs). This learned policy is then reused to initialize training against a high-fidelity simulation environment in an adaptive and incremental manner. That is, the shifting from a low fidelity to higher fidelity environments is done adaptively after the convergence of the learned policy with the low fidelity environment. Due to this adaptive learning strategy, most of the initial policy learning takes place against lower-fidelity environments, yielding a minimal computational cost in the initial stages (significant part) of the reinforcement learning process. Robustness of the policy learned using this framework is finally evaluated against uncertainties in the model dynamics.

The outline of the remainder of this paper is as follows. Section 2 provides the description of the problem and the proposed framework to solve the robust optimal well control problem. Section 3 details the model parameters for the two case studies designed for demonstration. Results of the proposed framework on these two case studies are demonstrated in Sect. 4. Finally, Sect. 5 concludes with a summary of the research study and an outlook on future research directions.

2 Methodology

Fluid flow control in subsurface reservoirs has many engineering applications, ranging from the financial aspects of efficient hydrocarbon production to the environmental problems of contaminant removal from polluted aquifers (Whitaker 1999). In this paper, a canonical single-phase subsurface flow control problem (also referred to as robust optimal well control problem) is studied where water is injected in porous media to displace a contaminant. This process is commonly modeled using an advection equation for tracer flow through porous media (also called Darcy flow through porous media) over the temporal domain \({\mathcal {T}} = [t_0, t_M] \subset {\mathbb {R}}\) and spatial domain \({\mathcal {X}} \subset {\mathbb {R}}^2\). In the context of fluid displacement (e.g., groundwater decontamination), the tracer corresponds to clean water injected in the reservoir from the injector wells and the non-traced fluid corresponds to the displaced contaminated water from the reservoir through producer wells. The source and sink locations within the model domain correspond to the injector and producer wells, respectively. Tracer flow models water flooding with the fractional variable \(s(x,t) \in [0,1]\) (also known as saturation). Saturation s(xt), represents the fraction which is calculated as the ratio of injected clean water mass to the displaced contaminated water mass at location \(x \in {\mathcal {X}}\) and time \(t \in {\mathcal {T}}\). The flow of fluid in and out of the domain is represented by a(xt), which is treated as the source / sink terms of the governing equation. The set of well locations is denoted as \(x' \in \mathcal {X'}\) (where \(\mathcal {X'} \subset {\mathcal {X}}\)). In other words, a(xt) is assigned to zero everywhere in the domain \({\mathcal {X}}\) except the set of locations \(x'\). The controls \(a^+(x,t)\) (formulated as \(\max (0,a(x,t))\)) and \(a^-(x,t)\) (formulated as \(\min (0,a(x,t))\)) represent the injector and producer flow controls, respectively (note that \(a=a^+ + a^-\)). The task of the problem under consideration is to find optimal controls \(a^*(x',t)\), which is the solution of the closed-loop optimization problem defined as

$$\begin{aligned}&\max _{s(\cdot ), a(\cdot )} \int _{t_0}^{t_M} \left( \sum _{x'}a^-(x',t)(1-s(x',t)) \right) \text {d}t,&x' \in \mathcal {X'}, \ t \in {\mathcal {T}} \end{aligned}$$
(1a)
$$\begin{aligned}&\frac{\text {d}s}{\text {d}t} = \frac{1}{\phi } \left( a^+ + sa^- - \nabla \cdot sv \right) ,&x \in {\mathcal {X}}, \ t \in {\mathcal {T}} \end{aligned}$$
(1b)
$$\begin{aligned}&s(\cdot ,t_0) = s_0,\ \ v \cdot {\textbf {n}} = 0,&\end{aligned}$$
(1c)
$$\begin{aligned}&\sum _{x'} a^+(x',t) = -\sum _{x'} a^-(x',t) = c.&x' \in \mathcal {X'}, \ t \in {\mathcal {T}} \end{aligned}$$
(1d)

The objective function defined in Eq. 1a represents the total flow of fluid displaced from the reservoir (for example, contaminated water production) and is maximized over a finite time interval \({\mathcal {T}}\). The integrand in this function is referred to as Lagrangian term in control theory and is often denoted by L(sa). The water flow trajectory s(xt), is governed by advection Eq. 1b which is solved given the velocity field v, which is obtained from the Darcy law: \(v = - (k/\mu ) \nabla p\). The pressure \(p(x,t) \in {\mathbb {R}}\), is obtained from the pressure equation \(-\nabla \cdot (k/\mu ) \nabla p = a \). Porosity \(\phi (x,\cdot )\), permeability \(k (x,\cdot )\), and viscosity \(\mu (x,\cdot )\) are the model parameters. Permeability k, represents the model uncertainty and is treated as a random variable that follows a known probability density function \({\mathcal {K}}\) with K as its domain. The initial and no-flow boundary conditions are defined in Eq. 1c, where n denotes outward normal vector from the boundary of \({\mathcal {X}}\). The constraint defined in Eq. 1d represent the fluid incompressibility assumption along with the fixed total source/sink term c, which represents total water injection rate in the reservoir. In a nutshell, the optimization problem provided in Eqs. 1 is solved to find the optimal controls \(a^*(x',t)\) such that they are robustly optimal over the entire uncertainty domain of permeability, K.

2.1 RL Framework

According to RL convention, the optimal control problem defined in Eq. 1 is modelled as a Markov decision process, which is formulated as a quadruple \(\left\langle {\mathcal {S}},{\mathcal {A}},{\mathcal {P}},{\mathcal {R}} \right\rangle \). Here, \({\mathcal {S}} \subset {\mathbb {R}}^{n_s}\) is a set of all possible states with the dimension \(n_s\), \({\mathcal {A}} \subset {\mathbb {R}}^{n_a}\) is a set of all possible actions with the dimension \(n_a\). The state S, is represented with the saturation \(s(x,\cdot )\) and pressure \(p(x,\cdot )\) values over the entire domain \({\mathcal {X}}\). The action A, is represented by an array of well control values \(a(x',\cdot )\). More details of this array, like the representation of action, are presented in Sect. 3.3. The optimal control problem defined in Eq. 1 is discretized into M control steps and as a result, its solution is a set of optimal control values \({a^*(x',t_1), a^*(x',t_2), \ldots , a^*(x',t_M)}\) where \(t_0<t_1<t_2<\cdots <t_{M}\). The transition function \({\mathcal {P}}:{\mathcal {S}}\times {\mathcal {A}} \rightarrow {\mathcal {S}}\), is assumed to follow the Markov property. That is, transition to the state \(S(t_{m+1})\) is obtained by executing the actions \(A(t_{m})\) when in the state \(S(t_{m})\). Such transition function is obtained by discretizing Eq. 1b. For a transition from state \(S(t_{m})\) to state \(S(t_{m+1})\), the real value reward \(R(t_{m+1})\) is calculated as \(R(t_{m+1})={\mathcal {R}}(S(t_{m}), A(t_{m}), S(t_{m+1}))\), where \({\mathcal {R}}:{\mathcal {S}} \times {\mathcal {A}} \times {\mathcal {S}} \rightarrow {\mathbb {R}}\) is the reward function. The reward function is obtained by discretizing the objective function (Eq. 1a) into control steps such that

$$\begin{aligned} R(t_{m+1})= \int _{t_m}^{t_{m+1}} L(s,a) \text {d}t. \end{aligned}$$
(2)

Optimal controls are obtained by learning a control policy function, which is defined as \(\pi :{\mathcal {S}} \rightarrow {\mathcal {A}}\). This function is denoted as \(\pi (A|S)\) and is generally represented by a neural network. Essentially, the control policy \(\pi (A|S)\), maps a given state \(S(t_m)\), to an action \(A(t_m)\). For an optimal control problem, with M control steps, the goal of reinforcement learning is to find an optimal policy \(\pi ^*(A|S)\) such that the expected reward \(G = \sum _{m=1}^M \gamma ^{m-1} R(t_{m})\), is maximized. Note that immediate rewards R, are exponentially decayed by the discount rate \(\gamma \in [0,1]\). The discount rate represents how myopic the learned policy is; for example, a learned policy is considered completely myopic when \(\gamma =0\). The controller, which is also referred to as an agent, follows the policy and explores various control trajectories by interacting with the environment, which consists of a transition function \({\mathcal {P}}\) and a reward function \({\mathcal {R}}\). The data gathered by these control trajectories are used to update the policy towards optimality. Each such update of the policy is called a policy iteration. In RL literature, a single complete control trajectory is referred to as an episode. Essentially, RL algorithms attempt to learn the optimal policy \(\pi ^*(A|S)\) from a randomly initialized policy \(\pi (A|S)\), by exploring the state-action space by executing a high number of episodes.

In order to represent the variability in permeability, a finite number of well spread samples is chosen from the predefined uncertainty distribution. This is achieved with a cluster analysis (see Appendix 1 for the formulation of cluster analysis) of the distribution domain K. The sample vector \({\textbf {k}}= \{k_1, k_2, \cdots k_l\}\), is constructed with samples of the distribution \({\mathcal {K}}\), which are located nearest to the cluster centres. The policy \(\pi ^*(A|S)\), is learned by randomly selecting the parameter k from the training vector \({\textbf {k}}\) at the beginning of each episode. The policy return \(R^{\pi (A|S)}\), is computed by averaging the returns of policy \(\pi (A|S; k_i)\) (policy applied to the simulation where the permeability is set to \(k_i\)) in l simulations, which is formulated as

$$\begin{aligned} R^{\pi (A|S)} = \frac{1}{l} \sum _{i=1}^l \sum _{m=0}^{M-1} \int _{t_m}^{t_{m+1}} L(s, \pi (A|S; k_i)) \text {d}t. \end{aligned}$$
(3)

In optimal well control problems, the system is partially observable; that is, reservoir information is only available at well locations throughout the reservoir life cycle. To accommodate this fact, the agent is provided with the available observation as its state. For this study, observation is represented with a set of saturation and pressure values at the well locations \(x'\). This is also apparent in the definition of Lagrangian term L(sa), where values s and a are taken at well locations \(x'\), as defined in Eq. 1a. Note that with such a representation of states, the underlying assumption of the Markov property of the transition function is approximated. Such system is referred to as partially observable Markov decision process (POMDP). By the definition of POMDP, the policy requires the observations and actions of all previous control steps to return the action for a certain control step. However, for the presented case studies, observation from only the previous control step is observed to be sufficient for policy representation.

2.2 Learning Convergence Criteria

The optimal policy convergence is detected by monitoring the policy return \(R^{\pi (A|S)}\), after every policy iteration. Conventionally, when this value converges to a maximum value, the optimal policy is assumed to be learned. The convergence criteria for ith policy iteration is defined as

$$\begin{aligned} \delta _i = \left| \frac{R^{\pi (A|S)}_i - R^{\pi (A|S)}_{i-1} }{\max (R^{\pi (A|S)}_{i-1}, \epsilon ) } \right| < \delta , \end{aligned}$$
(4)

where \(\delta _i\) is the return tolerance at ith policy iteration, \(\delta \) is the stopping tolerance and \(\epsilon \) is a small non-zero number used to avoid division by zero. The convergence of policy learning process is often flat near the optimal result. For this reason, the convergence criteria defined in Eq. 4 is checked for the last n consecutive policy iterations. For example, if \({\textbf {r}}\) is the array of monitored values of \(R^{\pi (A|S)}\) at all policy iterations, the policy \(\pi (A|S)\) is considered converged when the convergence criteria are met (Eq. 4) for last n policy iterations is met. Algorithm 1 delineates the pseudocode for this convergence criteria.

figure a

Figure 1 illustrates the effect of n and \(\delta \) on the convergence criteria for an example of a reinforcement learning process.

Fig. 1
figure 1

Plot of policy returns versus number of training episodes: a illustrates effect of \(\delta \) on convergence criteria and b illustrates effect of n on convergence criteria

The policy return plot is shown in blue, where each value at the policy iteration is shown with a dot. The corresponding return tolerance is plotted in gray color, which is represented in percentage format (\(\delta _i\times 100\), where \(\delta _i\) is calculated from Eq. 4). It can be seen that the convergence criteria (denoted with markers on these plots) takes longer to satisfy when the stopping tolerance \(\delta \), is smaller and consecutive policy iteration steps n, are higher.

2.3 Adaptive Multigrid RL Framework

An adaptive multigrid RL framework is proposed where, essentially, the policies learned using lower grid fidelity environments are transferred and trained with higher-fidelity environments. The fidelity of the grid for an environment is described by the factor \(\beta \in (0,1]\). The environment with \(\beta =1\) is assumed to have the fine-grid discretization, which guarantees good approximation of fluid flow production out of the domain as defined in Eq. 1a. For any environment where \(\beta <1\), the size of the environment grid is coarsened by the factor of \(\beta \). For example, if a high-fidelity environment where \(\beta =1\) corresponds to the simulation with grid size \(64\times 64\), the simulation grid size is reduced to \(32\times 32\) when \(\beta \) is set to 0.5. Restriction operator \(\Phi _{\beta }()\), is used to coarsen the high fidelity simulation parameters with the factor of \(\beta \). This is done by partitioning a finer grid of size \(m \times n\) (corresponding to \(\beta =1\)) into coarser dimensions \(\lfloor \beta m \rfloor \times \lfloor \beta n \rfloor \) (corresponding to \(\beta < 1\) where \(\lfloor \cdot \rfloor \) is the floor operator) and computing these values of the coarse grid cell as a function \({\textbf {f}}\), of the values in the corresponding partition. Figure 2a illustrate this restriction operator for a variable \(x \in {\mathbb {R}}^{n\times m}\).

Fig. 2
figure 2

Illustration for the restriction operator \(\Phi _{\beta }\) (a) and prolongation operator \(\Phi _{\beta }^{-1}\) (b) for a parameter x

The function \({\textbf {f}}\), for different parameters of the reservoir simulation, are listed in Table 1.

On the other hand, the prolongation operator \(\Phi ^{-1}_{\beta }()\), maps the coarse grid environment parameters to the fine grid as shown in Fig. 2b.

A typical agent-environment interaction using this framework is illustrated in Fig. 3. Note that the transition function \({\mathcal {P}}\) and the reward function \({\mathcal {R}}\) are subscripted with \(\beta \) to indicate the grid fidelity of the environment. State \(S(t_m)\), action \(A(t_m)\) and reward \(R(t_m)\) are denoted with shorthand notations, \(S_m\), \(A_m\) and \(R_m\), respectively. Throughout the learning process, the policy is represented with states and actions corresponding to the high-fidelity grid environment. As a result, actions, and states to and from the environment, undergo the restriction \(\Phi _{\beta }\) and prolongation \(\Phi ^{-1}_{\beta }\) operations at each time-step as shown in the environment box in the Fig. 3.

Table 1 Restriction operator function for simulation parameters
Fig. 3
figure 3

A typical agent-environment interaction in the proposed multigrid RL framework

The proposed framework is demonstrated for PPO algorithm. PPO (Schulman et al. 2017) is a policy gradient algorithm that models stochastic policy \(\pi _\theta (A|S)\), with a neural network (also known as the actor network). Essentially, the network parameters \(\theta \), are obtained by optimizing the objective function defined as

$$\begin{aligned} \begin{aligned} J_\textrm{ppo}(\theta )&= \hat{{\mathbb {E}}}_t \Bigg [ \Bigg . \min \Big ( \Big . r_t(\theta ) {\hat{Adv}} (S_t,A_t),\\&clip (r_t(\theta ),1-\epsilon ,1+\epsilon ){\hat{Adv}} (S_t,A_t) \Big . \Big ) \Bigg . \Bigg ] , \end{aligned} \end{aligned}$$
(5)

where \(r_t(\theta ) = \pi _{\theta }(A_t|S_t)/\pi _{\theta _{old}}(A_t|S_t)\) and \(\theta _{old}\) correspond to the policy parameters before the policy update. The advantage function estimator \({\hat{Adv}}\), is calculated using the generalized advantage estimator (Schulman et al. 2015) derived from the value function \(V_t\). The value function estimator \({\hat{V}}_t\) is learned through a separate neural network, termed as the critic network. The definitions of advantage and value functions are provided in Appendix 2. In practice, a single neural network is used to represent both the actor and critic networks. The objective function for this integrated actor-critic network is the sum of the actor loss term (Eq. 5), value loss term and entropy loss term. For the purpose of maintaining brevity in our description, these latter loss terms are omitted and the policy network’s objective function is treated as \(J_\textrm{ppo}(\theta )\) in further discussion. However, please note that they are considered while executing the framework. The readers are referred to Schulman et al. (2017) for a detailed definition of the policy network loss term. The Algorithm 2 presents the pseudocode for the proposed multigrid RL framework.

figure b

The framework consists of, in total, m values of grid fidelity factor which are represented with an array \(\varvec{\beta }=[\beta _1, \beta _2,\ldots ,\beta _m]\), where \(\beta _m=1\) and \(\beta _1<\beta _2<\cdots <\beta _m\). The environment is denoted as \({\mathcal {E}}_{\beta _i}\), which represents the environment with the grid fidelity factor \(\beta _i\). Policy \(\pi _{\theta }(A|S)\) is initially learned with the environment \({\mathcal {E}}_{\beta _1}\), until the convergence criteria are met. The convergence criteria are checked using the Algorithm 1 with predefined parameters \(\delta \) and n. Upon convergence, further policy iterations are learned using the environment \({\mathcal {E}}_{\beta _2}\), and so on until the convergence criteria are met for the highest grid fidelity environment \({\mathcal {E}}_{\beta _m}\). A limit for the number of episodes to be executed at each grid level is also set. This is done by defining an episode limit array \({\textbf {E}}=[E_1, E_2,\ldots , E_m]\), where \(E_m\) is the total number of episodes to be executed and \(E_1<E_2<\cdots <E_m\). That is, for every environment with grid fidelity factor, \(\beta _j\) the maximum number of episodes to be trained is limited to \(E_j\).

3 Case Studies

Two test cases are designed, representing two distinct uncertainty distributions of permeability and control dynamics. For both cases, the values for model parameters emulate those in the benchmark reservoir simulation cases, SPE-10 model 2 (Christie et al. 2001). Table 2 delineates these values for test cases 1 and 2. As per the convention in geostatistics, the distribution of \(\log {(k)}\) is assumed to be known and is denoted by \({\mathcal {G}}\).As a result, \(g=\log (k)\) is treated as a random variable in the problem description defined in Eq. 1. The uncertainty distributions for test cases 1 and 2 are indicated with \({\mathcal {G}}_1\) and \({\mathcal {G}}_2\), respectively.

Table 2 Reservoir model parameters

3.1 Uncertainty Distribution for Test Case 1

The log-permeability uncertainty distribution for test case 1 is inspired by the case study of Brouwer et al. (2001). Figure 4a shows schematics of the spatial domain for this case. In total, 31 injector wells (illustrated with blue circles) and 31 producer wells (illustrated with red circles) are placed on the left and right edges of the domain, respectively. As illustrated in Fig. 4a, a linear high-permeability channel (shown in gray) passes from the left to right side of the domain.

Fig. 4
figure 4

Schematic of the spatial domain for test case 1 (a) and 2 (b)

\(l_1\) and \(l_2\) represent the distance from the upper edge of the domain on the left and right sides, while the width of the channel is indicated by w. These parameters follow uniform distributions defined as \(w \sim U(120, 360)\), \(l_1 \sim U(0,L-w)\) and \(l_2 \sim U(0,L-w)\), where L is the domain length. In other words, the random variable g follows the probability distribution \({\mathcal {G}}_1\) that is parameterized with w, \(l_1\) and \(l_2\) which is described as

$$\begin{aligned} g \sim {\mathcal {G}}_1(w, l_1, l_2). \end{aligned}$$

To be specific, the log-permeability g at a location (xy) is formulated as

$$\begin{aligned} g(x, y) = \left\{ \begin{matrix} \log {(245)} &{} if &{} \frac{l_2-l_1}{L}x+l_1 \le y \le \frac{l_2-l_1}{L}x+l_1+w, \\ &{} &{} \\ \log {(0.14)} &{} \ \ otherwise , &{} \end{matrix}\right. \end{aligned}$$

where x and y are horizontal and vertical distances from the upper left corner of the domain illustrated in Fig. 4a. The values for permeability at the channel (245 mD) and the rest of the domain (0.14 mD) are inspired from Upperness log-permeability distribution peak values specified in SPE-10 model 2 case.

3.2 Uncertainty Distribution for Test Case 2

Test case 2 represents the uncertainty distribution of a smoother permeability field. Figure 4b illustrates reservoir domain for this case. It comprises 14 producers (illustrated with red circles) located symmetrically on the left and right edges (7 on each edge) of the domain and 7 injectors (illustrated with blue circles) located at the central vertical axis of the domain. A prior distribution F is assumed on all locations \(x \in {\mathcal {X}}\) as

$$\begin{aligned}&F(x) = \mu + Z(x), where , \nonumber \\&{\mathbb {E}}(Z(x))=0, \nonumber \\&Cov (Z(x), Z({\tilde{x}})) = \sigma ^2 k(x,{\tilde{x}}), \end{aligned}$$
(6)

where the process variance, \(\sigma \), is assigned as 5 and the exponential covariance function (kernel), \(k(x,{\tilde{x}})\), is defined as

$$\begin{aligned} k(x,{\tilde{x}}) = \exp \left[ - \left( \frac{(x_1 - {\tilde{x}}_1)^2}{l_1^2} + \frac{(x_2 - {\tilde{x}}_2)^2}{l_2^2} \right) ^{1/2} \right] , \end{aligned}$$

where the parameters \(l_1\) and \(l_2\) are assigned to be 620ft (width of the domain) and 62ft (10% of domain width), respectively. The posterior distribution given the observed log-permeability vector, \({\textbf {g}}(x') = [ g(x'_1), g(x'_2), \cdots , g(x'_n)]\), where each observation corresponds to a log-permeability value of 2.41 at a well location (that is, \(n=21\) since there are, in total, 21 wells in this case). From the principle of ordinary kriging, the posterior distribution, \({\mathcal {G}}_2\), for log-permeability at a location \(x \in {\mathcal {X}}\) is a normal distribution which is defined as

$$\begin{aligned} g(x) \sim&\ {\mathcal {G}}_2( {\hat{g}}(x), {\hat{s}}^2(x)), where, \\ {\hat{g}}(x) =&\ {\hat{\mu }} + {\textbf {k}}(x', x)^\intercal {\textbf {k}}(x',x')^{-1} ({\textbf {g}}(x')-{\textbf {1}}{\hat{\mu }}), \\ {\hat{s}}^2(x) =&\ \sigma ^2 \Bigg [ \Bigg . 1 - {\textbf {k}}(x',x)^\intercal {\textbf {k}}(x',x')^{-1} {\textbf {k}}(x',x)\\&+ \frac{(1-{\textbf {1}}^\intercal {\textbf {k}}(x',x')^{-1} {\textbf {k}}(x',x))^2}{ {\textbf {1}}^\intercal {\textbf {k}}(x',x')^{-1} {\textbf {1}}} \Bigg . \Bigg ], \end{aligned}$$

where \({\textbf {k}}(x',x)\) is the n dimensional vector whose ith value is \(k(x'_i, x)\), \({\textbf {k}}(x',x')\) is the \(n\times n\) dimensional matrix whose value at (ij) is \(k(x'_i, x'_j)\), \({\textbf {1}}\) is a n dimensional vector with all elements of one (\({\textbf {1}} = [1,1,\ldots , 1]^\intercal \)) and \({\hat{\mu }}\) is an estimate of the global mean \(\mu \), which is obtained from the kriging model based on the maximum likelihood estimate of the distribution F(x) (from Eq. 6) for the observations \({\textbf {g}}(x')\), and is formulated as

$$\begin{aligned} {\hat{\mu }} = \frac{{\textbf {1}}^\intercal {\textbf {k}}(x',x')^{-1} {\textbf {g}}(x')}{{\textbf {1}}^\intercal {\textbf {k}}(x',x')^{-1} {\textbf {1}}}. \end{aligned}$$

The log-permeability distribution \({\mathcal {G}}_2\), is created with an ordinary kriging model using the geostatistics library gstools (Müller and Schüler 2019). In the simulation, samples of the permeability fields are obtained with a clockwise rotation angle of \(\pi /8\).

3.3 State, Action and Reward Formulation

PPO algorithm attempts to learn the parameters \(\theta \) of the policy neural network \(\pi _{\theta }(A|S)\). The episodes (i.e., the entire simulation in the temporal domain \({\mathcal {T}}\)) are divided into five control steps. Each episode timestep corresponding to a control step is denoted with \(t_m\), where \(m \in \{1,2,\ldots , 5 \}\). The state S, is represented by an observation vector that consists of saturation and pressure values at well locations \(x'\). Since the saturation values in the injector wells are always one, regardless of time \(t_m\), they are omitted from the observation vector. Consequently, the observation vector is of the size \(2n_p + n_i\) (i.e., \(n_s = 93\) for test case 1 and \(n_s = 35\) for test case 2). Note that this observation vector forms the input to the policy network \(\pi _{\theta }(A|S)\). A vector of flow control values of all the injector and producer wells, denoted by A, is represented as the action. The action vector A consists of in total \(n_p + n_i\) values (that is, \(n_a=62\) for test case 1 and \(n_a=21\) for test case 2). To maintain constraints defined in Eq. 1d, the action vector is represented by a weight vector \(w \in {\mathbb {R}}^{n_a}\), such that \(0.001\le w_j \le 1\). Each weight value \(w_j\), corresponds to the proportion of flow through the jth well. As a result, the values in the action vector are written as \(( w_1,\ldots ,w_{n_i},w_{n_i+1},\ldots ,w_{n_i+n_p} )\). Flow through jth injector \(A_j\), is computed such that the constraint defined in Eq. 1d is satisfied

$$\begin{aligned} A_j = -\frac{w_j}{\sum _{i=j}^{n_i}w_j}c. \end{aligned}$$

Similarly, the flow through the jth producer, \(A_{j+n_i}\), is calculated as

$$\begin{aligned} A_{j+n_i} = \frac{w_{j+n_i}}{\sum _{j=1}^{n_p}w_{j+n_i}}c. \end{aligned}$$

The reward function, as defined in Eq. 2, is divided by total pore volume (\(\phi \times lx \times ly \)) as a form of normalization to obtain a reward function in the range [0,1]. The normalized reward represents the recovery factor or the sweep efficiency of the contaminated fluid. Recovery factor represents the total amount of contaminants swept out of the domain. For example, the recovery factor of 0.65 means that 65% of contaminants are swept out of the domain using water flooding. To put it in the context of a groundwater decontamination problem, the optimal controls correspond to the well controls that maximize the percentage of contaminants swept out of the reservoir.

3.4 Multigrid Framework Formulations

The proposed framework is demonstrated using three levels of simulation grid fidelity. Note that \(\beta =1\) case corresponds to the finest level of simulation, which provides an accurate estimate of recovery factor. In this study, the aim is to exploit the simulations with \(\beta <1\), which correspond to computationally cheaper simulation run times by definition. Furthermore, \(\beta <1\) simulations are deliberately chosen such that they correspond to a recovery factor estimate that deviates from accurate \(\beta =1\) simulation estimates. Table 3 lists the discretization grid size corresponding to these grid fidelity factors for both test cases.

Table 3 Grid fidelity factor and corresponding grid size

Figures 5a and b plot the recovery factor estimates (with all wells open equally) for these grid fidelity factors for each permeability sample \(k_i\), in test cases 1 and 2, respectively.

Fig. 5
figure 5

Comparison of recovery factor estimates with \(\beta =1\), \(\beta =0.5\) and \(\beta =0.25\) for test case 1 (a) and test case 2 (b)

The deviation from the accurate recovery factor (that is, for \(\beta =1\)) for \(\beta =0.5\) and \(\beta =0.25\) can be seen for both cases. As expected, the recovery factor estimates with \(\beta =0.25\) show a higher deviation from the estimates of \(\beta =1\) compared to those of \(\beta =0.5\). To show the effectiveness of the proposed framework, the obtained results are compared with single-grid and multigrid frameworks. The results for a single-grid framework are the same as if they were obtained using the classical PPO algorithm, where the environment has a fixed fidelity factor throughout the policy-learning process. This is done by setting the grid fidelity factor array \(\varvec{\beta }\), and episode limit array E, with a single value in Algorithm 2. The factor n in convergence criteria procedure (delineated in Algorithm 1) is set to infinity. In other words, convergence criteria is unchecked and the policy learning takes place for a predefined number of episodes. In total, three such single-grid experiments are performed corresponding to \(\beta =0.25\), \(\beta =0.5\), and \(\beta =1.0\). In addition, two multigrid experiments are performed to demonstrate the effectiveness of the proposed framework. The first multigrid experiment is referred to as fixed, where convergence criteria are kept unchecked just like single-grid frameworks. Multiple levels of grids are defined by setting the grid fidelity factor array \(\varvec{\beta }\), and episode limit array E, as an array of multiple values corresponding to each fidelity factor value and its corresponding episode count. In the fixed multigrid framework, policy learning takes place by updating the fidelity factor of the environment according to \(\varvec{\beta }\) without checking the convergence criteria (i.e., by setting \(n=\infty \)). Second, the parameters of the adaptive multigrid framework are set similarly to those used in the fixed multigrid framework, except for the convergence criteria parameters n and \(\delta \). Table 4 delineates the number of experiments and their corresponding parameters for test cases 1 and 2.

Table 4 Multigrid framework experiments

Figure 6 provides visualization for the effect of fidelity factor \(\beta \), on the simulation in test case 1.

Fig. 6
figure 6

Effect of grid fidelity factor \(\beta \) on the environment for test case 1: a on a sample of log-permeability, b on corresponding saturation and c on simulation run time

Figure 6a and b show log-permeability and saturation plots corresponding to \(\beta =0.25\), \(\beta =0.5\) and \(\beta =1.0\). Furthermore, Fig. 6c illustrates the effect of grid fidelity on simulation run time for a single episode (shown on left with a box plot of 100 simulations) and the equivalent \(\beta =1\) simulation run time for each grid fidelity factor (shown on right). Equivalent \(\beta =1\) simulation run time is defined as the ratio of average simulation run time for a grid fidelity factor \(\beta \), to that corresponding to \(\beta =1\). This quantity is used as a scaling factor to convert the number of simulations for any value of \(\beta \) to its equivalent number of simulations as if they were performed with \(\beta =1\). Similar plots for test case 2 are shown in Fig. 7.

Fig. 7
figure 7

Effect of grid fidelity factor \(\beta \) on the environment for test case 2: a on a sample of log-permeability, b on corresponding saturation and c on simulation run time

The results obtained using the proposed framework are evaluated against the benchmark optimization results. These benchmark optimal results are obtained using the differential evolution (DE) algorithm (Storn and Price 1997). For both optimization methods (PPO and DE), multi-processing is used to reduce total computational time. However, the parallelism behavior is quite varied between the PPO and DE algorithms. For instance, neural networks are back propagated synchronously at the end of each policy iteration of PPO, which causes extra computational time in waiting and data distribution. For this reason, a computational cost comparison among various experiments is performed by comparing the number of simulation runs in each experiment. The PPO algorithm for the proposed framework is executed using the stable baselines library (Raffin et al. 2019), while python’s SciPy (Virtanen et al. 2020) library is used are provided in Appendix 3 delineates all the algorithm parameters used in this study.

4 Results

The control policy in which the injector and producer wells are equally open throughout the entire episode is called the base policy. Under such policy, the water flooding prominently takes place in the high permeability region, leaving the low permeability region swept inefficiently. The optimal policy for these test cases would be to control the producer and injector flow to mitigate this imbalance in water flooding. The optimal policy, learned using reinforcement learning for test case 1, shows on average around 12% improvement with respect to the recovery factor achieved using the base policy. While for test case 2, the average improvement is in the order of 25%.

Figure 8 illustrates the plots for the policy return \(R^{\pi (A|S)}\), corresponding to all the frameworks listed in Table 4 for test case 1. At the beginning of the learning process, the policy return values for single-grid framework keeps improving and eventually converge to a maximum value when the policy converges to an optimal policy. Note that for lower value of grid fidelity factor \(\beta \), the optimal policy return is also low. In other words, the coarsening of simulation grid discretization also reflects in overall reduction in recovery factor. This is due to the low accuracy of the state and action representation for environments with \(\beta <1\). However, the overall computational gain is observed as a result of coarser grid sizes. The simulation run time corresponding to \(\beta =0.25\) and \(\beta =0.5\) shows a reduction of around 66% and 54% compared to \(\beta =1\). The results of multigrid frameworks are compared with the single grid framework corresponding to \(\beta =1\) which refers to the classical PPO algorithm that uses the environment with a fixed high-fidelity grid factor. As shown in the plots at the center and right of Fig. 8, both multigrid frameworks show convergence to the optimal policy, which is achieved using the high-fidelity single-grid framework. In the fixed multigrid framework, the fidelity factor is incremented at a fixed interval of 25,000 number of episodes. The adaptive framework is also provided with the same interval but with additional convergence check within each interval. For multigrid learning plots shown in Fig. 8 (center and right plots), the equivalent number of episodes corresponding to the environment with \(\beta =1\) is illustrated as a secondary horizontal axis.

Fig. 8
figure 8

Plots of policy return versus number of episodes for test case 1

In this way, the computational effect of multigrid frameworks is directly compared to a single-grid framework (with \(\beta =1\)). The equivalent number of \(\beta =1\) episodes corresponding to episodes with a certain value \(\beta \) is computed by multiplying it with the equivalent \(\beta =1\) simulation run time. For example, the number of episodes with \(\beta =0.25\) is multiplied by 0.37. For a fixed multigrid framework, it takes 45,496 equivalent \(\beta =1\) episodes to achieve an equally optimal policy that is obtained with a 75,000 number of episodes using single grid (\(\beta =1\)) framework. Similarly, the same is achieved with just 28,303 equivalent \(\beta =1\) episodes using the adaptive multigrid framework. In other words, around 38% and 61% reductions are observed in the simulation run time using fixed and adaptive multigrid frameworks, respectively. Further, the robustness of the policy learned using these frameworks is compared by applying it on the highest fidelity environment with random permeability samples from the distribution \({\mathcal {G}}_1\), which were never seen during the policy learning process. Figure 9a shows the plots of these unseen permeability fields, while the corresponding results obtained using these frameworks are plotted in Fig. 9b.

Fig. 9
figure 9

Evaluation of learned policies for test case 1: a evaluation samples of log-permeability distribution \({\mathcal {G}}_1\), b recovery factor (in % format) versus evaluation sample index (from a) plot

Optimal results obtained using differential evolutionary (DE) algorithms are provided as benchmark (marked as DE in Fig. 9b). Note that DE algorithm is not a suitable method to solve the robust optimal control problem since it can provide optimal controls only for certain permeability samples as opposed to PPO algorithm where the learned policy is applicable to all samples of permeability distribution. However, DE results are used as the reference optimal results, which are achieved by direct optimization on sample-by-sample basis. Equivalence in the optimality of learned policies obtained using these three experiments can be observed from the closeness in their corresponding optimal recovery factors. Figure 10 demonstrates the visualization of the policy for an example of the permeability sample in case 1.

Fig. 10
figure 10

Illustration of learned optimal control policies for test case 1

In this figure, the results are shown for permeability sample index 4 from the Fig. 9a where a high permeability channel passes through the lower region of the domain. The optimal policy in this case would be to restrict flow through the injector and producer wells that are in the vicinity of the channel. The superpositioned comparison of optimal results for base case, differential evolution, single-grid framework (where \(\beta =1\)), fixed multigrid framework and adaptive multigrid framework shows that the optimal policy is learned successfully using the proposed framework.

For test case 2, similar results are observed as shown in Fig. 11.

Fig. 11
figure 11

Plots of policy return versus number of episodes for test case 2

The single-grid algorithms converge to an optimal policy in total 150,000 number of episodes. The fixed multigrid algorithm is trained with 50,000 episode interval for each grid fidelity factor as shown in the central plot in Fig. 11. The optimal policy is learned in 94,141 equivalent \(\beta =1\) episodes thus saving around 38% of simulation run time. The adaptive multigrid framework further reduces computational cost by achieving the optimal policy in 39,582 equivalent \(\beta =1\) episodes (simulation time reduction of about 76% with respect to the \(\beta =1\) single-grid framework). Figure 12 illustrates the results of policy evaluation on unseen permeability samples from the distribution \({\mathcal {G}}_2\).

Fig. 12
figure 12

Evaluation of learned policies for test case 2: a evaluation samples of log-permeability distribution \({\mathcal {G}}_2\), b recovery factor (in % format) versus evaluation sample index (from a) plot

The permeability samples are shown in Fig. 12a and the optimal recovery factor corresponding to the learned policies is plotted in Fig. 12b. Figure 13 shows the optimal controls for an example of the permeability sample index 5 from Fig. 12a.

Fig. 13
figure 13

Illustration of learned optimal control policies for test case 2

The optimal policy learned using differential evolution algorithm refers to increasing the flow through injector wells which are in the low permeability region while restricting the flow through producer wells for which the water cut-off is reached. Policies learned using the RL framework take advantage of the default location and orientation of high-permeability regions. In this case, the optimal policy is achieved by controlling the well flow control such that the flow traverses through the permeability channels (that is, the flow is more or less perpendicular to the permeability orientation).

5 Conclusion

An adaptive multigrid RL framework is introduced to solve a robust optimal well control problem. The proposed framework is designed to be general enough to be applicable to similar optimal control problems governed by a set of time dependant nonlinear PDEs. Numerically, a significant reduction in the computational costs of policy learning is observed compared to the results of the classical PPO algorithm. In the presented case studies, 61% computational savings in simulation runtime for test case 1 and 76% for test case 2 is observed. However, note that these results are highly dependent on the right choice of the algorithm hyperparameters (e.g., \(\delta \), n, \(\varvec{\beta }\) and \({\textbf {E}}\)) which were tuned heuristically. As a future direction for this research study, the aim is to find the optimal values for \(\varvec{\beta }\) that maximize the overall computational savings. Furthermore, the policy transfer was performed sequentially in the current framework, which seemed to have worked optimally. However, to improve the generality of the proposed framework, it would be important to study the effect of the sequence of policy transfers on overall performance.