Robust optimal well control using an adaptive multi-grid reinforcement learning framework

Reinforcement learning (RL) is a promising tool to solve robust optimal well control problems where the model parameters are highly uncertain, and the system is partially observable in practice. However, RL of robust control policies often relies on performing a large number of simulations. This could easily become computationally intractable for cases with computationally intensive simulations. To address this bottleneck, an adaptive multi-grid RL framework is introduced which is inspired by principles of geometric multi-grid methods used in iterative numerical algorithms. RL control policies are initially learned using computationally efficient low fidelity simulations using coarse grid discretization of the underlying partial differential equations (PDEs). Subsequently, the simulation fidelity is increased in an adaptive manner towards the highest fidelity simulation that correspond to finest discretization of the model domain. The proposed framework is demonstrated using a state-of-the-art, model-free policy-based RL algorithm, namely the Proximal Policy Optimisation (PPO) algorithm. Results are shown for two case studies of robust optimal well control problems which are inspired from SPE-10 model 2 benchmark case studies. Prominent gains in the computational efficiency is observed using the proposed framework saving around 60-70% of computational cost of its single fine-grid counterpart.


Introduction
Optimal control problem involves finding controls for a dynamical system such that a certain objective function is optimized over a pre-defined simulation time. Recently, reinforcement learning (RL) has been demonstrated as an effective method to solve stochastic optimal control problems in fields like manufacturing (Dornheim et al., 2020), energy (Anderlini et al., 2016) and fluid dynamics (Rabault et al., 2019). RL, being virtually a stochastic optimisation method, involves a huge number of exploration and exploitation attempts in order to learn the optimal control policy. As a result, learning the optimal policy requires a large number of simulations of the controlled dynamical system which is often computationally expensive. In this paper, an adaptive multi-grid RL framework is introduced to reduce overall computational cost of number of simulations required to learn the optimal control policy.
Various research studies have shown the effectiveness of using multi-grid method to improve the convergence rate of reinforcement learning. Anderson and Crawford-Hines (1994) extend Q-Learning by casting it as a multi-grid method and has shown a reduction in the number of updates required to reach a given error level in the Q-function. Ziv and Shimkin (2005) and Pareigis (1996) formulated the value function learning process with a Hamilton-Jacobi-Bellman (HJB) equation which is solved using algebraic multi-grid methods. Albeit the effectiveness of this strategy, HJB formulation is only feasible when the model dynamics are well defined. As a result, these methods cannot to be applied to problems where the model dynamics are an approximate representation of reality. Li and Xia (2015) used multi-grid approach to compute tabular Q values for energy conservation and comfort of HVAC in buildings which is applicable to certain simple RL problems with finite and discrete state-action space. In this paper, the aim is to present a generalized multi-grid RL approach which can be applied on both, discrete and continuous, state and action space where HJB formulation may not be possible for instance, when the transition in model dynamics is not necessarily differentiable and/or when the model is stochastic. This framework is essentially inspired by the principles of geometric multi-grid methods used in iterative numerical algorithms. The optimal policy learning process is initiated using a low fidelity simulation that correspond to a coarse grid discretization of the underlying partial differential equations (PDEs). This learned policy is then reused to further train it using high fidelity simulations in an adaptive and incremental manner. Robustness of the policy learned using this framework is finally evaluated against uncertainties in the model dynamics.
In reinforcement learning literature, such a learning process is categorized as transfer learning. The idea behind transfer learning is that instead of learning directly on the target task, the agent can first train on one or more source task(s), and transfer the knowledge acquired to aid in solving the target task (Taylor and Stone, 2009). In the context of current study, highest fidelity simulation correspond to the target task which is assumed to have the fine-grid discretization which guarantees good approximation of the output quantities of interest with the accuracy required by the problem at hand. Low grid fidelity simulations that compromises on the accuracy of these quantities, on the other hand, correspond to source tasks. These low grid fidelity simulations are generated using a degree of freedom parameter called grid fidelity factor (much like in the study done by Narvekar et al. (2016)). Transfer learning is a much broader sub-domain of RL that covers knowledge transfer in the form of data samples (Lazaric et al., 2008), policies (Fernández et al., 2010), models (Fachantidis et al., 2013) or value functions (Taylor and Stone, 2005). In this study, the knowledge transfer is done in the form of the policy for a model-free, on-policy algorithm called proximal policy optimisation (PPO). Since the policy is designed for the state and actions corresponding to the highest fidelity simulation we employ a mapping function that maps states and actions from low fidelity simulations to high fidelity simulations and vice versa. This is done by defining restriction (mapping from high to low fidelity simulation) and prolongation (mapping from low to high fidelity simulation) operators which are normally found in classical geometric multi-grid methods.
Effectiveness of this multi-grid RL framework is demonstrated for robust optimal well control problem which is a subject of intensive research activities in subsurface reservoir management (van Essen et al., 2009;Roseta-Palma and Xepapadeas, 2004;Brouwer et al., 2001). For this problem, the dynamical system under consideration is non-linear and, in practice, is partially observable since the data is only available at a sparse set of points (i.e. well locations). Furthermore, the subsurface model parameters are highly uncertain due to sparsity of the available field data. Optimal well control problem consists of optimizing the control variables like valve openings of wells in order to maximize sweep efficiency of injector fluid throughout the reservoir life. Reservoir permeability field is considered as an uncertain model parameter for which the uncertainty distribution is known. Although the proposed framework is demonstrated for robust optimal well control problem, it is designed to be general enough to be applicable to similar optimal control problems governed by a set of PDEs. Two test cases -both representing a distinct model parameter uncertainty and control dynamics -are used to demonstrate the computational gains of using the multi-grid idea.
The outline of the rest of this paper is as following: Section 2 provides the problem description and proposed framework to solve robust optimal well control problem. Section 3 details the model parameters for the two case studies designed for demonstration. Results of the proposed framework on these two case studies are demonstrated in section 4. Finally, section 5 concludes with the research study summary and an outlook of future research directions.

Methodology
Fluid flow control in subsurface reservoirs has many engineering applications ranging from the financial aspects of efficient hydrocarbon production to the environmental problems of contaminated removal from polluted aquifers (Whitaker, 1999). In this paper, a canonical single-phase subsurface flow control problem (also referred as robust optimal well control problem) is studied where water is injected in porous media to displace a contaminant. This process is commonly modeled using an advection equation for tracer flow through porous media (also referred as Darcy flow through porous media) over the temporal domain T = [t 0 , t M ] ⊂ R and spatial domain X ⊂ R 2 . In the context of fluid displacement (e.g. groundwater decontamination), the tracer corresponds to clean water injected in the reservoir from the injector wells and the non-traced fluid corresponds to the displaced contaminated water from the reservoir through producer wells. The source and sink locations within the modeled domain correspond to injector and producer wells, respectively. The tracer flow models water flooding with the fractional variable s(x, t) ∈ [0, 1] (also referred as saturation) which represents the fraction of injected clean water to the displaced contaminated water at location x ∈ X and time t ∈ T . The fluid flow in and out of the domain is represented with a(x, t) which is treated as source/sink terms of the governing equation. Set of well locations are denoted as x ∈ X (where X ⊂ X ). In other words, a(x, t) is assigned to zero everywhere in the domain X except the set of locations x . The controls a + (x, t) (formulated as max(0, a(x, t))) and a − (x, t) (formulated as min(0, a(x, t))) represent the injector and producer flow controls, respectively (note that a = a + + a − ). Task of the problem under consideration, is to find optimal controls a * (x , t) which is the solution of following closed-loop optimisation problem: The objective function defined in equation (1a) represents the total displaced fluid flow out of the reservoir (e.g. contaminated water production) and is maximized on the finite time interval T . The intigrand in this function is referred as Lagrangian term in control theory and is often denoted by L(s, a). The water flow trajectory s(x, t), is governed by advection equation (1b) which is solved given the velocity field v, which is obtained from the Darcy's law: v = −(k/µ)∇p. The pressure p(x, t) ∈ R, is obtained from the pressure equation, −∇ · (k/µ)∇p = a. Porosity φ(x, ·), permeability k(x, ·), and viscosity µ(x, ·), are the model parameters. Permeability k, represents the model uncertainty and is treated as a random variable that follows a known probability density function K with K as its domain. The initial and no flow boundary conditions are defined in equation (1c), where n denotes outward normal vector from the boundary of X . The constraint defined in equation (1d) represent the fluid incompressibility assumption along with the fixed total source/sink term c which represents total water injection rate in the reservoir. In a nutshell, the optimisation problem provided in equations (1) is solved to find the optimal controls a * (x , t) such that they are robustly optimal over the entire permeability uncertainty domain, K.

RL framework
According to RL convention, the optimal control problem defined in equation (1) is modeled as a Markov decision process which is defined as a quadruple S, A, P, R . Here, S ⊂ R ns is set of all possible states with the dimension n s , A ⊂ R na is a set of all possible actions with the dimension n a . The state S, is represented with the saturation s(x, ·) and pressure p(x, ·) values over the entire domain X . The action A, is represented with an array of well control values a(x , ·). More details of this array like representation of action are presented in section 3.3. The optimal control problem defined in equation (1) is discretized into M control steps and as a result, its solution is a set of optimal control values a * (x , t 1 ), a * (x , t 2 ), . . . , a * (x , t M ) where t 0 < t 1 < t 2 < · · · < t M . The transition function P : S × A → S, is assumed to follow Markov property. That is, transition to the state S(t m+1 ) is obtained by executing the actions A(t m ) when in the state S(t m ). Such transition function is obtained by discretizing equation (1b). For a transition from the state S(t m ), to the state S(t m+1 ), the real valued reward R(t m+1 ), is calculated as R(t m+1 ) = R(S(t m ), A(t m ), S(t m+1 )), where R : S × A × S → R is the reward function. The reward function is obtained by discretizing the objective function (equation (1a)) into control steps such that, The optimal controls are obtained by learning a control policy function which is defined as π : S → A. This function is denoted as π(A|S) and is generally represented with a neural network. Essentially, the control policy π(A|S), maps a given state S(t m ), into an action A(t m ). For an optimal control problem, with M control steps, the goal of reinforcement learning is to find an optimal policy π * (A|S) such that the expected reward G = M m=1 γ m−1 R(t m ), is maximized. Note that immediate rewards R, are exponentially decayed by the discount rate γ ∈ [0, 1]. The discount rate represents how myopic the learned policy is, for instance, learned policy is considered completely myopic when γ = 0. The controller, which is also referred to as an agent, follows the policy and explores various control trajectories by interacting with the environment which consists of a transition function P and a reward function R. The data gathered by these control trajectories are used to update the policy towards optimality. Each such update of the policy is referred to as the policy iteration. In RL literature, a single complete control trajectory is referred to as an episode. Essentially, RL algorithms attempt to learn the optimal policy π * (A|S) from a randomly initialized policy π(A|S), by exploring state-action space by executing a high number of episodes.
In order to represent the variability in permeability, a finite number l, of well spread uncertainty distribution samples is chosen. This is achieved with a clustering analysis (please refer appendix A for cluster analysis formulation used in this paper) of the domain K. The sample vector k = {k 1 , k 2 , · · · k l }, is constructed with samples of the distribution K, which are located nearest to the cluster centers. The policy π * (A|S), is learned by randomly selecting the parameter k from the training vector k at the beginning of every episode. The policy return R π(A|S) , is computed by averaging the returns of policy π(A|S; k i ) (policy applied on the simulation where permeability is set to k i ) on l simulations, which is formulated as, In optimal well control problems, the system is partially observable, that is, reservoir information is only available at well locations throughout the reservoir life cycle. In order to accommodate this fact, the agent is provided with the available observation as its state. For this study, observation is represented with a set of saturation and pressure values at the well locations x . Note that, with such representation of states, the underlying assumption of Markov property of the transition function is approximated.

Learning convergence criteria
The optimal policy convergence is detected by monitoring the policy return R π(A|S) , after every policy iteration. Conventionally, when this value converges to a maximum value, the optimal policy is assumed to be learned. The convergence criteria for ith policy iteration is defined as, where δ i is the return tolerance at ith policy iteration, δ is stopping tolerance and is a small nonzero number used to avoid division by zero. The convergence of policy learning is often flat near the optimal result. For this reason, the convergence criteria defined in equation (4) is checked for the latest n consecutive policy iterations. For instance, if r is the array of monitored values of R π(A|S) at all policy iterations, the policy π(A|S) is considered converged when the convergence criteria (equation (4)) for last n policy iterations is met. Algorithm 1 delineates the pseudocode for this convergence criteria. Figure 1 illustrates effect of n and δ on convergence criteria for an example of end if 10: end procedure reinforcement learning process. Policy return plot is shown in blue color where each value at policy iteration is shown with a dot. The corresponding return tolerance is plotted in gray color which is represented in percentage format (δ i × 100, where δ i is computed from equation (4)). It can be seen that the convergence criteria (denoted with markers on these plots) is more stringent when the stopping tolerance δ, is smaller and consecutive policy iteration steps n, are higher.

Adaptive multi-grid RL framework
An adaptive multi-grid RL framework is proposed where, essentially, the policies learned using lower grid fidelity environments are transferred and trained with higher fidelity environments. The grid fidelity for an environment is described with the factor β ∈ (0, 1]. The environment with β = 1 is assumed to have the fine-grid discretization which guarantees good approximation fluid flow production out of the domain as defined in equation (1a). For any environment where β < 1, the environment grid-size is coarsened with the factor of β. For instance, if a high fidelity environment where β = 1 corresponds to simulation with grid size 64 × 64, the simulation grid size is reduced to 32 × 32 when β is set to 0.5. Restriction operator Φ β (), is used to coarsen the high fidelity simulation parameters with the factor of β. This is done by partitioning a finer grid of size m × n (corresponding to β = 1) into the coarser dimensions βm × βn (corresponding to β < 1 where · is the floor operator) and computing these coarse grid cell values as a function f, of values in the corresponding partition. Figure 2a illustrate this restriction operator for a variable x ∈ R n×m . The function f, for different parameters of the reservoir simulation are listed in table 1. On the other hand, prolongation operator Φ −1 β (), maps a coarse grid environment parameters to fine grid as shown in figure 2b. A typical agent-environment interaction using this framework is illustrated in figure  3. Note that the transition function P, and reward function R, are sub-scripted with β to indicate the grid fidelity of the environment. State S(t m ), action A(t m ) and reward R(t m ) are denoted with shorthand notations, S m , A m and R m , respectively. Throughout the learning process the policy is represented with states and actions corresponding to high fidelity grid environment. As a result, actions and states, to and from the environment, undergo the restriction Φ β and prolongation Φ −1 β operations at each time-step as shown in the environment box of the figure 3.
The proposed framework is demonstrated for PPO algorithm. PPO (Schulman et al., 2017) is a policy gradient algorithm that models the stochastic policy π θ (A|S), with a neural network (also referred to as the actor network). Essentially, the network parameters θ, are obtained by optimizing for the objective function, where r t (θ) = π θ (A t |S t )/π θ old (A t |S t ) and θ old correspond to the policy parameters before the policy update. The advantage function estimatorÂdv, is computed using generalized advantage estimator (Schulman et al., 2015) which is derived from the value function V t . The value function estimatorV t is learned through a separate neural network termed as the critic network. Definitions of advantage and value functions are provided in appendix B. In practice, a single neural network is used to x 11 x 12 x 1m x 21 x 22 x 2m x 11 x 1m x n 1 x n m n × m x 11 x 11 x 1m x 11 x 11 x 1m Figure 2: illustration for the restriction operator Φ β and prolongation operator Φ −1 β for a parameter x Environment, E β Figure 3: A typical agent-environment interaction in the proposed multi-grid RL framework represent both, actor and critic networks. The objective function for this integrated actor-critic network is the summation of actor loss term (equation (5)), value loss term and entropy loss term. For the purpose of maintaining brevity in our description these latter loss terms are omitted and the policy network's objective function is treated as J ppo (θ) in further discussion. However, please note that they are considered while executing the framework. Readers are referred to Schulman et al. (2017) for the detailed definition of policy network loss term. Algorithm 2 presents the pseudocode for the proposed multi-grid RL framework. The framework consists of, in total, m values of grid fidelity factor which are represented with an array β = [β 1 , β 2 , . . . , β m ], where β m = 1 and β 1 < β 2 < . . . < β m . The environment is denoted as E βi , which represents the environment with the grid fidelity factor β i . The policy π θ (A|S) is learned initially with environment, E β1 , until the convergence criteria is met. The convergence criteria is checked using the algorithm 1 with predefined parameters δ and n. Upon convergence, further policy iterations are learned using the environment E β2 , and so on until the convergence criteria is met for the highest grid fidelity environment E βm . A limit for number of episodes to be executed at each grid level is also set. This is done by defining an episode limit array E = [E 1 , E 2 , . . . , E m ], where E m is total number of episodes to be executed and E 1 < E 2 < . . . < E m . That is, for every environment with grid fidelity factor β j the maximum number of episodes to be trained is limited to E j .

Case studies
Two test cases are designed representing two distinct permeability uncertainty distributions and control dynamics. For both cases, the values for model parameters emulate those in the benchmark reservoir simulation cases, SPE-10 model 2 (Christie et al., 2001). Table 2 delineates these values for test case 1 and 2. As per the convention in geostatistics, the distribution of log (k) is assumed to be known and is denoted by G. As a result, g = log(k) is treated as a random variable in the problem description defined in equation (1). Uncertainty distributions for test case 1 and 2 are denoted with G 1 and G 2 , respectively.

12:
end for 13: Optimize J ppo (θ) with K epochs and minibatch size M ≤ N T

15:
Compute the policy return R π θ (A|S) and append it in r

Uncertainty distribution for test case 1
The log-permeability uncertainty distribution for test case 1 is inspired from the case study done by Brouwer et al. (2001). Figure 4a shows schematics of the spatial domain for this case. In total, 31 injector wells (illustrated with blue circles) and 31 producer wells (illustrated with red circles) are placed at the left and right edge of the domain, respectively. As illustrated in Figure 4a, a linear high permeability channel (shown in gray color) passes from the left to right side of the domain. l 1 and l 2 represent the distance from the top edge of the domain on the left and right side while the channel width is denoted with w. These parameters follow uniform distributions defined as, w ∼ U (120, 360), l 1 ∼ U (0, L − w) and l 2 ∼ U (0, L − w), where L is domain length. In other words, the random variable g follows the probability distribution G 1 which is parameterized with w, l 1 and l 2 : g ∼ G 1 (w, l 1 , l 2 ).
To be specific, log permeability g at a location (x, y) is formulated as: where x and y are horizontal and vertical distances from the upper left corner of the domain illustrated in figure 4a. The values for permeability at the channel (245 mD) and the rest of the domain (0.14 mD) are inspired from Upperness log-permeability distribution peak values specified in SPE-10 model 2 case. Test case 2 represents uncertainty distribution of a smoother permeability field. Figure 4b illustrates reservoir domain for this case. It comprises of 14 producers (illustrated with red circles) located symmetrically on left and right edges (7 on each edge) of the domain and 7 injectors (illustrated with blue circles) located at the central vertical axis of the domain. A prior distribution F is assumed over all the locations x ∈ X as, where the process variance, σ, is assigned as 5 and the exponential covariance function (kernel), k(x,x), is defined as, where the parameters l 1 and l 2 are assigned to be 620ft (width of the domain) and 62ft (10% of domain width), respectively. The posterior distribution given the observed log-permeability vector, g(x ) = [g(x 1 ), g(x 2 ), · · · , g(x n )], where each observation correspond to a log-permeability value of 2.41 at a well location (i.e., n = 21 since there are, in total, 21 number of wells in this case). From the principle of ordinary kriging, the posterior distribution, G 2 , for log-permeability at a location x ∈ X is a normal distribution which is defined as, [1, 1, · · · , 1] ) andμ is an estimate of the global mean µ, which is obtained from the kriging model based on the maximum likelihood estimation of the distribution F (x) (from equation (6)) for the observations g(x ), and is formulated as, The log-permeability distribution G 2 , is created with an ordinary kriging model using the geostatistics library gstools (Müller and Schüler, 2019). In the simulation, samples of the permeability fields are obtained with a clockwise rotation angle of π/8.

State, action and reward formulation
PPO algorithm attempts to learn the parameters θ of the policy neural network π θ (A|S). The episodes (i.e. the entire simulation temporal domain T ) are divided in five control steps. Each episode timestep corresponding to a control step is denoted with t m , where m ∈ {1, 2, · · · , 5}. The state S, is represented by an observation vector which consists of saturation and pressure values at well locations, x . Since the saturation values at injector wells are always one, irrespective of the time t m , they are omitted from the observation vector. Consequently, the observation vector is of the size 2n p + n i (i.e., n s = 93 for test case 1 and n s = 35 for test case 2). Note that this observation vector forms the input to the policy network π θ (A|S). A vector of flow control values of all the injector and producer wells, denoted by A, is represented as the action. The action vector A, consists of in total n p + n i values (i.e., n a = 62 for test case 1 and n a = 21 for test case 2). In order to maintain constraint defined in equation (1d), the action vector is represented with a vector of weights w ∈ R na , such that 0.001 ≤ w j ≤ 1. Each weight value w j , corresponds to the proportion of flow through the jth well. As a result, the values in the action vector are written as, (w 1 , · · · , w ni , w ni+1 , · · · , w ni+np ). Flow through jth injector A j , is computed such that the constraint defined in (1d) is satisfied: Similarly, flow through jth producer, A j+ni , is written as, The reward function, as defined in equation (2), is divided by total pore volume (φ × lx × ly) as form of normalization to obtain a reward function in the range [0,1]. The normalized reward represents recovery factor or sweep efficiency of the contaminated fluid. Recovery factor represents the total amount of contaminants swept out of the domain. For instance, the recovery factor of 0.65 means that in total of 65% of contaminants are swept out of the domain using waterflooding. To put it in the context of ground water decontamination problem, the optimal controls correspond to the well controls that maximize the percentage of contaminants swept out of the reservoir.

Multi-grid framework formulations
The proposed framework is demonstrated using three levels of grid fidelity corresponding to β = 0.25, β = 0.5 and β = 1.0. Table 3 lists the discretization grid size corresponding to these grid fidelity factors for both test cases. In order to show the effectiveness of the proposed framework, the obtained results are compared with single grid and multi-grid frameworks. The results for single grid framework are same as if they were obtained using classical PPO algorithm where the environment has a fixed fidelity factor throughout the policy learning process. This is done by setting the grid fidelity factor array β, and episode limit array E, with a single value in algorithm 2. The factor n in convergence criteria procedure (delineated in algorithm 1) is set to infinity. In other words, convergence criteria is unchecked and the policy learning take place for a predefined number of episodes. In total three such single-grid experiments are done corresponding to β = 0.25, β = 0.5 and β = 1.0. Further, two multi-grid experiments are performed to demonstrate the effectiveness of the proposed framework. The first multi-grid experiment is referred as "fixed" where convergence criteria is kept unchecked just like single-grid frameworks. The multiple levels of grids are defined by setting the grid fidelity factor array β, and episode limit array E, as an array of multiple values corresponding to each fidelity factor value and its corresponding episode count. In the fixed multigrid framework, policy learning takes place by updating the environment fidelity factor according to β without checking the convergence criteria (i.e. by setting n = ∞). Secondly, the "adaptive" multi-grid framework parameters are set similar to those used in fixed multi-grid framework except for the convergence criteria parameters n and δ. Table 4 delineates number of experiments and their corresponding parameters for test case 1 and 2. Figure 5 provide visualization of effect of fidelity factor β, on the simulation of test case 1. Figure 5a and 5b show log-permeability and saturation plots corresponding to β = 0.25, β = 0.5 and β = 1.0. Further, figure 5c illustrate the effect of grid fidelity on simulation run time for single episode (shown on left with a box plot with 100 simulation run trials) and "equivalent β = 1 simulation run time" for each grid fidelity factor (shown on right). Equivalent β = 1 simulation run time is defined as the ratio of average simulation run time for a grid fidelity factor β, to that corresponding to β = 1. This quantity is used as a scaling factor to   Results obtained using the proposed framework are evaluated against the benchmark optimisation results obtained using differential evolution (DE) algorithm (Storn and Price, 1997). For both optimisation methods (PPO and DE) multiprocessing is employed to reduce total computational time. However, parallelism behaviour is quite varied between PPO and DE algorithms. In PPO algorithms, neural networks are back propagated synchronously at the end of each policy iteration which causes extra computational time in waiting and data distribution. As a results, in order to compare computational efforts irrespective of computational resources and parallelism behaviours, it is fair to compare number of simulation runs which is a major source of computational cost in these algorithms. The PPO algorithm for the proposed framework is executed using the stable baselines library (Raffin et al., 2019), while python's SciPy (Virtanen et al., 2020) library is used for DE algorithm. Appendix C delineate all the algorithm parameters used in this study.

Results
The control policy where injector and producer wells are equally open throughout the entire episode is referred to as the base policy. Under such policy, the water flooding prominently takes place in the high permeability region leaving the low permeability region swept inefficiently. The optimal policy for these test cases would be to control the producer and injector flow to mitigate this imbalance in water flooding. The optimal policy, learned using reinforcement learning for test case 1, show on   an average around 12% improvement with respect to recovery factor achieved using the base policy. While for test case 2, the average improvement in the order of 25% is observed. Figure 7 illustrates the plots for policy return R π(A|S) , corresponding to all the frameworks listed in table 4 for test case 1. At the beginning of the learning process, the policy return values for single-grid framework keeps improving and eventually converge to a maximum value when the policy converges to an optimal policy. Note that for lower value of grid fidelity factor β, the optimal policy return is also low. In other words, the coarsening of simulation grid discretization also reflects in overall reduction in recovery factor. This is due to the low accuracy of states and actions representation for environments with β < 1. On the other hand, the overall computational gain is observed due to coarser grid sizes. Simulation run time corresponding to β = 0.25 and β = 0.5 show around 66% and 54% reduction as compared to that with β = 1. the results of multi-grid frameworks are compared with the single grid framework corresponding to β = 1 which refers to classical PPO algorithm using the environment with a fixed high fidelity grid factor. As shown in the plots at the center and right of figure 7, both multi-grid frameworks show convergence to the optimal policy which is achieved using high fidelity single grid framework. In the fixed multi-grid framework the fidelity factor, is incremented at a fixed interval of 25000 number of episodes. The adaptive framework is also provided with the same interval but with additional convergence check within each interval. For multi-grid learning plots shown in figure 7 (center and right plots), equivalent number of episodes corresponding to the environment with β = 1 is illustrated as a secondary horizontal axis. This way, the computational effect of multi-grid frameworks is directly compared to single-grid (with β = 1) framework. The equivalent number of β = 1 episodes corresponding to episodes with certain β value are computed by multiplying it with the equivalent β = 1 simulation run time. For instance, number of episodes with β = 0.25 are multiplied with 0.37. For fixed multi-grid framework, it takes 46264 number of equivalent β = 1 episodes to achieve an equally optimal policy that is obtained with 75000 number of episodes using single grid (β = 1) framework. Similarly, the same is achieved with just 28907 number of equivalent β = 1 episodes using adaptive multi-grid framework. In other words, around 38% and 61% reduction is observed in simulation run time using fixed and adaptive multigrid frameworks, respectively. Further, the robustness of the policy learned using these frameworks is compared by applying it on a highest fidelity environment with random permeability samples from the distribution G 1 , which were never seen during the policy learning process. Figure 8a shows the plots of these unseen permeability fields, while the corresponding results obtained using these frameworks are plotted in figure 8b. Optimal results obtained using differential evolutionary (DE) algorithms are provided as benchmark (marked as DE in figure 8b). Note that DE algorithm, in itself, is not a suitable method to solve the robust optimal control problem since it can provide optimal controls only for certain permeability samples as opposed to PPO algorithm where the learned policy is applicable to all samples of permeability distribution. However, DE results are used as the reference optimal results which are achieved by direct optimization on sample by sample basis. Equivalence in the optimality of learned policies obtained using these three experiments can be observed from the closeness in their corresponding optimal recovery factors. Figure 9 demonstrate the policy visualization for an example of permeability sample in case 1. In this figure, the results are shown for permeability sample index 4 from the figure 8a where a high permeability channel passes through lower region of the domain. The optimal policy, in this case, would be to restrict the flow through injector wells and producer wells which are in the vicinity of the channel. The super-positioned comparison of optimal results for base case, differential evolution, single-grid framework (where β = 1), fixed multi-grid framework and adaptive multi-grid framework shows that the optimal policy is learned successfully using the proposed framework.
For test case 2, similar results are observed as shown in figure 10. The single-grid algorithms converge to an optimal policy in total 150000 number of episodes. The fixed multi-grid algorithm  is trained with 50000 episode interval for each grid fidelity factor as shown in the central plot in figure 10. The optimal policy is learned in 92657 equivalent β = 1 episodes thus saving around 38% of simulation run time. The adaptive multi-grid framework further reduces computational cost by achieving the optimal policy in 36618 number of equivalent β = 1 episodes (simulation time reduction of about 76% with respect to β = 1 single-grid framework). Figure 11 illustrate the results of policy evaluation on an unseen permeability samples from the distribution G 2 . The permeability samples are shown in figure 11a and the optimal recovery factor corresponding to learned policies are plotted in figure 11b. Figure 12 demonstrate the optimal controls for an example of permeability sample index 5 from figure 11a. The optimal policy learned using differential evolution algorithm refers to increasing the flow through injector wells which are in low permeability region while restricting the flow through producer wells for which the water cutoff is reached. The policies learned using RL framework takes advantage of the default location and orientation of high permeability regions. In this case, the optimal policy is achieved by controlling the well flow control such that the flow traverses through the permeability channels (that is, the flow is more or less perpendicular to the permeability orientation).

Conclusion
An adaptive multi-grid RL framework is introduced to solve robust optimal well control problem. The proposed framework results in significant reduction in computational cost of policy learning process as compared to classical PPO algorithm results. In the presented case studies, 61% computational savings in simulation runtime for test case 1 and 76% for test case 2 is observed. The results are highly dependent on the right choice of the algorithm hyper-parameters (e.g. δ, n, β and E) which were tuned heuristically. As a future direction to this research study, the aim is to find the optimal values for β that maximizes the overall computational savings. Furthermore, policy transfer was performed sequentially in the current framework which seemed to have worked optimally. However, to improve the generality of the proposed framework it would be important to study the effect of sequence of policy transfer on the overall performance.      A Cluster analysis of permeability uncertainty distribution training vector k, is chosen to represent the variability in the permeability distribution K. For the optimal control problem, our main interest is in the uncertainty in the dynamical response of permeability rather than the uncertainty in permeability itself. As a result, connectivity distance (Park, 2011) is used as a measure of distance between permeability field samples. The connectivity distance matrix D ∈ R N ×N among N samples of K is formulated as, where N correspond to a large number number of samples of uncertainty distribution, s(x , t; k i ) is saturation at location x , and time t, when the permeability is set to k i and all wells are open equally. Multi-dimensional scaling of the distance matrix D is used to produce N two-dimensional coordinates d 1 , d 2 , · · · , d N , each representing a permeability sample. The coordinates d 1 , d 2 , · · · , d N are obtained such that the distance between d i and d j is equivalent to D(k i , k j ). In the k-means clustering process, these coordinates are divided in l sets S 1 , S 2 , · · · , S l , obtained by solving the optimisation problem: where µ Si is average of all coordinates in the set S i . The training vector k is a set of l samples of K where each of its value k i correspond to the one nearest to µ Si . Total number of samples N and clusters l are chosen to be 1000 and 16 for both uncertainty distributions, G 1 and G 2 . Training vector k is obtained with samples k 1 , · · · , k 16 each corresponding to a cluster center. Figure 13a and 13b show cluster plots for G 1 and G 2 permeability distribution samples, respectively. Further, 16 permeability samples, each randomly chosen from a cluster, are chosen to evaluate the learned policies. Figures 8a and 11a illustrate these samples for test case 1 and 2, respectively.

B Definitions of value and advantage function
In RL, the policy π(A|S) is said to be optimal if it maps the state S t with an action A t that correspond to maximum expected return value. These return values are learned through the data obtained in agent-environment interactions. Following are some definition of return values typically used in RL: Value function is the expected future return for a particular state S t and is defined as, where E π [· · · ] denotes expected value given that the agent follows the policy π. As a short hand notation, V (S) at state S t is denoted as V t . Q function is similar to value function except that it represent the expected return when the agent takes action a t in the state S t . It is defined as, Advantage function is defined as the difference between Q function and value function and is denoted by Adv(S, A) at state S and action A.

C Algorithm parameters
Parameters used for PPO are tabulated in table 5 which were tuned using trial and error. For PPO algorithm, parameters were tuned in order to find least variability in learning plots. Figures 14 and  15 show learning plots corresponding to three distinct seeds to show the stochasticity of the obtained results. The DE algorithm's parameters are delineated in table 6. The code repository for both the test cases presented in this paper can be found on the link: https://github.com/atishdixit16/ ada_multigrid_ppo.     0.9 mutation factor (0.5,1) (0.5,1)