Keywords

1 Introduction

In recent decades, there has been a renewed interest in the decision-making process for optimal well placement. The reasons for this are twofold. First, there has been a proliferation in the amount of data collected during reservoir development that can help guide the decision-making process. Second, improvement in transistor technology has enabled the use of artificial intelligence algorithms for utilizing data collected to maximize recovery. These improvements in computer technology have led to faster and computationally cheaper flow simulations leading to the consideration of various scenarios that represent uncertainty in underlying reservoir properties. In addition, the faster computers also facilitate the training process that is intrinsic to most machine learning methods.

Due to improvements in horizontal drilling technologies, several unconventional reservoir plays have been rendered profitable. The optimal method to maximize economic return from these assets is through the strategic placement of wells in high productivity zones, minimizing the number of wells required for the efficient recovery of hydrocarbons from the reservoir. Optimal well placement also emphasizes strategic extraction of knowledge about the subsurface which in turn promotes the development of superior reservoir models, reducing the uncertainty in future well placement decisions.

Early methods for well location optimization focused on considering wide varieties of constraints for adjoint or mixed-integer optimization. These constraints would include geologic uncertainties, cost estimates, fluid properties, facilities etc. Methods such as mixed-integer programming have been used to optimize the placement of well locations [1,2,3] by expressing the objective function as a linear combination of the variables. These methods have the advantage of being fast and give an easily interpretable solutions to the problem. But these optimization techniques fail to characterize the highly non-linear relationships between reservoir variables, and between these variables and the dynamic response. Gradient-based optimization methods allow for the inclusion of non-linear relationships between variable and rely on the computation of the gradient of a prespecified optimization function. The adjoint-based formulations for gradient-based optimization allows for the simpler interpretation of the constraints and allows for the identification and isolation of the interesting well regions in the reservoir [4]. It has the added advantage of leading to faster convergence as compared to other gradient methods while being highly versatile and interpretable. These methods allow for the inclusion of diverse decisions variables and are usually combined with conventional flow simulation. As the well placement problem is inherently a dynamic programming problem, any increase in the range of decision-variables leads to an exponential increase in the range of choices that need to be considered for the placement problem. This also leads to an increase in the number of flow simulations that need to be considered for the optimization problem. Due to the computational costs, these methods utilize ingenious approaches to reduce the search space, but they are heavily reliant upon the definition of the adjoint equation.

One of the most popular methods for well location optimization is the use of population-based optimization algorithms, more specifically Genetic algorithms (GA). GA based optimization aims to replicate the process of natural selection in a control environment. They function by initializing a population of candidate well locations and evaluating the value of the objective function (also known as determining the fitness of the population). The ‘fittest’ wells are then selected for the reproduction step in which new wells will be determined by utilizing ‘mutations’ and ‘crossovers’. Bittencourt and Horne [5] introduced GA to the oil and gas domain for the purpose of well location optimization. Their approach aimed to reduce the search space by using Tabu search [6, 7]. To include geostatistical inputs into the GA formation, several authors [8, 9] have included kriging into their framework. This helped improve the interpretability of GA results at the cost of increased computational requirement. Other authors have conducted studies into the explanatory power of GA and on the effect of hyper-parameters [10], extensions to horizontal wells [11] and gas-condensates [11, 12] etc. GA-based algorithms rely on sampling-based optimization, but the approach is plagued by issues, such as, non-optimality of solutions, vast number of hyper-parameters, high computational costs for evaluating candidate well locations.

Metrics such as productivity potential maps [13] can be extremely useful in speeding up the evaluation of the well locations. In addition, research into use of neural networks for forecasting production [14] and developing earth models [15, 16] can be included in the proxy model formulation to develop a holistic view of the uncertainty in the reservoir properties while doing away with the need for full flow simulations. These deep learning models enable the use of transfer learning [17] accelerating the training process. In addition, these transfer learning models can be built independently by private oil and gas companies in conformance to their policy of data privacy.

The field of reinforcement learning is uniquely positioned to address dynamic programming problems. With recent advancements in the use of deep learning for functional approximations of reinforcement learning solutions, the applicability of the DRL methods have skyrocketed. Some of the popular uses include playing chess [18], controlling robots [19], improving cybersecurity [20] etc. A key insight into the use of reinforcement learning for well location optimization is that these methods have been extensively utilized for addressing multi-stage decision-making under uncertainty. Recent application of reinforcement learning to the field of geosciences include slope stability analysis for landslide prevention [21] and for determination of the first arrival for seismic image processing [22]. Extensive research into improving the sampling [23], memory buffers [24], addressing biases [25, 26] etc. has led to improved applicability of DRL for problems previously perceived to be too large and too complex to be addressed by reinforcement learning.

Here, we show the application of reinforcement learning methods to the well location problem. In Sect. 2, we take an in-depth look into the theory of the reinforcement learning algorithms applied in the research. Section 3 discusses the well location problem and highlights the applicability of the reinforcement learning approach to addressing the problem. Section 4 demonstrates the application of reinforcement learning to two case studies. The first case studies address the optimization of wells in a 2-D reservoir while Case 2 addresses the optimization of a 3-D reservoir. Sections 5 and 6 focus on the discussions on the case studies and provide concluding remarks.

2 Theory

Reinforcement learning involves an agent (or a decision maker) interacting with its environment \(\mathcal{E}\), usually formulated as a Markov decision process \((\mathcal{S},\mathcal{A},T, r,\gamma )\) [27] with state space \(\mathcal{S}\), action space \(\mathcal{A}\), reward \(r\), transition function \(T({s}^{{\prime}}|s,a)\) and discount factor \(\gamma \in [{0,1}]\). At each time step \(t\), the agent takes an action \(a\in \mathcal{A}\) in state \(s\in \mathcal{S}\) and transitions to state \(s{^{\prime}}\) while receiving a reward \(r\left(s,a,{s}^{{\prime}}\right)\in {\mathbb{R}}\). The goal of the agent is the maximize the expected cumulative discounted future reward, or return, \({R}_{t}={\sum }_{{t}^{{\prime}}=t}^{T}{\gamma }^{{t}^{{\prime}}-t}{r}_{t}\), where \(T\) is the time-step at termination. The maximum expected return achievable by following any strategy, after being in state \(s\), and then taking some action \(a\) is defined as the optimal action-value function \({Q}^{*}\left(s,a\right)\). It can be mathematically represented as,

$$\begin{array}{c}{Q}^{*}\left(s,a\right)=\underset{\pi }{\mathrm{max}}{\mathbb{E}}\left[{R}_{t}|{s}_{t}=s,{a}_{t}=a,\pi \right] \end{array}$$
(1)

where, \(\pi :\mathcal{S}\to \mathcal{A}\) is a policy mapping states to actions. It defines the mechanism by which the agent selects an action at state \(s\). If \({Q}^{*}(s{^{\prime}},{a}^{{\prime}})\) has been determined for all possible action \(a^{\prime}\), then the optimal policy would be to select the \(a^{\prime}\) that would maximize the expected value of the future reward, \(r+\gamma {Q}^{*}(s^{\prime},{a}^{{\prime}})\). The optimal action-value function obeys the Bellman equation [28].

$$\begin{array}{c}{Q}^{*}\left(s,a\right)={\mathbb{E}}_{{s}^{{^{\prime}}}}\left[r+\gamma \underset{{a}^{{^{\prime}}}}{\mathrm{max}}{Q}^{*}\left({s}^{{^{\prime}}},{a}^{{^{\prime}}}\right)|s,a\right] \end{array}$$
(2)

The \(\gamma \underset{{a}^{{\prime}}}{\mathrm{max}}{Q}^{*}\left({s}^{{\prime}},{a}^{{\prime}}\right)\) term in the equation highlights the consideration of future actions in future states on the determination of the value of the current action.

$$\underset{{{s^{\prime}},\mathrm{ a}}^{{^{\prime}}}}{\mathrm{max}}{\mathrm{Q}}^{*}\left({\mathrm{s}}^{{^{\prime}}},{\mathrm{a}}^{{^{\prime}}}\right) \ge 0 \to {\mathrm{E}}_{{\mathrm{s}}^{{^{\prime}}}}\left[\mathrm{r}+\upgamma \underset{{\mathrm{a}}^{{^{\prime}}}}{\mathrm{max}}{\mathrm{Q}}^{*}\left({\mathrm{s}}^{{^{\prime}}},{\mathrm{a}}^{{^{\prime}}}\right)|\mathrm{s},\mathrm{a}\right] \ge \mathrm{ E}\left[\mathrm{r}\right]$$
(3)

This differentiates DRL from conventionally employed optimization techniques which instead focus on the maximization of immediate expected reward. We can then determine the action-value function iteratively, \({Q}_{i+1}={\mathbb{E}}\left[r+\gamma \underset{{a}^{{\prime}}}{\mathrm{max}}{Q}_{i}\left({s}^{{\prime}},{a}^{{\prime}}\right)|s,a\right]\). When the action-value function converges to \({Q}^{*}\), the optimal policy can then be determined, \({\pi }^{*}=\mathrm{arg}\underset{\mathit{a^{\prime}}}{\mathrm{max}}{Q}^{*}({s}^{{\prime}},a^{\prime})\). This convergence usually takes place when the state-action space has been thoroughly sampled. One way to determine a new policy from an action-value function is to act \(\epsilon \)-greedy with respect to the action-values i.e., taking an action with the highest action value with a probability of \(1-\epsilon \), and taking a random action with a probability of \(\epsilon \). By introducing \(\epsilon \) component, exploration is introduced i.e., a sub-optimal action-value may be selected with the objective of sampling the action-value space more broadly. The expectation is that the agent will then be able to improve upon its value function estimates. During the initial training of the DRL algorithm, a high \(\epsilon \) allows for better exploration of the state-action space. As the RL algorithm learns more about the environment, a lower \(\epsilon \) would allow for fine-scale refinement to the action-value function. Q-learning, a form of temporal-difference (TD) [29] learning, is frequently used to estimate the optimal action values. Usually, due to the computational requirements associated with sampling each state-action pair, functional approximators are utilized to estimate the action-value function, e.g., Deep Q-Network (DQN) algorithm [30] uses neural networks as functional approximators. The action-value function is approximated using the neural network (or Q-network) with parameters \(\theta \). The Q-network is trained by minimizing the loss function \({L}_{i}({\theta }_{i})\) at every iteration \(i\), described by the following equation:

$$\begin{array}{c}{L}_{i}\left({\theta }_{i}\right)={\mathbb{E}}_{s,a}\left[{\left({\mathbb{E}}_{{s}^{{{\prime}}}}\left[r+\gamma \underset{{a}^{{{\prime}}}}{\mathrm{max}}Q\left({s}^{{{\prime}}},{a}^{{{\prime}}};{\theta }_{i-1}|s,a\right)\right]-Q\left(s,a;{\theta }_{i}\right)\right)}^{2}\right]\end{array}$$
(4)

To optimize the performance of the neural network, mini batch stochastic gradient descent can be conducted [31]. Under this approach, the NN parameters are updated considering a stochastic approximation (using vectorization) of the gradient of the loss function. To improve convergence with the functional approximator, an experience replay memory can be generated in which agent’s experience \(({s}_{t},{a}_{t},{r}_{t},{s}_{t+1})\) are stored at every time step. The Q-learning updates are then conducted using random mini-batch samples from the memory. The mechanism for the update of the neural network parameters \(\theta \) is given as:

$$\begin{array}{c}\Delta \theta =\alpha \left[r+\gamma \underset{{\mathrm{a}}^{{{\prime}}}}{\mathrm{max}}Q\left({s}^{{{\prime}}},{a}^{{{\prime}}};{\theta }^{{{\prime}}}\right)\right] {\nabla }_{\theta }Q\left(s,a;\theta \right)\end{array}$$
(5)

The target in the loss function is dependent on the trained neural network weights from the previous iteration and can change over multiple iterations. Hence, as the neural network minimizes the loss after every training step, there is an update to the neural network parameters \(\theta \) which leads to the target, \(Q\left(s,a;{\theta }_{i}\right)\), changing or ‘moving’ as well. This problem of moving targets can be addressed by dissociating the training of the target and the estimate by using separate neural networks. In the above equation, the learning rate, \({\alpha }\), controls the change in the model weights to the estimated error. Hyperparameter tuning is conducted to control the learning rate, but, usually, a smaller \({\alpha }\) leads to convergence to optimal weights, this comes at the cost of longer training time.

Another issue with the DQN formulation is the overestimation bias in the maximization step. This can lead to poor learning. One way this issue can be addressed is by decoupling the selection of the action from its evaluation. This is known as Double DQN or DDQN [25] and the new target is,

$$\begin{array}{c}{Y}_{i}=r+\gamma Q\left({s}^{{\prime}},\mathrm{arg}\underset{{\mathrm{a}}^{{{\prime}}}}{\mathrm{max}}Q\left({s}^{{\prime}},{a}^{{\prime}};{\theta }_{i}\right);{\theta }_{i}^{{\prime}}\right)\end{array}$$
(6)

The selection of the action is due to the online weights \({\theta }_{i}\) and the second set of weights \({\theta }_{i}^{{\prime}}\) is used to evaluate the value of the policy. The weights \({\theta }_{i}^{{\prime}}\) are updated by switching the roles of the target and the value network. The updates to the weights \({\theta }_{i}^{{\prime}}\) are conducted after every \(\tau \) steps to prevent the problem of moving targets.

3 Well Location Problem

During the early phases of reservoir development minimal information regarding the reservoir properties is available leading to decision-making for reservoir development under uncertainty. This problem of placing wells sequentially in an uncertain environment can be formulates as be a Markov decision process. A reservoir DRL agent can be trained to solve the problem of the optimal exploitation of the reservoir resources.

In the early phases of reservoir development, well data is not available and seismic data forms the basis for any decision-making. Existing geostatistical techniques such as stochastic simulation can be utilized to generate multiple reservoir realizations using seismic data as secondary data. These reservoir realizations reflect the uncertainty in the reservoir properties that are simulated conditioned to little to no hard data. Also, in many cases the semi-variogram inferred and modeled based on the secondary data does not accurately represent the spatial trends in the reservoir property being simulated. In these situations, inputs from geologists, petro-physicists and reservoir engineers are critical for accurately identifying trends in reservoir properties. Usually, the reservoir models guide the well placement decisions and the additional information derived, after conducting well surveys, is used to update the reservoir models. With additional data regarding the reservoir becomes available, the uncertainty in reservoir properties reduce which in turn allows for better decision-making.

This forms the framework for training the reservoir DRL agent (see algorithm in [25]). The agent starts in an unexplored state, i.e., in a state where no well data is available, with the goal of maximizing the cumulative hydrocarbon production after the placement of \(N\) wells. This ensures that the problem is of the fixed-horizon type i.e., the end-state is well defined. At any given time, the reservoir DRL agent can act by placing a well at a valid location. After the action selection, the reservoir DRL agent receives information regarding the petrophysical properties at the drilled point and a reward reinforcing the goodness of the action selected.

The paper presents a simplified approach for addressing uncertainty in reservoir properties by focusing on the lithofacies as an indicator of the goodness of a selected action. The reason for this is two-fold: first, there have been several studies conducted emphasizing the relationship between lithofacies and other petrophysical properties such as porosity and permeability [32,33,34,35,36,37,38]; second, categorical variables like lithofacies can dramatically reduce the state-space in the DRL formulation. The reservoir DRL agent attempts to place wells in high productivity regions and balances the additional information gained by placement of a well against the economic profitability due to the hydrocarbon resources extracted from the selected regions.

Several case studies have been conducted to test the efficacy of the recommended policies and the robustness of the policies to initial assumptions about the reservoir. The goal of the DRL agent is not to present the best policy for all given combinations of assumptions regarding the geo-spatial distributions of the lithofacies. Training an DRL agent requires several assumptions regarding the development of reservoir models and the role that experts in the field of geology, petroleum engineering and earth science play in the decision-making process. Errors in the decisions taken to build the ensemble of reservoir models will affect the developed policies.

4 Case Studies

Case 1A focuses on the application of DRL for determination of optimal policy using Double Deep Q Network (Double DQN or DDQN). Cases 1B, 1C and 1C_alternate focus on the applicability of the developed policy in cases where the initial assumptions regarding the reservoir were flawed. The neural network architecture and the hyperparameters for the simulations can be found in the appendix. The process flow is demonstrated in Fig. 1 considering various pathways for simulating the environment. The update to the prior probabilities of pay-facies is crucial to the determination of the optimal policy. There are several methodologies to conducting this update.

Fig. 1
figure 1

Workflow for determination of optimal policy for exploiting the reservoir using reinforcement learning

  1. 1.

    Using fast-variogram computations and generating an updated ensemble of reservoir realizations. This method is the slowest but would yield the most statistically accurate updates.

  2. 2.

    Using initial ensemble to derive correlation between locations and using data assimilation tools, such as Ensemble-based Kalman filtering (EnKF) to update reservoir realizations. This method requires some preprocessing to derive necessary statistics and relies on a multi-Gaussian assumption.

  3. 3.

    Using individual realizations for DRL training in each episode. This is the most computationally efficient method of training the DRL agent. This presumes that the initial ensemble has within it a set of models that are consistent with the extracted new information. Thus, updates of the initial ensemble of reservoir realizations would be not required. This is the most computationally efficient method for deriving the policy though it relies on the exhaustive characteristics of the initial set of reservoir realizations. This approach also allows for the parallelization of the learning process (Table 1).

    Table 1 Single stage optimization algorithm

For the case studies, the dynamics of the environment is simulated by considering a randomly selected reservoir realization for every episode (one pass through the training process, from the initial unexplored state to the terminal state) from the initial ensemble generated using seismic data. Using a lookup table (with the facies values stored for the selected realization) allows for fast computation of the next state and reward that is delivered to the reservoir DRL agent. The encoding of the state and the determination of the reward function are vital to the implementation of the deep reinforcement learning algorithm. The state encodes new information gained by placing a well in a reservoir. This information can be well log data, production data, core analysis information etc. Information regarding the lithofacies at the drilled location has been considered for the case studies presented here. This information can be encoded as a vector in two ways.

  1. 1.

    As a vector consisting of facies information at all grid points including placeholders for grid points where the information is not available. This formulation is efficient in cases where the reservoir is discretized into few grid points.

    $$\begin{array}{c}{s}_{t}=\left({f}_{1},{f}_{2},\ldots ,{f}_{n}\right) \end{array}$$
    (7)

    where \({f}_{i}\) represents the facies at the ith grid point.

  2. 2.

    As a vector consisting of well location and determined facies at drilled well locations. This formulation works better for large reservoirs discretized into several grid blocks.

    $$\begin{array}{c}{s}_{t}=\left({x}_{1},{x}_{2},\dots ,{x}_{N},{y}_{1},{y}_{2},\dots ,{y}_{N}, {f}_{1},{f}_{2},\dots ,{f}_{N}\right) \end{array}$$
    (8)

    where \({x}_{i},{y}_{i}\) and \({f}_{i}\) represent the x-coordinate, y-coordinate, and facies at the ith grid point.

The latter formulation has been considered for the case studies. Recent research into deconvolutional neural networks [39] have shown their efficacy for the extension of point information in multi-dimensions to regions where the information is unavailable and/or uncertain [40, 41]. Though this has not been explored in the current research, the authors aim to explore this in future work.

The agent receives a reward determining the desirability of placing the well at the grid location and as described earlier the reservoir DRL agent’s sole goal is to maximize the cumulative expected reward over \(N\) well placement decisions. The definition of the reward function dictates the key criterion that the reservoir DRL agent will focus upon to optimize the well location. Though running full physics flow simulations to compute the reward is most accurate with perfect information regarding the subsurface, it is computationally infeasible after the placement of every single new well in an episode. Hence, a proxy linear regression model has been developed to return a reward to the reinforcement learning agent.

Several factors affect the desirability of well location, such as the facies at the well location, facies of surrounding grid blocks, distance to other wells, distance to the reservoir boundary etc. The proxy model was developed by conducting flow simulations for several reservoir models keeping the fluid properties constant and varying the prespecified parameters. The correlation between the variables and the target variable, the total hydrocarbon output, has been shown in Fig. 2.

Fig. 2
figure 2

Correlation heat map between well location and facies connectivity parameters and well production

The following regression equation is used as a proxy for the flow.

$$\begin{array}{c}{C}_{0} + {C}_{{w}_{fac}}{X}_{{w}_{fac}}+ {C}_{n{h}_{fac}}{X}_{n{h}_{fac}}+ {C}_{b}{X}_{b}+{C}_{{w}_{PPM}}{X}_{{w}_{PPM}}= reward \end{array}$$
(9)
  • \({w}_{fac}\) = facies of grid at well location,

  • \(n{h}_{fac}\) = facies of grids in the neighborhood grids,

  • \(b\) = distance to the reservoir boundary,

  • \({w}_{PPM}\) = productivity potential metric to account for well spacing.

It is to be noted that the developed proxy model assumes a linear relationship between the variables. This allows for the faster computation of the reward function. Non-linear reward functions accounting for the non-linear relationships between reservoir properties and fluid production will provide more refined reward results and such relationships can be accounted for with the use of deep learning models, though this has been left for future research. Also, the reward function lends to the addition of expert information regarding the reservoir into the reinforcement learning framework. Expert information can be incorporated into the reward function by modifying the relationships between the input variables (facies in the case studies considered in this paper) and the reward derived, for e.g., well pattern constraints can be enforced by penalizing wells that are not in desired pattern.

The effect of overlapping influence of neighborhood wells on the production of a placed well is included in the proxy model in the form of Productivity Potential Maps (PPM). The regression equation for reward is developed by conducting flow simulations considering various well locations across various reservoir realizations conditioned to the initial seismic data. Reinforcement learning based methods allow for infinite customization of the optimization of well location through the modification of the state definition, action space and most importantly the reward function. In this research, we have implemented the \(\epsilon \)-greedy policy which involves the selection of the action associated with the highest action-value function with a probability of \(1-\epsilon \) and a random action with a probability of \(\epsilon \). Other choices for the exploration (or behavioral) policy include Boltzmann policy, Boltzmann Gumbel policy [42], SoftMax policy [43] etc. Our selection of the \(\epsilon \)-greedy policy is based on its ease of application and low memory requirements as compared to other choices of behavioral policy.

The reward function optimization of a 2-D reservoir needs to account for information from surrounding grid points, but the formulation of the state vector includes the facies and coordinates of the well location only. For the optimization of a 3-D reservoir, the state formulation needs to account for the facies of individual grid points across the vertical section of the reservoir as a simple aggregation of the facies along the well path will lead to loss of vital information regarding the distribution of facies across the horizontal layers and the correlation of facies between reservoir layers. Another point to consider is the greater variety of potential well paths that can be drilled in a 3-D reservoir (inclined and lateral wells), but the inclusion of non-vertical well paths leads to a combinatorial explosion and has been discussed in [44] but is not included here for the sake of brevity.

The choice of number of decision-stages can also have an adverse effect on the computational time. In case of single well placement decision-stages, the RL problem will have to account for the value function of \(nPk\) grid points where \(k\) is the total number of wells and \(n\) is the total number of possible initial well locations. For the case studies considered, the number of wells selected has been set to 5 (placed in sequential decision-making stages) which allows for an interesting comparison of DRL with single-stage optimization techniques while being computationally efficient.

Case 1A considers a 2D reservoir (25 × 25 grid points) shown in Fig. 3 with initial seismic impedance map shown in Fig. 4. The regions in red represent pay facies regions (encoded as 1 in the DRL state function) and the ones in purple represent the non-pay facies (encoded as 0). As evident from the figures, the base reservoir realization has 3 distinct pay facies regions (represented by the three red bands). The upper channel is not well represented in the seismic impedance map while the lower 2 channels are slightly displaced in the seismic map. The issue of lack of precision of seismic data is frequently encountered in reservoir characterization and reservoir modeling methods need to be able to identify such translations and diffusions of features.

Fig. 3
figure 3

Original reservoir lithofacies for Case 1A

Fig. 4
figure 4

Seismic impedance map for Case 1A

The goal of the reservoir DRL agent is to maximize the cumulative rewards for the placement of 5 wells sequentially. An initial ensemble of 1000 reservoir realizations is generated using sequential indicator simulation using locally varying mean by utilizing the seismic data as secondary data. The size of the ensemble used to train the agent is a compromise between processing time and realized reward (as discussed in [44]). Recognizing that and adequate ensemble is necessary to reach an optimum reward, we selected an ensemble size of 1000 realizations. The suite of reservoir models generated are unconditional (no hard data is available) but, importantly, the reservoir realizations generated adhere to the distribution of pay facies probabilities developed using the seismic impedance maps. The reservoir realizations can be made geologically realistic, and this process can be further improved by considering expert information regarding the subsurface. This would then enable the use of training images for multi-point simulation using locally varying mean data [45]. As mentioned earlier, the assumptions governing the ensemble generation process can have dramatic effects on the policy that the DRL agent eventually converges to. The DRL agent attempts to identify the trends in channel connectivities in the reservoir models assuming that the channels are the preferred pay facies in the reservoir.

The pixel-wise average of pay-facies across reservoir realizations is shown in Fig. 5. The reservoir DRL agent is trained following the methodology shown in Fig. 1 and is then compared to a single-stage optimization policy. The single-stage policy maximizes the reward over the placement of the 5 wells by placing wells in the grid points with the highest probability of locating a pay-facies region (the same reward function formulation is considered for both single-stage optimization technique and the DRL technique) and is shown in Fig. 6. As evident by this simultaneous well placement policy, general trends in the seismic impedance are used to build a mechanism to exploit the reservoir. In the case where the initial assumptions regarding the reservoir are incorrect (due to imprecision and translation of features), this short-horizon policy would suggest well placements that are not profitable and can skew the expectation of optimality (as evident by the placement of Well 4 in non-pay facies region). In addition, due to the incremental nature of geospatial analysis, future information is not considered for present decision making.

Fig. 5
figure 5

Pixelwise average of reservoir facies for the ensemble of models generated using sequential indicator simulation using locally varying mean

Fig. 6
figure 6

Representation of the single-stage optimization policy (following conventional geostatistical well placement methodologies which involve the selection of the most probable well location under well spacing and productivity constraints) on the ground reality

Figure 7 demonstrates the manner in which the reservoir DRL agent is trained to address the problem. The figure shows the increase in the cumulative return with increasing number of episodes the agent is trained. The expected return tapers off asymptotically and at this stage the policy is assumed to have converged. The asymptotic nature of the convergence also depends on the exploration hyperparameter \(\epsilon \) and requires the annealing of the parameter to transition from exploration to exploitation. The trained policy has been demonstrated on the ground reality and several reservoir realizations in Fig. 8 and Fig. 9 respectively. The sequential placement of wells contingent upon prior wells placed can clearly be seen in the figures. Across all reservoir realizations, the first well placed is in the same location. This is because the initial well placement solely depends on the prior seismic data and the state encoding contains no information regarding the reservoir facies. After the first well, the additional information from that well influences the reservoir models and the subsequent well is placed according to the developed model.

Fig. 7
figure 7

Moving average cumulative reward per episode for Case 1A

Fig. 8
figure 8

Well placement based upon policy generated by DRL agent

Fig. 9
figure 9

Policy demonstrated on selection of realizations from the ensemble generated using indicator simulation

In addition to the placement of wells in the true reservoir, it is interesting to visualize the performance of the DRL agent on the ensemble of reservoir realizations (as demonstrated in Fig. 10). The agent is not able to suggest perfect placement of wells across all realizations. In most cases, the DRL agent can generate high rewards. In a few realizations, the wells are placed sub-optimally (for ~12% of the realizations). The placement of subsequent wells depends upon the results of the reservoir property analysis at the given well location. For a 2D binary facies case with the objective of placing 5 wells, there are at most 16 facies combinations at the well locations \(\left({n}_{facies}^{{n}_{wells}-1}\right)\) generated and the developed policy must account for the same. The process for determination of the optimal policy is similar to pruning the leaf nodes in a decision tree. Figure 11 demonstrates the process for developing the policy and the well configuration developed for the ground truth case has been highlighted. As the major channel is aligned at \(45^\circ \) with respect to the horizontal, successful wells are placed with coordinate where \(x\approx y\). When the DRL agent initially fails to place wells in pay regions, it moves towards exploring regions where \(x\) deviates from \(y\) (the downward facing branches of the tree).

Fig. 10
figure 10

Histogram of cumulative rewards from the developed DRL policy, tested on the ensemble of reservoir realizations (for Case 1A)

Fig. 11
figure 11

Policy for exploitation of the reservoir (NP representing that the placed well is a non-pay facies region and P represents pay facies region). The coordinates of the next well have been shown within the boxes

Parameters that govern the performance of the deep Q-learning process such as the learning rate (\(\alpha \)), mini-batch size, replay buffer size and the exploration parameter can alter the learning rate and eventual policy convergence. The effects of the hyperparameters on the developed policy, assuming the same neural network model configuration and computational resources (2.5 GHz Intel Xeon Processor, 2 Nvidia Tesla K80 computing modules, FDR InfiniBand, 10 Gbps Ethernet), is shown in Table 2. With increase in batch size, there is a decrease in the number of episodes required to convergence and an overall improvement in the final converged policy. The time taken for the convergence of the policy scales nearly hyperbolically with respect to \(\tau \), which is the number of episodes between updates to the target network under a DDQN approach, with no dramatic reduction in the reward at convergence of the policy.

Table 2 Effect of hyperparameters on the convergence of the final policy in terms of the quality of the developed policy (demonstrated using the converged cumulative reward) and the computational time

Case 1B aims to demonstrate the dependence of the developed policy and the training of the agent on the initial seismic information and modeling assumptions. A diffused seismic impedance map, generated via dilation of the seismic impedance map for Case 1A, is now considered leading to a greater uncertainty in the developed ensemble of reservoir realizations. The seismic impedance map and the pixel-wise average of pay-facies is shown in Fig. 12. With the ground truth the same as in Case 1A, the pay facies probability map shows an increase in the uncertainty of pay facies probability at all locations (demonstrated by probability values being closer to \(0.5\)). The reservoir DRL agent is trained on the ensemble of realizations developed using the diffused seismic impedance map and is then tested on the ground model.

Fig. 12
figure 12

Seismic impedance map for Case 1B showing diffused channel case and the pixel-wise average of pay-facies across the ensemble of reservoir realizations. Consider this against the crisper channel description in Case 1A

The histogram of the reward distribution for this case is shown in Fig. 13. The result shows that the new model performs worse than the other ensemble of reservoir realizations generated using the more precise seismic map. As the DRL agent bases the policy on reservoir realizations generated by modelers and attempts to mimic human decision-making, with the increase in uncertainty regarding the position of channels there is a reduction in the quality of policy developed.

Fig. 13
figure 13

Histogram of cumulative rewards from the developed deep DRL policy trained using the ensemble of reservoir realizations generated using the diffused channel seismic impedance map, tested on the ensemble of reservoir realizations (for Case 1B)

Another significant difference is the computational speed that results from considering more precise information. There was a \(19.8\%\) increase in computational time required for convergence of the policy when considering the ensemble of models generated conditioned to less precise seismic information. This is because the DRL trains on markedly different realizations from one case to the next (high uncertainty in the pay facies probability). Figure 14 shows the developed policy tested on the ground truth. The wells are placed further apart as they were trained considering realizations in which the pay facies are aligned roughly in the \(45^\circ \) azimuth direction. However, there is no guarantee that the wells are placed in regions of high channel connectivity in individual realizations due to the high uncertainty associated with the realizations.

Fig. 14
figure 14

Well placement on ground truth suggested by DRL agent trained using the ensemble of realizations generated conditioned to the diffused seismic data

Cases 1C and 1C_alternate considers the robustness of the training to faulty interpretations of channel orientation from seismic data. The goal in this case is to demonstrate the variation in the generated policy with increasing uncertainty associated with detected seismic features. The seismic impedance map from Case 1A is considered along with two other seismic impedance maps (shown in Fig. 15). These maps contain the main channel oriented at \(35^\circ \) and \(15^\circ \) with respect to the x-axis (the original seismic has the main channel oriented at \(45^\circ \) with respect to the x-axis) and will be used to generate two distinct sub-cases labeled 1C for the \(35^\circ \) seismic and 1C_alternate for the \(15^\circ \) seismic. The seismic impedance maps for Cases 1C and 1C_alternate were generated by rotating individual sections of the seismic impedance map for Case 1A by the prescribed angle followed by diffusion to blend in the gaps. The pixel-wise average of the pay facies across the ensemble of generated realizations have been shown in Fig. 16. By training the DRL agent on these variations of pay facies map, the effect of initial assumptions on the developed policy is demonstrated. In addition to demonstrating the developed policy on the ground truth model (shown in Fig. 17), the histogram of episode rewards for the application of the DRL agent on the realizations developed using the seismic impedance maps has been shown in Fig. 18. As the channel orientation starts deviating significantly from the orientation in the “ground truth” model (from \(10^\circ \) deviation in Case 1C to \(30^\circ \) deviation in Case 1C_alternate), the developed policy performs worse. The wells may end up getting placed in the non-pay regions. In both the sub-cases the reservoir DRL agent attempts to place the wells in the orientation of the major channel in their trained ensemble. Though the policy developed for Case 1C enables the agent to place wells in pay facies regions, these wells are placed in the periphery of the major channel. In Case 1C_alternate, the first well is placed in a non-pay facies location followed by 3 wells placed in pay facies regions. This leads the agent to falsely ‘believe’ that Well 1 was an anomaly and the major channel may yet be found at \(15^\circ \) orientation with respect to the x-axis. This is ultimately proven false with the 5th and final well. Hence, the initial assumptions upon which the ensemble of reservoir realizations are built can have a dramatic effect on the eventual policy developed with the associated error increasingly significantly if models are conditioned to incorrectly interpreted information. This is crucial to demonstrate that the DRL agent attempts to mimic human decision-making and falls to the same pitfalls as decision-makers would if poor reservoir models are used to guide the decision-making.

Fig. 15
figure 15

Seismic impedance maps for Cases 1C on the left and 1C_alternate on the right considering imprecise seismic data that incorrectly identifies the channel angle

Fig. 16
figure 16

Pixel-wise average of pay facies across the ensemble of reservoir realizations generated using seismic data that do not reflect the correct channel orientation for Case 1C (left) and Case 1C_alternate (right)

Fig. 17
figure 17

Developed policy demonstrated on the ground truth for Cases 1C and 1C_alternate respectively

Fig. 18
figure 18

Histogram of cumulative rewards from the developed DRL policy trained using the ensemble of reservoir realizations generated using the rotated channel seismic impedance map, tested on the ensemble of reservoir realizations generated using the original seismic data for Cases 1C and 1C_alternate, respectively

Case 2 considers the placement of vertical wells in a 3D reservoir (\(100\times 120\times 10\)). The 3D case considers the Stanford V reservoir with the goal of placing 5 vertical wells in an uncertain environment. To extend the 2D case to a 3D one, a modification is made to the state vector representation. The facies at the well location can represented as an aggregate of facies encoding at all grid blocks contacted by the vertical well. While treating the 3D model as a stack of 2D models require less computational resources for the determination of the optimal policy, that assumption may result in the failure to capture the facies spatial correlations between different z-slices. The 3D formulation addresses this deficiency albeit at the expense of additional computational resources.

The ground truth and the pixel-wise ensemble of reservoir realization are shown in Figs. 19 and 20. Using the process workflow shown in Fig. 1, the DRL agent efforts to maximize the cumulative expected reward. The 3D formulation converges to a better policy than the case with multiple stacked 2D layers. This is due to the better representation of spatial correlation between reservoir layers. It is clear from the ensemble average map in Fig. 20 that the stochastic realizations conditioned only to the seismic impedance information do not depict the channel trends in the ground truth model accurately. The major channel features are displaced to the northern part of reservoir in Fig. 20 and several key high productivity regions are also consistently missing in the generated ensemble. Though in real-world cases experts add information to the ensemble generation process in the form of prior geologic constraints, the current work has deliberately not attempted to do so, to not introduce bias in the generated ensemble. Constraints such as well productivity, reservoir boundary and channel connectivity have been considered in this formulation.

Fig. 19
figure 19

Pixel-wise addition of the ground truth for Case 2

Fig. 20
figure 20

Pixel-wise average of pay facies (standardized by dividing the number of layers) across layers averaged over the ensemble of reservoir realizations generated for Case 2

Figures 21 and 22 demonstrate the well locations corresponding to the optimum DRL policy and single-stage optimization policy respectively on the ground truth. Due to the uncertainty represented in the ensemble, the second and third wells in the optimum DRL policy are placed in an unproductive region (in high probability pay-facies region in the ensemble as seen in Fig. 20). The well placement policy generated using DRL shows a gradual update to the beliefs regarding the pay facies location. The DRL algorithm learns about the non-existence of high productivity regions at the location of wells 2 and 3 and applies that knowledge to the placement of wells 4 and 5. These wells are located well within channel regions. In the single-stage optimization policy (implementing the gradient-based optimization considering the same reward formulation as the DDQN method), every single well placement has an exaggerated effect on the beliefs regarding the location of channels and well placement oscillates wildly from one region of the reservoir to the next. In real-world scenarios, field development decisions may involve the initial drilling of an exploratory well and then, depending on the observation of reservoir properties at the well location, decision would be taken to either step out and drill an offset well near the first well or, in case the first well happens to be dry, to explore a different region of the reservoir. While the single-stage optimization policy and the DRL policy derive well placement strategies with similar overall reward, DRL policy demonstrates a policy that mimics this human decision-making process. As mentioned in Sect. 2, RL techniques are guaranteed to deliver the optimal well placement policy and outperforms techniques that do not account for future actions when applied to multi-stage well location optimization problems only when the value functions for all state-action pairs are accurately quantified. The similarity of performance between the single-stage and DRL policies can be attributed to the early termination and incomplete exploration of the state-action space by the DRL agent. Future research will attempt to quantify the variation in the total reward with increasing well placement decision-stages. This will aid in the quantification of the advantages/disadvantages of human-mimicry in well placement policies. The efficacy of the DRL model has been also demonstrated in another 3-D case study which can be found in [44].

Fig. 21
figure 21

Demonstration of the developed DRL policy on the ground truth in the 3-D reservoir case

Fig. 22
figure 22

Demonstration of the single-stage optimization policy on the ground truth in the 3-D reservoir case

To extend the formulation to continuous state properties, such as permeability, porosity etc., and continuous action properties, such as well operating conditions, greater granularity in well location, optimization of horizontal well trajectory etc., the existing reservoir formulation can be modified to include such features into the reward function. Proxy model to evaluate such actions can also be developed. Another avenue forward is to utilize policy gradient methods [46] to directly estimate the optimal policy without evaluating the value functions. These extensions would speed up the DRL process albeit at the cost of some loss in optimality of the developed policy. These improvements will be the areas of study for future research.

5 Discussion

Through the case studies presented in this paper, the authors aim to demonstrate the applicability of reinforcement learning for well location optimization. The paper also presents the pitfalls of poor modeling assumptions that result in sub-optimal well placements. It is evident that, through intelligent sampling, DRL agent can identify high-productivity reservoir regions accounting for constraints in well placement. The effectiveness of reinforcement learning as compared to other well location optimization techniques include,

  1. 1.

    The ability to consider expert information and ease of integration with existing procedures for well location optimization.

  2. 2.

    Taking a long-term view to the well placement problem. Algorithms for well optimization studied in literature take a short horizon view to tackling the well problem, not accounting for the information gained from placing a well. Doing so, in the traditional workflows would require retraining after every well placement action.

  3. 3.

    The ease of integration of optimization methods with existing deep learning infrastructure. The field of reinforcement learning is one of the biggest beneficiaries of the improvements in research into deep learning.

  4. 4.

    Policy gradient methods can aid in the consideration of continuous state-action spaces leading to consideration of diverse set of reservoir properties. Though this comes at the expense of the loss of guarantee of an optimal solution, policy gradient approximations can deliver excellent solutions to the problem at hand if computational resources are scarce.

The authors plan to study policy gradient methods and extend the current formulation to account for the placement of horizontal wells. A comparison between the results developed using temporal difference (TD) methods (as shown in the current work) and policy gradient methods can lay the path for further studies into the use of reinforcement learning for well location optimization.

Reinforcement learning is not a panacea and would not solve all issues associated with the well placement problem. Some of the limitations of reinforcement learning include,

  1. 1.

    Its inherent dependence on the assumptions behind the modeling of the environment (which dictates the transition probabilities, \(T({s}^{{\prime}}|s,a)\), leading to the development of sub-optimal policies if based on unrealistic modeling assumptions.

  2. 2.

    Its high computational expense. It is suitable for multi-step optimization problems but not recommended for single-stage optimization problems.

  3. 3.

    Its need for proxies for the reward to speed up evaluation of candidate wells (or actions in a particular state). This is because full reservoir simulation for each well placement decision can be computationally expensive.

  4. 4.

    Its low interpretability due to the utilization of deep learning models as functional approximators.

6 Conclusion

Reinforcement learning is useful to find the optimal solutions of multi-stage decision problems, especially those for which there are feedforward effects of decisions i.e., future decisions are affected by decisions taken in past stages. Due to the computational intensity of the application of RL, it may not be useful for single-stage optimization problems or for problems with extremely low uncertainty regarding the state-action space.

The problem of placement of wells in an uncertain environment can be formulated as a Markov decision process. The authors have demonstrated a novel approximate dynamic programming framework for addressing the problem. Reinforcement learning provides a unique framework for automating decision-making by considering several scenarios and extending the optimal solution to the sub-problems (i.e., location of each well) to generate a comprehensive solution to exploit the reservoir. Also, due to the ease of integration of expert information and reutilization of existing tools, the policies developed using reinforcement learning provide a geostatistically and petrophysically consistent framework for addressing the problem of optimization of well location. By the addition of intelligent sampling techniques and the use of approximately greedy methods for policy determination, the process of locating wells for exploiting reservoir resources can be sped up. The work also utilizes proxy models developed using regression and deep learning models that allow for the faster evolution of optimum well locations. The selection of the reward function dictates the convergence of the DRL algorithms and poor reward formulation may lead to the development of non-optimal policies to solve the optimization problem. The paper also presents cases that demonstrate that the policies developed using reinforcement learning are superior to existing single-stage optimization techniques, but the quality of solutions developed is dependent on the accuracy and precision of the prior models in the ensemble and on the parameters that drive the DRL process.