Introduction

The data analytic lifecycle, in relation to big data problems and data science projects, is classified into six phases (EMC 2015). These consist of discovery, data preparation, model planning, model building, communication of results, and operationalization. Ideally, it is preferable to go in the feed-forward direction (from discovery to operationalization), but often, this is a back and forth process throughout the duration of the project. The goal of the discovery phase is to develop a problem statement and formulating an initial hypothesis that we test using data. The data preparation phase includes gathering the needed data from various sources, cleaning and performing necessary transformations, in addition to coming up with a plan for handling and storing the collected data and data generated during the project lifecycle. In the third phase, we assess methods for building the model in the next phase. The primary activity in this phase in exploratory data analysis (EDA) with the aim of examining relationships among variables and selecting those variables that are of interest in the project and those that show some promise during the EDA process for further consideration. This phase can also suggest appropriate models for consideration during the model building phase. The model planning phase involves developing a workflow for building the model. In the model-building phase, we construct the model using the information from previous phases, and the workflow developed in the model planning phase (Wigwe et al. 2020).

In the oil and gas industry, spatio-temporal and other machine learning models have received a range of applications in different projects. The general area of the application of data-driven analytics is called “Petroleum Data Analytics” (Mohaghegh 2016). Ettehadtavakkol and Jamali (2019) presented a spatio-temporal analysis of water production from Marcellus shale using kriging estimation. Siddiqui, et al. (2019) used machine learning modes to study fluid type variation and completion optimization in the Eagleford. Zhou et al. (2014) applied data mining techniques to evaluate gas production performance in the Marcellus. Wigwe et al. (2019a, b) presented both spatial and neural network techniques to analyze Bakken oil production while Zargari and Mohaghegh (2010) showed an application of machine learning models for the Bakken field development planning. Simha et al. (2019) integrated spatio-temporal unsupervised learning method with reservoir simulation to identify a unique scenario for assessing the impact of uncertainties on production. Although the modeling workflow presented in this paper is similar for other data analytic applications, the specific methods implemented, and their formulation is not native to the oil and gas industry. As a result, we will present a brief mathematical modeling background and provide some details on the two techniques presented in this paper for spatio-temporal models before presenting results of application for oil and gas production prediction.

Spatio-temporal statistics

Almost every data collected is usually associated with space and time. In a “non-spatio-temporal” data, the ST components were either not recorded or discarded when space and time are not of interest to the observer or the research objective. In a spatio-temporal dataset, we include information about where and when the data was collected (Cressie and Wikle 2011). Hence, spatio-temporal statistics is the statistical analysis of such data. In ST data analysis, the analyst may be interested in one or more of the following goals (Wikle et al. 2019):

  1. 1

    Gaining more understanding of the data.

  2. 2

    Looking for relationship between two ST processes.

  3. 3

    Making prediction in space and time.

  4. 4

    Inference on model parameters.

  5. 5

    Forecasting in time, etc.

Two traditional approaches to spatio-temporal modeling are a descriptive approach and a dynamic approach. In the descriptive approach, the spatio-temporal model uses the mean and covariance functions to characterize the ST process (Cressie and Wikle 2011; Wikle et al. 2019). The kriging method is based on this approach. Variability (or uncertainty) is captured through a marginal probability distribution. There are several reasons why we use this modeling approach in practice, one of them being a lack of understanding of the spatio-temporal process under study. The dynamic approach further incorporates how spatio-temporal processes evolve over time and are built from conditional probability distributions (Cressie and Wikle 2011). Stroud et al. (2001) proposed a modeling framework for space–time data that accounts for the spatial variability by modeling the mean function at each time as a locally weighted mixture of regression surfaces while they account for temporal variability by allowing component surfaces to evolve through time.

Methodology

Spatio-temporal exploratory data analysis (EDA)

Visualization of spatio-temporal data poses specific challenges because of the number of dimensions needed to present the plots. At least three dimensions are suggested as a minimum to be displayed at the same time (Cressie and Wikle 2011), representing two- or three-dimensional space and time. Some of these plots come in the form of maps, colors, and animations, and enable a simple presentation of important information that leads to the development of appropriate spatio-temporal models (Wikle et al. 2019). Static maps, multi-panel plots (Pebesma 2012), Hovmöller diagram (Cressie and Wikle 2011; Hovmöller 1949; Pebesma 2012; Wikle et al. 2019) and animations are all space–time plots. We generally use histograms and boxplots to show the distribution of a single continuous variable (EMC 2015; Navidi 2015; Westfall and Henning 2013; Wikle et al. 2019). For details and applications of these ST visualizations to oil and gas datasets see Wigwe and Watson (2021).

Spatio-temporal models

At a minimum, a dynamic spatio-temporal model, from a hierarchical modeling framework, requires the specification of a data model. This is a conditional model of the data, conditioned on the true process of interest and some model parameters. A process model captures how a spatio-temporal process evolves with time, along with some given parameters. Model for parameters is specified to yield the Bayesian hierarchical model (BHM) or estimates of the parameters are provided to yield the empirical hierarchical model (EHM). In general, we represent a spatio-temporal model by the stochastic process using Eq. (1) (Blangiardo and Cameletti 2015):

$$Y\left( {s, t} \right) \equiv \left\{ {y\left( {s, t} \right),\left( {s, t} \right) \in {\mathcal{D} } \subset {\mathbb{R}}^{2} \times {\mathbb{R}}} \right\}$$
(1)

where \({\mathcal{D}}\) is a subset of real numbers \({\mathbb{R}}\) in 2-D space and time. \(y\left( {s,t} \right)\) implies that the process is indexed by space and time. In the framework presented by Cressie and Wikle (2011), at the top level of the hierarchical model, the data model is given by Eq. (2):

$$\left[ {\left\{ {Z\left( {x; r} \right):x \in D_{s} , r \in D_{t} } \right\}|\left\{ {Y\left( {s; t} \right):s \in {\mathcal{N}}_{s} , r \in {\mathcal{N}}_{r} } \right\}, \theta_{D} } \right]$$
(2)

where \(Z\left( {x;r} \right)\) is an observation (data) at spatial location \(x\) and time \(r\), \(Y\left( {s;t} \right)\) is the process of interest at spatial location \(s\) and time \(t\) and \(\theta_{D}\) represents the model parameters for the data model, which could vary with space and/or time. At the second level, the process model is given by Eq. (3):

$$\left[ {Y\left( {s;t} \right)|\left\{ {Y\left( {w;t - \tau_{1} } \right)w \in {\mathcal{N}}_{s}^{\left( 1 \right)} , \ldots , Y\left( {w;t - \tau_{p} } \right)w \in {\mathcal{N}}_{s}^{\left( p \right)} } \right\}, \theta_{P} } \right]$$
(3)

where \({\mathcal{N}}_{s}^{\left( 1 \right)} , \ldots , {\mathcal{N}}_{s}^{\left( p \right)}\) are the neighbors of spatial locations \(s\), which corresponds to time lags \(0 < \tau_{1} < \cdots < \tau_{p}\) and \(\theta_{P}\) is the model parameter for the process model, which could vary with space and/or time as well. At the third and final level, the parameter model is given by: \(\left[ {\theta_{D} , \theta_{P} |\theta_{h} } \right],\) where \(\theta_{h}\) represents hyper-parameters.

Fixed rank Kriging (FRK)

FRK facilitates optimal spatial prediction for large spatial and spatio-temporal (ST) datasets (Wikle et al. 2019; Zammit-Mangion and Cressie 2017). It constructs a spatial random effects model on a fine resolution discretized spatial domain known as a basic areal unit (BAU) whose primary use is to account for problems related to change of support (Wikle et al. 2019). The model decomposes spatial random processes using spatial (or ST) basis functions. Model parameters are estimated using the expectation–maximization algorithm. This prediction framework is computationally efficient because of the reduction in dimensionality using basis function. If covariates are included in the model, they must be specified in the BAUs.

Generalized additive models

This class of models is similar to generalized linear models whose linear predictors are smooth functions of the covariates (Hastie and Tibshirani 1986, 1990; Wood 2017). The model is given by Eq. (4):

$$g\left( {\mu_{i} } \right) = \varvec{A}_{\varvec{i}}\varvec{\theta}+ f_{1} \left( {x_{1i} } \right) + f_{2} \left( {x_{2i} } \right) + f_{3} \left( {x_{3i} , x_{4i} } \right) + \ldots$$
(4)

where \(\mu_{i} \equiv {\mathbb{E}}\left( {Y_{i} } \right)\) and \(Y_{i} \sim EF\left( {\mu_{i} , \phi } \right). Y_{i}\) is a response variable, \(EF\left( {\mu_{i} , \phi } \right)\) denotes an exponential family distribution with mean, \(\mu_{i} ,\) and scale parameter, \(\phi ,\) \(\varvec{A}_{\varvec{i}}\) is a row of matrix for any parametric model component, \(\theta\) is the corresponding parameter vector, and the \(f_{j}\) are smooth functions of covariates, \(x_{k} .\) The model allows for a flexible specification of the dependence of the response on the covariates, but by specifying the model only in the smooth functions, rather than a parametric relationship. The smooth functions use penalized regression splines (or other splines) as basis functions with a specified number of dimensions, \(k\), which controls model smoothness. Thin plate regression splines, cubic regression splines, and Splines on the sphere are some of the most popular basis used in GAM models. Basis functions (Wood 2000, 2003, 2017) enable efficient approximation in this set up by means of a smoothness parameter, which we choose by cross-validation–generalized cross-validation (GCV) score.

Model selection, validation, and diagnostics

Training-data validation, within-sample validation, forecast validation, hindcast validation, and cross-validation are the approaches used to compare model predictions with real-world observations (Hastie et al. 2009; Wikle et al. 2019). We compare the performance of several models on the training and the test data using model diagnostics. These metrics are also used in selecting the best model. Graphically, for regression problems, the residual plots are useful for checking model assumptions. The conditional quantile plot (Wilks 2011) is also a useful diagram that plots the predicted values on the x-axis, and the quantiles from the empirical predictive distribution of the observations associated with the predictions on the y-axis. Any bias in the model predictions becomes apparent, depending on the position of the plot vis-a-vis the 45° diagonal line. The mean squared prediction error (MSPE) or its square root version (RMSPE) is the most used diagnostic/model validation statistic. It captures issues related to bias and variance. The predictive cross-validation score (PCV) and the standardized cross-validation score (SCV) are also measures used in evaluating model performance (Kang et al. 2009). Lower values of PCV and MSPE, and SCV closer to 1 indicate better model performance. Scoring rules for spatio-temporal predictions compare prediction distribution to a validation observation. The commonly used scoring rule for continuous variables is the continuous rank probability score (CRPS). Models with Lower CRPS are better. The Akaike information Criteria (AIC) is a model selection criterion that penalizes bias due to overfitting when evaluating models using training data, and the number of parameters used in fitting the model. Model parameters are estimated using maximum likelihood. When comparing several models, the model with the lowest AIC is the best. See these references for detailed treatment (Hastie et al. 2009; Hooten and Hobbs 2015; Westfall and Henning 2013; Wikle et al. 2019; Wilks 2011).

Results and discussion of case studies

Application of spatio-temporal models for characterization of well performance is presented in this section. First, we present a geological overview of the five formations used as case studies. Using available data, we evaluate and compare well production performance across these formations on a yearly basis. We also carry out this comparative analysis using the first-year production data by completion year to capture any improvement in the performance of new wells over time. To build spatio-temporal models for each formation, we develop a workflow for each method that would be used across each formation. Using this workflow, we build the spatio-temporal models, evaluate the performance of the models using the model diagnostics presented under methodology and discuss the results. Model results plots presented, and discussions are for the major fluid phases in these formations. Models for the minor phases were also developed and the results presented in tabular form at the end.

Geology of formations and data preparation

The Bakken formation is in the Williston Basin and covers the western part of North Dakota, Montana, Manitoba and Saskatchewan in Canada and was first discovered in 1953. It has an areal coverage of about 200,000 sq. miles and the formation thickness ranges from 0 to 140 ft. The depth of the formation ranges from about 1000 ft in parts of Canada to about 15,000 ft in some areas of North Dakota (Kuhlman et al. 1992). The upper and lower formations are comprised mostly of shale while the middle formation is comprised mostly of sandstone, limestone, and siltstone. The upper and lower Bakken are organic-rich marine shales and are the petroleum source rock for the oil and gas produced from the Bakken petroleum system (Kumar et al. 2013; Li et al. 2015; Sonnenberg 2014; USGS 2008). The middle Bakken reservoir is the focus of most development activities in recent years (Kumar et al. 2013). The hydrocarbons that were generated in the formation resulted in over pressurizing the formation which then led to the creation of natural fractures. These natural fractures have been the main cause of increased permeability and productivity within the Bakken Formation (Jin et al. 2015; Tomomewo et al. 2019; Tran et al. 2011). The Bakken formation was estimated to contain between 3 and 4.3 billion barrels of recoverable oil and 1.85 trillion cubic feet of associated gas by the United States Geological Survey (USGS) in 2008 (USGS 2008). We retrieved production and completion data from North Dakota Industrial Commission (NDIC) and DrillingInfo. The Bakken analysis focuses on the McKenzie County, in which 2349 horizontal wells with completion data are available for study.

The Marcellus formation is an organic-rich shale that occurs in the subsurface of four states in the USA, these are Ohio, West Virginia, Pennsylvania, Maryland, and New York (Bartuska et al. 2012; Koesoemadinata et al. 2011; Yildirim et al. 2019; Zamirian et al. 2016). The formation is divided in two, the Upper Marcellus and the Lower Marcellus with the Lower Marcellus having a significantly higher concentration of organic matter as compared to the Upper Marcellus. It covers an area of more than 100,000 sq. miles. The EIA (2017) estimates oil reserves of 143 million barrels (MMbbls) and 410 trillion cubic feet (Tcf) of gas in place with recoverable gas at 50 Tcf. The Marcellus dataset contains 2020 horizontal wells completed since 2008 in three counties: Washington, Greene and Fayette in the south-west corner of Pennsylvania.

The Eagle Ford Shale formation is a hydrocarbon bearing formation found in South Texas. It is best known for producing variable amounts of dry gas, wet gas, Natural-gas liquids, Condensates, and more oil than other traditional shale plays. It is believed to be the source rock for many conventional oil and gas fields in the Texas Gulf Coast. The formations usually targeted are the Lower Eagle Ford, Upper Eagle Ford and the Austin Chalk (Shelley et al. 2012). It extends across 26 counties from East Texas to the Mexican border with an acreage of about 20,000 sq. miles and the thickness ranges from 50 to 300 ft according to the Railroad Commission of Texas (RRC 2020). The formation is estimated to contain 66 trillion cubic feet of natural gas 8.5 billion barrels of oil and 1.9 billion barrels of natural gas liquids (USGS 2018). Most of the rock within the Eagle Ford formation is very brittle hence it is a good candidate for drilling horizontal wells and for hydraulic fracturing (Jaripatke and Pandya 2013; Lalehrokh and Bouma 2014; Nwabuoku 2011; Siddiqui et al. 2019). The Eagleford dataset contains 3413 horizontal wells located in LaSalle County, TX.

The Delaware Basin is one of the prolific basins in the USA comprising of stacked reservoirs which are the Wolfcamp shale, bone spring formation and Avalon shale. The Wolfcamp shale and bone spring formation are collectively called the “Wolfbone” play (Lohoefer et al. 2014a, b; Lalehrokh and Bouma 2014; Sharma et al. 2014; Yates et al. 2013). The Delaware Basin is in South East New Mexico (Eddy, Chaves, and Lea County) and West Texas (Culberson, Pecos, Loving, Terrel, Reeves, Ward and Winkler County). USGS estimates recoverable hydrocarbons to be in excess of 19 billion barrels of oil, 1.6 billion barrels of natural gas liquids and 16 trillion cubic feet of natural gas (EIA 2018, 2019). The Wolfcamp and Bonespring formation dataset contains 2295 horizontal wells drilled in Lea County, NM.

We gathered data of 10,077 horizontal wells from January 2008 to June 2019 from state agencies and drillinginfo. The data was cleaned, combined in a long format, and stored in a database for analysis (Table 1). The basic covariates needed for spatio-temporal modeling are space and time, where space is location (longitude and latitude) and time represents the number of months since January 2008. Hence, \(t = 1\) for January 2008, \(t = 12\) for December 2008 and so on. The space–time covariates are available for all formations and were used to characterize the spatio-temporal process for the formations. The Bakken formation has additional information. The wells have completion data (stages, perforated interval, pounds of proppants and volume of fluid), and geologic data (TOC content of upper, and lower Bakken along with isopach for upper, middle and lower Bakken (Source: Nordeng and Helms 2010)). These additional covariates were included in the Bakken model. It was necessary to normalize the covariates before modeling as this expedited computation and removed any scaling effect. In this analysis, we used the min–max normalization to scale the variables to \(\left[ {0, 1} \right]\) using: \(n\_X_{i} = \left( {X_{i} - \hbox{min} \left( X \right)} \right)/\left( {\hbox{max} \left( X \right) - \hbox{min} \left( X \right)} \right)\). The six-month cumulative oil and gas production are the dependent variables. We apply log transformation (or log-link function for GAM model) to the dependent variables due to their lognormal distribution.

Table 1 Number of horizontal wells in each formation

Comparative production analysis

Figure 1 shows a comparative plot of the production performance of wells in each formation. The Bakken, Eagleford, and Wolfbone formations primarily produce oil while the Marcellus is a gas-rich formation, with the Eagleford producing a good amount of gas, especially during its initial development years. Gas production from the Eagleford has been on a decline since 2010 with most of the new wells targeting oil-rich portions of the formation. The number of active wells has steadily increased in all formations while new wells are on the rise for the Permian basin formation. This plot is informative but does not fully display the performance of the new wells shown on the bottom right in Fig. 1.

Fig. 1
figure 1

Comparison of production from all formations. Oil and gas yearly productions are reported on a per well basis to eliminate the effect of well count. Colors used are consistent across all four plots as shown in the legend

Figure 2 shows 12-months cumulative production per well for new wells only. Oil and gas production from new wells increase each year for all formations, except for gas production from the Eagleford for reasons already mentioned above. The increased productivity of new wells could be a direct result of technological improvements in drilling and completion design: longer laterals/extended reach wells, higher stages, more pounds of sand and injected volume of frac fluid, fracture complexity (zipper fracs) resulting in the opening of natural fractures, etc. The Bakken formation outperforms the other oil-producing formations consistently, with comparable production from the Wolfbone formation since 2016.

Fig. 2
figure 2

Comparison of 12-months cumulative production from new wells for all formations as a function of completion year. Oil and gas production are reported on a per well basis to eliminate the effect of well count. Colors used are consistent across all plots. 2019 data contains at best four to six months of production data and hence could equal or surpass prior year’s results

Model setup

Figure 3 shows the modeling workflow. For the FRK model, we construct the BAUs using the data. BAUs are the basic framework on which we build the model and carry out predictions. Figure 4 shows the constructed BAUs. In Fig. 5, we show the spatial basis function constructed for each formation using 2 resolution. Adding a second resolution helps capture finer details across space. With two resolutions, we generate a total of 77 basis functions across the spatial domain. There are 12 temporal basis functions with an aperture size of 6 months spanning the 11.5 years of production data (Fig. 6). Taking a tensor product of the spatial and temporal basis functions results in 924 spatio-temporal basis function that we supply to the FRK model.

Fig. 3
figure 3

Modeling workflow/flowchart

Fig. 4
figure 4

Basic areal units (BAUs) generated for: a Bakken, b Wolfbone, c Eagleford, and d Marcellus. We can perform FRK predictions on these grids

Fig. 5
figure 5

FRK model Spatial Basis functions for: a Bakken, b Wolfbone, c Eagleford, and d Marcellus. Two resolutions produce 77 basis functions across each spatial domain

Fig. 6
figure 6

Temporal Basis function for FRK models with 12 basis functions. The same specification was used across all four formations. The tensor product of the spatial and temporal basis function yields 924 ST basis function for the FRK model

For the GAM model, the type of basis and its dimension enable a reasonable approximation of the underlying data-generating process. We used the thin plate regression spline, “\(tp\)”, and cubic regression spline, “\(cr\)”, for the spatial and temporal dimensions, respectively, and their tensor product evaluated to obtain ST basis functions. Selection of the basis dimension, \(k\), which is the number of basis function to construct, is an iterative process. It enables the model to capture the inherent variability in the data. After several iterations, we arrived at \(k = \left( {50, 20} \right)\) as the appropriate basis dimension, leading to \(k - 1 = 999\) spatio-temporal basis functions. This is equivalent to the FRK setup. Once these components are determined, we build the GAM model. If any of the covariates or smooth functions are unimportant, as reported by the p value at the 5% level, we drop the covariate and re-fit the model. Finally, we make predictions at test locations and visualize model results. Save the best model for deployment for an undrilled location.

Bakken formation

Figure 7 shows a multi-panel, facetted, spatial plot of the Bakken data. It shows the yearly oil production in barrels per day per stage, where the days represent the total number of days the well was online in the specified year. This is represented by the colors while the size of each bubble correlates with the number of stages. The facets are populated with some information like the year, the total number of wells and, how many of those wells were new completions. In 2013, for instance, the map contains 986 wells for which 406 are new completions. Although we see larger bubbles as we go across time, there does not seem to be any spatial correlation with the number of stages. Similar observations were made using the other completion variables available (the pounds of proppant, the volume of fluid, the perforated interval, and the per stage versions—plots not shown).

Fig. 7
figure 7

Spatial map of wells in the Middle Bakken in McKenzie County. Colors represent yearly oil production in barrels per day per stage while the size is mapped to the number of stages. See Wigwe et al. (2020)

Another observation from Fig. 7 is the increase in activity around the north-east corner of the County. Wells drilled in this area have a higher initial rate than wells drilled in the western part, regardless of the number of stages used in completion. Geologically, this portion of the Middle Bakken has a higher thickness. Figure 8 shows the distribution of six-month cumulative oil production. The left figure shows actual production values that suggest, on average, increasing oil production with time as the distribution shifts toward the right. Each plot has a similar distribution with different characteristics (average, standard deviation, kurtosis, and level of skewness). Adding this temporal dimension gives us further insight into incremental production or the addition of reserve as new wells come online. What the plot does not show is that technology has “changed”. Operators have increased the length of laterals, the number of stages, volume of fluid, and pounds of proppants pumped in these completion-dependent unconventional wells. Consequently, generating a similar plot (on the right) that captures this dependence on the size of the completion becomes necessary. The figure on the right shows that the temporal dimension does not have a notable effect on the per stage normalized production, as the six-month cumulative oil production per stage has a distribution with parameters that are in a comparable range across time. If completion data is not available, then modeling oil production as dependent on time is advisable.

Fig. 8
figure 8

Temporal histogram showing the distribution of bakken oil production data

The correlation plot matrix (Fig. 9) shows some problems with correlated variables. A solution will be to use principal components (PCs) as variables in the model (Everitt and Hothorn 2009; Hastie et al. 2009; Jolliffe 2002; Zhou et al. 2014). An additional advantage of using PC is for dimensionality reduction when the number of covariates is high. There are 9 covariates in the Bakken data, which would yield 9 PCs. This is a moderate number of covariates. Some authors have argued about using the total PCs for regression analysis rather than the first few that capture enough variability in the data. This is because of the possibility that lower PCs may actually have a significant influence on the model at the 5% level (see Jolliffe 2002). Following this suggestion would yield a similar result to the original covariates. Hence, we would carry out this analysis with the original scaled variables regardless of the correlation.

Fig. 9
figure 9

Correlation plot of Bakken variables

The Temporal histogram (Fig. 8) suggests an approximately lognormal distribution, but a gamma distribution will be more appropriate (Fig. 10). Gamma distribution resulted in the best fit.

Fig. 10
figure 10

Suggested choice of distribution for fitting six-month oil. Legend is ordered from best to worse fit

Table 2 shows a summary of the full model for the six-month oil production. This result implies that only the completion variables are significant at the 5% level in this model (as shown by the probability column) and the completion year does not influence the model performance. The result also shows that including the ST component would significantly improve the model. Consequently, the final model contains the significant variables. We used the final model for prediction at test locations. Figure 11 shows the actual and predicted values for the GAM and FRK model. The FRK model outperforms the GAM model and would be the preferred model for this dataset. We evaluated other model diagnostics and selection criteria and present the results in the “Model summary” section.

Table 2 GAM full model summary for the Bakken
Fig. 11
figure 11

Actual versus predicted six-month oil for GAM (left) and FRK (right) for the Bakken formation

Eagleford formation

Figure 12 shows a map of the Eagleford with horizontal wells drilled since 2008. Wells with higher gas production are on the southern portion. In a previous study (Wigwe and Watson 2021), we observed an increased activity from 2008 to 2019 with most wells brought online in 2013 and 2014. We also found that there is a great deal of variability in oil and gas production both spatially and temporally across La Salle county. Hence, the goal of this study is to develop ST models to capture this variability in order to make predictions of oil and gas production for undrilled locations. We present and discuss the results of FRK and GAM spatio-temporal models of six-month gas production. A summary that includes model results for oil production is presented at the end.

Fig. 12
figure 12

Map of Eagleford showing horizontal wells by major phases. We analyzed wells located in La Salle County in this study

Figure 13 shows the distribution of six-month oil and gas production using temporal histograms. A lognormal distribution or a gamma distribution with a log-link function would be appropriate to model this data. The left panel shows decreasing gas production with time as the distribution shrinks. This reduction in gas production is most likely due to the drilling of more wells in the oil-rich sections of the formation. The oil production (on the right figure) has not increased reasonably, but the minimum has “moved” with time toward the right (Wigwe and Watson 2021).

Fig. 13
figure 13

Temporal histogram for the Eagleford oil and gas data, each plot shows a similar distribution across time

Figures 14 and 15 show model results for FRK and GAM, respectively. Both models result in a good match with the FRK performance slightly better for this formation. The right figures show the distribution of prediction error for each model. As we will show later, the model predictions tend to give a better match for gas production compared to oil production across all the formations. This is related to the better response of gas to completion than oil due to its smaller molecular size.

Fig. 14
figure 14

Actual versus predicted six-month gas (left) along with the prediction error (right) for the Eagleford formation from the FRK model

Fig. 15
figure 15

Actual versus predicted six-month gas (left) along with the prediction error (right) for the Eagleford formation from the GAM model

Marcellus formation

The space–time plot shown in Fig. 16 shows the development of the Marcellus in three counties in Pennsylvania. Fourteen of the wells drilled in 2008 clustered around the center of Washington County. With time, development continued into the other two counties such that by June 2019, there were 1976 active wells producing from the Marcellus in these three counties. Of the 12,000 points plotted on the map, only 1000 cases were less than 100 MMcf annually, indicating how prolific these wells were.

Fig. 16
figure 16

ST map of the marcellus wells in three counties shown, PA. Colors represent yearly gas production. Each panel shows the number of wells and new drills

Figure 17 shows the temporal boxplot of six-month gas production. The average production increased from 250 to 2000 MMcf and the distribution for each period is more right-skewed due to higher producing wells. This increase is due to technological changes related to improvements in drilling and completion design. In line with the previous analyses, a gamma or lognormal-type distribution is appropriate for modeling this data.

Fig. 17
figure 17

Temporal boxplot shows bi-annual Marcellus gas production. The dashed lines represent the group average

Figures 18 and 19 show the results of FRK and GAM models. While the FRK model tends to underestimate the high gas volumes, resulting in the higher prediction error, both models sufficiently capture the spatio-temporal variability in the data. Figure 20 shows the diagnostic conditional quantile plots of the results. The plot is useful for capturing any evidence of bias in the predicted values. As shown, the GAM model does not show bias in the predictions while the FRK model shows bias when predicting gas production above 3 Bcf. We select the GAM model for prediction of the expected performance of undrilled locations.

Fig. 18
figure 18

Actual versus predicted six-month gas for GAM (left) and FRK (right) for the Marcellus formation

Fig. 19
figure 19

Distribution of prediction error using GAM (left) and FRK (right) for the Marcellus formation

Fig. 20
figure 20

Conditional quantile plot of six-month gas in Bcf for GAM (left) and FRK (right) showing that both models produced predictions with no bias for the Marcellus dataset

Delaware basin: Wolfcamp and Bonespring – “Wolfbone”

We present data for oil production in the New Mexico side of the Delaware Basin (in Lea County). Average oil production increased gradually with time, with more drastic increases recorded from 2016 to 2019 (Fig. 21). Gamma distribution provided the best fit to the data for modeling of oil production. Figure 22 shows perspective and contour plots of the GAM model. The plots are 3-D visualization of predicted oil production. The contour lines represent predicted six-month oil production. It shows the flexibility of the GAM model in characterizing the variability in the data across the spatio-temporal domain. The behavior of this model mimics reality. Figure 23 shows the predicted and actual oil production from the GAM and FRK models. Both models have reasonable performance and tend to overestimate production (biased high), although the FRK model has a superior performance for this formation in general.

Fig. 21
figure 21

Temporal boxplot shows bi-annual Wolfbone oil production, with the most increase seen from 2014–2015 to 2016–2017 period

Fig. 22
figure 22

GAM model predictions capture spatio-temporal variability in six-month oil production (in barrels) across the formation. The un-contoured regions indicate no activity or absence of data

Fig. 23
figure 23

Actual versus predicted six-month oil for GAM (left) and FRK (right) for the Wolfcamp and Bonespring formations

Model summary

There are variations in spatio-temporal development and the production response of the formations studied. The Bakken development has focused on the north-eastern part of the formation in McKenzie County. The Eagleford development covers La Salle County with active development focused on the SW-NE diagonal. For the Marcellus, initial development was in Washington County and spread to the other two counties over time. Table 3 presents a summary of all final models constructed for the unconventional plays along with the model diagnostics discussed in the “Model selection, validation, and diagnostics” section. We constructed different models for oil and gas production. The mean represents the mean of the data while the predicted mean is the mean of the predicted values from each model. The GAM model performs poorly at predicting individual data points but performs well when estimating the mean of the data. Hence, if the goal of the study is capturing the mean, then select the GAM model, otherwise, the FRK model is better suited for the analysis.

Table 3 Summary of model diagnostics and selection criteria

The importance of drilling and completion technology, and geological consideration to the successful development of unconventional resources is well documented (Chong et al. 2010; Kolawole et al. 2019; Kolawole and Ispas 2019; Maity and Ciezobka 2019; Pope, et al. 2010; Sharma et al. 2014; Soliman, et al. 2014). For this reason, completion and geologic parameters were included in the Bakken models because these variables were available. However, the performance of the models did not show much improvement over a case (results not presented) which utilized only location and time parameters. This could be because the ST parameters have implicitly accounted for geologic and technology since the data reflect these changes have improved well performance over time (Wigwe et al. 2019a, b). As a result, we expect the models for the other formations to reasonably describe geological and technological variations across each play over time in the absence of these variables in the model.

Conclusion

We presented an application of spatio-temporal models for production evaluation in a multi-basin study of four unconventional formations in the USA: the Bakken, Marcellus, Eagleford, and Wolfcamp and Bonespring formations. For each formation, we presented ST plots, including the multi-panel space–time plots, temporal histogram, and temporal boxplot. These visualizations enabled us to select a suitable distribution and suggest appropriate covariates to include in the model. Using the workflow presented in Fig. 3, we fit two ST models (Fixed rank kriging, FRK, and spatio-temporal generalized additive model, GAM), and compared their predictions. The FRK model produced better results compared to the GAM model. Both models overestimated predicted production values, with more bias in the GAM model predictions than the FRK model results. The GAM model performed better at predicting the mean of the data. We select the GAM model if this is the goal of the study. In summary:

  1. 1

    From the production evaluation, we observed incremental average oil and gas production for new wells in each formation from 2008 to 2019. The Bakken formation consistently outperformed the other oil-producing formations during this period for new wells (Fig. 2).

  2. 2

    Spatio-temporal models have applications in the oil and gas industry. With properly formulated models, we can perform spatio-temporal production evaluation successfully, regardless of the specific basin studied as shown in this paper. These techniques highlight the importance of space and time in production prediction as it takes geology and technological changes with time into account.

  3. 3

    The tensor product of space and time showed a strong influence on the GAM model.

  4. 4

    Overall, the models account for between 60 and 95% of the variability in the six-month production in the oil-producing and gas-producing formations.

  5. 5

    The FRK model performs better than the GAM model across all formations and production streams evaluated, excepting the Marcellus model that favors the GAM model as shown by the computed model diagnostics in Table 3. This could be because the FRK model is a specialty model designed specifically for spatial and spatio-temporal datasets, and as a result is able to capture the covariance structure of the dataset. Hence, the FRK model would be the preferred model for prediction of the performance of undrilled locations.

  6. 6

    Both models tend to overestimate oil and gas production, with the GAM model showing a much higher bias compared to the FRK model.

  7. 7

    Overall, the gas models show a better response to capturing the variability in gas production than the oil models do on oil production. This behavior of the models is consistent across all four formations. We posit that this observation is related to the response of the fluid phase to completion, with the molecular size of the gas phase playing a key role in this regard.

  8. 8

    There are several variables that could be included in a model formulation as apparent when carrying out a reservoir simulation study. The results presented in this paper suggests that space and time have strong correlation with oil and gas production, and this is based on sound scientific principles that are in line with Tobler’s First Law. Hence, we recommend these methods to provide a first pass result when studying a given field before commissioning a full-scale reservoir (simulation) study. With the availability of more covariates that would be influential on the dependent variable, we expect model performance to improve.