A simulation model for dynamic behavior of directional sucker-rod pumping wells: implementation, analysis, and optimization

Sucker-rod pumping wells can be either vertical or directional. Over time, research efforts on the functioning of vertical wells led to a well-established set of mathematical models and practical tools. When it comes to directional wells, however, no general agreement has been reached, and the topic remains in active discussion. This paper revisits, extends, implements and optimizes an overlooked model, initially devised in 1995, whose computational complexity resulted in long processing times that stymied its adoption. This model fully utilizes the 3D trajectory of the rod string, allowing for the use of two viscous friction models and proposing its own formulation for downhole boundary conditions. The resulting model can be used to efficiently simulate the dynamic behavior of directional sucker-rod pumping wells taking into account the fluid flow inside the rod-tubing annulus. We present and analyze a serial and a parallel software implementation of this CPU-intensive model based on an explicit finite-difference method. We also describe our contributions to the accuracy and performance of the original model and software implementation. A rough approximation shows that the proposed serial version is about 200 times faster than the legacy original code, if we were to run the latter in a modern processor. On top of that our parallel implementation achieved a 6.5×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} speedup over the serial version in a shared-memory system, making it a suitable tool for well design and optimization. The research contributes to the discussions on mathematical modeling of directional sucker-rod pumping wells, and illustrates how performance-focused techniques can enable the effective use of computationally demanding models to facilitate further refinements and applications.


List of symbols
Volumetric fraction of gas within the downhole pump (percentage varying from 0 to 1, where 1 means 100%) s Length of the segment chosen to divide the rod string into adjacent points (m) t Time interval between steps of the simulation (s) (s) Fluid viscosity at a given point in the rod string (Pa ⋅ s) Coulomb friction coefficient (dimensionless) Angular speed of the polished rod (rad/s) r (s) Density of the rod that a given point in the rod string is located on (kg/m 3 ) (s) Unit binormal vector at a given point in the rod string Acceleration of gravity (m/s 2 ) 1 3 (s) Curvature vector at a given point in the rod string (s) Unit normal vector at a given point in the rod string (s) Unit tangent vector at a given point in the rod string A 1 to A 6 , B 1 to B 6 Fourier coefficients (dimensionless) A p Area of the transversal section of the plunger (m 2 ) A r (s) , A rk (s) Area of the transversal section of the rod that a given point in the rod string is located on (m 2 ) Axial force at point i of the rod string, at time j (N) F s (s, t) Axial force at a given point in the rod string, at time t (N) f v (s, t) Force of viscous friction per unit of length at a given point in the rod string, at time t (N/m) h b Vertical depth of installation of the downhole pump (m) i anc Integer flag indicating whether well tubing is anchored -if it is anchored, the value of this flag is 0, otherwise, its value is 1 K 1 (s) , K 2 (s) , K 3 (s) , K 4 (s) Geometric factors, derived from the diameters of the well tubing and rods, for a given point in the rod string k L Average compressibility of the liquid phase of the fluid (Pa −1 ) Distance between the standing valve and traveling valve in the downhole pump at time t (m) L k Length of the rod string section that a given point in the rod string is located on (m) n Total number of adjacent points used to divide the rod string p b (t) Absolute pressure within the downhole pump, at time t (Pa) p d (t) Discharge pressure of the downhole pump, at time t (Pa) p f Dynamic pressure exerted by the fluid (Pa) p g Pressure gradient of water (9.794 kPa/m) p s Intake pressure of the downhole pump (Pa) p wh Well tubing pressure, measured at the wellhead (Pa) r c (s) Radius of curvature of the well at a given point in the rod string (m) S Stroke length of the pumping unit (m) s Measured length, starting from the downhole pump, until a given point in the rod string along the well tubing, at t = 0 (m) t Elapsed time since motion start (s)

TF(t)
Value of truncated Fourier series, at time t (rad) u (s, t) Displacement of a given point in the rod string, starting from its initial position s, at time t (m) U rk (s) Perimeter of the circular section of the rod that a given point in the rod string is located on (m) v Speed of sound within the rod string (m/s) v j ri Longitudinal speed of the rod that a point i of the rod string is located on, at time j (m/s) v p (t) Plunger speed, at time t (m/s) v r (s, t) Longitudinal speed at a given point in the rod string, at time t (m/s)

Introduction
Sucker-rod pumping is the oldest and most common artificial lift system used in the oil industry worldwide (Costa 2008). The production strings attached to sucker-rod wells can be either vertical or directional. There are several mathematical models that accurately describe the behavior of vertical sucker-rod wells, and well-established design techniques are readily available (Gibbs (1963); Miska et al. (1997); API (2008)). Directional sucker-rod wells, however, do not behave the same as vertical wells and require design techniques of their own. Evchenko and Zakharchenko (1984) reported that Peslyak created one of the first known models aimed at directional wells in Russia in 1964. They proposed a quasi-static model, equivalent to Peslyak's that computes axial forces along deviated wells whose trajectory is contained in a vertical plane, i.e., a 2D problem. Lukaziewicz (1991) presented a dynamic model that takes the motion of the rod string into account while allowing for a deviated 2D trajectory, using boundary conditions described by Doty and Schmidt (1983) and Everitt and Jennings (1992). Gibbs (1992) formulated a 3D dynamic model based on boundary conditions he had previously developed for vertical wells (Gibbs 1963(Gibbs , 1977. Xu (1994) presented a diagnostic method for vertical and straight inclined wells; this was shortly followed by a 3D dynamic model (Xu 1994) that used boundary conditions developed by Lukaziewicz (1991) and Xu and Hu (1993). Costa (1995), on whose research we base this study, devised a new dynamic model that couples directional trajectories in 3D to fluid flow in the rod-tubing annulus. This model differs from prior works in essentially three aspects. Firstly, in its proposed motion equation, it considers more information on the 3D trajectory of the well (for instance, the unit binormal vector along the rod string in the Coulomb friction term). Secondly, it allows for the use of two viscous friction models, based on the work of Gibbs (1963) and Lea (1991). Finally, it states its own downhole boundary conditions to describe the behavior of the downhole pump. Costa's model remains unpublished in English, which made it harder for the wider scientific community to find his work and build upon it, and unfortunately we are not aware of any derived studies. Since then, no clear consensus has been reached on the problem of accurately modeling the behavior directional sucker-rod wells, and the topic remains active. An example of later research is a comprehensive model presented by (Xu et al. 1999). Some more recent works are a simulation model proposed for oil-producing wells that does not cover directional wells (Dong et al. 2013); a patent (Pons 2014) and an algorithm (Araujo et al. 2015) to compute the downhole dynamometer card of a directional sucker-rod well based on its surface dynamometer card; and a novel mathematical model to numerically compute sucker-rod string dynamics in deviated wells in predictive or diagnostic mode (Moreno and Garriz 2020).
Turning our attention to Costa's work, the mathematical formulation of the model was accompanied by a reference implementation in software. Figure 1 describes how this implementation worked. Whereas the results of Costa's simulations had favorable matches to actual field data, the computational complexity of his model was much higher than earlier approaches. The core of the model is a second-order partial differential equation (PDE), solved numerically through an explicit finite-difference method (FDM). With the computing resources available to Costa in 1995, it could take nearly 109 seconds to simulate a single well. The activity of well design routinely demands experimenting with many different configurations of equipment and operating variables, and the long processing times required by Costa's reference implementation hampered its adoption for routine usage.
In this study, we examine Costa's model and make contributions to it with the aim of improving its performance and making it a viable option for frequent use. To achieve this goal, we use contemporary programming tools to implement serial and parallel versions of this model, and we examine their performance and scalability. The results obtained are encouraging, especially when using multiple threads. They show that the model can be implemented more efficiently, enabling further extensions and refinements. Additionally, the proposed implementation is capable of processing many different well configurations in several levels of accuracy in a time-effective manner, making it a suitable choice for tasks like well parameter optimization.
This work is structured as follows. First, the background is explained. Then, we present Costa's model and its numerical solution. We proceed to describe our proposed implementation and contributions. Next, we discuss the results of our experiments. We finish by presenting some conclusions and suggesting directions for future work. This paper is a revised and extended version of (Araújo and Xavier-de-Souza 2016).

Background
According to Takács (2015), the individual components of a sucker-rod well can be divided into two major groups: surface and downhole equipment. The main elements found in a common installation are shown in Fig. 2.
The set of components that make up the surface equipment, starting at the prime mover and proceeding toward the polished rod, is commonly referred to as a pumping unit or pumpjack. The pumping unit converts the rotary motion of the prime mover into a periodic up-and-down movement of the polished rod and rod string, thus lifting the fluids contained in the reservoir toward the wellhead.
Based on a specific configuration of equipment and operating variables, Costa's model attempts to estimate a number of operating parameters for the well, as follows: -PPRL (peak polished rod load); -MPRL (minimum polished rod load); -PT (peak torque on the gear reducer); -PRHP (polished rod horsepower); -PD (daily fluid displacement performed by the downhole pump).
The computation of those parameters requires an intermediate step. The primary values computed by the model as it simulates the pumping cycles of the well are surface dynamometer cards and downhole dynamometer cards. At any point of the simulation, the operating parameters can be extracted from the cards. Surface cards correlate traction forces acting on the polished rod to the corresponding positions along the stroke length of the pumping unit. Downhole cards are analogous, but the traction forces they consider act on the downhole plunger. A pumping cycle has two stages, upstroke and downstroke. They correspond to when the polished rod is moving up or down, respectively. The polished rod traverses the entire stroke length twice in each pumping cycle, first in upstroke and then in downstroke. Consequently, dynamometer cards associate two force values to each position along stroke length, one for each cycle stage, as shown in Fig. 3.
Costa points out a number of factors we need to consider for an accurate simulation of directional sucker-rod wells. They are: -Directional rod string dynamics; -Directional friction between rods and tubing; -Viscous friction between rods and fluids; -Whether gas is present at the level of the downhole pump, and how much of it there is; -Whether well tubing is anchored.  Costa's work considers all of the above factors simultaneously. It also shows that given appropriate constraints, we can characterize the previously mentioned models developed by Lukaziewicz (1991) and Gibbs (1992) as particular cases of Costa's integrated model.

The simulation model
Costa's work proposes a mathematical model for the motion of the rod string in directional wells and also a solution for it, which is based on a numerical approach using an explicit FDM (Costa 1995). This paper shows the key points of the model on account of it being presented in English for the first time. We describe the model in the next subsection, and its solution in Subsect. 3.2.

The motion equation
Costa proposed to describe the motion of the rod string in directional wells through a damped wave equation, shown in the second-order PDE below. The full list of symbols is available at the beginning of this paper.
Term u(s, t) is the displacement of a given point in the rod string (m), starting from its initial position s (m), at time t (s). Term , a Coulomb friction coefficient (dimensionless), multiplies a nonlinear expression. The last term in the (1) , is meant to represent viscous friction ("VF") between rods and fluids. Instead of a direct expression to be applied as written, this term is more of an adaptable modeling block that the engineer may use to emphasize certain aspects of the simulation while downplaying the influence of others. We can replace VF by the expression below, proposed by Gibbs (1963): where c is a damping coefficient given by: where v is the speed of sound within the rod string (m/s); c D is a damping factor (dimensionless); and D b is the measured depth of installation of the downhole pump (m). Note that Eq.
(3) encompasses both viscous and directional friction, which requires choosing proper values for the damping factor c D . Lea (1991) presents another formulation for where is the viscosity of the fluid (Pa ⋅ s). For the expressions involved in computing geometric factors K 1 (s) and K 2 (s) , we refer the interested reader to Lea (1991) or Costa (1995). It is the engineer's decision whether to use Gibbs's or Lea's VF model. It is necessary to keep in mind, however, that this choice affects the computation of discharge pressure of the downhole pump when calculating the relevant boundary conditions. This has important consequences to the computational complexity of the simulation, and we analyze them in Sects. 4.1, 4.3 and 5 . As for the boundary conditions of the motion equation, for the sake of brevity, we do not discuss them here. We examine them, along with initial conditions, in Appendix A.

Numerical solution
Costa describes an explicit FDM for the numerical solution of Eq. (1). In sucker-rod wells, the rod string is comprised of a number of sections, and each section is made up of one or more rods connected to each other. The FDM in question divides each section of the rod string into a number of points, in such a way that the segments between each point have equal length. The model presents the following equations: ( (1). As for Eq. (6), when we apply Hooke's law to the displacement of a point located on the rod string in relation to its starting position and proceed by taking the derivative of the resulting expression with respect to time, we obtain: We can restate Eq. (1), a second-order PDE, as an equivalent system of first-order PDEs defined by Eqs. (5) and (6). We solve this system numerically in two steps. First, we compute the speed at each point that the rod string was divided into. Following that, we compute the corresponding axial forces. Let i be a point within the rod string and j a moment in time. In that context, let v j ri be the longitudinal speed of the rod that the point is located on (m/s). Consider also the corresponding axial force (N) and force of Coulomb friction per unit of length (N/m), written as F j Si and f j ci , respectively. We proceed by defining the length of the segment chosen to divide the rod string into adjacent points as s (m), and the time interval between steps of the simulation as t (s). Under Gibbs's VF model, we can use the below stencils to compute the rod speeds and axial forces for each point i at moment j: In case we are using Lea's VF model, the stencil to use for rod speed is: Upon using Eq. (11), It is also necessary to take the boundary conditions of the motion equation into account when computing the numerical solution. For reasons of conciseness, we leave the discussion of the necessary procedures to Appendix B.

Tapered rod strings
Until now, it was assumed that the diameters of all sections of the rod string were equal. When working with tapered rod strings, however, the diameter is different between rod sections, and the simulation model has to be adjusted accordingly.
We analyze the speeds and forces in the point of transition between two sections. Whereas the speed of the last rod in the first section and that of the first rod in the second section are the same, axial forces in each of these rods may be different on account of a small effect of pressure. Let k be a section of the rod string, and k + 1 the section following it. Each section is divided into n k points. At moment j in time, let p j k be the pressure at the lower extremity of rod section k (Pa), considering the dynamic behavior of the fluid. The below equations describe the relation between rod speeds and axial forces in the point of transition between these sections: If we discretize Eq. (6) and then apply a backward difference stencil for section k + 1 , the below equation computes the forces in the point of transition: We repeat this procedure for section k: , Eq. (16) is approximately equal to: These equations allow to properly take account of tapered rod strings.

Numerical stability and stopping criteria
The condition of numerical stability for the simulation model is as suggested by Laine et al. (1989). According to Schafer and Jennings (1987), setting s to about 200 m is sufficient to obtain an adequate approximation. That said, Costa's model sets s to 20 m or less as to avoid the possibility of overlooking any points of depth measurement along the directional trajectory of the well. If we assume v to be about 4968.24 m/s (16300 ft/s), as proposed by Takács (2015), Eq. (18) requires t to be 0.004 s or lower.
The maximum pumping speed of the wells tested by Costa was 14 cycles per minute, thus leading a pumping cycle to last around 4.28 seconds. Dividing that by t = 0.0004 s splits each pumping cycle into about 1065 time intervals. A higher number of time intervals improves accuracy. Schafer and Jennings (1987) also affirm that for wells with a pumping speed up to 15 cycles per minute starting from rest, after three pumping cycles the computed dynamometer cards show no meaningful change. That notwithstanding, Costa suggests running the simulation for five pumping cycles out of caution to ensure a steady state is achieved.

Performance optimizations
We implemented the simulation model using the C++ language in serial and parallel versions. The parallel version employs multithreading capabilities provided by OpenMP technology (OpenMP 2020) in addition to the standard C++ features used in the serial version. The structure of the program is as follows: 1. Setup: configuration parsing, variable assignment, and initial computations; 2. Well configuration traversal, with each iteration performing the following steps: Algorithm 1 details The Main Simulation Loop, without a doubt the most CPU-intensive section of the program. It must be executed for each well configuration we want to simulate. The well configuration data, in their turn, are slightly different according to the VF model we choose to use. As Gibbs's model requires a Coulomb friction coefficient and a damping factor, its traversal of well configurations is arranged as shown in Algorithm 2. Whereas Lea's model also uses a Coulomb coefficient, its direct use of fluid viscosity makes the damping factor unnecessary, and the corresponding traversal of well configurations is laid out as listed in Algorithm 3.
The best way to check the quality of the simulation results is to use a well configuration based on an actual functioning oil well, and then to compare the operating parameters derived from the computed dynamometer cards to the corresponding values acquired in the field. Such values are typically obtained using automation sensors or specialized tools. Unfortunately, no values for either the Coulomb friction coefficient or the damping factor are known to work reliably for every oil field, and therefore it is necessary to test ranges of values for both variables. In order to find adequate values for each scenario of interest, it is suggested to compare simulation results to actual operating data of functioning oil wells. For the description of a mathematical method to identify the damping factors of directional sucker-rod pumping wells, we refer the interested reader to Liu (2007).
Costa verified the accuracy of his model by finding good agreement between estimated and actual values for a number of wells. Consequently, we assumed the model to be sound and focused our research on the efficient implementation of the model rather than a reassessment of its quality. That said, we made some contributions to the accuracy and performance of the model, as described in Subsect. 4.3. In addition, Appendix C shows results obtained through the model by presenting a comparison between real and simulated data of three functioning oil wells not included in those originally used by Costa.

Optimization and parallelization considerations
There are several immediate opportunities for optimization in the Main Simulation Loop shown in Algorithm 1. For instance, as many terms of the equations used in Parts 1 through 7 do not change between iterations, we can compute them in advance to use repeatedly later. The savings in total computing time are worthwhile, greatly outweighing the small increases in memory consumption needed to store the precomputed values.
Costa's original implementation in Fortran used twodimensional matrices extensively. Upon reimplementing the model in C++ language, we chose to use equivalent one-dimensional arrays instead. The main advantage gained with this change of representation is to be able to easily redirect large arrays through a few memory pointer swap operations. When processing a time interval of a pumping cycle, we must access two-dimensional speed and force matrices for both the current and previous time intervals, the first dimension being the rod section and the second dimension a point within the rod section. Upon completing the computations of a time interval, we need to copy the speed and force matrices of the current time interval over the corresponding matrices of the previous time interval, so their data can be used in the time interval that comes next. A naive approach to this task would be to use raw memory copy operations. If we were to do so, however, the copy would take increasingly longer times whenever the simulation would require larger matrices. A better method is to use memory pointers to simply swap the addresses pointed to by the speed and force matrices of the current and previous time intervals. The result is the same as a copy operation: the most recent data computed in the current time interval will be available to the next time interval in the matrices corresponding to the previous time interval, at the fixed and modest cost of a simple swap in memory pointers.
In addition to such optimizations as described above, parallel processing is essential to further improve processing speeds. For a long time, CPU manufacturers succeeded to obtain better performance by simply boosting clock speeds. However, an upper limit was reached in operating frequency, beyond which processors would either overheat or consume inordinate amounts of power. This problem was addressed by the development of multiple processor systems, which require programmers to subdivide problems into simpler chunks to be computed simultaneously by various threads running in parallel. Multiple processor systems have become ubiquitous, and compute-capable GPUs are increasingly common (Sutter 2004(Sutter , 2012. Consequently, to fully exploit modern computer hardware, we must keep parallel processing in mind when analyzing performance-sensitive problems.
We proceed to parallelize the Main Simulation Loop by analyzing which parts of it can be subdivided into simpler pieces to be computed simultaneously. By analyzing the model and inspecting the equations used at each part, we conclude that Parts 1 and 6 have a very modest computational cost. Consequently, there is no incentive to alter them, and we leave them in serial form.
Parts 2 and 5, however, are well-suited to run in parallel. The former uses either Eq. (8) or (11), and the latter uses Eq. (10). They compute speed and force at each point by traversing nearly all points that the rod string was divided into, excluding only those points located at transitions between sections. The processing of a point does not interfere with other points, which is a helpful feature when it comes to parallel processing.
Part 3, which uses Eq. (17), and Part 7, which uses Eqs. (13) and (14), are not good targets for parallelization. They require traversing all sections of the rod string, with the exception of the last one, performing some simple computations for each section. According to Costa, the number of rod sections in a sucker-rod well usually does not exceed four. As the number of rod sections to process is likely small, and the computing requirements at each section are low, there is no point in processing them in multiple threads. Therefore, they remain in serial form.
Part 4, which involves an iterative process of convergence (see Appendix B), needs to be considered under two different contexts. When Gibbs's VF model is used, the discharge pressure of the downhole pump will remain fixed during the entire simulation, as described in Eq. (31) in Appendix A. As the computations required by Part 4 aside that of discharge pressure have low processing costs, under these circumstances Part 4 has no work suitable to parallelization. Under Lea's VF model, however, the computation of discharge pressure becomes much more expensive, as Eq. (32) in Appendix A requires traversing all points that the rod string was divided into. The processing of a point does not interfere with other points; whereas the computing effort at each point is relatively low, parallelizing it may still be helpful.

Complexity analysis
Let c be the number of pumping cycles to simulate, and i the number of time intervals per cycle. Parts 1 through 7 in Algorithm 1 are executed c ⋅ i times.
The processing costs of Parts 1 and 6, as previously mentioned, are modest. The computational complexity of the equations used by these parts is not affected by the scale of the simulation, and our experiments show their execution time to be very low and essentially constant. Therefore, we regard their combined complexity as O(1).
Parts 2 and 3, combined, traverse all points that the rod string was divided into. Parts 5 and 7, also combined, exhibit that same behavior. It also occurs in Part 4, if the engineer chooses to work with Lea's VF model. Part 4 involves an iterative process of convergence (see Appendix B); let j be the average number of iterations needed for convergence to happen. All points that the rod string was divided into are, therefore, traversed (2 + j) times at each time interval of the simulation. Let e be the total number of points; the cumulative complexity of these parts is then O(2 + j) ⋅ O(e) . In our experiments, Part 4 converged in four iterations most of the time. Assuming j = 4 to be a typical value, in the average case all points that the rod string was divided into are traversed six times at each time interval.
Combining the previous results, the serial complexity of the Main Simulation Loop is: If we assume the cost of Parts 1 and 6, represented by O(1), to be negligible, and apply j = 4 as a typical value, the average serial complexity is: We gain further insight about the serial complexity by examining e. The value of e is inversely proportional to the length of the segment chosen to divide the rod string, i.e., s in Eq. (18). s , in turn, is directly proportional to t by that same equation. As t is inversely proportional to the number of time intervals per cycle, i, we finally conclude that e is directly proportional to i. Therefore, as Eq. (20) multiplies e by i, Eq. (20) should exhibit quadratic behavior for fixed values of c.

Parallel complexity
As the results of each time interval depend on those of the previous interval, the iterations of pumping cycles and time intervals per cycle are strictly serial. Therefore, the term O(c ⋅ i) cannot be decreased by parallel execution.
We make no attempt to rewrite Parts 1 and 6 in parallel, and we regard their combined parallel complexity the same as their serial complexity, i.e., O(1).
The traversal of all points that the rod string was divided into occurs at Parts 2 and 3, combined; at Part 4; and at Parts 5 and 7, also combined. Unfortunately, we cannot make these traversals parallel to each other, as Part 4 depends on the results of Parts 2 and 3, and Parts 5 and 7 depend on the results of Part 4. Consequently, their parallel complexity is equal to their serial complexity, O(2 + j).
Whereas the traversals mentioned in the previous paragraph cannot be made in parallel to each other, it is possible to parallelize the processing performed within each of them. Let p be the number of processors available. The parallel complexity within each traversal is 1 : We multiply this expression by the previous results, and find the parallel complexity of the Main Simulation Loop to be: Assuming again the cost of Parts 1 and 6, represented by O(1), to be negligible, and applying j = 4 as a typical value, the average parallel complexity is: Equation (23) retains the quadratic behavior of Eq. (20) for fixed values of c, but in a smaller scale. Additionally, the parallelization should become more efficient as e increases.

Contributions to the original work
In reviewing Costa's model, we identified some opportunities for improvement. Firstly, in the original work, in a discussion about average fluid speed at a given section of the rod string, we find the equation below: where v p (t) is the plunger speed at moment t (m/s). This equation assumes two simplifying hypotheses. The first is that the fluid is incompressible, and the second is that the speed of the rod string and that of the plunger are approximately equal. We can improve the accuracy of this equation by considering the pressure within the downhole pump. Equation (24) assumes that when v p ≥ 0 , i.e., the plunger is moving upward, the traveling valve is closed and fluid is moving toward the wellhead; and that when v p < 0 , i.e., the plunger is moving downward, the traveling valve is open and no fluid is moving toward the wellhead. However, the traveling valve is closed only when the pressure within the downhole pump is lower than the discharge pressure, regardless of whether the plunger is moving upward or downward. Consequently, rather than using the speed of the plunger to infer the state of the traveling valve, it is more accurate to do so by checking the previously mentioned pressures. When the pressure within the downhole pump, p b (t) , is lower than the discharge pressure-i.e., the pressure at the lower end otherwise.
, of the rod section that the plunger is located on, p 1 (t) -the traveling valve is closed; otherwise, it is open. Taking into account the considerations about the pressure within the downhole pump, we propose to rewrite Eq. (24) as: A second point of improvement is in the calculation of the speed of the polished rod when evaluating the boundary conditions of the surface equipment. Costa's original work described only the use of Eqs. (26) and (27) in Appendix A to this end. In this paper, as mentioned in Appendix A, we recommend to use position/speed profiles measured in the field when they are available. In the absence of such profiles, we should compute the speed of the polished rod using the detailed equations available in the API Specification for Pumping Units (API 2013) in case the geometrical specifications of the pumping unit are at hand. Equations (26) and (27) then become a solution to be employed only as a last resort, since they are less accurate than the other two methods.
Moving on to the software implementation, we point out two differences in the evaluation of downhole boundary conditions, i.e., Part 4 of the Main Simulation Loop. The first difference stems from the fact that in the iterative process to compute plunger speed and the axial force acting on it at each step of the simulation (see Appendix B), we need to make an initial guess for the value of plunger speed. Costa's implementation simply used an initial value of 0 at each step. In the proposed implementation, we use an initial value of 0 in the first step only, and, in the later steps, the initial value is the plunger speed found in the previous step. With this change, convergence is slightly faster than in the original implementation. Since the intermediate loops of the iterative process are strictly sequential, by speeding up this otherwise. process, we lower the time required by the serial portion of the program. Consequently, the parallelizable portion of the program grows in size, which may increase parallel performance as more parallel resources are added to the system. The second difference in Part 4 has to do with the evaluation of Eq. (32) (see Appendix A), which is used to calculate discharge pressure with Lea's VF model. One of the terms of Eq. (32) is p f s k , defined in Eq. (33). This latter equation requires reading the value of longitudinal speed v r (s, t) at all points that the rod string was divided into, grouped by rod section. Costa's implementation performed the traversal of all such points in order to read v r (s, t) at each intermediate loop of the iterative process of Part 4, which was very time-consuming. We mitigate this cost in our implementation by taking advantage of the fact that since v r (s, t) remains constant for the whole duration of the iterative process, we can precompute the sum of v r (s, t) for each   section of the rod string only once before the iterative process and reuse it as many times as needed. Strictly speaking, the need to traverse all points that the rod string was divided into, even if only a single time before the iterative process, could be an opportunity for parallel processing. That said, as such a parallelization would require opening and closing a multi-threaded execution region for each section of the rod string to perform only a simple operation (computing a sum via parallel reduction), we chose not to do this. The rationale for this decision was that the overhead of synchronizing all threads at the end of the reduction operation outweighed any performance savings yielded by parallelization.  Still on the topic of Part 4, when using Gibbs's VF model, discharge pressure is given by Eq. (31) in Appendix A. As the result of this last equation remains fixed during the entire simulation, we compute it only once before the beginning of the simulation and reuse it repeatedly. Apart from the computation of discharge pressure using either Gibbs's or Lea's VF model, the remaining operations of Part 4 have very modest processing requirements, and therefore we also left them in serial form.

Results and discussion
We executed simulations in a GNU/Linux system running CentOS 6.5 64-bit equipped with an Intel Xeon E5-2698 v3 CPU, with sixteen physical cores and two hardware threads per core. The clock speed of all cores was fixed at 2.3 GHz. We used G++ version 8.3.0 to compile both serial and parallel versions of the program.
We simulated a single well configuration under both Gibbs's and Lea's VF models for a workload of five pumping cycles with an increasing number of time intervals per cycle, using both serial and parallel versions. As the number of time intervals per cycle is inversely proportional to the length of the segment used to divide the rod string, i.e., s , an increase in the number of time intervals also increases the number of points along the rod string, as the value of s becomes lower. The engineer may gain insight about the operating conditions of the rod string by analyzing the axial forces, longitudinal speeds and forces of Coulomb friction that the simulator computes for all of these points at each time interval. However, using lower values for s may not be helpful in all circumstances, as our experiments show that a greater number of points only benefits the computed dynamometer cards and the operating parameters derived from them to a certain extent. For Gibbs's VF model, we did not identify any meaningful changes in these results for values of s lower than 0.75 meters. In the case of Lea's model, the lower bound for s under which we observed no significant changes was 0.30 meters. That notwithstanding, if the engineer is interested in analyzing the operating conditions of particular segments of the rod string during the simulation, it can be helpful to use lower values of s . This should be specially desirable regarding the upper and lower extremities of the rod string, which, because of their specialized functions and characteristics, have a strong influence on the global behavior of the rod string, despite being usually much shorter than the remaining rod sections. However, higher values of s should suffice in case the engineer is only interested in deriving operating parameters from the dynamometer cards computed in the simulation. We used a directional well located at the on-shore portion of the Potiguar basin in northeastern Brazil as a model for the well configuration data. The true vertical depth of this well was 696.5 meters, and its measured depth was 876.3 meters. Its rod string was comprised of two sections, with the first section being 411.5 meters long and the second 464.8 meters. Table 1 shows the full list of configuration properties for the well. The well profile data-i.e., the segments of the directional trajectory of the production stringare listed in Table 2.
We executed five runs of the simulation for each number of time intervals, and averaged the resulting execution times. We included the segment length of 14.67 meters in the simulations in consideration of that being the standard value used in Costa's original implementation. This segment length corresponds to 2000 time intervals per pumping cycle. The execution results are listed in Tables 3 through 6, and plotted in Figs. 4 through 9.
We achieved considerable improvements in comparison to the initial results reported in 1995. The execution times for Gibbs's and Lea's VF models in Costa's original implementation were about 104.3 and 108.9 seconds, respectively. Those results were obtained using the standard value of 14.67 meters for s . Upon looking for that same value of s in Tables 3 and 4, the execution times of the serial version of our implementation are, respectively, 0.011 and 0.015 seconds for each VF model, i.e., 9481 and 7260 times faster than the original implementation. If we assume that the frequency of operation was the main and dominant factor of performance improvements from one generation of processors to another (and also abstracting away advances such as more efficient mathematical coprocessors, more specialized instruction sets, instruction-level parallelism and smarter compilers), in order to Costa's implementation to have equivalent performance, it would have to be run in a processor with a speed of 2.3GHz ÷ 9481 = 242.5KHz and 2.3GHz ÷ 7260 = 316.8KHz. Considering that the frequency of the processor used by Costa in 1995 was 66MHz, this results in a rough estimate of improvement for our proposed implementation of about a factor of 208 (66MHz ÷ 316.8KHz) over the original program. That said, we now proceed to analyze the parallel implementation. Starting with Gibbs's VF model, Table 3 and Fig. 4 show improvements in parallel execution time over serial time for s equal to 0.05 meters with two threads, and then for s less than or equal to 0.5 meters between four and sixteen threads. With thirty-two threads, we observe such improvements only for s less than or equal to 0.3 meters. Additionally, using thirty-two threads is always slower than sixteen threads. Moving on to Lea's VF model, Table 4 and Fig. 5 show improvements in parallel execution time over serial time for s less than or equal to 0.5 meters under all multi-thread counts. Using thirty-two threads is only faster than sixteen threads when s is less than or equal to 0.1 meters. For both models, the overall pattern is that parallel performance gains are more pronounced as the value of s gets lower and therefore there is more work to execute. Under Gibbs's VF model, using multiple threads with s lower than or equal to 0.5 meters is advantageous under most cases, except for five scenarios under two threads and two scenarios under thirty-two threads. Lea's VF model, on the other hand, yields favorable speedups with s lower than or equal to 0.5 meters in all scenarios.
We also observe that the processing times for Lea's VF model are considerably higher than those for Gibbs's model, and that the performance of Lea's model scales better in the parallel case. The parallel speedups shown in Tables 5 and  6 and Figs. 6 and 7 provide these same insights in a different format. As discussed previously, Lea's model has higher processing requirements than Gibbs's model because of more complex calculations for the discharge pressure of the downhole pump. Whereas Lea's model uses the viscosity of the fluid and needs to traverse all points that the rod string was divided into, Gibbs's model simplifies these details through the use of the damping coefficient.
Tables 5 and 6 and Figs. 8 and 9 provide information on parallel efficiency. Under both Gibbs's and Lea's models, whereas parallel efficiency under two threads shows very narrow fluctuations, it scales along with problem size at nearly all remaining thread counts. With Gibbs's model, efficiency values for s less than or equal to 0.1 meters under four and eight threads are notable outliers. Finally, the efficiency results of Lea's model are in a clearly lower range than those of Gibbs's.
Performance profiling is of fundamental importance for software optimization, as it enables to diagnose execution bottlenecks and to assess the results of improvement efforts. The absolute execution times, and their percentage counterparts, for a single run of Gibbs's VF model when s is equal to 0.05 meters are shown in Tables 7 and 8, illustrating how much time each part of the simulation requires. Our previous analysis in Section 4.1 suggested to direct parallelization efforts toward Parts 2 and 5, which took 99% of total execution time in the serial case according to Table 8. We regard the 1% of total execution time required by the remaining parts as the strictly serial fraction of the problem under Gibbs's VF model.
The inspection of parallel times at Tables 7 and 8 shows improvements in total execution time; however, the percentage of time required by the strictly serial parts becomes increasingly higher along with the number of threads, the only exception being one case under thirty-two threads. The absolute variations in the time required by the strictly serial parts were moderate, except for a slightly higher uptick in Part 4. As for the parts that underwent parallelization, the percentage of execution time in Part 2 declined, while that of Part 5 had moderate fluctuations between two and sixteen threads, followed by a steeper increase under thirtytwo threads. We conclude that on the whole, the simulation under Gibbs's VF model attained good scalability.
The absolute execution times, and their percentage counterparts, for a single run of Lea's VF model when s is equal to 0.05 meters are shown in Tables 9 and 10 , illustrating how much time each part of the simulation requires. Our previous analysis in Section 4.1 suggested to direct parallelization efforts toward Parts 2, 4 and 5, which took 99.66% of total execution time in the serial case according to Table 9. We regard the 0.34% of total execution time required by the remaining parts as the strictly serial fraction of the problem under Lea's VF model.
The behavior of Part 4 demands special attention when using Lea's VF model. As discussed previously, the computation of discharge pressure under Lea's model is much more demanding than under Gibbs's model. That said, we chose not to parallelize this calculation, and therefore the execution performance of Part 4 in the serial and parallel cases should exhibit very similar results. Regarding the other parallel parts, the implementation of Part 2 under Lea's VF model closely resembles that found under Gibbs's model, but there is an extra term to compute in the numerator of Eq.
(11) when compared to Eq. (8). Finally, Part 5 is completely identical under both models. The inspection of parallel times at Tables 9 and 10 shows that whereas total execution time improved, the percentage of time required by the strictly serial parts grew modestly for all thread counts. The absolute variations observed in the time required by the strictly serial parts were moderate as well. As for the parts identified as parallelization candidates, the overall observation is that whereas the percentage of execution time of Part 2 had noticeable decreases under all multi-thread counts, that of Part 5 had lower decreases between two and sixteen threads. Part 4, which could have been parallelized but was ultimately left in sequential form, had noticeable increases in percentage of execution time under all multi-thread counts. Regarding absolute execution times, whereas they decline under all multi-thread counts for Part 2 and nearly all counts for Part 5, at Part 4 they show only limited change. Part 2 is considerably slower under Lea's VF model than under Gibbs's model, which we attribute to two reasons. The first is that there is an extra term to compute; as for the second reason, although the compiler succeeds in generating fast, vectorized binary instructions for Part 2 under Gibbs's model, unfortunately this does not happen under Lea's model. Regarding Part 4, the mostly uniform behavior observed is in accordance with the fact that as mentioned previously, its implementation remained the same in serial and parallel versions of the simulation program. We conclude, based on the previous numbers, that the simulation under Lea's VF model attained fair scalability. However, Part 4 stands out as a point of attention to be further investigated in future optimization work.

Summary and Conclusions
In this work, we presented Costa's model for the motion of directional sucker-rod pumping wells (Costa 1995). We analyzed the computational behavior of this CPU-intensive model, proposed contributions to its accuracy and reimplemented it with a focus on improved execution speed. The main conclusions are as follows: 1. Whereas the model had long processing times in its original implementation, we have shown that it can be efficiently implemented in contemporary hardware. We can use it to quickly process many different well configurations in varied levels of accuracy. This makes it a suitable tool for well design and optimization, as we can adapt it to seek goals such as maximizing oil production, minimizing energy consumption or extending the useful life of production equipment; 2. The use of Gibbs's VF model results in faster execution times and better parallel efficiency than Lea's VF model, given that the calculation of downhole boundary conditions under Lea's model is much more demanding. The engineer must keep this in mind in case execution speed is a concern.
We point out the following suggestions for future work: 1. The spare computing power made available through the proposed software optimizations facilitates further expansions to the model to include other realistic behavior, such as motor slip and reservoir coupling. It should also be straightforward to incorporate other VF models by adapting the motion equation and the downhole boundary conditions accordingly; 2. The downhole boundary conditions formulated by Costa assume that the ensemble of subsurface equipment is working normally. Those conditions can be expanded to simulate anomalous scenarios, such as fluid pound and leaking valves (Doty and Schmidt 1983). That way, in addition to well design and optimization, the model can be used as a diagnostic tool; 3. The simulator has only been used with oil wells located in the Potiguar basin in northeastern Brazil. It can be further developed by testing it with wells of other geographic areas under different operating conditions.

A boundary and initial conditions of the motion equation
As is typical of PDEs, the motion equation presented in Costa's model for directional wells (Costa 1995) has boundary and initial conditions to consider, which we discuss in this appendix. We start by the boundary conditions for the surface equipment, where it is necessary to be able to determine the speed of the polished rod for any moment in time. There are several alternatives to do this. If the engineer would like to run simulations based on a functioning oil well and field measurements can be made to obtain a profile that correlates positions along the stroke length of the pumping unit with the speed of the polished road, this should be the preferred approach. If a position/speed profile measured from an actual working well is not available but the geometrical specifications of the pumping unit are at hand, the recommended procedure is to use the detailed equations described in the API Specification for Pumping Units (API 2013). A third choice, less accurate than the previous ones, is to approximate the motion of the polished rod by a Fourier series truncated to six terms, as proposed by Laine et al. (1989). The below equation computes the value of the truncated Fourier series at a given moment in time: where A 1 to A 6 , B 1 to B 6 are Fourier coefficients, and is the angular speed of the polished rod (rad/s). The speed of the polished rod (m/s) at a given moment in time is: (26) TF(t) = A 1 cos( t) + ⋯ + A 6 cos(6 t) where S is the stroke length of the pumping unit (m). The geometry of the pumping unit dictates the value of the Fourier coefficients. Table 11 shows average values recommended by Laine et al. (1989) for conventional pumping units. We proceed to discuss the boundary conditions for the downhole equipment. To compute the distance between the standing valve and traveling valve in the downhole pump at a given moment in time, we use: where e m is the "dead space" in the downhole pump-i.e., the distance between the standing valve and the traveling valve when the system is idle and the polished rod is in its bottom-most position (m); and e t (t) is the well tubing elongation, at moment t (m).
The equation for the well tubing elongation is: The below equation gives the change in absolute pressure within the downhole pump: where p s is the intake pressure of the downhole pump (Pa); p b (t) is the absolute pressure within the downhole pump from which change will be computed, at moment t (Pa);  where p wh is the well tubing pressure, measured at the wellhead (Pa); d f is the relative density of the fluid (dimensionless); p g is the pressure gradient of water (9.794 kPa/m); and h b is the vertical depth of installation of the downhole pump (m).
The expression to use if the engineer prefers to employ Lea's model is: where L k is the length of the rod string section that a given point in the rod string is located on (m), and the loss of pressure caused by friction at a given point in the rod string (Pa/m) is given by:  where K 3 (s) and K 4 (s) are geometric factors, derived from the diameters of the well tubing and rods, for a given point in the rod string. We point the interested reader to Lea (1991) or Costa (1995) for the expressions involved in their computation.
In addition to the boundary conditions, there are some initial conditions to consider when t = 0 . They are: where F s (s, t) is the axial force at a given point in the rod string, at moment t (N).

B numerical solution of the boundary conditions
In this appendix, we describe how to numerically compute the surface and downhole boundary conditions of the motion equation in Costa's model for directional wells (Costa 1995). Starting with the surface boundary conditions, when they are applied to the numerical solution, the speed of the polished rod should ideally be retrieved from a position/speed profile obtained from field measurements. If such a profile is not available, the engineer may use the detailed equations described in the API Specification for Pumping Units (API 2013), provided that the geometrical specifications of the pumping unit are at hand. In the absence of said geometrical data, the engineer may obtain approximate (albeit less accurate) values by resorting to Eqs. (26) and (27): To find the force on the polished rod, we use: (39) TF j n+1 = A 1 cos( j t) + ⋯ + A 6 cos(6 j t) + B 1 sin( j t) + ⋯ + B 6 sin(6 j t) ,   Proceeding to the evaluation of downhole boundary conditions, we employ an iterative process to compute both the speed of the plunger and the axial force acting on it, given an initial guess for the value of the plunger speed. This is caused by a circular dependency when analyzing downhole conditions: the force on the plunger depends on its speed, and the plunger speed, in turn, depends on the force. We show the necessary stencils below. The stencil for the plunger displacement is: where According to the VF model we choose to use, the discharge pressure of the downhole pump, p j+1 d , comes from using either Eq. (31) or Eq. (32).
We need the plunger displacement and discharge pressure to compute pressure within the downhole pump through the stencil below: where We use the pressure within the downhole pump to compute the force at the plunger: Following that, we use the force at the plunger to find the plunger speed: (42) u j+1 1 = u j 1 + u 1 ,

C real and simulated data of three functioning oil wells
This appendix lists real and simulated data for three functioning on-shore oil wells located in the Potiguar basin in northeastern Brazil. The following information is available: -Well configuration properties used in the simulations (see Tables 12, 15 and 18 ); -Well profile data-i.e., the segments of the directional trajectories of the production strings (see Tables 13, 16 and 19 ); -Real and simulated data corresponding to the operating parameters of the wells (see Tables 14, 17 and 20 ); -Real and simulated dynamometer surface cards (see Figs. 10,11 and 12 ).
The real data for the wells were obtained from a supervisory system for automated wells, based on automation sensors installed in the field. All simulations were done using Gibbs's VF model. According to Costa, which recommends a range of values between 0.05 and 0.15 for the damping coefficient, lower values for this configuration property give more accurate results for Peak Torque and PPRL, whereas higher values produce more accurate results for PRHP and greater similarity between simulated and measured surface cards. Also, given that Peak Torque values computed by the simulator assume that the pumping unit is perfectly balanced, significant deviations are to be expected when comparing actual and simulated Peak Torque values in case the pumping unit is not balanced in the field.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.