MIN3P-HPC: A High-Performance Unstructured Grid Code for Subsurface Flow and Reactive Transport Simulation

The numerical simulation of flow and reactive transport in porous media with complex domains is nontrivial. This paper presents a method to implement fully unstructured grid capabilities into the well-established software ParMIN3P-THCm, a process-based numerical model designed for the investigation of subsurface fluid flow and multicomponent reactive transport in variably saturated porous media with parallelization capability. The enhanced code, MIN3P-HPC, is modularized to support different cell types, spatial discretization methods and gradient reconstruction methods. MIN3P-HPC uses a vertex-centered control volume method with consideration of both vertex-based and cell-based material properties (e.g., permeability). A flexible parallelization scheme based on domain decomposition and thread acceleration was implemented, which allows the use of OpenMP, MPI and hybrid MPI-OpenMP, making optimized use of computer resources ranging from desktop PCs to distributed memory supercomputers. The code was verified by comparing the results obtained with the unstructured grid version to those produced by the structured grid version. Numerical accuracy was also verified against analytical solutions for 2D and 3D solute transport, and by comparison with third-party software using different cell types. Parallel efficiency of OpenMP, MPI and hybrid MPI-OpenMP versions was examined through a series of solute transport and reactive transport test cases. The results demonstrate the versatility and enhanced performance of MIN3P-HPC for reactive transport simulation.


Introduction
Numerical models for subsurface flow and reactive transport have become important tools in helping researchers and engineers to gain a better understanding of physical, chemical and biological processes in the field of earth and environmental sciences. These models include, but are not limited to, CORE 2D V4 (Samper et al. 2012), CRUNCHFLOW (Steefel and Molins 2016), STOMP (White and Oostrom 2006), HYDROGEOCHEM (Yeh et al. 2004), HYTEC (van der Lee et al. 2003), HPx (Šimůnek et al. 2012), IPARS (Wheeler et al. 2012), OpenGeoSys (Kolditz et al. 2012), ORCHESTRA (Meeussen 2003), PFLOTRAN (Lichtner et al. 2018a, b;Hammond et al. 2012;Trinchero et al. 2018), PHREEQC3 (Parkhurst and Appelo 2013), PHT3D (Appelo and Rolle 2010), RT3D (Clement and Johnson 2012), TOUGHREACT Wei et al. 2015), NUFT (Hao et al. 2011), MIN3P (Mayer 1999;Mayer et al. 2002) and other codes aimed at the micro scale (Mostaghimi et al. 2016;Kang et al. 2006). For spatial discretization of reactive transport problems, structured grids have commonly been the method of choice. This discretization method is sufficient for simulating problems that focus on geochemical aspects because locally mass-conservative formulations are readily available, and because this method is easier to implement. However, many advances have been made in recent years, and the field of reactive transport modeling has matured to a degree that allows for application of these models on a larger scale, requiring the consideration of more complex geometry. Compared to structured grids, the unstructured grid approach allows the user to follow domain or material boundaries and to adjust the resolution wherever needed, without refining the entire domain. Unstructured grid methods originally emerged as a viable alternative to the structured or block-structured grid techniques for discretizing complex geometry in fluid mechanics problems (Mavriplis 1997). Unstructured grid capabilities are particularly beneficial for applications on the field scale, where subsurface flow and reactive transport often require discretization of complex domains. The spatial discretization of such domains in 2D and 3D can be cumbersome, in particular when using a structured grid that only allows for the generation of rectangular meshes. Generally, a better mesh quality provides better convergence and more precise solutions, especially in regions with high pressure, thermal or concentration gradients. Without local refinement, these areas are more likely to cause larger numerical errors. In such situations, an unstructured grid, which can more easily replicate irregular geometries, is beneficial. Unstructured grid capabilities not only provide greater flexibility for discretizing complex domains, but also enable straightforward implementation of adaptive meshing techniques, in order to enhance solution accuracy and efficiency.
Due to the relative complexity of the numerical algorithms, capabilities for simulating reactive transport using unstructured grids are not well developed. This is in part because unstructured grid techniques generate increased computational overhead compared to structured grid methods, a fact that has limited their use particularly for large 3D simulations (Mavriplis 1997). In addition, excessive CPU time due to a lack of parallelization and the requirement for locally mass-conservative solutions also limit the use of unstructured meshes. To the authors' knowledge, only a few of the aforementioned codes Yeh et al. 2004;Lichtner et al. 2018a;Kolditz et al. 2012) can be used for 2D/3D unstructured grid problems (Steefel et al. 2014). Despite the differences in numerical capabilities and key features of flow, transport, and geochemical and microbial reactions, there are still computational limitations that remain, such as flexibility in cell types and parallelization methods.
This paper presents an approach to implement fully unstructured grid capabilities in ParMIN3P-THCm (Su et al. 2017), the parallel version of the MIN3P code (Mayer 1999;Mayer et al. 2002). The key features of the MIN3P code include 3D saturated/unsaturated fluid flow, biogeochemical reactions, heat transport, solute and gas transport, density coupling between flow and transport, and 1D hydromechanical coupling (Mayer et al. 2002;Henderson et al. 2009;Mayer and MacQuarrie 2010;Bea et al. 2012Bea et al. , 2016Su et al. 2017). A global implicit method has been implemented using the direct substitution approach for solving the multicomponent advection-dispersion equation and biogeochemical reactions. However, the requirement for a structured grid approach limits the application of ParMIN3P-THCm in some cases. The enhanced code is named MIN3P-HPC, where HPC stands for high-performance computing as well as high-performance code for complex geometry. The objectives of this contribution are to present (1) the development of a flexible method supporting different cell types in 2D and 3D, while maintaining high-order numerical accuracy; and (2) the implementation of a high-performance parallelization approach that takes advantage of cutting-edge computer architecture, and allows the use of OpenMP, MPI or hybrid MPI-OpenMP for acceleration and scaling.

Methods
In this section, the main governing equations solved in MIN3P-THCm are summarized, followed by an outline of the methods used for unstructured grid implementation and parallelization.

Model Equations
To provide context for the unstructured grid development, the governing equations of the MIN3P-THCm model are briefly reviewed below. The key model components consist of density-dependent variably saturated flow, energy transport and multicomponent reactive transport in porous media. The flow of water and solute mass transport above the ground surface, or surface flow over fracture/macropore interfaces, is not considered. For additional details on the underlying theory and governing equations for biogeochemical reactions, activity corrections and mechanical loading calculations, we refer to (Mayer et al. 2002;Henderson et al. 2009;Mayer and MacQuarrie 2010;Bea et al. 2012Bea et al. , 2016Su et al. 2017).

Density-Dependent Variably Saturated Flow Equations
The implementation of non-isothermal and density-dependent variably saturated flow is described in (Henderson et al. 2009) and (Bea et al. 2012 where φ [L 3 void L −3 bulk] is porosity, S a [L 3 water L −3 void] is the saturation of the aqueous phase and ρ a [M L −3 ] is the pore water density, which is a function of temperature and solution composition.
is the vertical component of the mean total stress and q a [L T −1 ] is the aqueous phase flux, which can be defined as is a source-sink term for the aqueous phase (Bea et al. 2012). The saturation and relative permeability are a function of the aqueous phase pressure based on the relationships given by (Wösten and van Genuchten 1988). Additional details on governing equations for flow and related constitutive relationships can be found in (Henderson et al. 2009) and (Bea et al. 2012(Bea et al. , 2016. Equation (1) is based on Richards' equation written in the head form. It should be noted that the nonlinearity between saturation S a and pressure P makes it impossible to maintain the derivation of water content equality uniformly at the discrete level. In general, though, high accuracy in the time integration is essential to avoid mass balance errors (Farthing and Ogden 2017). For the control volume method, the solution is mass conservative in the sense that fluxes across interfaces between cells are matched.

Energy Balance Equations
The formulation for energy transport implemented in MIN3P-THCm follows the formulation of (Bea et al. 2012) where c a , c g and c s [E M −1 • C] are the heat capacities for aqueous, gas (vapor) and solid phases, respectively; ρ g and ρ s [M L −3 ] are the density of gas (vapor) and solid phase, respectively; S g [L 3 gas L −3 void] is the gas (vapor) phase saturation; ∇ · J e [E L −3 T −1 ] is the energy flux; L w [E M −1 ] is the latent heat of vaporization for water; and Q e [E L −3 T −1 ] is an energy source-sink term.

Reactive Transport Equations
The global mass conservation equations for reactive transport in variably saturated porous media, which contain the contributions for all mobile, adsorbed and mineral species, can be written as (Mayer et al. 2002) where [mol L −3 T −1 ] are external source and sink terms for the aqueous phase and gas phase, respectively.

Unstructured Grid Implementation
The structured grid version of MIN3P-THCm uses the finite volume method (FVM) for spatial discretization, allowing for simulations in one, two and three spatial dimensions. An efficient "influence coefficient" technique is used in the code to avoid costly numerical integration and substantial computational cost in computing matrix entries (Forsyth 1993;Huyakorn et al. 1984Huyakorn et al. , 1986Mayer and MacQuarrie 2010). The influence coefficient represents the geometric characteristics of the different cell types and material properties that are used in flux calculations. The hydraulic conductivity tensor, heat conductivity tensor and solute dispersion tensor can be integrated into the influence coefficient for different cell types without requiring a separate code for each cell type. This technique significantly reduces work associated with code development and makes the code more extensible (e.g., support of hybrid cell types). Compared to the structured grid implementation, the difference in numerical implementation for the unstructured grid lies mainly in the calculation of interfacial fluxes between control volumes. For the unstructured grid implementation, interfacial fluxes can be obtained by defining control volumes around each unknown and by integrating the fluxes over the boundary of a control volume using a midpoint integration rule, where the value of the flux at each control volume face is computed by upwind or central differencing schemes (Mavriplis 1997). Such schemes can be applied for both structured grids and unstructured grids. However, for the unstructured grids, additional cross-diffusion terms are required to ensure stability and accuracy (Mavriplis 1997;Nishikawa 2014b;Pasdunkorale and Turner 2005), which can be obtained by gradient reconstruction and interpolation. A brief introduction to the spatial discretization, gradient reconstruction and evaluation of interfacial flux and cross-diffusion terms is given below.

Spatial Discretization
The spatial discretization of MIN3P-HPC is based on a vertex-centered scheme in which the unknowns of Eqs. (1), (3) and (4) are stored at the vertices of the mesh. A dual control volume for the unstructured mesh (dual mesh), which represents the control volume of each vertex, is created accordingly. Material properties such as hydraulic conductivity can be stored either at the vertices or for the cells. Two methods are used to define the dual mesh in MIN3P-HPC. One method is referred to as the medial-dual (MD) method (Nishida 2010), which joins the midpoint of each edge of the primary grid with the center of the cell. Another method is referred to as the Voronoi-dual (VD) method (Freedman et al. 2014), in which case the segment that joins the two adjacent cell circumcenters is perpendicular to their common edge and crosses this edge at the midpoints, as shown in Fig. 1 for a 2D triangular mesh. Here we revised the general VD dual mesh with consideration of low-quality meshes, where the circumcenter is located outside the cell. For this circumstance, the midpoint of the largest edge is used to build the dual mesh. The revised VD method is valid for any triangular mesh in 2D and tetrahedral mesh in 3D, including meshes containing cells with obtuse angles. These kinds of meshes are not allowed in the conventional VD method. Either the revised VD or MD method can be used in the code; however, we prefer to use the revised VD method because it uses an elegant characteristic of the FVM, i.e. that the control volume interfaces are orthogonal to the line segment between adjacent vertices, which is crucial because the divergence theorem is based on the outward normal vector of the control volume interface. Based on numerical experiments conducted by Su et al. (2020), a revised VD method has been chosen for this study, providing good numerical accuracy, stability and convergence rates. In order to best represent complex geometry, different cell types, including triangular and quadrilateral cells in 2D, tetrahedral and hexahedral, and prism cells in 3D, can be considered in MIN3P-HPC.  (4) can be rewritten in the integral conservative form for an arbitrary control volume V i , enclosed by its boundary Γ i , for detail see (Mayer 1999;Mayer et al. 2002;Henderson et al. 2009;Bea et al. 2012). The time-related term ( ∂ ∂t ) and source-sink term (Q) are identical for both structured and unstructured grids. It is the gradient-related flux term (∇) that differentiates the unstructured grid discretization from the structured grid discretization.
Taking the density-dependent flow Eq. (1) as an example, by applying the Gauss divergence theorem, the governing equation for an arbitrary control volume V i , enclosed by its boundary Γ i and time t, can be written aŝ where d S is the integral with respect to the interfacial area, dV is the integral with respect to the control volume and dt is the integral with respect to the time increment. By applying fully implicit time weighting, the storage terms are approximated using the values of the unknown functions at the vertex inside the control volume where N +1 represents the new time level and N defines the old time level; l represents the l th face of control volume i; S l is the area of the control volume face;n is the unit outward normal of the control volume face; and q N +1 a,l can be expressed as Discretization of Eqs. (3) and (4) can be carried out in a similar way as Eq.
(1). Among Eqs. (6) to (9), only the flux-related term in Eq. (8) differentiates the unstructured grid implementation from the structured grid formulation. This term requires the estimation of the gradient at the control volume face.

Gradient Reconstruction
To calculate the flux term in Eq. (8), a robust and accurate gradient reconstruction method is required. Estimation of fluxes in unstructured grids is a nontrivial task, especially in the case of highly anisotropic media. Taking the l th control volume face, for example, and ignoring the time integral, substituting Eq. (10) into Eq. (8) yields is the hydraulic conductivity tensor at the center of the l th control volume face; S l [L in 2D or L 2 in 3D], is the area of the l th control volume face; and ψ = (Pl+ρa,l gz) ρ a,l g is the pressure head. Since S l is a geometry parameter that can be calculated based on a given mesh, and ρ a,l is the averaged aqueous phase density at the l th control volume face using the arithmetic mean, the accuracy of the flux across the l th control volume face depends on K l ∇ψ ·n l , which can be rewritten as The vector w l = K T ln l can be further decomposed as w l = w l ·û l û l + w l ·n l n l , as shown in Fig. 2.û l is the unit vector of w l on the l th control volume face andn l is the unit outward normal. The vector direction of u l depends on the w l direction and n l direction. Similarly, for v l = v l ·û l û l + v l ·n l n l , we obtain The flux term in Eq. (12) can be rewritten as where w l , v l ,n l andû l are the geometry parameters that can be calculated based on the given mesh geometry and material properties (e.g., hydraulic conductivity tensor, dispersion tensor). Equation (14) includes two terms: the primary term ∇ψ · v l can be approximated by Taylor's theorem, and the gradient required for the secondary term or cross-diffusion term can be computed using either Green-Gauss gradient reconstruction or least-squares method (Nishikawa 2014a;Pasdunkorale and Turner 2005;Sejekan 2016). For structured and orthogonal meshes, u l is perpendicular to v l and the gradient ∇ψ is perpendicular to u l based on first-order Taylor's expansion; Eq. (14) can be reduced to K l ∇ψ ·n l = w l ·n l v l ·n l (Mayer 1999;Mayer et al. 2002). In this case, a gradient reconstruction term is not required, as two-point flux approximation is accurate. For general meshes, where orthogonality does not hold, multi-point flux approximation with gradient reconstruction is needed for better accuracy (Aavatsmark 2002). These methods are known for not being monotonic. In this contribution, we use a revised Voronoi-dual mesh method, together with multi-point upstream weighting, which ensures stability and monotonicity while providing good convergence rates (Su et al. 2020).
Here we provide an alternative implementation based on a higher-order gradient reconstruction for both the primary term and the cross-diffusion term, following the work of (Pasdunkorale and Turner 2005). As shown in Fig. 2, let x F denote the location of center point F in the l th face of control volume V i , δx − be the vector from F to node i, δx + be the vector from F to node j. Consider the Taylor expansion of function ψ where the remainder R has the Lagrange form, which can be expressed as One can substitute the vectors δx − and δx + for δx in the above equation to obtain the function values at node i and node j, which yields where ε ji is the kth-order Taylor's expansion for the high-order least-squares gradient reconstruction.
For a fine mesh (e.g., orthogonal mesh), it is possible to assume ε ji ≈ 0; however, this may not hold for a mesh with large expansion factors or aspect ratios. Substitution of Eq. (17) into Eq. (14) gives the following expression for the flux term where (18) can be determined by the higherorder least-squares gradient reconstruction method (Pasdunkorale and Turner 2005;Sejekan 2016). This method is used to reduce local mass balance errors, which is a necessary requirement for the solution of reactive transport, in particular when the global implicit solution method is used. Alternatively, the gradient at the point F can be calculated from base functions defined on the elements (Helmig 1997;Bas 1999). The energy balance Eq.
(3) and reactive transport Eq. (4) are discretized in a similar way by replacing the hydraulic conductivity tensor K with a heat conductivity tensor or dispersion tensor, respectively. Discretization into quadrilateral, tetrahedral, hexahedral and prism meshes can be carried out in a similar way.

Parallel Implementation
Parallelization of MIN3P-HPC was based on ParMIN3P-THCm (Su et al. 2017), a general-purpose parallel code based on the domain decomposition method using MPI. For MIN3P-HPC, the parallel scheme was further optimized by adding thread acceleration using OpenMP. The optimized parallel code allows the use of OpenMP, MPI and hybrid MPI-OpenMP. With these features, different parallel schemes can be used for simulation tasks with different scale and complexity.
An overlapping domain decomposition of the unstructured grid is implemented based on PETSc's (Balay et al. 1997(Balay et al. , 2018a DMPlex module. Overlapping domain decomposition methods are used for efficiency and flexibility. Domain decomposition for an unstructured grid requires overlap of one or more nodes if using the Green-Gauss gradient reconstruction or second-order least-squares gradient reconstruction method. For the higher-order least-squares gradient reconstruction, two or more nodes must overlap between adjacent subdomains. There is no additional message communication required to obtain data from a neighboring subdomain, since each subdomain has local copies of all data associated with the ghost nodes/cells. For each subdomain, OpenMP threading acceleration is used. The parallelized version of the code can be run in three modes: with thread acceleration only, the code runs in OpenMP parallelization; with domain decomposition only, the code runs in MPI parallelization; and by combination of thread acceleration and domain decomposition, the code runs in hybrid MPI-OpenMP parallelization. The hybrid MPI-OpenMP parallelization can take advantage of MPI scalability but simultaneously reduces communication overhead by using OpenMP. In addition to the default PETSc parallel solver packages included in ParMIN3P-THCm, an alternative linear iterative solver (LIS) (Nishida 2010; Kotakemori et al. 2005Kotakemori et al. , 2008Fujii et al. 2005) has been implemented in the code that can support OpenMP, MPI and hybrid MPI-OpenMP. System input and output (IO) is usually a bottleneck in the parallel code. In MIN3P-HPC, HDF5 and XDMF data file format are used to support efficient parallel IO. The workflow of MIN3P-HPC parallelization remains the same as ParMIN3P-THCm (Su et al. 2017), with the aforementioned new features added.
The general geometric multigrid method for unstructured meshes (Bastian and Helmig 1999) has not been implemented in this research. Alternatively, algebraic multigrid methods are available to solve the linearized equations. Although algebraic multigrid methods commonly require substantial indirect addressing and integer computations, MIN3P-HPC is able to solve the system of equations efficiently. MIN3P-HPC is built on the PETSc library, in which various algebraic multigrid solvers (e.g., Hypre, BoomerAMG) are optimized by the developers to effectively solve the linearized equations (May et al. 2016). These solvers provide different multigrid approaches yielding excellent performance for a broad range of problems.

Verification and Demonstration
The aims of this section are to (1) provide verification of the code for general reactive transport problems using the unstructured mesh, and (2) demonstrate the advantages and suitability of the unstructured mesh implementation for reactive transport problems with irregular material property distributions. The purpose of the simulations is not the assessment of real-world reactive transport problems or interpretation of data, although the input parameters are physically realistic. Since the structured grid version of MIN3P-THCm has been verified against many other reactive transport codes (Carrayrou et al. 2010;Marty et al. 2015;Perko et al. 2015;Şengör et al. 2015), and ParMIN3P-THCm has also been verified by comparing the results to those of the sequential version for a wide range of verification problems (Su et al. 2017), the verification of MIN3P-HPC was carried out by comparing the results to analytical solutions, structured grid solutions obtained with MIN3P-THCm and ParMIN3P-THCm, and another reactive transport code, namely PFLOTRAN (Lichtner et al. 2018a, b). It should be mentioned that the control volume method is common in the field of reactive transport modeling due to the property of being locally mass conservative (Steefel et al. 2014) and that the intercomparisons are made using only control volume numerical methods.
To demonstrate the advantages and suitability of the unstructured mesh implementation, two synthetic reactive transport cases were created. We use a set of hypothetical parameters to define the simulation cases. The chosen parameters are within the range of values observed in the field (Bain et al. 2001;Mayer et al. 2001Mayer et al. , 2002Bea et al. 2016Bea et al. , 2018 and thus are suitable for demonstrating the capabilities of our reactive transport code.

Verification with Analytical Solutions
For basic model verification of flow and transport capabilities, a 3D solute transport case, for which an analytical solution is available, is considered.The test case is based on 3D migration of chloride from a landfill, created by filling in a gravel pit excavated in a valley-fill aquifer (Wexler 1992). The solute source is at the inflow boundary on the left side of the simulation domain. Physical parameters are summarized in Table  1.
Molecular diffusion is added to account for the effect of transverse dispersion, via an upscaled diffusion coefficient. The analytical solution is based on the algorithm provided in (Wexler 1992). A structured grid mesh, an unstructured mesh consisting of prisms and an unstructured tetrahedral mesh are used for both MIN3P-HPC and PFLOTRAN, as shown in Fig. 3.
Concentrations from the analytical solution are calculated for locations defined by the structured grid in Fig. 3a. The normalized concentration C/C 0 distribution after 3,000 days is shown in Fig. 4. The results from MIN3P-HPC and PFLOTRAN both agree well with the analytical solution. It should be noted that the solution obtained from (Wexler 1992) is semi-analytical because the integral in the solute concentration formula must be evaluated numerically using the Gauss-Legendre numerical integration technique. Because of this, the analytical solution is not mesh-free, and a similar structured grid is needed for calculating the analytical solution. This makes point-to-point comparisons between the analytical solution and unstructured grid solution impossible. There are small differences in the source concentrations on the left boundary that are caused by the different numerical schemes of MIN3P-HPC and PFLOTRAN. PFLOTRAN uses a cell-centered scheme where the unknowns (here chloride concentrations) are set at the cell center, while MIN3P-HPC uses a vertexcentered scheme, where the unknowns are set at the vertices. Compared to the solutions obtained analytically and with the structured grid method, the unstructured grid solutions obtained with both MIN3P-HPC and PFLOTRAN show a low level of numerical dispersion, which cannot be avoided unless the mesh is fully orthogonal. The tetrahedral mesh produces more significant dispersion than the structured grid or prism mesh, which implies that the mesh quality of the orthogonal and prism meshes is better than the discretization based on tetrahedral grids. Nevertheless, MIN3P-HPC using the tetrahedral mesh still generates smooth results of acceptable quality without oscillation.

Verification Against Structured Grid Solutions
Considering that analytical solutions for reactive transport problems with heterogeneous material properties, complex reactions or density-dependent flow do not exist, an alternative option for verification is by comparison of the results obtained from the unstructured grid implementation with those obtained from the structured grid version. For this purpose, a density-dependent flow problem is selected to evaluate the accuracy of the code. In this context, it is worth mentioning that density-dependent flow and transport are very sensitive to the simulation mesh, thus proving a suitable test for the unstructured grid implementation. The density-dependent flow and transport problem presented here is the Elder problem, a well-known 2D fluid convection problem affected by solute concentration gradients (Voss and Souza 1987;Simmons and Elder 2017). The Elder problem domain consists of a cross section containing a zone of freshwater underlying a source of brine. A concentration of 0 g/L was assigned to the lower part of the domain, and the concentration was set to C s at the upper boundary. No flow boundaries were assigned to any of the four sides of the domain. A constant fluid pressure of 0 Pa was assigned to the upper left corner of the domain (Voss and Souza 1987). The physical input parameters are provided in Table 2. Three meshes were considered in this verification problem: a structured grid, an unstructured mesh triangulated from the structured grid and a general triangular mesh, as shown in Fig. 5. The structured and triangulated structured grids (Fig. 5a, b contain 8,192 nodes and the unstructured grid (Fig. 5c) contains 7,390 nodes. The average resolution is 2.4 m for these three meshes.
Velocity and concentration distributions indicate that the results obtained with the unstructured grid methods compare well with those from the structured grid. Although this problem is very sensitive to mesh quality, and other possible stable solutions have been obtained (Simmons and Elder 2017;Simpson and Clement 2003), the various unstructured grid implementations are able to reproduce the dimensions and temporal evolution of the convection cells in a consistent fashion.
To further verify the implementation, simulation results obtained with different mesh resolutions have been tested for both structured and unstructured meshes.  , d to f velocity distribution after 4 years using SG, USG1 and USG2, g to i concentration distribution after 4 years using SG, USG1 and USG2, j to l concentration distribution after 7 years using SG, USG1 and USG2 mesh are in good agreement with results obtained from the unstructured mesh. The small difference between structured and unstructured meshes is due to the sensitivity of mesh-induced anisotropy (Boufadel et al. 1999).
It is interesting to observe that the mesh refinement both improves the convergence rate and minimizes the mass balance error, while the parallel solver remains robust when the number of processors is increased. The convergence rate and relative mass balance error for these simulations are shown in Table 3.

Demonstration of Reactive Transport in a Heterogeneous Porous Medium
This 2D reactive transport case focuses on calcite dissolution in a fully saturated heterogeneous porous medium. The simulation domain contains two materials with different hydraulic conductivity, porosity and dispersivity. The components H + , CO 2− 3 and Ca 2+ , the secondary species OH − , HCO − 3 and H 2 CO 3 (aq), and the mineral calcite are included in the geochemical system. A fixed hydraulic head was specified at the lower left and upper right boundary, respectively. An acidic solution with elevated  Fig. 5, and the colored symbols represent the locations of the observation points (coordinates provided in text). a Observation points 1 to 3 using structured mesh, b observation points 1 to 3 using unstructured mesh, c observation points 4 to 6 using structured mesh, d observation points 4 to 6 using unstructured mesh, e observation points 1 to 3 using structured mesh and unstructured mesh with resolution of 0.5 m, and f observation points 4 to 6 using structured mesh and unstructured mesh with resolution of 0.5 m concentrations of carbonic acid was specified at the inflow boundary, reacting with the mineral calcite (CaCO 3 ). Reaction products then exit at the upper right boundary, as shown in Fig. 7a and 7(b). Physical and chemical input parameters are given in Table 4 and are within the range of parameters relevant for field conditions (e.g., Bain et al. 2001;Mayer et al. 2001Mayer et al. , 2002Bea et al. 2016Bea et al. , 2018. Four different mesh sizes were used for both structured and unstructured grids. The average cell/edge size for the four meshes was 0.01 m, 0.005 m, 0.0025 m and 0.00125 m. For the structured grid, the total number of nodes was 561, 2,145, 8,385 and 33,153, respectively, and for the unstructured grid, the total number of nodes was 666, 2,499, 9,448 and 37,129. The simulation was carried out for a period of 100 days. The dissolution front of calcite advances rapidly in the zone with higher hydraulic conductivity, leading to sharp gradients when it reaches the zone with lower hydraulic conductivity. At the interface of these two zones, the dissolution front can advance differently depending on the cell size. The calcite volume fractions after 100 days are shown in Fig. 7cj. The simulation results indicate that the structured grid implementation requires a fine mesh with a cell size no larger than 0.0025 m to produce accurate results, while a mesh resolution of 0.005 m is sufficient for the unstructured grid solution. This corresponds to a total of 8,385 nodes for the structured grid and 2,499 nodes for the unstructured grid. The structured grid represents the hydraulic conductivity distribution in a "jagged" shape, and in order to produce an accurate solution, a high mesh resolution is required. Compared to the structured grid, the unstructured grid can represent the irregular distribution of hydraulic conductivities much better and does not require a highly refined mesh to obtain accurate results. The simulated results illustrate that geochemical reactions tend to occur at interfaces with varying properties, implying that gradients controlling fluid and solute exchange are often large near the interfaces, requiring refined spatial discretization in these regions.

Demonstration of Reactive Transport in a Complex Nonuniform Domain
The second demonstration case focuses on calcite dissolution through a variably saturated porous medium with heterogeneous and anisotropic hydraulic conductivities, as shown in Fig. 8a. The number of nodes in the simulation domain is 73,258 and the average edge length of triangle cells is 0.625 m The flow is considered steady in this case. The simulation domain contains three materials with different hydraulic conductivity, porosity and dispersivity. The upper and lower boundaries are assumed to be zero flux. Fixed hydraulic heads are specified at the left and right boundaries. As for the first demonstration example, a solution rich in carbonic acid enters the solution domain across the inflow boundary and reacts with the mineral calcite, and reaction products exit across the upper right boundary. The physical and chemical input parameters are given in Table 5 and are within the range of parameters relevant for field conditions (e.g., Bain et al. 2001;Mayer et al. 2001Mayer et al. , 2002Bea et al. 2016Bea et al. , 2018. The simulation was carried out for a period of 500 days. As shown in Fig. 8, flow and reactive transport in the simulation domain occur in three distinct regions, including (1) slow flow and transport through the low permeability Zone 1, (2) fast flow and transport through high permeability Zone 2, and (3) diffusion-dominated transport in the very low permeability Zone 3. The results for water saturation, flow velocity, calcite volume fractions and pH after 100, 200 and 500 days are shown in Fig. 8b-Fig. 7 2D reactive transport in a heterogeneous porous medium: a structured grid, b unstructured grid, c, e, g, i calcite volume fraction after 100 days with average cell size ranging from 0.01 m to 0.00125 m for the structured grid, and d, f, h, j calcite volume fraction after 100 days with average cell size ranging from 0.01 m to 0.00125 m for the unstructured grid Porosity in Zone 2 φ 2 0.15 [-] Hydraulic conductivity in Zone 1 j. The dissolution front of calcite advances quickly through Zone 2, but is inhibited by limited access to Zones 1 and 3, leaving behind an extensive region containing residual calcite. The pH value in the simulation domain changes accordingly as the calcite dissolves. The unsaturated zone above the water table is not exposed to the acidic solution; calcite remains present throughout the simulation and pH remains elevated, as shown in Fig. 8. The simulation demonstrates the versatility of the code in simulating flow and reactive transport in a complex domain with irregular boundaries and heterogeneous and anisotropic material properties.

Parallel Performance
For parallel performance testing, both accuracy and computational time are considered. To quantitatively estimate the accuracy of parallelism for the unstructured grid code, we selected the reactive transport case with a complex nonuniform domain as the test case (Sect. 3.4). The results obtained with the sequential code were used as the reference solution for comparison with the results computed by the various parallel implementations and different numbers of processors. The relative root mean square error (RRMSE) (Despotovic et al. 2016) was calculated for the simulation results to quantify the accuracy of the parallel implementation. Two variables were selected, namely calcium concentrations and pH. The parallel accuracy was tested for both the OpenMP and MPI versions, with up to 32 processors and 160 processors, respectively. As shown in Table 6, both the OpenMP and MPI parallel versions yield small RRMSEs, indicating that the numerical results are consistent when the parallel option is used, independent of the number of processors. The RRMSEs obtained for the results from the OpenMP parallel version are smaller than the RRMSEs obtained for the MPI parallel version because the convergence criteria used in the LIS solver are stricter than those used in the PETSc solver. For both versions, the RRMSE remains small and does not increase as the number of processors increases, indicating that both parallel versions generate consistent results.
To analyze the computational time, seven scenarios have been tested, which include both solute transport and reactive transport test cases in 2D and 3D, as shown in Table  7. The 2D solute transport test cases include simulations using a coarse mesh (trans-2da), a medium mesh (trans-2d-b) and a fine mesh (trans-2d-c). The 3D solute transport test cases include simulations using a coarse mesh (trans-3d-a) and a medium mesh (trans-3d-b). The reactive transport test cases include 2D simulations using a medium mesh (reactran-2d) and 3D simulation using a medium mesh (reactran-3d). For the solute transport test cases, the initial and boundary conditions are the same as those shown in Table 1, except that the 2D solute transport test cases use a Z-plane domain cutting through the source center (Z = 198.12 m). The reactive transport cases use the same domain as the solute transport cases except that the initial and boundary conditions of components are changed to those listed in Table 4. The total degrees of freedom (tdof) for the 2D solute transport problem ranges from 4,250 for the coarse mesh to 33,601 for the fine mesh, and tdof for the 3D solute transport problem ranges from 49,321 for the coarse mesh to 664,671 for the fine mesh. The tdof for the reactive transport problem is 43,395 for the 2D problem and 1,994,013 for the 3D problem, respectively. All scenarios have been tested using the MPI and OpenMP versions. Scenarios trans-2d-b, trans-2d-c and reactran-2d have also been tested for the hybrid MPI-OpenMP version. The parallel performance of the structured grid version was comprehensively analyzed in a previous contribution (Su et al. 2017). In this paper, we focus on the parallel performance analysis based on a scaling test for unstructured meshes. The performance analysis evaluates how the total runtime of each simulation scales with the number of processors, without considering the computational cost for input and output. Each case investigated uses the unstructured grid option of the code and is performed for the same mesh and resolution. For the OpenMP version, the performance was tested for up to 32 processors for the 2D case and 48 processors for the 3D case. For the MPI version, the performance was tested for up to 32 processors for the 2D case and 192 processors for the 3D case. The hybrid version was tested for up to 192 processors.

Parallel Performance for Solute Transport Case
The total runtime and speedup related to the number of processors (nprocs) are shown in Fig. 9. Both OpenMP and MPI versions show significant runtime decreases and speedup increases as the number of processors increases. Generally, the performance  of the OpenMP version is better than that of the MPI version when the tdof is small, as shown in Fig. 9a-c. This is mainly due to the communication overhead in the MPI parallelization for cases with small tdof. In the 3D test case with the fine mesh and thus a large number of unknowns, the MPI parallelization provides a better performance than the OpenMP version. As shown in Fig. 9d, the total runtime for the MPI version is generally less than that of the OpenMP version. This is caused by the overhead incurred in creating the threads for the OpenMP version. For the small simulation case, the communication overhead for the MPI version is significant compared to the overhead in creating threads for the OpenMP version. For the large simulation case, the overhead for the OpenMP version can be more expensive than that for the MPI version. It can also be observed that the OpenMP version can gain super-linear speedup by taking advantage of better usage of cache of the CPU, as shown in Fig. 9c,  Fig. 9c, d that parallel performance yields very good speedup when the tdof per processor is larger than 2,000.

Parallel Performance for Reactive Transport Case
To test the performance for reactive transport simulations, the 2D and 3D solute transport cases were modified by including calcite precipitation/dissolution reactions and aqueous complexation of calcium and carbonate species. The performance of reactive transport simulations is shown in Fig. 10. Compared to the solute transport simulation, reactive transport simulations add additional computational burden to the system caused by local geochemical reactions and their coupling to the transport equations. Since this extra burden does not require additional communication cost, it does not Fig. 9 Parallel runtime and speedup: a 2D solute transport with original mesh, b 2D solute transport with fine mesh, c 3D solute transport with original mesh, and d 3D solute transport with fine mesh negatively affect the parallel performance. However, this extra burden may change cache efficiency, depending on the problem size. As shown in Fig. 10a, b, for a 2D reactive transport simulation that does not require large memory, the simulation is "cache-friendly" and produces better performance than the solute transport simulation. For large 3D reactive transport simulations, in which the memory demand is far more than the cache size, the parallel performance declines relative to the solute transport simulation with the same grid resolution, as shown in Fig. 10c, d. Comparing Fig. 10a-d, we can also observe that the OpenMP parallel version generally provides better speedup than the MPI parallel version.

Parallel Performance of Hybrid Version
Both MPI parallelization and OpenMP parallelization have their own advantages and limitations. In the MIN3P-HPC code, a hybrid MPI-OpenMP parallelization is implemented by taking advantage of both parallelization methods. For evaluating the parallel performance of the hybrid version, three cases were selected, as shown in Table 7. In this context, computational costs are reported for runtimes associated with computing matrix entries, solving linear equations and for the total runtime. All local geochemical reaction calculations were included in computing the matrix entries. Case trans-2d-b and reactran-2d were tested with the hybrid version using two threads for each MPI process, and the code performance was compared to OpenMP and MPI versions. Case trans-2d-c was tested with the hybrid version using from two to nine threads for each MPI process. Here, nprocs refers to the total number of processors and threads used. As shown in Fig. 11, performance of the hybrid version depends on the case size and computational cost. For small cases (trans-2d-b) that use low computational resources, as shown in Fig. 11a, the hybrid version does not produce benefits compared to the OpenMP and MPI versions, indicating that the overhead in the hybrid version cannot be further reduced for this case and that the OpenMP version has much better performance than the other two versions. With geochemical reactions added into the case, the hybrid version generally yields better performance than the MPI version when nprocs is small (e.g., less than 10), as shown in Fig. 11b. Performance for matrix entry computations is good and of similar efficiency for all parallel versions, but the hybrid version generally yields better performance as nprocs increases. However, the performance of the linear solver investigated here may deteriorate as nprocs increases, thus slowing the total performance. In general, it appears that OpenMP is marginally faster for the solute transport case due to the overhead of the overlapping domain decomposition, while the hybrid approach performs better for the reaction case due to the lower overhead of thread setup. For both cases, the matrix setup scales ideally because there is no communication involved, whereas the solution of the linear system of equations is more problematic, due to communication requirements.
To test the role of threads on the performance of the hybrid version, case trans-2d-c was tested for up to 192 processors and threads. As shown in Fig. 11c, using more threads gives better performance as nprocs increases. In this case, when nprocs is less than 10, using two threads gives the best performance. As nprocs increases, using four threads provides the best performance when nprocs is around 100, and Fig. 10 Parallel runtime and speedup for solute transport and reactive transport: a 2D case using MPI parallel version, b 2D case using OpenMP parallel version, c 3D case using MPI parallel version and d 3D case using OpenMP parallel version Fig. 11 Parallel runtime of hybrid version: a case trans-2d-b, b case reactran-2d and c case trans-2d-c finally, using six threads yields the best performance when nprocs is larger than 100. It can be deduced that the performance of the hybrid parallel version depends on the problem size, number of threads per processor and total number of processors and threads used. For the case of limited computer resources, the hybrid version gives better performance when the number of threads per processor is small. In contrast, if more substantial computational resources are available, a larger number of threads per processor provides better performance when using the hybrid version.
It should be noted that load balancing is a substantial concern in parallel performance. The code has been implemented with different scheduling methods for OpenMP parallelization of loops, including static, dynamic, guided and automatic. The performance of these scheduling methods depends on the balance of the simulation case. Without loss of generality, here we assume that the simulation cases are not well balanced. Static scheduling with different chunk sizes has been tried in the performance test for OpenMP parallelization. The non-weighted domain decomposition method is used for the MPI and hybrid MPI-OpenMP versions.

Conclusions
This paper introduced the implementation of unstructured grid capabilities into the existing reactive transport codes MIN3P-THCm and ParMIN3P-THCm. The new code, MIN3P-HPC, provides solutions of comparable accuracy relative to the results obtained with the structured grid version as well as with PFLOTRAN. For meshes that lack orthogonality, e.g., a tetrahedral mesh, MIN3P-HPC can avoid oscillation and generates accurate results without loss of quality. In addition, the new code provides a high degree of versatility and is suitable for reactive transport problems involving complex geometries and irregular internal and external boundaries by using the cell types that can best represent the simulation domain. These enhanced capabilities increase the solution accuracy without significantly increasing computational cost, especially for cases with a high degree of heterogeneity and anisotropy.
The performance of different parallelization schemes was analyzed based on a series of test cases. The implemented parallelization scheme provides flexibility to optimize performance by specifying the use of computational resources for different computer platforms, ranging from desktop PCs to distributed memory supercomputers. Both OpenMP parallelization and MPI parallelization provide good acceleration. The hybrid MPI-OpenMP version, if configured properly, can give additional speedup compared to conventional OpenMP and MPI parallelization. The parallel performance test indicates that for small problems, OpenMP parallelization generally performs better than MPI parallelization, while for large problems that require more substantial computational resources, the MPI and hybrid MPI-OpenMP parallelizations are preferable. By using the high-performance parallelization scheme, the solution of a problem with a given size can be obtained faster than with the sequential version. As a result, the revised code shows a high level of versatility and is suitable for simulating large-scale longterm reactive transport problems with complex geometries, irregular boundaries and heterogeneous material distributions within practical computing times.