Computationally Efficient Multiscale Neural Networks Applied to Fluid Flow in Complex 3D Porous Media

The permeability of complex porous materials is of interest to many engineering disciplines. This quantity can be obtained via direct flow simulation, which provides the most accurate results, but is very computationally expensive. In particular, the simulation convergence time scales poorly as the simulation domains become less porous or more heterogeneous. Semi-analytical models that rely on averaged structural properties (i.e., porosity and tortuosity) have been proposed, but these features only partly summarize the domain, resulting in limited applicability. On the other hand, data-driven machine learning approaches have shown great promise for building more general models by virtue of accounting for the spatial arrangement of the domains’ solid boundaries. However, prior approaches building on the convolutional neural network (ConvNet) literature concerning 2D image recognition problems do not scale well to the large 3D domains required to obtain a representative elementary volume (REV). As such, most prior work focused on homogeneous samples, where a small REV entails that the global nature of fluid flow could be mostly neglected, and accordingly, the memory bottleneck of addressing 3D domains with ConvNets was side-stepped. Therefore, important geometries such as fractures and vuggy domains could not be modeled properly. In this work, we address this limitation with a general multiscale deep learning model that is able to learn from porous media simulation data. By using a coupled set of neural networks that view the domain on different scales, we enable the evaluation of large (>5123\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$>512^3$$\end{document}) images in approximately one second on a single graphics processing unit. This model architecture opens up the possibility of modeling domain sizes that would not be feasible using traditional direct simulation tools on a desktop computer. We validate our method with a laminar fluid flow case using vuggy samples and fractures. As a result of viewing the entire domain at once, our model is able to perform accurate prediction on domains exhibiting a large degree of heterogeneity. We expect the methodology to be applicable to many other transport problems where complex geometries play a central role.


Introduction
In the last few decades, micro-tomographic imaging in conjunction with direct numerical simulations (digital rock technologies) have been developed extensively to act as a complementary tool for laboratory measurements of porous materials (Schepp et al. 2020). Many of these breakthroughs are partly thanks to advances in data storing and sharing (Prodanovic et al.), wider availability of imaging facilities (Cnudde and Boone 2013), and better technologies (hardware and software) to visualize fine-scale features of porous media (Wildenschild and Sheppard 2013). Nevertheless, characterization based on standalone images does not provide enough insight on how the small-scale structures affect the macroscopic behavior for a given phenomenon (i.e., fluid flow). A more robust way of understanding these (and potentially being able to upscale them) is through simulating the underlying physics of fluid flow.
The increase in speed and availability of computational resources (graphics processing units, supercomputer clusters, and cloud computing) has made it possible to develop direct simulation methods to obtain petrophysical properties based on 3D images (Pan et al. 2004;Tartakovsky and Meakin 2005;White et al. 2006;Jenny et al. 2003). However, solving these problems in time frames that could allow their industrial applicability requires thousands of computing cores. Furthermore, the most insight could be gained in repeated simulation with dynamically changing conditions. For example, to assess the influence of diagenetic processes, such as cementation and compaction, surface properties like roughness, or tuning the segmentation of a sample to match experimental measurements. These scenarios entail solving a forward physics model several times (in similar domains), which is prohibitively expensive in many cases. A model that could give fast and accurate approximations of a given petrophysical property is of great interest.
A particular physical framework of interest in digital rocks physics is to describe how a fluid flows through a given material driven by a pressure difference. This is relevant to characterize how easy it is for a fluid to travel through a specific sample, and it can also reveal the presence of preferential fluid paths and potential bottlenecks for flow. By understanding how a fluid behaves in a representative sample, it is possible to use this data to inform larger scale processes about the effect of the microstructure. For example, a strong understanding of representative flow can provide macroscopic simulations with constitutive relations. The simplest and most important way to summarize the microstructural effects on flow is with the permeability, which is a volume-average property derived from the fluid velocity which describes how well a fluid can advance through its connected voidspace. Knowing the permeability is of interest for not only for petroleum engineering ), but for carbon capture/sequestration (Bond et al. 2017), and aquifer exploitation (Cunningham and Sukop 2011), but also in geothermal engineering (Molina et al. 2020), membrane design, and fuel cell applications (Holley and Faghri 2006).
Despite the fact that there are many published analytical solutions and computational algorithms to obtain the permeability in a faster manner, they do not work well in the presence of strong heterogeneities associated with important geometries such as fractures (Tokan-Lawal et al. 2015). This is partly due to the fact that most of these proposed solutions are computed based on averaged properties of the solid structure, such as the porosity and the tortuosity of the sample (Carman 1939(Carman , 1997Kozeny 1927;Bear 1972). The main issue is that samples with very similar average structural values could have widely different volumetric flux behaviors, for example, when fractures or vugs are present. For instance, a certain porous structure could have permeability values spanning three orders of magnitude depending whether the domain is not fractured, or if it hosts a fracture parallel or perpendicular to the direction of flow. While these situations significantly affect permeability, the porosity remains relatively unchanged; there is no known route for characterizing the relationship between flow and complex microstructure in terms of small number of variables. While there are theoretical approaches based on correlation functions (Torquato 2020), they are mostly applied to statistically homogenous microstructures, and it remains unclear if such an approach can apply to the complex multiscale porous media found in many natural samples.
To obtain a measure of the permeability of a sample taking into account the 3D microstructure, a fluid flow simulation can be carried out with a wide variety of iterative numerical methods that approximate the solution of the Navier-Stokes equation (Saxena et al. 2017). One of the most prominent is the Lattice-Boltzmann method (LBM). Although these simulations are performed at a much smaller scale relatively to a natural reservoir, they provide the critical parameters to enable the upscaling of hard-data (such as cores coming from wells or outcrops) into field scale simulators. Although it would be desirable to simulate bigger computational volumes that contain more information about the reservoir of interest (since imaging technology can provide volumes that are 2000 3 voxels or larger), it is computationally expensive, making it very difficult to perform routinely or repeatedly.
A representative elementary volume (REV) has to be ensured to reliably utilize these properties in large scale (field) simulations (and thus upscale). An REV is defined as the size of a window where measurements are scale-independent, and that accurately represents the system (Bachmat and Bear 1987). Notwithstanding, having an REV, i.e., porosity (which is easily determined from a segmented image), does not guarantee that this window size would have a representative response in a flow property like permeability. As shown in Costanza-Robinson et al. (2011), for fairly homogeneous samples, the side length of the window to obtain an REV in a dynamic property is at least five times what is its for the structure (porosity). This is one of the reasons why porosity alone is a poor estimate for permeability: Even when the pore structures are similar, the flow patterns that they host could be very different due to the global nature of the steady-state solution. In the context of single fractures, it is still unclear if an REV exists (Santos et al. 2018a;Guiltinan et al. 2020b, a). This puts into prominence the need for more advanced methods that can provide accurate solutions on large samples that take in account all the complexities of the domain.
In the last decade, convolutional neural networks (ConvNets) (Fukushima 1980) have become a prominent tool in the field of image analysis. These have taken over traditional tools for computer vision tasks such as image recognition and semantic segmentation, as result of being easily trained to create spatially-aware relationships between inputs and outputs. This is accomplished with learnable kernels which can be identified with small windows that have the same dimensionality as the input data (i.e., 2D for images, 3D for volumes). They have been successfully applied in many tasks regarding earth science disciplines (Alakeely and Horne 2020; Jo et al. 2020;Pan et al. 2020), and particularly in the field of digital rocks (Guiltinan et al. 2020a;Mosser et al. 2017a, b;Chung et al. 2020;Bihani et al. 2021;Santos et al. 2018b). These data-driven models have also been useful for solving flow (Santos et al. 2020b;Wang et al. 2021a, b;Alqahtani et al. 2021), successfully modeling the relationship between 3D microstructure and flow response much more accurately than empirical formulas that depend only on averaged properties.
However, ConvNets are expensive to scale to 3D volumes. This is due to the fact that these structures are memory intensive, so traditional networks used for computer vision tasks (i.e., UNet (Jha et al.) or the ResNet (He et al. 2016) ) limit the input sizes 1 3 to be around 100 3 (Santos et al. 2020b;Wang et al. 2021b). As shown in Santos et al. (2020b), one could subsample the domain into smaller pieces to use these architectures, where the subsample does not need to be an REV but it has to be accompanied by features that inform the model about the original location of this subdomain (i.e., tortuosity, pore-space connectivity, distance to the non-slip boundaries). This method provides accurate results, nevertheless, predictions stop being reliable in domains with large heterogeneities, such as a fracture or vugs. Alternatively, one can downsample the image into a more tractable size (Alqahtani et al. 2021), sacrificing detail. Wang et al. (2021a) states that one major point of focus for deep learning of porous media is the computational cost of working with 3D arrays, and the development of architectures that are better suited for porous materials instead of the usual architectures proposed for 2D classification/segmentation of standardized datasets (Deng et al. 2009). Here, we provide a strong approach to directly address these outstanding issues.
A multiscale approach that is able to capture large and small scale aspects of the microstructure simultaneously is an attractive proposal to overcome this limitation. Multiscale approaches have precedent in the ConvNet literature. Super-resolution Con-vNets (Shi et al. 2016;Huang et al. 2017;Yang et al. 2018;Yamanaka et al. 2017) address images at two different scales, although the problem is not one of regression, but of interpolation; fine-scale input is not available to the model. Karra et al (2017) provide an explicitly multiscale training pipeline, which progressive grows a generative network to build a high-resolution model of image datasets. This is accomplished by starting with coarser, lower-resolution networks and gradually adding details using networks that work with finer scales. Perhaps the most relevant to our work is Sin-GAN (Shaham et al. 2019), another generative model that uses a linked set of networks to describe images at different scales; finer scales build upon the models for coarser scales.
We invoke similar principles to build a MultiScale Network for hierarchical regression (MS-Net), which performs regression based on a hierarchical principle: coarse inputs provide broad information about the data, and progressively finer-scale inputs can be used to refine this information. A schematic of the workflow is shown in Fig. 1. In this paper, we use MS-Net to learn relationships between pore structure and flow fields of steady-state solutions from LBM. Our model starts by analyzing a coarsened version of a porous domain (where the main heterogeneities affecting flow are present), and then proceeds to make a partial prediction of the velocity field. This is then passed subsequently to finer-scale networks to refine this coarse prediction until the entire flow field is recovered. This paradigm exhibits strong advantages over other ConvNet approaches such as Poreflow-Net (Santos et al. 2020b) with regard to the ability to learn on heterogeneous domains and in terms of the computational expense of the model. While applied here to fluid flow, we believe this hierarchical regression paradigm could be applied to many disciplines dealing with 3D volumes, not limited to the problems studied here.
The rest of this manuscript is organized as follows. In Sect. 2, we describe our methods, and in Sect. 3, we describe the data we have applied our methods to. In Sect. 4, we describe the results of training using two different data sources. We show the results on test data comprised of a variety of samples presenting a wide range of heterogeneities at different scales. In Sect. 5, we provide discussion, including comments on the memory-efficiency of the approach, and we conclude in Sect. 6.

Multiscale Neural Network Description
Our end goal is to build a model which can estimate flow characteristics based on the microstructure found in a variety of complex samples, with constant fluid properties and time independent driving force. We aim to capture the steady-state fluid flow and associated statistics thereof, such as the permeability, but emphasize that other field quantities at steady-state could be addressed with the same framework. The main requirement for our approach is to have a domain (constituting a 3D binary array), and a matching response (simulation or experiment) across that domain. Additional information would be needed to capture more complex situations such as time-dependent (unsteady) flow. We choose to model this relationship with system of deep neural networks.
The task of learning the physics of transport in porous media requires a model that can learn complex relationships (like the one between structure and fluid velocity), and that has capacity to generalize many possible domains (outside of the ones used for training) for its broad applicability. One approach is to obtain massive amounts of data. This is expensive for multiscale porous media, which need to be modeled at very high resolutions. Moreover, model training time scales with the evaluation speed of the model. The common deep learning approach to building complex, well-generalizing models is to use networks that are (1) very deep, having a large number of layers or (2) very wide, having a large number of neurons at each layer. Although these strategies typically results in higher accuracy, they always result in a larger computation required to train and use the model. The memory cost is proportional not only to the width and depth of the network, but also with the volume of the sample that needs to be analyzed. Overview of our multiscale network approach. Starting from a 3D CT image of a fractured carbonate (a gray-scale cross-section of the domain is shown in the back, where unconnected microporosity can be observed, and the main fracture of the domain is shown in dashed red lines), we predict the single-phase flow field of the image by combining predictions made over multiple resolutions of the binary input (shown to the left, where only the surface between the pore-space and the solid is shown in gray). Each neural network (depicted as purple boxes) scans the image with different window sizes that increase exponentially (depicted with gray outline cubes in the domain). The different sizes of the neural networks depict higher model capacity. The predictions of the networks are all linked together (via the arrow connections) to provide an approximation of the Navier-Stokes solution (shown in the right, where the higher velocity values get brighter colors). All these elements are explained in detail in Sects. 2 and Fig. 3 In practice, this has limited the volume with which 3D data can be evaluated on a single GPU to sizes on the order of 80 3 (Santos et al. 2020b).
One approach to address this limitation is to break large domains into sub-samples, and augment the feature set so that it contains hand-crafted information pertaining to the relative location of each sub-sample (Santos et al. 2020b). This can add information about the local and global boundaries surrounding the subsample. However, a clear limitation of this approach is its applicability for domains containing large-scale heterogeneities. Figure 2 shows the variation of properties as a function of window size for various data considered in this paper, and it is clear that in some cases the REV may be much larger than 80 voxels. If this data is split into sub-domains, the large-scale information about features is limited to the hand-crafted information fed to the model, and the power of machine learning to analyze the correlations in the 3D input space is limited to window sizes that are orders of magnitude smaller than the REV.
Instead of increasing the model size, one can design topologies for a network (architectures) which aim to capture aspects of human cognition or of domain knowledge. For example, self-attention units (Vaswani et al. 2017;Jaegle et al. 2021;Ho et al. 2019;Dosovitskiy et al. 2020) capture a notion of dynamic relative importance among features.
We propose an architecture designed to overcome the hurdles of training convolutional neural networks to small 3D volumes, the multiscale network (MS-Net), and show that MS-Net is a suitable system to model physics in complex porous materials. The MS-Net is a coupled system of convolutional neural networks that allows training to entire samples to understand the relationship between pore-structure and single-phase flow physics, including large-scale effects. This makes it possible to provide accurate flow field estimations in large domains, including large-scale heterogeneities, without complex feature engineering.
In the following sections, we first provide an overview of how convolutional neural networks work, and explain our proposed model, the MS-Net, a system of single-scale networks that work collectively to construct a prediction for a given sample. We then explain our loss function which couples these networks together. Finally, we explain the coarsening and refining operations used to move inputs and outputs, respectively, between different networks.

Overview of Convolutional Networks
Our model is comprised by individual, memory inexpensive neural networks which are described in Appendix A. All of the individual networks (one per scale) of our system are composed by identical fully convolutional neural networks (which means that the size of the 3D inputs are not modified along the way). The most important component of a convolutional network is the convolutional layer. This layer contains kernels (or filters) of size (k size ) 3 that slide across the input to create feature maps via the convolution operation: where F denotes the total number of kernels (or channels) of that layer, * is the convolution operation, b a bias term, and f is a nonlinear activation function. A detailed explanation of these elements is provided in Goodfellow et al. (2016). The elements of these kernels are trainable parameters, and they operate on all the voxels of the domain. These parameters are optimized (or learned) during training. By stacking these convolutional layers, a convolutional network can build a model which is naturally translationally covariant, that is, a shift of the input image volume produces a shift in the output image volume (LeCun et al. 2015). In this work, we use k size = 3 , since it is the most efficient size for current hardware (Ding et al. 2021).
An important concept in convolutional neural networks is the field of vision. The field of vision ( FoV ) dictates to which extent parts of the input might affect sections of the output. The FoV of a single convolutional layer with a k size = 3 is of 3 voxels. For the case of a fully convolutional neural network with no coarsening (pooling) inside its layers (the input size is equal to the output size), the FoV is given by the following relation: where L is the number of convolutional layers of the network, and k size the size of the kernel. For the case of the single scale network architecture used here (see Appendix A for details), the FoV is 11 voxels. This is much smaller than the REV of any of our samples (Fig. 2). It is worth noting that it is not possible to add more layers to increase the FoV and still train with samples that are 256 3 or larger in current GPUs. We now explain MS-Net, which addresses this limitation using a coupled set of convolutional networks.

Hierarchical Network Structure
The MS-Net is a system of interacting convolutional neural networks that operate at different scales. Each neural network takes as input the same domain at different resolutions through coarsening and refining operations, ℂ and ℝ m , which will be explained in detail in Sect. 2.4. Each network is responsible for capturing the fluid response at a certain resolution and to pass it to the next network. This process is visualized in Fig. 3.
What changes between the individual networks is the number of inputs that they receive. The coarsest network receives only the domain representation at the coarsest scale (Eq. 4), (1) while the subsequent ones receive two, the domain representation at the appropriate scale, and the prediction from the previous scale ( Fig. 3 and Eq. 5). As mentioned above, the input's linear size is reduced by a factor of two between every scale. Mathematically, the system of networks can be described as such: where n indicates the coarsest scale, NN i the individual neural networks, and ℂ() and ℝ m () , the coarsening and refinement operations, respectively. In this system of equations, the input is first coarsened over as many times as there are networks. The neural network that works with the coarsest input ( NN n ) has the largest FoV with respect to the original image, and processes the largest scale aspects of the problem. The results of this network are used both as a component of the output of the system, and are made available for the finer scale networks, so that finer-scale, more local processing that can resolve details of flow have access to the information processed at larger scales. The process of coarsening an image progressively doubles the field of vision per scale, yielding the following formula for FoV in MS-net: As we stated in the previous section, the finest (first) network has a FoV of 11 voxels. With our proposed system, the second network can see with a window of 22, and so on, with the FoV increasing exponentially with the number of scales. Using these principles, the model is able to analyze large portions of the image that can contain large-scale heterogeneities affecting flow (Fig. 1). The sizes of the windows do not strictly need to be REVs, since ... Our model consists of a system of fully convolutional neural networks where the feed-forward pass is done from coarse-to-fine (left to right). Each scale learns the relationship of solid structure and velocity response at the particular image resolution. The number of scales (n) is set by the user and all these scales are trained simultaneously. In this figure we are showing the original (finest) scale, the coarsest (n) and the second coarsest ( n − 1 ). The masked refinement step is explained on Sect. 2.4). X 0 represents the original structure and ŷ the final prediction of the model the network still processes the entire image at once. Nevertheless, the bigger the FoV, the easier it is to learn the long range interactions of big heterogeneities affecting flow. Computationally, it would be possible to add enough scales to be able to have FoVs that are close to the entire computational size of the sample (200 3 -500 3 ). Early experiments with a very large number of scales revealed that this limits the performance of the model when applied to small samples.

Images at Different Scales
Our workflow relies on a multiscale modeling approach. We identify the scale number to denote how many times the original image has been coarsened. Hence, scale 0 refers to the original image, scale 1 is the original image after being coarsened one time, and so on. This process is visualized in Fig. 4. There are two main benefits to this approach, which constructs a series of domain representations with varying level of detail. The first is that coarser networks, which analyze fewer voxels, can be assigned more trainable parameters (larger network widths or more neurons, as depicted in Fig. 1) without incurring memory costs associated with the full sample size. The second is the exponential increase the FoV rendered by this approach (Eq. 7).
For this particular problem of fluid flow, as a proxy for pore-structure, we used the distance transform (also known as the Euclidean distance), which labels each void voxel with the distance to the closest solid wall (seen in Fig. 4). We selected this feature because it is very simple and inexpensive to compute, and provides more information than the binary image alone. The fact that no additional inputs are needed makes the MS-Net less dependant on expert knowledge for feature engineering. Note that we treat only the connected void space; the steady-state flow in an isolated void will be zero, and we can safely fill disconnected voids that do not connect to the inlet/outlet system with solid nodes as a cheap preprocessing step.
This distance is related to the velocity field in a non-linear way, which must be learned by the network. Nonetheless, it is possible to visualize how coarser images provide more straightforward information about fluid flow. In Fig. 5, we show input domains against different scales (top row) and corresponding, 2D histograms (bottom row) relating the feature value to the magnitude of the velocity. At scale zero, the distance value is not strongly related to the velocity; for a given distance value, the velocity may still range over more than three orders of magnitude. This deviation from the parabolic velocity profile arises due to the complexity of the porous media. At scale 3, the feature and velocity have been coarsened several times, and a clearer relationship between the distance and velocity emerges. It is then the job of the nth neural network to determine how the 3D pattern of features is non-linearly related to the velocity at this scale, and to pass this information on to networks that operate at a finer scale, as shown in Fig. 3.

Coarsening and Refinement Operations
We use the term coarsening ( ℂ ) to describe the operation of reducing the size of an image by grouping a section of neighboring voxels into one voxel (in this case, by averaging). We use the term refinement ( ℝ ) to denote the operation of increasing the computational size of an image, but not the amount of information (this is also known as image upscaling, but we use the term refinement to avoid potential confusion with upscaling in reservoir engineering or other surrogate modeling).
The coarsening and refining operations should have the following properties applied to data z (i.e., input or output volumes): the angle brackets ⟨⟩ represent the volumetric average over space, and the masked refinement ℝ m () projects solutions from a coarse space back into the finer resolution space while assigning zero predictions to regions that are occupied by the solid. The first equation indicates that coarsening should preserve the average prediction, and the second says that refinement should be a pseudo-inverse for coarsening, that is, if we take an image, refine it, and then subsequently coarsen it, we should arrive back at the original image. Note that the opposite operation-coarsening followed by refinement-cannot be invertible, as the coarser scale image manifestly contains less information than the fine scale one. histograms of the normalized distance transform vs velocity after coarsening steps. As the system is coarsened, the correlation between the distance transform and the velocity becomes more apparent The coarsening operation is simple. As mentioned in Sect. 2.3, we first coarsen our input domain n-times. Coarsening is applied via a simple nearest neighbor average; every 2 3 region of the image is coarsened to a single voxel by averaging. This operation is known as average pooling in image processing.
The refinement operation is more subtle. There exists a naive nearest-neighbors refinement algorithm, wherein the voxel value is replicated across each 2 3 region in the refined image. However, this presents difficulties for prediction in porous medianamely, that if this operation is used for refinement, flow properties from coarser networks will be predicted on solid voxels where they are by definition zero (refer to Fig. 13), and the fine-scale networks will be forced to learn how to cancel these predictions exactly. Early experiments with this naive refinement operation confirmed that this behavior is problematic for learning. Instead, we base our refinement operation on a refinement mask derived from the input solid/fluid nodes. This is performed such that, when refined back to the finest scale, the prediction will be exactly zero on the solid nodes and constant on the fluid nodes, while conserving the average. We refer to this masked refinement as ℝ m . This requires computing refinement masks that re-weight regions in the refined field based on the percentage of solid nodes in each sub-region. Refinement masks for a particular example are visualized in Fig. 6. An example calculation and pseudo-code for this operation is given in the A.3. Then, the masked refinement operation can simply be computed as naive refinement, followed by multiplication by the mask. Figure 14 demonstrates the difference between the operations by comparing naive refinement with masked refinement.
The masked refinement operation is cheap and parameter-free; nothing needs to be learned by the neural network, unlike, for example, the transposed convolution (Goodfellow et al. 2016). We thus find it an apt way to account for the physical constraints posed Fig. 6 Schematic of the coarsening and masking process. Top: Starting from the original domain, a series of coarsening steps are performed, every 2 3 neighboring voxels are averaged to produce one voxel at the following scale. Structural information is lost along the way. Bottom: Masks for each scale, which re-weight a naive refinement operator. These masks have larger weights in regions where the prediction must be redistributed, near the boundaries with solid nodes, and is zero in regions that correspond entirely to solid nodes. For example, if the image of scale 3 is refined using nearest-neighbors and multiplied by the mask 2, the image at scale 2 would be fully recovered by fields defined only within the pore. The masked refinement operation is also the unique refinement operation such that when applied to the input binary domain, coarsening and refinement are true inverses; the masked refinement operation recovers the original input domain from the coarsened input domain.

Loss Function
To train the MS-Net, the weights of the convolutional kernels and biases of the network (Eq. 14) are optimized to minimize the following equation based on the mean squared error for every sample s at every scale i between the prediction at that scale ŷ i,s and the true value coarsened at that scale y i,s . The full loss L can be broken in to contributions for each image L s , and scale, L i s : where n is the total number of scales, and S the number of samples, ⟨...⟩ denote the spatial averaging operation. This equation accounts for the variance in the true velocity, 2 y s , in order to weight samples that contain very different overall velocity scales (related to permeability) more evenly. Since the coarsest scale is implicitly present in the solution at every scale (Eq. 6), the coarsest network is encouraged to output most of the magnitude of the velocity Fig. 7. This loss function is also useful to be able to train with samples of (10) Fig. 7 Velocity prediction per scale and sample loss L s for the training samples, organized by column. (top row) Normalized mean velocity (that is, v ∕v at each scale. Coarser scales are shown with lighter lines. Note that the average velocity is predicted mostly by the coarsest scale, and refined by further scales. (bottom row) Corresponding values of the loss function for each sample during training. Although the permeability ( k ∝v ) of the samples spans orders of magnitude, the network assigns roughly the same importance to them during training using a well-normalized loss function varying structural heterogeneity (and fluid response), since the mean square error is normalized with the sample variance to obtain a dimensionless quantity that is consistent for every sample. The velocity variance does not scale all the samples to the exact same distribution, but it is a good first order approximation to allow training models with very different velocity distributions.

Data Description: Flow Simulation
To train and test our proposed model, we carried-out single-phase simulations using our in-house multi-relaxation time D3Q19 (three dimensions and 19 discrete velocities) lattice-Boltzmann code (D'Humieres et al. 2002). Our computational domains are periodic in the z-direction, where an external force is applied to drive the fluid forward simulating a pressure drop. The rest of the domain faces are treated as impermeable. The simulation is said to achieve convergence when the coefficient of variation of the velocity field is smaller than 1x10 −6 between 1000 consecutive iterations. We run each simulation on 96 cores at the Texas Advanced Computing Center. The output of the LB solver is the velocity field in the direction of flow (here, the z-direction). To calculate the permeability of our sample, we use the following equation: where v is the mean of the velocity field in the direction of flow, is the viscosity of the fluid, dP dl is the pressure differential applied to the sample, and dx dl is the resolution of the sample (in meters per voxel). Although we used the LBM to carry-out our simulations, the following workflow is method agnostic. It only relies on having a voxelized 3D domain with its corresponding voxelized response.

Results
Below we will present two computational experiments (porous media and fractures) that were carried-out to show how the MS-Net is able to learn from 3D domains with heterogeneities at different scales. In the first subsection, we will show to what extent the MS-Net is able to learn from very simple sphere-pack geometries to be able to accurately predict flow in a wide range of realistic samples from the Digital Rock Portal (Prodanovic et al.). It is worth noting that simulating the training set took less than one hour per sample, and training the model took seven hours, while some samples in the test set took over a day to achieve convergence through the numerical LBM solver. In the second experiment, we show that training using two fracture samples of different aperture sizes and roughness parameters is enough to estimate permeability for a wider family of aperture sizes, and roughnes.

Training the MS-Net with Sphere Packs
To explore the ability of the MS-Net to learn the main features of flow through porous media, we utilize a series of five 256 3 numerically dilated sphere packs (starting from the original sphere pack imaged by Finney and Prodanovic (2016)) to train the network (Fig. 15). The porosity of the training samples range from 10 to 29%, and their permeability from 1 to 37 Darcy. For reference, the individual simulation times to create the training samples range from 10 to under 50 minutes to achieve convergence.
Our model consists of four scales, using 2 2n+1 filters per scale (2 in the finest model and 128 in the coarsest). During training, each sample is passed through the model (as shown in Fig. 3), and then the model optimizes its parameters (the numbers in the 3D filters and the biases) to obtain a functional relationship between the 3D image and the velocity field by minimizing the loss function (Eq. 12). The MS-Net can be trained end-to-end since all included operations are differentiable (including the coarsening and masked refinement) with respect to the model's weights.
In short, we are looking to obtain a relation of the form of velocity v z as a function of The network was trained for 2500 epochs, with a batch size of all five of the samples (i.e., full-batch training). This process took approximately seven hours. To highlight the order of magnitude of the loss function during training, the first 1000 epochs of the training are shown in Fig. 7. This illustrates that magnitude of the loss for each sample is similar, despite the very different velocity scales associated with each sample. We also tried augmenting the training dataset utilizing 90 degree rotations about the flow axis, but no benefits were observed. The loss function value per sample and the mean velocity per scale of the network are plotted in Fig. 7. As seen in the top row of the figure, the coarsest scale is responsible for most of the velocity magnitude, and finer-scale networks make comparatively small adjustments. The bottom row of plots shows the loss for each sample used for training. We note that the normalization of our loss (Eq. 12) puts the loss for each sample on a similar scale, despite the considerable variance in porosity and permeability between samples.
Once trained, the network produces very accurate predictions on the training samples. Figure 8a shows predicted and true velocity fields, and relative error, for a cross section of one of the training example. Figure 9a shows a histogram of the true vs. predicted velocity across all pore-space voxels, along with a plot of the relative change in fluid flux across x-y slices of the sample. However, performance on training data does not necessarily translate to useful performance on the vast variety of real-world microstructures, and as such we endeavor to assess the trained MS-net on many test microstructures compiled from various sources. Figure 8, rows b through e, and Fig. 9, rows b through e, show velocity field performance on various samples from test sets, which we examine in the following sections.

Fontainebleau Sandstone Test Set
Training to the sphere-pack dataset reveals that the model can learn the main factors affecting/contributing to flow through permeable media. To assess the performance of the trained MS-Net, we used Fontainebleau sandstones (Berg 2016) at different computational sizes ( 256 3 and 480 3 ). The cross sections of these structures are visualized in the right panel of Fig. 15. Overall results are presented in Table 1, and flow field plots for a single sample are presented in Fig. 8b and 9b. The relative percent error of the permeability can be calculated as e r = |1 − k pred k | . The typical accuracy of the permeability is approximately within 10%. One remarkable fact is that the model retains approximately the same accuracy when applied to 480 3 samples as 256 3 samples.
It is worth noting that the simulation of the sample with a porosity of 9.8 % took 13 hours to converge; this single sample takes more computational effort than the entire construction of training data and the training the model together.

Additional Test Predictions on more Heterogeneous Domains
To assess the ability of the model trained with sphere packs to predict flow in more heterogeneous materials, we tested samples of different computational size and complexity. We split the data into three groups according to their type: • Group I: Artificially created samples: In this group, we include a sphere pack with an additional grain dilation (that lowered the porosity) from the lowest porosity training sample 1 , a vuggy core created by removing 10% of the matrix grains from the original sphere pack (Finney and Prodanovic 2016) to open up pore-space and create disconnected vugs (Khan et al. 2019(Khan et al. , 2020, then the grain were numerically dilated two times to simulate cement growth, a sample made out of spheres of different sizes where the porosity at the inlet starts at 35% and it decreases to 2% at the outlet, and this last sample reflected in the direction of flow. • Group II: Realistic samples: Bentheimer sandstone (Muljadi 2015), an artificial multiscale sample (microsand) (Mohammadmoradi 2017), Castlegate sandstone (Mohammadmoradi 2017). The sizes where selected such that they were an REV. • Group III: Fractured domains: Segmented micro-CT image of a fractured carbonate (Prodanovic and Bryant 2009), layered bidispered packing recreating a propped fracture (Prodanovic et al. 2010), and a sphere pack where the spheres intersecting a plane in the middle of the sample where shifted to create a preferential conduit, and this same structure rotated 90 degrees in the direction of flow (so that the fracture is in the middle plane of the flow axis and deters flow) (Prodanovic et al. 2010).
The lowest porosity sample (porosity of 7.5%) took 26 hours running on 100 cores to achieve convergence. Besides an accurate permeability estimate, another measure of precision if the loss function value at the finest scale (from Eq. 12). These two are related, but not simple transformations of each other. The loss function provides a volumetric average of the flow field error. We normalized this value using the sample's porosity to obtain a comparable quantity, which results in a quantity that is roughly the same for all samples. Visualizations of some of these samples are shown in Fig. 16, and the prediction results are shown in Table 2. Table 2 reveals remarkable performance on the variety of geometries considered. Samples from all groups are predicted very well, with permeability errors for the most part within about 25% of the true value, through samples ranging by three orders of magnitude in permeability. Flow field errors for one sample from each group are shown in Figs. 8c-e, 9c-e. The percentage change of flux shows how much the fluid flux varies from slice-to-slice across the sample. In our results, we can see percentage changes up to 10%. Each row corresponds to the sample samples as shown in Fig. 8 Two notable failure cases emerged. In the first, the Castlegate sandstone, we find that the flow field prediction is still somewhat reasonable, as visualized in Figs. 8d and 9d. The largest failure case (highlighted in bold in Table 2), is the fractured sphere pack with a fracture parallel to fluid flow. In this case, the model is not able to provide an accurate flow field due to the difference in flow behavior that a big preferential channel (like this synthetic fracture) imposes compared with the training data (sphere packs), and as a result the predicted permeability is off by a factor of 5. Likewise, the sample also has the highest loss value. However, since no example of any similar nature is found in the training set, we investigate in the following section the ability of the model to predict on parallel fractures when presented with parallel fractures during training. We additionally show here the ratio between the loss (Eq. 12) and the porosity, which is another measure of accuracy in the active cells of the prediction. In bold, the worst performing sample using the model trained on sphere packs

Training the MS-Net with Fractures
Since the MS-Net is able to see the entire domain at each iteration, we carried-out an additional numerical experiment with domains hosting single rough fractures from a study of the relationship between roughness and multiphase flow properties (Santos and Jo 2019;Santos et al. 2018a). Alqahtani et al. (2021) identified flow through fractures as an outstanding problem in deep-learning of porous materials.  Table 3 Results of training and testing in different fractures The first column indicates if the sample was part of the training set, followed by the mean aperture Ap , fractal exponent D f and permeability k. In bold, the worst performing sample of Table 2 Train

3
The domains were created synthetically using the model proposed by Ogilvie et al. (2006), where the roughness of the fracture surfaces is controlled by a fractal dimension D f . The cross sections of the domains and their histograms are shown in Fig. 10. We utilize two sets of five fractures fractures with increasing roughness D f . Within each group, each fracture has a constant mean fracture aperture Ap and therefore the porosity of each sample in a group is identical. The total computational size of these is 256× 256×60. We trained our model using two of these synthetic fractures, one from each group, and tested them on the other 8 fractures. The results are shown in Table 3.
We selected two samples with different mean aperture and roughness exponent so that the network might learn how these factors affected flow. From the results of Table 3, we can conclude that the network is able to extrapolate accurately to construct solutions of the flow field for a wide variety of fractal exponents. The training fractures have permeability of 224 and 301 Darcy, whereas accurate test results are found ranging between 91 and 953 Darcy. This gives strong evidence that the model is able to distill the underlying geometrical factors affecting flow, even when the coarse-scale view of the pore is quite similar.
We contrast the machine learning approach with classical method such as the cubic law (Tokan-Lawal et al. 2015). For these synthetically created fracture, the cubic law would return a permeability value that depends only on the aperture size, whereas LBM data reveals that the roughness could influence the permeability by a factor of 3. There have been a number of papers attempting to modify the cubic law for a more accurate permeability prediction. However, there is evidence that those predictions could be off by six orders of magnitude (Tokan-Lawal et al. 2015). There are also other approaches in line with our hypothesis that a fracture aperture field should be analyzed with local moving windows (Oron and Berkowitz 1998). This model could be used to estimate more accurately the petrophysical properties of fractures for hydraulic fracturing workflows (Xue et al. 2019;Park et al. 2021).

Discussion
We have shown the MS-Net performing inference in volumes up 512 3 , chosen to obtain the LBM solutions in a reasonable time-frame. The MS-Net can be scaled to larger systems on a single GPU. Table 4 reports the maximum size system which a forward pass of the finest-scale network was successful for various recent GPU architectures, without modifying our workflow. Additional strategies such as distributing the computation across multiple GPUs, or pruning the trained model (Tanaka 2020;Li et al. 2017) might be able to push this scheme to even larger computational domains. For all architectures tested, the prediction time was on the order of one second, whereas LBM simulations on a tight material may take several days to converge, even when running on hundreds of cores.
The number of scales used could be varied. For our experiments, we chose to train a model with four scales, since we did not see an increase in accuracy with more scales. This number is a parameter that could be explored in further research. The FoV of the coarsest network is of 88 voxels wide, and the model itself operates on the entire domain simultaneously, rather than on subdomains. For comparison, the FoV of the PoreFlow-Net (Santos et al. 2020b) is of 20 voxels, and operated on subdomains of size 80 3 due to memory limitations. We utilize 2 2n + 1 filters per scale (2 in the finest network and 128 in the coarsest). We also believe that our work could be also utilized for compressing simulation data, since, as shown in Fig. 7, a single model is able to learn several simulations with a high degree of fidelity to make them more easily portable (also called representation learning in deep learning). A training example is on the order of 500 Mb of data in single precision float-point, whereas the trained model is approximately 25 Kb. Thus, when training to a single 512 3 example, the neural network encodes the solution using approximately 2 × 10 −4 bytes per voxel; it is tremendously more efficient than any floating point representation. One would also need to keep the input domain to recover the solution, but this is itself a binary array that is more easily compressed than the fluid flow itself. For example, we applied standard compression methods to the binary input array for the Castlegate sandstone, which was then fit into 2.4 MB of memory.

Conclusions and Future Work
It is well-established that semi-analytical formulas or correlations derived from experiments can fail to predict permeability by orders of magnitude. Porosity values alone can be misleading due to the fact that this does not account for how certain structures affect flow in a given direction, or due to the presence of heterogeneities. However, going beyond simple approximations is often expensive. We have presented MS-Net, a multiscale convolutional network approach, in order to better utilize imaging technologies for identifying pore structure and associate them with flow fields. When training on sphere packs and fractures, MS-Net learns complex relationships efficiently, and a trained model can make new predictions in seconds, whereas new LBM computations can take hours to days to evaluate.
We believe that it would be possible to train the MS-Net using more and more diverse data to create a predictive model that could be able to generalize to more domains simultaneously (for example, unsaturated soils, membranes, mudrocks, and kerogen). This could be done using active learning principles, carrying out simulations where the model has a low degree of confidence in its prediction, such as in Santos et al. (2020a), to examine a vast variety of possible training samples and only compute physics-based flow simulations on a targeted set.
The MS-Net architecture is an efficient way of training with large 3D arrays compared to standard neural network architectures from computer vision. Although this model is shown to return predictions that are accurate, there are still high local errors. A plausible solution is to use the model prediction as the initial input for a full-physics simulation, as shown in Wang et al. (2021b). This workflow can be specially efficient if the simulator can take advantage GPU computing. (McClure et al. 2014(McClure et al. , 2021 On the other hand, there are desirable physical properties that might be realized by a future variant, such as mass or momentum continuity. One avenue of future work could be to focus on designing conservation modules for the MS-Net using such hard constraints for ConvNets (Mohan et al. 2020). An important hurdle of applying these techniques in porous media problems is that the bounded domains make the implementation more challenging.
Another important area of future work would be to address data from different scientific domains. This includes similar endeavors such as steady-state multiphase flow, waved propagation through a solid matrix, and component transport in porous media. The model could also be applied to other 3D problems, such as seismic inversion (Cho et al. 2018;Cho and Jun 2021), astronomical flows (van Oort et al. 2019), or the flow of blood through highly branched vessel structures.
Lastly, we believe that an important endeavor is to create more realistic domains, with multiscale features such as fractal statistics. One avenue to pursue such methods is the Generative Adversarial Network (GAN), another ML technique which allows a generator model to learn to create new data by fooling a discriminator model (the adversary) that is trained to distinguish between real data and the Generator's outputs. The multiscale technique has been applied to many real-world datasets such as human faces, but has not, to our knowledge, been used to construct synthetic porous media.

Appendix A Single neural network description
Each network of our system (one per scale) is composed by fully convolutional networks (which means that the dimensions of the 3D inputs are not modified along the way). Each of them is composed by stacks with the following layers: • 3D convolution with a 3 3 kernel: This layer contains kernels (or filters) of size 3 3 that are slid across the input to create feature maps via the convolution operation: where F denotes the number of kernels (or output channels) of that layer, * is the convolution operation and b a bias term. • Instance Normalization (Ulyanov et al. 2016): This layer normalizes its inputs to have a mean of zero and a standard deviation of one. This facilitates training a model with samples that have strong velocity contrasts (different orders of magnitude). This is done to every sample using their individual statistics and no trainable parameters.
where x is the sample mean and its standard deviation, is a small constant to avoid divisions over zero. This layers allows better flow of information (by constraining the mean and the standard deviation of the outputs) and reduces the risk of training diverging. • Continuously Differentiable Exponential Linear Unit (CeLU) (Barron 2017): This layers helps to build nonlinear relationships (like the one between pore-structure and velocity field, shown in Fig. 5). All the data that passes through this layer is transformed using the following equation: (14) where is set to 2. We utilize this function because it speeds-up and improves training by virtue of not having vanishing gradients and by having mean values near zero. It also able to output negative values (unlike the ReLU) The outputs of this network are constrained from minus two to infinity. The last layer of each network is a 3D convolution with a 1 3 kernel: The 1 3 kernel acts as a linear regressor which reduces the dimensionality of the output to one single 3D image (in our case, the velocity field). This is done to weight all the feature maps from the previous layer (Fig. 11) and output a single 3D matrix (in this case, the velocity at that particular scale). The three components of the velocity tensor (vx,vy, and vz) can be predicted if this layer is set to have three filters.
The fourth block of our network does not include an activation function because we would like to give the network expressive power to be able to output negative velocities. Every convolutional layer includes a bias term. This system is visualized explicitly in Fig. 11.

Normalization of the Data and Initialization of the Network Parameters
We first start the training workflow by coarsening the initial inputs n times (depending on the number of scales desired Sect. 2.3). Then, we center the velocity of all the training set to be near one by dividing the LB simulation results with a constant. (This procedure is shown in Fig. 12). This has the advantage of not having to compute and store the summary statistics of the training set (as opposed to default normalization approaches). It also preserves the solid values as zero.
We have also observed that if we scale the weights of the last layer of the coarsest network to output results that are close to one (the mean velocity of our normalized data), the training exhibited a speed-up of several hours, since the initial prediction is a closer approximation to the solution compared to the default initialization scheme (He et al. 2015).

Coarsening (Pooling) Operation
The coarsening (or pooling) operation is defined as: where J is an array of all ones, d the number dimensions of the problem, and * 2 the convolution operation with a stride of 2.

Mask Calculation
We use the following python code to generate the masks of each sample at the beginning of the training cycle: Fig. 13 Schematic of the coarsening and refining process As an example, parting from a 2D region of an image of 2x2 pixels where 3 out of 4 pixels are solid, when this image is coarsened all these pixels are averaged into one value. When this coarsened image is refined using the masked approach, the only void pixel would get the value of the nearest neighbors matrix times 4 (Fig. 13).

End-to-End Model Training
1. Initialize MS-Net: In this paper we used a model with n=4 scales (4 networks from Appendix A). Each of these networks has 2 2s+1 convolutional kernels in each of their 5 layers. The network that works with the finest scale has 2 kernels while the one that receives the coarsest input has 128. The image shows the five binary samples per set which were superimposed for visualization purposes.
Since the geometries are binary (zeros in the void space and ones in the solids) is possible to sum all five to obtain a new array ( im = sand 1 + sand 2 + sand 3 + sand 4 + sand 5 ). The highs of the color bar stand for solids that are present in every domain while the lows are sections that are only present in the lower porosity samples