BioFVM-X: An MPI+OpenMP 3-D Simulator for Biological Systems

. Multi-scale simulations require parallelization to address large-scale problems, such as real-sized tumor simulations. BioFVM is a software package that solves diﬀusive transport Partial Diﬀerential Equations for 3-D biological simulations successfully applied to tissue and cancer biology problems. Currently, BioFVM is only shared-memory parallelized using OpenMP, greatly limiting the execution of large-scale jobs in HPC clusters. We present BioFVM-X: an enhanced version of BioFVM capable of running on multiple nodes. BioFVM-X uses MPI+OpenMP to parallelize the generic core kernels of BioFVM and shows promising scalability in large 3-D problems with several hundreds diﬀusible substrates and ≈ 0.5 billion voxels. The BioFVM-X source code, examples and documentation, are available under the BSD 3-Clause license at https:// gitlab.bsc.es/gsaxena/biofvm x.


Introduction
Advances in understanding complex biological systems such as tumors require multi-scale simulations that integrate intracellular processes, cellular dynamics, and their interaction with the environment.Computational biologists use a wide range of approaches to simulate how single cells affect multi-cellular systems' dynamics [17,24].Nevertheless, large-scale multi-scale modeling still needs tools to accurately simulate the environment in an efficient manner.
BioFVM [8] is a Finite Volume Method (FVM) [20] based simulation software for solving Partial Differential Equations (PDEs) [29] that model complex processes like the uptake, release and diffusion of substrates for multi-cellular systems such as tissues, tumors or microbial communities.Apart from being a self-contained callable library that can be used to implement and simulate biological models, BioFVM forms the core component of PhysiCell [9] -a flexible, lattice-free, agent-based multi-cellular framework capable of simulating cell mechanics, such as cell movement, cell-cell interaction and different cell phenotypes, as well as the micro-environment consisting of diffusing substrates, signaling factors, drugs, etc. BioFVM is capable of handling multiple substrates and can simulate chemical and biological processes using both cell and bulk sources.The following diffusive PDE on a computational domain Ω (and boundary ∂Ω) is solved for a substrate density vector ρ: with the boundary condition (D • ∇ρ) • n = 0 on ∂Ω and the initial condition ρ(x, t 0 ) = g in Ω.In (1) above, D is the matrix of (constant) diffusion coefficients, λ is the decay rate, f is the net source term and • is the termwise product of vectors [8].Without loss of generality, the substrate density ρ can represent any kind of molecule such as a nutrient, a by-product, a signal molecule or a drug.As a consequence, modeling complex environments requires simulating many densities, posing a challenging scaling problem.Simulating the environment requires the numerical solution of the linear system obtained by a Finite Volume Discretization of the PDE given by Eq. ( 1), which BioFVM solves using the Thomas algorithm [31] -a fast, direct solver for tridiagonal systems.BioFVM's biggest scalability limitation is that it cannot execute on multiple nodes of an HPC cluster to solve a single, coherent problem and thus the problem must fit into the memory of a single node.We present BioFVM-X1 : an enhanced distributed version that uses MPI (Message-Passing Interface [21]) to parallelize the core kernels of BioFVMenabling one to solve very large problems which were not previously solvable using the shared-memory only version.This contribution represents the first and the most critical step on the road to a distributed implementation of PhysiCell.

Related Work
Different agent-based approaches have been proposed to model and simulate multi-cellular systems, including on-lattice cellular automata, the Cellular-Potts model [10] and overlapping spheres, among others [23].BioFVM [8,9] was created with the goal of achieving simplicity of usage, flexibility in expressing cell models, and optimizing execution speed while minimising dependencies on external libraries but is only shared-memory parallelized using OpenMP [22].
For realistic, complex simulations, the need is to simulate billions of cells and dynamic, complex 3-D environments, only achievable by optimal, full scale utilization of parallel systems [12,14].Biocellion [14] is a flexible, discrete agentbased simulation framework that uses MPI for inter-node communication, as well as other dependencies, such as PNNL Global Arrays [25], CHOMBO [3], the Intel TBB [11] and the iterative Multigrid solver [2,32].Nevertheless, Biocellion has fixed routines to describe system behaviors, is dependent on external libraries and is closed source, which might deter potential users.Chaste is an open-source, general purpose simulation package for modeling soft tissues and discrete cell populations [18] that can be used with MPI using PETSc [1] but which itself suffers from multiple dependencies.Timothy [4,5] is another open-source, MPI based tool but with several dependencies, such as Zoltan [6], Hypre [7] and SPRNG [19].

Internal Design and Domain Partitioning
The simplicity, flexibility, minimal dependence on external libraries, execution speed and openness of BioFVM make it an ideal experimental candidate for distributed parallelization.In BioFVM, the 3-D simulation domain is divided into Voxels (Volumetric pixels).The principal classes depicting the internal architecture and their relationship in BioFVM is shown in Fig. 1.
The top-level biological entities along with related classes (see Fig. 1) are: (1) Biological Environment (Microenvironment and Microenvironment Options), (2) Physical Domain represented as 2-D/3-D Mesh (General Mesh, Cartesian Mesh and Voxel), and (3) Cells (Basic Agent and Agent Container).The data members of some classes are either the objects or the pointers of another class type (see dashed arrows in Fig. 1).The Microenvironment class sets the micro-environment name, the diffusion/decay rates of substrates, defines constants for the Thomas algorithm, contains an object of Cartesian Mesh, a pointer to the Agent Container class and performs I/O.A group of resizing functions that determine the global/local voxels are members of the Cartesian Mesh class.The Microenvironment Options class helps to set oxygen as the first default substrate and the default dimensions of the domain/voxel.The Cartesian Mesh class is publicly derived from General Mesh (thick arrow in Fig. 1).The Basic Agent class forms an abstraction of a cell.An object of the Basic Agent class can either act as a source or sink of/for substrates.Each agent has a unique ID, a type, and maintains the local/global index of its current voxel.
We initialize MPI with the MPI THREAD FUNNELED thread support level and after domain partitioning [27,28], assign the sub-domains to individual MPI processes.Our implementation as of now supports only a 1-D x-decomposition (see Appendix A).The randomly generated positions of basic agents are mapped to respective processes (see Appendix B) after which they are created individually and in parallel on the MPI processes.Each MPI process initializes an object of the Microenvironment class, maintains the local and global number of voxels, local (mesh index) and global voxel indices (global mesh index) and the center of each local voxel's global coordinates.A 1-D x-decomposition permits us to employ the optimal serial Thomas algorithm [30,31] in the undivided y and z dimensions.This enables all threads within a node to simultaneously act on elements belonging to different linear systems.
The Thomas algorithm is used to solve a tridiagonal system of linear equations in serial and consists of two steps, namely, Forward Elimination (FE) step followed by a Backward Substitution (BS) step.Unfortunately, both the steps involve serial and dependent operations and thus, the solver is inherently serial and cannot be fully (trivially) parallelized.Although we decompose data in the x−direction, the solver still runs serially i.e.MPI process rank i must finish the FE before this step can begin on MPI process rank i + 1.Thus, the performance of this multi-node but serial Thomas solver is expected to be worse than a single-node Thomas solver due to the overhead of communication.The performance penalty is least in the x−direction as the data is contiguous in the memory as compared to the y and z direction where the data in the voxels' vector is non-contiguous.Thus, we decompose data only in the x−direction and avoid decomposition in the other directions.We expect to replace this non-optimized implementation by a modified, MPI+OpenMP version of the modified Thomas algorithm [15] in future versions.

Experiments
We used the MareNostrum 4 (MN4) supercomputer at the Barcelona Supercomputing Center (BSC) for all our experiments.Each node has two 24-core Intel Xeon Platinum 8160 processors and a total memory of 96 GB.BioFVM-X only requires a C++ compiler and an MPI implementation for compilation.We used GCC 8.1 and OpenMPI 3.1.1running atop the SUSE Linux Enterprise Server 12 SP2 OS.The parallel file system is the IBM General Parallel File System and the compute nodes are interconnected with the Intel Omni-Path technology with a bandwidth of 100 Gbits/s.We pinned the threads to individual cores and bind each MPI process to a single processor (socket).We set the OpenMP environment variables OMP PROC BIND=spread, OMP PLACES=threads [26] and used the --map-by ppr:1:socket:pe=24 notation to allocate resources (see https:// gitlab.bsc.es/gsaxena/biofvmx).
We used a cubic physical domain and cubic voxels for all our tests.Our implementation assumed that the total number of voxels in the BioFVM's x-direction are completely divisible by the total number of MPI processes.The example that we used to demonstrate the benefits of Hybrid parallelism is tutorial1 in the BioFVM/examples directory.This example: (1) Initializes and resizes the micro-environment (μ-environment, MC kernel) (2) Creates a Gaussian profile (GPG kernel) of the substrate concentration (3) Writes the initial and final concentrations to a .matfile (I/O kernel) (4) Creates Basic Agents (Sources and Sinks, BAG kernel) and ( 5) Simulates Sources/Sinks and Diffusion (Solver kernel).
Figure 2 presents timing results for the MC, GPG, BAG, I/O and Solver kernels on physical domains of sizes 1000 3 , 1920 3 and 3840 3 .Cubic voxels had a volume of 10 3 with 5 × 10 2 sources and 5 × 10 2 sinks in this example.We denote the Hybrid implementation as "Hyb (n = a)", where "a" denotes the total number of nodes.For example, with Hyb (n = 2), we obtain a total of 2 (nodes) × 2 (MPI processes) × 24 (OpenMP threads) = 96 OpenMP threads, as we always run 2 MPI processes per node and 24 OpenMP threads per MPI process.Instead of 8 MPI processes for the domain of size 1000 3 , we used 10 MPI processes due to a divisibility problem. Figure 3 shows the initial and final concentration of the diffusing substrate (oxygen) for a domain of size 1000 3 .The simulation plots were obtained with Hyb (n = 1) by executing the cross section surface.mMatlab script bundled with BioFVM.
In summary for Hyb (n = 1), both MC and BAG kernels took advantage of the multiple MPI processes as initialization of the Microenvironment and Basic Agent class objects were simultaneously carried out on separate processes in BioFVM-X as opposed to a single thread in BioFVM.The (MPI) I/O kernel showed significant performance gains over serial I/O for the tests considered (Fig. 2).Nevertheless, the Solver kernel execution run-times did not reflect a significant gain in the Hybrid version.An extended analysis of these results can be found in Appendix C. Note that it is generally very difficult for an MPI+OpenMP implementation to outperform the pure OpenMP implementation on a single node, as is the case of Fig. 2, due to the additional memory footprint of MPI and the cost of message-passing/synchronization.Our aim in the current work was to tackle very large problems that cannot fit into the memory a single node and to reduce their time to solution in a multi-node scenario.
After testing with increased voxels and basic agents, we run a performance test to evaluate the scalability in the number of substrates.We found that the pure OpenMP BioFVM version is incapable of executing a simulation of 400 substrates on a domain of 1500 3 due to memory limitations.Nonetheless, we successfully run a Hybrid simulation using 400, and even 800 substrates, on a domain of 1500 3 by distributing the computation between 2 nodes.To further showcase BioFVM-X capabilities, we run a parallelized version of the model of tumor growth in a heterogeneous micro-environment from BioFVM [8].We verified that the BioFVM-X distributed-memory 3-D tumor example yielded the exact same results as the shared-memory one (see Appendix D and Fig. 8).This is further proof that BioFVM-X correctly distributes the original BioFVM models with a boost in performance due to the load distribution and the potential of scaling simulations to a cluster of nodes, thus enabling researchers to address bigger, more complex problems.In addition, with a problem of size 7680 3 , the memory consumption of the pure OpenMP version reaches ≈97% of the total memory of the node (96 GB) and the simulation terminates with a bus error.For the same problem size, the Hybrid code on 4 (with 192 threads) and 8 nodes (with 384 threads) executes successfully (Table 1).

Conclusion and Future Work
Multi-scale modeling has already proven its usefulness in a diversity of largescale biological projects [9,16,24], but these efforts have been hampered by a scarcity of parallelization examples [4,12,14].We present BioFVM-X -an enhanced MPI+OpenMP Hybrid parallel version of BioFVM capable of running on multiple nodes of an HPC cluster.We demonstrate that BioFVM-X solves very large problems that are infeasible using BioFVM as the latter's execution is limited to a single node.This allows BioFVM-X to simulate bigger, more realistic in-silico experiments.Further, despite the fact that our solver is only partially parallelized, we see performance gains in multiple execution kernels.In the future, we aim to replace the solver in the x-direction with a parallel modified Thomas algorithm [15].BioFVM-X is open source under the BSD 3-Clause license and freely available at https://gitlab.bsc.es/gsaxena/biofvmx.Even though it can be used to easily implement and simulate biological models in a self-contained manner, BioFVM-X also forms the lower layer of our ongoing efforts to have a parallel large-scale and multi-scale modeling framework termed PhysiCell-X, based on PhysiCell [9] -a framework that is under active development and has multiple stable releases.Figure 5 shows the algorithm for domain partitioning where voxels are assigned to each MPI process.First, the domain dimensions (e.g.xmin, xmax) and the voxel dimensions (Δx) are used to decide the total number of global voxels (g x nodes).Given the total number of MPI processes (P ), the voxels per MPI process (l x nodes) in the x-direction are computed next.This is followed by the computation of the global coordinates for the centers of voxels (for brevity, lines 1-6 in Fig. 5 show this for the x−direction only, with the treatment of remaining directions being analogous to the x−direction).Further, since each MPI process must maintain the local and corresponding global voxel index, the global mesh index of the first voxel (l strt g index) is computed on each process -used subsequently to assign the global mesh index to each voxel on that process (see the triply nested loop in Fig. 5).In addition to the assignment of a local/global voxel index on each process, a list of the immediate directionalneighbours of each voxel is also maintained (not shown in Fig. 5).In parallel,

Appendix A 1-D Pure x-Domain Decomposition
Δx y and z analogous 2: l x nodes ← g x nodes d [1] 3: while j++ < l y nodes do 15: yj ← j × g x nodes 16: while i++ < l x nodes do 17: such a scheme must accommodate for the cases when there is no local x, y or z neighbour but a global neighbour exists on the neighbouring process or when the process is aligned to the physical boundary of the domain.In BioFVM, a list for the Moore neighbourhood is also built for each voxel.The Moore neighbourhood equates to a 9-pt stencil in 2-D and a 27-pt stencil in 3-D [13].

Appendix B Mapping Basic Agents to a Voxel
A mapping that relates the position coordinates of the Basic Agent to the local index of a process-specific voxel is illustrated with the help of an algorithm in Fig. 6.Given the positions vector (denoted by p[] in Fig. 6) of a Basic Agent, first the MPI Cartesian coordinates of the MPI process that contains the Basic Agent are computed (denoted by x p , y p and z p ).This is followed by the computation of the global x, y and z index (denoted by first x , first y and first z ) of the first voxel of the MPI process that contains the Basic Agent.After calculating the directional i.e.

Appendix C Extended Results
For an 8x increase in the number of voxels, the OpenMP MC, GPG, BAG and I/O kernels show a 7.86 − 8.67x, 3.29 − 7.05x, 1.15 − 1.3x and 6.78 − 8.51x increase, respectively (Fig. 2).The increase in the corresponding kernels for the best overall Hybrid version are: 8.7−9.4x,3−7.78x, 0.77−1.14xand 3.14−6.68x,respectively (Fig. 2).Both MC and BAG kernels can take advantage of the multiple MPI processes as initialization of the Microenvironment and Basic Agent class objects are simultaneously carried out on separate processes in BioFVM-X as opposed to a single thread in BioFVM.The (MPI) I/O kernel shows significant performance gains over serial I/O for the tests considered.For an 8x increase in the mesh resolution, the 6.78 − 8.11x increase for Hybrid version in the Solver kernel looks promising as compared to the 9.24−15.93xpure OpenMP increase, but the Hybrid version's absolute execution run-times do not reflect a significant gain.To help solve this, future versions of BioFVM-X will use the parallel modified Thomas solver [15] in the x-direction.

Appendix D Correctness Checking
To verify the correctness of the simulation, we run a simulation on a domain of size 1000 3 but increase the number of Basic Agents to 2 × 10 6 (Fig. 7 and Table 2).To further underline the correctness of BioFVM-X, we compared the results of a tumor growth model in a heterogeneous environment from BioFVM [8] available at this link.In this model a 2-D tumor growth is driven by a substrate supplied by a continuum vascular system and cells die when it is insufficient.Additionally, the tumor cells have motility and can degrade the vascular system.We first expanded this example to a 3-D example (instead of the original 2-D) and specified the domain as 80 × 80 × 80 voxels for a total of 512 000 voxels.We choose two different configurations: -a shared-memory configuration (OpenMP) of 48 threads -a hybrid shared and distributed-memory configuration (MPI+OpenMP) of 2 MPI processes running 24 threads each on a single node.
The comparison of the shared-memory and distributed-memory simulations yields identical results as shown in Fig. 8, further confirming that BioFVM-X provides the same results as BioFVM.The code to reproduce the figure is available on the BioFVM-X code repository.Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material.If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Fig. 1 .
Fig.1.Key classes in BioFVM, along with their member data and functions.Functions are distinguishable by a leading parenthesis i.e. ( ).Names are arbitrary but convey semantic information.Solid, thick arrow with an un-shaded triangle represents inheritance and dashed arrows denote a pointer or class relationship -the class (or its pointer) being pointed to by the arrow is a data member of the class from which the arrow originates.

Fig. 2 .Fig. 3 .
Fig. 2. Pure OpenMP Vs Hybrid MPI execution times for increasing problem sizes.Hyb (n = 1) represents the time when a single node with 2 MPI processes and 24 threads is used.The Best Hybrid represents the least time for that kernel for any number of experimental nodes considered.

Figure 4 Fig. 4 .
Figure4shows a 1-D x-direction domain partition of a 3-D domain (x-direction is the unit-stride dimension).

Fig. 5 .
Fig. 5. Assignment of voxels to MPI processes in 1-D x-Domain Decomposition.Only partitioning of x-dimension is shown (same for y and z-directions).Prefixes l and g stand for "local" and "global", respectively.Array d[] contains the topology dimensions and array c[] contains MPI process coordinates [21].The triply nested loop sets the global voxel (vxl) centers (cntr) and the global voxel index (indx).

Fig. 6 .
Fig. 6.Mapping the position of a Basic Agent (in array p[]) to the process-local index (l index) of a voxel that contains p[].Prefix l stands for "local".Array d[] contains the MPI Cartesian topology dimensions.l x/y/z nodes give the number of process local voxels and Δx, Δy, Δz denote voxel dimensions (generally Δx = Δy = Δz).

Fig. 7 .
Fig. 7. 2-D cross-section of the final concentration density of a given substrate with 2 × 10 6 agents on a domain of size 1000 3 using (a) Pure OpenMP (b) Hybrid MPI + OpenMP

Fig. 8 .
Fig. 8. 2-D cross-section of the (a) initial and (b) final concentration densities of three substrates from the 3-D tumor growth model on a domain of size 80 × 80 × 80 voxels using shared-memory (OpenMP d*) and distributed-memory with MPI+OpenMP (Hybrid d*).

Table 1 .
Time (in seconds) of execution for the pure OpenMP and the Hybrid version for a problem of size 7680 × 7680 × 7680 (≈0.5 billion voxels).The pure OpenMP version terminates while throwing Out Of Memory error.

Table 2 .
Time (in seconds) of execution of simulation for the OpenMP version and the Hybrid version in a domain of size 1000 × 1000 × 1000 with 2 × 10 6 Basic Agents.