1 Introduction

Advances in understanding complex biological systems such as tumors require multi-scale simulations that integrate intracellular processes, cellular dynamics, and their interaction with the environment. Computational biologists use a wide range of approaches to simulate how single cells affect multi-cellular systems’ dynamics [17, 24]. Nevertheless, large-scale multi-scale modeling still needs tools to accurately simulate the environment in an efficient manner.

BioFVM [8] is a Finite Volume Method (FVM) [20] based simulation software for solving Partial Differential Equations (PDEs) [29] that model complex processes like the uptake, release and diffusion of substrates for multi-cellular systems such as tissues, tumors or microbial communities. Apart from being a self-contained callable library that can be used to implement and simulate biological models, BioFVM forms the core component of PhysiCell [9] - a flexible, lattice-free, agent-based multi-cellular framework capable of simulating cell mechanics, such as cell movement, cell-cell interaction and different cell phenotypes, as well as the micro-environment consisting of diffusing substrates, signaling factors, drugs, etc. BioFVM is capable of handling multiple substrates and can simulate chemical and biological processes using both cell and bulk sources. The following diffusive PDE on a computational domain \(\varOmega \) (and boundary \(\partial \varOmega \)) is solved for a substrate density vector \(\rho \):

$$\begin{aligned} \frac{\partial \rho }{\partial t} = \nabla \cdot (\mathbf {D} \circ \nabla \rho ) - \lambda \circ \rho + \mathbf {f}, \end{aligned}$$
(1)

with the boundary condition \((\mathbf{D} \circ \nabla \rho ) \cdot n = 0\) on \(\partial \varOmega \) and the initial condition \(\rho (\mathbf {x},t_{0}) = g\) in \(\varOmega \). In (1) above, \(\mathbf {D}\) is the matrix of (constant) diffusion coefficients, \(\lambda \) is the decay rate, \(\mathbf {f}\) is the net source term and \(\circ \) is the term-wise product of vectors [8]. Without loss of generality, the substrate density \(\rho \) can represent any kind of molecule such as a nutrient, a by-product, a signal molecule or a drug. As a consequence, modeling complex environments requires simulating many densities, posing a challenging scaling problem. Simulating the environment requires the numerical solution of the linear system obtained by a Finite Volume Discretization of the PDE given by Eq. (1), which BioFVM solves using the Thomas algorithm [31] - a fast, direct solver for tridiagonal systems. BioFVM’s biggest scalability limitation is that it cannot execute on multiple nodes of an HPC cluster to solve a single, coherent problem and thus the problem must fit into the memory of a single node.

We present BioFVM-XFootnote 1: an enhanced distributed version that uses MPI (Message-Passing Interface [21]) to parallelize the core kernels of BioFVM - enabling one to solve very large problems which were not previously solvable using the shared-memory only version. This contribution represents the first and the most critical step on the road to a distributed implementation of PhysiCell.

2 Related Work

Different agent-based approaches have been proposed to model and simulate multi-cellular systems, including on-lattice cellular automata, the Cellular-Potts model [10] and overlapping spheres, among others [23]. BioFVM [8, 9] was created with the goal of achieving simplicity of usage, flexibility in expressing cell models, and optimizing execution speed while minimising dependencies on external libraries but is only shared-memory parallelized using OpenMP [22].

For realistic, complex simulations, the need is to simulate billions of cells and dynamic, complex 3-D environments, only achievable by optimal, full scale utilization of parallel systems [12, 14]. Biocellion [14] is a flexible, discrete agent-based simulation framework that uses MPI for inter-node communication, as well as other dependencies, such as PNNL Global Arrays [25], CHOMBO [3], the Intel TBB [11] and the iterative Multigrid solver [2, 32]. Nevertheless, Biocellion has fixed routines to describe system behaviors, is dependent on external libraries and is closed source, which might deter potential users. Chaste is an open-source, general purpose simulation package for modeling soft tissues and discrete cell populations [18] that can be used with MPI using PETSc [1] but which itself suffers from multiple dependencies. Timothy [4, 5] is another open-source, MPI based tool but with several dependencies, such as Zoltan [6], Hypre [7] and SPRNG [19].

3 Internal Design and Domain Partitioning

The simplicity, flexibility, minimal dependence on external libraries, execution speed and openness of BioFVM make it an ideal experimental candidate for distributed parallelization. In BioFVM, the 3-D simulation domain is divided into Voxels (Volumetric pixels). The principal classes depicting the internal architecture and their relationship in BioFVM is shown in Fig. 1.

Fig. 1.
figure 1

Key classes in BioFVM, along with their member data and functions. Functions are distinguishable by a leading parenthesis i.e. ( ). Names are arbitrary but convey semantic information. Solid, thick arrow with an un-shaded triangle represents inheritance and dashed arrows denote a pointer or class relationship - the class (or its pointer) being pointed to by the arrow is a data member of the class from which the arrow originates.

The top-level biological entities along with related classes (see Fig. 1) are: (1) Biological Environment (Microenvironment and Microenvironment_Options), (2) Physical Domain represented as 2-D/3-D Mesh (General_Mesh, Cartesian_Mesh and Voxel), and (3) Cells (Basic_Agent and Agent_Container). The data members of some classes are either the objects or the pointers of another class type (see dashed arrows in Fig. 1). The Microenvironment class sets the micro-environment name, the diffusion/decay rates of substrates, defines constants for the Thomas algorithm, contains an object of Cartesian_Mesh, a pointer to the Agent_Container class and performs I/O. A group of resizing functions that determine the global/local voxels are members of the Cartesian_Mesh class. The Microenvironment_Options class helps to set oxygen as the first default substrate and the default dimensions of the domain/voxel. The Cartesian_Mesh class is publicly derived from General_Mesh (thick arrow in Fig. 1). The Basic_Agent class forms an abstraction of a cell. An object of the Basic_Agent class can either act as a source or sink of/for substrates. Each agent has a unique ID, a type, and maintains the local/global index of its current voxel.

We initialize MPI with the MPI_THREAD_FUNNELED thread support level and after domain partitioning [27, 28], assign the sub-domains to individual MPI processes. Our implementation as of now supports only a 1-D x-decomposition (see Appendix A). The randomly generated positions of basic agents are mapped to respective processes (see Appendix B) after which they are created individually and in parallel on the MPI processes. Each MPI process initializes an object of the Microenvironment class, maintains the local and global number of voxels, local (mesh_index) and global voxel indices (global_mesh_index) and the center of each local voxel’s global coordinates. A 1-D x-decomposition permits us to employ the optimal serial Thomas algorithm [30, 31] in the undivided y and z dimensions. This enables all threads within a node to simultaneously act on elements belonging to different linear systems.

The Thomas algorithm is used to solve a tridiagonal system of linear equations in serial and consists of two steps, namely, Forward Elimination (FE) step followed by a Backward Substitution (BS) step. Unfortunately, both the steps involve serial and dependent operations and thus, the solver is inherently serial and cannot be fully (trivially) parallelized. Although we decompose data in the \(x-\)direction, the solver still runs serially i.e. MPI process rank i must finish the FE before this step can begin on MPI process rank \(i+1\). Thus, the performance of this multi-node but serial Thomas solver is expected to be worse than a single-node Thomas solver due to the overhead of communication. The performance penalty is least in the \(x-\)direction as the data is contiguous in the memory as compared to the y and z direction where the data in the voxels’ vector is non-contiguous. Thus, we decompose data only in the \(x-\)direction and avoid decomposition in the other directions. We expect to replace this non-optimized implementation by a modified, MPI+OpenMP version of the modified Thomas algorithm [15] in future versions.

4 Experiments

We used the MareNostrum 4 (MN4) supercomputer at the Barcelona Supercomputing Center (BSC) for all our experiments. Each node has two 24-core Intel Xeon Platinum 8160 processors and a total memory of 96 GB. BioFVM-X only requires a C++ compiler and an MPI implementation for compilation. We used GCC 8.1 and OpenMPI 3.1.1 running atop the SUSE Linux Enterprise Server 12 SP2 OS. The parallel file system is the IBM General Parallel File System and the compute nodes are interconnected with the Intel Omni-Path technology with a bandwidth of 100 Gbits/s. We pinned the threads to individual cores and bind each MPI process to a single processor (socket). We set the OpenMP environment variables OMP_PROC_BIND=spread, OMP_PLACES=threads [26] and used the –map-by ppr:1:socket:pe=24 notation to allocate resources (see https://gitlab.bsc.es/gsaxena/biofvm_x).

We used a cubic physical domain and cubic voxels for all our tests. Our implementation assumed that the total number of voxels in the BioFVM’s x-direction are completely divisible by the total number of MPI processes. The example that we used to demonstrate the benefits of Hybrid parallelism is tutorial1 in the BioFVM/examples directory. This example: (1) Initializes and resizes the micro-environment (\(\mu \)-environment, MC kernel) (2) Creates a Gaussian profile (GPG kernel) of the substrate concentration (3) Writes the initial and final concentrations to a .mat file (I/O kernel) (4) Creates Basic Agents (Sources and Sinks, BAG kernel) and (5) Simulates Sources/Sinks and Diffusion (Solver kernel).

Figure 2 presents timing results for the MC, GPG, BAG, I/O and Solver kernels on physical domains of sizes \(1000^{3}, 1920^{3}\) and \(3840^{3}\). Cubic voxels had a volume of \(10^{3}\) with \(5 \times 10^{2}\) sources and \(5 \times 10^{2}\) sinks in this example. We denote the Hybrid implementation as “Hyb (n = a)”, where “a” denotes the total number of nodes. For example, with Hyb (n = 2), we obtain a total of 2 (nodes) \(\times \) 2 (MPI processes) \(\times \) 24 (OpenMP threads) = 96 OpenMP threads, as we always run 2 MPI processes per node and 24 OpenMP threads per MPI process. Instead of 8 MPI processes for the domain of size \(1000^{3}\), we used 10 MPI processes due to a divisibility problem. Figure 3 shows the initial and final concentration of the diffusing substrate (oxygen) for a domain of size \(1000^{3}\). The simulation plots were obtained with Hyb (n = 1) by executing the cross_section_surface.m Matlab script bundled with BioFVM.

In summary for Hyb (n = 1), both MC and BAG kernels took advantage of the multiple MPI processes as initialization of the Microenvironment and Basic_Agent class objects were simultaneously carried out on separate processes in BioFVM-X as opposed to a single thread in BioFVM. The (MPI) I/O kernel showed significant performance gains over serial I/O for the tests considered (Fig. 2). Nevertheless, the Solver kernel execution run-times did not reflect a significant gain in the Hybrid version. An extended analysis of these results can be found in Appendix C. Note that it is generally very difficult for an MPI+OpenMP implementation to outperform the pure OpenMP implementation on a single node, as is the case of Fig. 2, due to the additional memory footprint of MPI and the cost of message-passing/synchronization. Our aim in the current work was to tackle very large problems that cannot fit into the memory a single node and to reduce their time to solution in a multi-node scenario.

Fig. 2.
figure 2

Pure OpenMP Vs Hybrid MPI execution times for increasing problem sizes. Hyb (n = 1) represents the time when a single node with 2 MPI processes and 24 threads is used. The Best Hybrid represents the least time for that kernel for any number of experimental nodes considered.

Fig. 3.
figure 3

3-D concentration density of oxygen simulated using Hyb (n = 1) for a domain of size \(1000^{3}\) and 1000 Basic Agents.

After testing with increased voxels and basic agents, we run a performance test to evaluate the scalability in the number of substrates. We found that the pure OpenMP BioFVM version is incapable of executing a simulation of 400 substrates on a domain of \(1500^{3}\) due to memory limitations. Nonetheless, we successfully run a Hybrid simulation using 400, and even 800 substrates, on a domain of \(1500^{3}\) by distributing the computation between 2 nodes.

To further showcase BioFVM-X capabilities, we run a parallelized version of the model of tumor growth in a heterogeneous micro-environment from BioFVM [8]. We verified that the BioFVM-X distributed-memory 3-D tumor example yielded the exact same results as the shared-memory one (see Appendix D and Fig. 8). This is further proof that BioFVM-X correctly distributes the original BioFVM models with a boost in performance due to the load distribution and the potential of scaling simulations to a cluster of nodes, thus enabling researchers to address bigger, more complex problems. In addition, with a problem of size \(7680^{3}\), the memory consumption of the pure OpenMP version reaches \(\approx \)97% of the total memory of the node (96 GB) and the simulation terminates with a bus error. For the same problem size, the Hybrid code on 4 (with 192 threads) and 8 nodes (with 384 threads) executes successfully (Table 1).

Table 1. Time (in seconds) of execution for the pure OpenMP and the Hybrid version for a problem of size \(7680 \times 7680 \times 7680\) (\(\approx \)0.5 billion voxels). The pure OpenMP version terminates while throwing Out Of Memory error.

5 Conclusion and Future Work

Multi-scale modeling has already proven its usefulness in a diversity of large-scale biological projects [9, 16, 24], but these efforts have been hampered by a scarcity of parallelization examples [4, 12, 14]. We present BioFVM-X - an enhanced MPI+OpenMP Hybrid parallel version of BioFVM capable of running on multiple nodes of an HPC cluster. We demonstrate that BioFVM-X solves very large problems that are infeasible using BioFVM as the latter’s execution is limited to a single node. This allows BioFVM-X to simulate bigger, more realistic in-silico experiments. Further, despite the fact that our solver is only partially parallelized, we see performance gains in multiple execution kernels. In the future, we aim to replace the solver in the x-direction with a parallel modified Thomas algorithm [15].

BioFVM-X is open source under the BSD 3-Clause license and freely available at https://gitlab.bsc.es/gsaxena/biofvm_x. Even though it can be used to easily implement and simulate biological models in a self-contained manner, BioFVM-X also forms the lower layer of our ongoing efforts to have a parallel large-scale and multi-scale modeling framework termed PhysiCell-X, based on PhysiCell [9] - a framework that is under active development and has multiple stable releases.