Keywords

1 Introduction

Blood flow simulation remains an area of active research. Many interesting properties have been identified with the help of simulations [4, 6, 9,10,11, 13]. There is an increasing interest in blood flow simulations in which the blood cells (red blood cells, platelets, white blood cells) are fully resolved [3, 8, 15, 16]. These simulations can be used to understand and find underlying mechanics of complex behaviour of blood flows including but not limited to platelet margination [9], the formation of the cell free layer [6], the Fåhræus–Lindqvist effect [2], the behaviour in microfluidic devices or the behaviour around micromedical implants [1, 5]. Simulations that model blood as a pure fluid flow are not able to recover these intricate properties of blood flow.

One of the challenges of suspension simulation codes is to parallelize them such that interesting systems with sufficient number of cells (>1000 cells) can be simulated for an extended duration (>0.1 s) in a reasonable time span (<5 days). Only a few open source solutions exist for suspension simulations that need to implement a complex mechanical model for the simulated cells, HemoCell [16] and Palabos-LAMMPS [14] are examples of available open-source codes that can be used to simulate blood flow. Other codes exists but are not (yet) available as open-source.

HemoCell is a software package that is developed at the University of Amsterdam that is able to simulate blood flow at high shear rates (>1000 \(\text {s}^{-1}\)) and with a high number of cells (>1000 cells). In this paper we present HemoCell an highly efficient parallel code for blood flow suspension simulations.

HemoCell is built on top of Palabos [7] and offers support for complex suspension simulations. Palabos is a general purpose lattice Boltzmann solver with high performance computing capabilities. We will shortly introduce HemoCell and its underlying models, followed by a discussion of challenges and solutions for efficient parallel simulations. These include boundary communication of processors for the suspension part, efficiently storing relevant information while avoiding global communication, and efficiently computing the complex material model associated with the cells within HemoCell. Next, we discuss the theoretical and practical implications of the methods we used to implement the suspension simulation software within HemoCell and provide performance measurements.

1.1 HemoCell

HemoCell [16] is an open source parallel code for simulating blood flows with fully resolved cells that is built as a library on top of Palabos [7]. Palabos is a versatile library which can be used to solve pure fluid flow problems with the lattice Boltzmann method (LBM). Palabos offers relevant multi-processing abilities. HemoCell implements the cell mechanics simulations and their coupling to the fluid using the immersed boundary method (IBM), see also Fig. 1.

Fig. 1.
figure 1

Overview of the Palabos and HemoCell libraries

HemoCell uses data parallelism to distribute the workload over many cores. With the help of PalaBos HemoCell can divide the flow domain into multiple rectangular blocks of each which represents a processor These domains are called atomic blocks (AB). ABs are abstracted away from the user through the use of functionals, which can be used to perform operations on a domain without knowing about the underlying distributed structure. Furthermore, each simulation can have multiple fields, which span the whole domain and represent a specific part of the simulation. In Hemocell two fields are used, a fluid field and a cell field. Palabos takes care of the boundary communication between processors for the fluid field. HemoCell takes care of the cell field and of the communication between the two fields as required by the immersed boundary method [12].

The cells consist of vertices which are connected through links that make up a boundary to represent a cell in the fluid. A RBC in HemoCell has 1280 vertices. A complex mechanical model is used to calculate forces [16]. This mechanical model requires that a cell is present on both processors whenever it is crossing a boundary. This results in the two main bottlenecks and thus challenges for HemoCell.

  1. 1.

    The material model of the cells needs to be calculated efficiently.

  2. 2.

    Dividing the cell field into multiple processors is complex because the material model requires duplication of cells over boundaries.

2 Calculating the Mechanical Model of a Cell

The cells within HemoCell are implemented as vertices and connections that form a triangulated mesh. These cells compute the forces acting on its vertices through a mechanical model [16]. Figure 2 shows a mesh used to represent a red blood cell.

Fig. 2.
figure 2

Mesh representing red blood cell in HemoCell

Závodszky et al. [16] model the forces acting on the vertices of a cell as follows:

$$\begin{aligned} F_{\text {total}}=F_{\text {link}}+F_{\text {bend}}+F_{\text {volume}}+F_{\text {area}}+F_{\text {visc}} \end{aligned}$$
(1)

Below we list all five forces and explain in detail what information is needed to compute them.

  1. 1.

    The link force \(F_{\text {link}}\) acts along the edges and between neighbouring points. The force on a single vertex i (\(F_{\text {link}}^i\)) can be described as follows:

    $$\begin{aligned} F_{\text {link}}^i = \sum _{n=1}^{m} C_{\text {link}} \frac{E_{i,i_n} - |i_n^x-i^x|}{E_{i,i_n}} \end{aligned}$$
    (2)

    Where \(E_{i,i_n}\) is the equilibrium length between two vertices, \(i^x\) is the location of vertex i, m is the number of direct neighbours of vertex i and \(i_n\) is the n’th direct neighbour of vertex i. \(C_link\) consists of all the constant terms that do not change during a simulation as explained by Závodszky et al. [16].

  2. 2.

    The bending force \(F_{\text {bend}}\) uses patches which are defined as a plane that goes through the average location of all direct neighbours of vertex i. The normal direction of this plane is defined as the average normal of all neighbouring triangles that include vertex i. The distance along the normal direction of this plane towards vertex i is used to calculate the bending force on vertex i, a negative term is added to the neighbours of i to make the force zero-sum.

    $$\begin{aligned} F_{\text {bend}}^i = C_{\text {bend}} \left( E^{\text {patch}}_i - \left( \frac{\sum _{n=1}^{m} i_n^x }{ m} - i^x\right) \cdot \left( \frac{\sum _{n=1}^{m} \mathbf{normal } \left( t_i^n \right) }{ L}\right) \right) \nonumber \\ - \sum _{n=1}^m \frac{1}{ N_i^m } F^{i_n}_{\text {bend}} \end{aligned}$$
    (3)

    Where \(E^{\text {patch}}_i\) is the equilibrium distance between the patch and the vertex i along the patch normal. \(t^n_i\) is the n’th triangle that is a direct neighbour of vertex i. \(\mathbf{normal }()\) returns the normal pointing outward from a triangle. L is the length of the summation of the normal vectors of all the triangles that are part of the patch, thus this division results in a unit vector along the average normal direction. The dot product results in a length term along the patch normal. \(N_i^m\) is the number of direct vertex neighbours of \(i_n\). C again of all the constant terms that do not change during a simulation.

  3. 3.

    The area force \(F_{\text {area}}\) acts on all the triangles that are part of the mesh. Therefore the force on a single vertex is a sum over all neighbouring triangles:

    $$\begin{aligned} F_{\text {area}}^i = \sum _{n=1}^{m} \mathbf{C }_{\text {area}} \left( \frac{E_{i_t^n}^{\text {area}} - \mathbf{area }\left( i_t^n \right) }{ E_{i_t^n}^{\text {area}}}\right) \left( i^x - \mathbf{middle }\left( i_t^n \right) \right) \end{aligned}$$
    (4)

    Where \(\mathbf{area }()\) calculates the area of a triangle, \(E_{i_t^n}^{\text {area}}\) is the equilibrium value for the area of triangle \(i_t^n\). \(\mathbf{middle }()\) calculates the average of the three triangle vertices of triangle \(i_t^n\). \(\mathbf{C }_{\text {area}}()\) is a function that takes the area ratio as input and outputs a force coefficient.

  4. 4.

    The volume force \(F_{\text {volume}}\) results from the total volume of the cell, thus information about all vertices is needed. The force is distributed over the vertices proportional to the area of the direct neighbouring triangles of that vertex.

    $$\begin{aligned} F_{\text {volume}}^i = \frac{ \mathbf{volume }(\text {cell}^i) - E^{\text {volume}}_{\mathbf{cell }^i} }{ E^{\text {volume}}_{\text {cell}^i} } \sum _{n = 1}^{m} C_{\text {volume}} \frac{ \mathbf{area }\left( t_i^n \right) }{ E_{t^n_i}^{\text {area}}} \mathbf{normal }\left( t^n_i \right) \end{aligned}$$
    (5)

    Where \(\mathbf{volume } \left( \right) \) calculates the volume of a complete cell, this function needs every vertex of the cell as input. \(E_{\text {cell}^i}^{\text {volume}}\) is the equilibrium volume of \(\text {cell}^i\) and \(\mathbf{normal }()\) is the normal direction of triangle \(t_i^n\).

  5. 5.

    The viscous force \(F_{\text {visc}}\) limits the relative velocity of neighbouring vertices connected with an edge.

    $$\begin{aligned} F_{\text {visc}}^i = \sum _{n=1}^m C_{\text {visc}} \cdot \left( \left( v^i - v_n^i \right) \cdot \left( \frac{i^x_n - i^x}{|i^x_n - i^x|}\right) \right) \cdot \left( \frac{i^x_n - i^x}{|i^x_n - i^x|}\right) \end{aligned}$$
    (6)

    Where \(v^v\) and \(v^i_n\) are the velocity of vertex i and \(i_n\) respectively. \(\sum ^{m}_{n=1}\) sums over all direct vertex neighbours of i.

2.1 Implementation of the Mechanical Model

The formulas for calculating force on each independent vertex are explained above. Between the calculation of the separate forces there are some overlaps, for example the calculation of the area of a triangle is used for both the volume and area forces Eqs. 4 and 5. This leaves room for optimization within implementing the calculations. In Fig. 3 a pseudo code of the implementation is shown. In this implementation we have tried to calculate each necessary value only once. Furthermore, we try to minimize the number of loops. Most notably in the first loop which calculates \(F_{\text {area}}\) all the necessary calculations for \(F_{\text {volume}}\) are stored for the second loop. In addition \(F_{\text {link}}\) and \(F_{\text {visc}}\) are calculated in the same loop as well.

Fig. 3.
figure 3

Pseudocode explaining how we optimized the calculation of the mechanical model within HemoCell.

3 Implementation of the Cell Field Communication Structure

When the cell field is divided up into multiple atomic blocks it becomes necessary to implement a communication structure. For a regular fluid field this simply constitutes to communicating the values of the fluid cells in the boundary layer to their corresponding neighbours. However it is not so simple for the cell field. The number of vertices in a communication boundary can change over time and therefore the communication size is not static but dynamic. Furthermore at every communication step it has to be determined which vertices are present within a communication boundary and which vertices are not.

Cells need information from all their vertices to calculate the mechanical forces. Almost all forces (\(F_{\text {area}},F_{\text {link}},F_{\text {bend}},F_{\text {visc}}\)) that act on the vertices only need information from their direct neighbours to be calculated. However the volume force \(F_{\text {volume}}\) needs information of all the vertices of the cell to be calculated. Therefore whenever a single vertex of a cell is present in an atomic block, the boundaries must include every other vertex of the corresponding cell as well. This means that the size of the boundary must be larger than the largest possible diameter of a cell. Figure 4 shows that a larger boundary size means that the number of neighbours and thus the communication will increase if the atomic blocks get too small.

Fig. 4.
figure 4

Visualization of the boundary size needed for the cell field.

Fig. 5.
figure 5

The top block shows in pseudocode a naïve implementation of the boundary communication. The bottom block shows our optimized implementation of the boundary communication algorithm.

There is a simple way to implement this boundary, namely by communication of vertices in the boundary. We will use this communication pattern as the base upon which we propose improvements, see Fig. 5. In the naïve implementation firstly all neighbours are determined that overlap with the boundary of the atomic block. Within HemoCell a RBCs (the largest cell) can stretch up to \(12~{\upmu }\text {m}\). Thus all neighbours within a \(12~{\upmu }{\text {m}}\) range send the vertices corresponding to the overlap they have with the boundary. This method has two drawbacks: First a lot of unnecessary data is communicated and second when the boundary size is larger than an atomic block the number of neighbours with which communication is necessary grows, usually in the form of \((2N+1)^3-1\) Where N is the number of neighbours in a single direction. So going from \(N=1\) to \(N=2\) creates \(124-26=98\) extra neighbours.

We implemented an improved and consequently faster method to communicate vertices in boundaries. The main idea is to only communicate vertices of cells that are needed. For this an extra communication step needs to be implemented. In this extra communication step an atomic block sends a list with all the IDs of the cells that need to be communicated to its neighbours. In the next communication step only these vertices are communicated. It is not possible to get rid of the inefficient boundary communication entirely as vertices very close to the domain are needed for non-local force calculations (e.g. inter cellular forces). However, this is much more efficient if only a very small boundary needs to be communicated.

3.1 Comparison Between Naïve and Optimized Implementation of the Boundary Communication Algorithm

To test the performance gain of our optimized boundary communication algorithm we have set up a simulation which is executed both with the naïve and the optimized implementation. The simulation consists of a cubic \(128~\upmu {\text {m}}^3\) volume that is periodic in all directions. Within this volume 7736 red blood cells are present. Figure 6 shows the simulated domain. An external body force is applied to drive the cell suspension inside the volume. The volume is simulated for 0.1 s and statistics are collected over the whole duration. The results are plotted in Fig. 7.

Fig. 6.
figure 6

The domain with which the simulations are performed with a differing number of processors

Fig. 7.
figure 7

Statistics for each of the simulations. The fluid part is handled by Palabos. The dotted line shows perfect linear scaling. (a) shows the statistics for the naïve implementation of the communication. (b) shows the statistics for our improved implementation.

The results show a significant improvement of HemoCell in two ways. Firstly, the base performance has improved by \(\approx \)36%, this can be deducted from the difference in wall clock time per iteration in Fig. 7 for 8 cores. Secondly, the strong scaling (dividing the same domain into more smaller atomic blocks) properties are better. In the worst case (\(512~\upmu {\text {m}}^3\) per atomic block) the edges of an atomic block are only \(8~\upmu {\text {m}}\) long. This means that the boundary of each block overlaps with 124 neighbours. In this case we see a performance improvement of \(\approx \)4 times over the naïve version. Over the whole range we see that our improved communication performs better.

4 Conclusions

Improving the performance of fully resolved blood flow simulations allows us to perform simulations up to 4 times faster. For a simulation of \(1~{\text {s}}\) a total number of 10 million timesteps is required. This means that the improved version of HemoCell only needs one day to complete this simulation with ABs of \(512~\upmu {\text {m}}^3\), whereas the naïve version would need four days.

We have shown that it is possible to merge the calculation the forces of the mechanical model in such a way that there is less computation than when all the forces are computed separately. This is achieved by re-using intermediate values and combining loops where possible.

By improving the communication structure better strong scaling results are achieved for HemoCell. Furthermore, the base performance with large ABs is improved as well.