1 Introduction

The authors want to build a heterogeneous multicore-manycore computing system and implement a dynamic distribution algorithm in double precision calculations. Thus, an HPC system that consists of 9 PS3 systems (one of them is the master node) connected through a Gbyte router was assembled and set up. After installing the operating system on each node, an OpenMPI library and Cell SDK 3.0 development tools is installed. The algorithm is developed to calculate the approximate value of π, using \(\pi= \int_{0}^{1} \frac{4}{1+x^{2}}\) formula. MPI library was initially used to distribute the calculation area on the 18 threads with the following command: mpirun -np 18 pi value, where value is the computing resolution and was chosen with the following values: 105, 106, 107, 108, 109, 1010, 1011, 1012. For each resolution, we obtained different values of error and computing time. After these tests, the intent is to introduce in the calculation the SPU cores on each node (9×6 SPUs). Thus, it makes a distribution area integral calculation for the 72 processes available for 9×PS3 cluster. Because the SPU cores do not have a double precision unit, these calculations are performed with great delay. So, we have two possibilities to solve this dilemma: The first one is to return to the calculations distribution only on the PPU 18 processes and the second one is to introduce also the SPU systems into account, but giving them a smaller amount of calculation than to the PPU units. An adaptive algorithm that distributes the total amount of computation between PPU and SPU units is performed. Distribution report is given by a k coefficient that is empirically determined from a mathematical formula for a small number of intervals to calculate the integral. After determining the k factor, it applies automatically to all values in the range: 105–1012. The coefficient k is determined for the minimum time, when the PPU and SPU units complete their calculations in an equal time (with a margin of 5 %), from the tests results that the k coefficient has different values depending on the number of mathematical calculations performed. The calculation of the coefficient k is presented for different methods of calculating the integral: the trapeze method, the rectangle method, and the experimental results obtained in different distribution models: MPI (only PPU threads) and MPI-SDK (the PPU and SPU threads).

2 Experimental resources

2.1 PS3 cluster architecture and setting

We chose to use the PS3 system in the cluster construction, due to their outstanding characteristics that make it suitable for scientific calculation. Some of these features would be:

  • PS3 is open-platform—meaning, it can run different operating systems (e.g., Fedora Core 8 for PPC).

  • The PS3 System contains Cell B.E. processor slightly different from the original (and 6×SPU 1×PPU).

  • Very low price approximately $300 makes it very attractive as a computing node in a cluster system.

The architecture of the communication system is the tree type shown in Fig. 1.

Fig. 1
figure 1

PS3 cluster architecture

The following steps are presented in cluster configuration:

  • Formatting the 9 PS3 nodes.

  • Installing the operating system Fedora Core 8 for PPC 64.

  • Installing the SSH service on each station and NFS, used to MPI communication and file sharing.

  • NFS setting up master node station and the other eight client stations.

  • Installed and configures OpenMPI library on all stations.

  • Installing Cell SDK 3.0 on each station.

The PS3 nodes are connected to 1000BASE-T Gigabit Ethernet switch. For this algorithm, the Gbyte switch (24 ports) is not a bottleneck because there is not a high data traffic through it. The total time to set up the threads on every node and calculation is greater than communication time. The following equation is total time of program execution \(\mathrm{Total}\ \mathrm{Time} = t_{\mathrm{setup}}(n)+\max(t_{\mathrm{com}}(i))+t_{\mathrm{SPU}}(6)+ \max(t_{\mathrm{calculation}}(1/n))+t_{\mathrm{PPU}\pi }+\max(t_{\mathrm{com}R}(i))+t_{\mathrm{master}_{n}\mathrm{ode}\pi}\);

  • t setup(n)—time to set up the nodes; initiate PPU and SPU threads.

  • max(t com(i))—max of times to transmit initial parameters to PPU threads (MPI-library).

  • t SPU(6)—max of times to transmit initial parameters to SPU threads (CellSDK-library).

  • max(t calculation(1/n))—max time of π fractions.

  • t PPUπ —time to add π fractions from SPU to PPU.

  • max(t comR (i))—max time to send the partial result from nodes to master node using MPI library.

  • \(t_{\mathrm{master}_{n}\mathrm{ode}\pi}\)—time to add π fractions from nodes to master node.

\(\max(t_{\mathrm{com}}(i))+t_{\mathrm{SPU}}(6)+t_{\mathrm{PPU}\pi }+\max(t_{\mathrm{com}R}(i))+t_{\mathrm{master}_{n}\mathrm{ode}\pi}\) is negligible in relation to max(t calculation(1/n)); max(t calculation(1/n))≫t setup(n).

Even if it increases the number of nodes (e.g., 23×23 nodes) in a two-level tree architecture, the total time of calculation decreases with t calculation(1/n) and increases with time to set up the 1058 threads on 529 nodes.

2.2 Cell architecture

CELL processor is a heterogeneous multicore architecture, being developed by a IBM-Sony-Toshiba consortium. It is built around a PowerPC processor, 64 bit (PPE), eight-core computing SIMD (SPEn), a memory controller, and an I/O controller interface. The communication between computing elements PPU-SPUs is done through a high speed bus—Element Interconnect Bus (EIB). At a clock frequency of 3.2 GHz, the maximum theoretical performance for SPE (Single Precision SP) is 25.6 GFlops, resulting in a performance of 204.8 GFlops overall for 8 SPUs. For (DP-Double Precision) theoretical maximum performance for a single SPE is 12.08 GFlops and 102.4 GFlops for 8 SPUs [8].

EIB provides a maximum bandwidth of 204.8 GB/s data transfer on-chip between the PPU, SPU, memory interface, and I/O controller. Memory controller interface provides a bandwidth of 25.2 GB/s with main memory. PPU unit runs the operating system and drives the SPUs. PPU memory hierarchy is similar to the units of conventional processors with 32 KB L1 cache and 512 KB L2 cache.

SPEs are designed for high performance processing of massive data and intensive calculation. The SPSs memory hierarchy consists of a set of 128×128-bit SIMD registers, 256 KB of local memory (LS-Local Store) and off-chip main memory shared through PPU unit. SIMD operations can run with four different granularity: 16 of 8-bit integer, 8 of 16-bit integer, 4 of 32-bit integer, or single precision floating point numbers and 2 of 64-bit double precision floating point numbers. Local memory (LS) and an SPS can hold code and data and can only access code and data of his memory.

Data transfers between LS memories and the main memory, as well as, the transfers between LS memories are achieved by DMA. DMA transfers are asynchronous and allow SPEs to overlap computation times with transfer times. LS memory management is done exclusively by software. Mailboxes mechanisms for communication between the PPE and SPEs are provided. Each mailbox is able to retain up to four 32-bit data elements at any time [4].

Block diagram of the Cell Broadband Engine processor is shown in Fig. 2.

Fig. 2
figure 2

Block diagram of the Cell/B.E.

3 Experimental work

The algorithm which will be described can be ported to another AMP architecture (asymmetric multicore platforms) which includes two levels of performance processors: Fast Cores and Slow Cores—dual ISA heterogeneous multicore architecture [2]. The k coefficient is used to unbalanced distribution of tasks on PPUs and SPUs so that its time work and energy consumption is balanced. This applies only to systems with performance asymmetry (SPU-Slow Core and PPU-Fast Core) [7]. In fact, SPU’s is much faster, but they are made for image processing. In mathematical calculation is slower because they don’t have a double precision unit (only in PS3 systems).

In [6], the authors present an implementation of core features of MPI for Cell. This implementation views each SPU as a node for an MPI process. In our implementation, we use not only SPU’s but also PPU units. On every PPU unit is started two Linux threads. The application is started on the master node with mpirun -np 18 pi value command. The -np 18 argument means that the master node will initiate 2 processes on every node using MPI library. In the same time, the master node send to the other nodes the k coefficient and a part of equally distributed from total amount quantity of calculation using MPI library. The even thread on every PPU unit (every node) starts one thread on every SPU processor and sends to SPU threads the quantity of calculation, using SDK-mailbox facility. The quantity of calculation for SPU threads is determined using k coefficient. After that, PPU even thread makes own calculation and waits for receive the fraction of π calculation from every SPU thread and add them (using the same SDK-mailbox facility). The odd PPU thread makes only own π fraction calculation. Finally, the master node receives the all fractions of π value from every nodes (18 threads) and adds them (using MPI library).

The k coefficient depends on the ratio of power computing between Fast Cores (PPU) and Slow Cores (SPU) and number of them. Even if we increase the number of nodes the optimal k is the same, because the nodes have the same hardware architecture (k is linear with scalability of the system). In [9], the authors are using a variety of algorithms for a few communications operations. In our situation, we do not need to use these types of communications because we are using only mailbox communication to send to each SPU only start, stop, and resolution values for every interval of integral assigned to SPUs.

3.1 Trapezoid area calculation method using the MPI distribution

Initially, we started to implement a calculating algorithm which had the approximate value of π by calculating integral aria with trapezoid method (Fig. 3) [5]. Using the MPI library, two threads (PPU is dual-threading) are distributed to each node the command mpirun-np 18 pi value starts 18 threads on the 9 stations. Each of these threads calculates an equally distributed part of integral, defined in C code [3]:

Fig. 3
figure 3

Formula for π calculation using trapezoid method

figure a

me variable is the rank of the process (between 1 and 18). n variable defines the computing precision and receives the value of value parameter. h variable is the step through which the x variable is incremented in integral area calculation of 0→1.

After running the test program, the following results are obtained (trapezoid method), with n between 105 and 1012 (Fig. 4—marked with circle). The Fig. 4a graph is showing that the increase of the resolution determines the increase of computing time of calculation. The error (Fig. 4b) decreases to the value of 1011 and after that is growing because of using double precision variables. We used the notation SQRT(Time) to get a legible notation. The same for the error, we used (20+LOG(Error)) to obtain a graph over the vertical axis.

Fig. 4
figure 4

Comparation between values of time (a) and error (b) obtained by two methods: classical MPI 18 PPU threads and optimal unbalanced method MPI-SDK 18 PPU threads and 54 SPU threads (trapezoid method)

3.2 Trapezoid area calculation method using MPI-SDK distribution

We want to use SPU cores in calculating the approximate value of π. Thus, the even process on each PPU, starts the six SPU processes. There is a total of 6×9=54 processes. After initialization, the mailbox is used to send the required values for calculation of a certain segment of the area integral. With a total of 18 PPU+54 SPU=72 processing units, it is hoped to yield approximately 400 %. We designed an algorithm for computing balanced distribution of fractions for each processing unit. In fractions, distribution was used MPI library for sending the computing values on every node and CellSDK 3.0 libraries for factions distribution on SPU’s within each node. The final result was three times weaker. The reason is that SPU units do not have a double precision computing unit. These calculations are performed using several times the unit simple precision. Thus, we try to design an architecture for dynamic unbalanced distribution of computing fractions between PPU and SPU’s sites. Below is shown the algorithm that divides 0→1 interval between PPU and SPU units depending on the coefficient k. The value of n′ is recalculated to a value as close to the received argument (between 105 and 1012) but equally distributed between all the PPUs and SPUs. Mathematical formulas for calculating the number of intervals for each process are listed below:

$$ \mathit{nppu}= \biggl[\frac{n-\frac{n}{k}}{\mathit{nprocs}} \biggr]\cdot\mathit{nprocs} $$
(1)
$$ n'= \biggl( \biggl[\frac{\mathit{nppu}}{\mathit{nprocs}} \biggr]+ \biggl[ \frac {n-\mathit{nppu}}{\mathit{nprocs}} \biggr] \biggr)\cdot\mathit{nprocs} $$
(2)
$$ \mathit{nspu}=n'-\mathit{nppu} $$
(3)

A graphical representation of distributions is given in Fig. 5.

Fig. 5
figure 5

Distribution of segments dedicated to each node

The coefficient k is determined by the algorithm by successive testing small values of n and determined by the optimal time achieved in a series of count steps (Fig. 6). k calculation has two steps: first—k determination and second π calculation using k coefficient for optimal distribution. First step start with n=103. Then determine the mean of times calculation for PPU threads and SPU threads. If this two mean of times is dramatically different, then increase or decrease k coefficient until this means is approximatively equal. If k coefficient is determined for n=103 and count=0, then make n=104, count=20 and restart a finest calculation of k.

Fig. 6
figure 6

The chart to determine the coefficient k and calculating the π

After determine the k coefficient, the value of n gradually increases from 105 to 1012 to obtain a better approach of π. Optimal value for k was determined to be approximately: 6<k<10. For values of k greater than 10, decreases the effectiveness of using SPU units. The chart to determine the coefficient k is shown in Fig. 6. nppu variable represents the number of points calculated for a PPU unit. nspu variable represents the number of points calculated for the SPU unit. Variable nprocs is the number of processes launched on the cluster system (between 1 and 18).

Figure 4 (marked with plus) shows the results from running the test programs ppu.c and spu.c on 18 threads (9×2PPU) and 54 threads (9×6SPU) with an optimal ratio unbalance, determined on small values of n, k=7.3 and n′ between 105 and 1012. It can be seen that the way to increase computing time meets the same form as shape marked with circle. It can be concluded that the coefficient k, determined for the optimal time for small values of n, is optimal for large values of it that is linear with increasing resolution.

Figure 4 shows the results which are obtained from comparison of the two methods for calculating the π-trapezoid method (classical MPI 18 PPU threads and optimized MPI-SDK unbalanced PPU 18 threads and 54 threads SPU). Are presented more values:

  • Pi→20+LOG(Error) obtained for the classical version of MPI.

  • Err 7.3→20+LOG(Error) obtained for the optimized version of MPI-SDK unbalanced with a coefficient k=7.3.

You can see an improvement of the calculation accuracy within the optimized MPI-SDK version unbalanced with a coefficient k=7.3.

  • Pi→SQRT(Time(s)) obtained for the classical version of MPI.

  • Time 7.3→SQRT(Time(s)) obtained for the optimized version of MPI-SDK unbalanced with a coefficient k=7.3.

Comparing the two sets of values, we can observe an improvement of the calculation time (approximately 20 %) of the optimized version of MPI-SDK unbalanced with a coefficient k=7.3, approximately 20 %. We can get the following conclusions:

  • Using SPUs in the double precision calculation, greatly delayed his performance.

  • Making a PPU-SPU unbalanced distribution brings an improvement in computing time, in comparison with the use of PPU units only.

  • Distribution coefficient must be determined for a small number of calculation points (a few tenths of a second) and then used to calculate a high resolution (1012—several hours).

  • Distribution coefficient is linear with resolution increasing.

3.3 Calculation of the rectangle area method using MPI distribution

To check if the value determined for k is dependant on computing power in double precision units PPU-SPU and on the number of double precision operations performed on these units, we continue with the implementation of an algorithm for calculating the integral area by rectangles method (Fig. 7).

Fig. 7
figure 7

Formula for π calculation using rectangle method

Distribute the two processes on each node (PPU). Each of these processes calculates an equally distributed part of the integral defined by block:

figure b

The variable w is the half step increments. The following results are obtained after running the test program (method rectangles), with n between 105 and 1012 (Fig. 8—marked with circle): We notice an improvement in execution time and error calculation to MPI trapezoid method, due to fewer calculations. Like in trapezoid method, we want to use SPU cores in the π approximate calculation. The aim is to determine the distribution coefficient k to check if it depends on the complexity of mathematical calculation.

Fig. 8
figure 8

A comparison between values of time (a) and error (b) obtained by two methods: classical MPI 18 PPU threads and optimal unbalanced method MPI-SDK 18 PPU threads and 54 SPU threads (rectangle method)

3.4 Rectangle area calculation method using MPI-SDK distribution

After running the program, we determine the optimal ratio k=9.3 and n between 105 and 1012 (Fig. 8—marked with minus). Figure 8 show the results that are obtained by comparing the two methods used in calculating the π value—rectangle method (classical MPI 18 PPU threads and optimized MPI-SDK unbalanced PPU 18 threads and 54 threads SPU).

As the trapezoid method, we found an improvement in time calculation between MPI distribution only on PPUs and MPI-SDK unbalanced distribution between PPU and SPSs, about 10 %.

A comparison between the two calculation variants MPI trapezoid method and MPI rectangles method is shown in Fig. 9.

Fig. 9
figure 9

Comparison between values of time (a) and error (b) obtained by the two methods MPI: trapezoid method and the rectangle method

As we can be seen from Fig. 9, there is a significant improvement in computing time for the same distribution (MPI) between the two methods (trapezoidal-rectangle) because fewer math computations run in the second method.

A comparison between MPI-SDK unbalanced trapezoid method k=7.3 and MPI-SDK unbalanced rectangle method k=9.3 is shown in Fig. 10.

Fig. 10
figure 10

Comparison between values of time (a) and error (b) obtained by MPI-SDK unbalanced trapezoid method k=7.3 and rectangle method k=9.3

As above, in Fig. 10 reveals a 25 % improvement in computing time for the same distribution (MPI-SDK-unbalanced) between the two methods (trapezoidal–rectangle). This is because fewer mathematical calculations run in the second method on the one hand and on the other hand, the modification factor k makes PPU units to take more calculations that the second method, minimizing the time calculation. To prove this, Fig. 11 shows a graph of time to calculate the rectangle method in MPI-SDK unbalanced distribution for k=7.3 and k=9.3.

Fig. 11
figure 11

Comparison of computing time for the rectangle method in MPI-SDK unbalanced distribution for k=7.3 and k=9.3

4 Conclusions

By introducing the SPU units in calculating the value of π and using an unbalanced loading distribution between PPU-SPU units, we obtain the following values of calculation time and errors (Fig. 12). The follows conclusions are derived from analysis of the two graphs:

  • The introduction of the SPU units into the calculation of (DP-Double Precision) improves calculation time by about 50 % but is well below of SPU units potential if they had a unit DP.

  • Also, we get a lower error in calculating of the π value.

  • To obtain maximum efficiency of computing time, you must use an unbalanced distribution formula calculations between SPU and PPU units.

  • Distribution coefficient k takes different values depending on the number and complexity of the calculations.

  • Preferably use SPU units in double precision calculations, but reserve their use to a lower level.

This algorithm can be used to increase the speed of double precision calculation in scientific computing.

Fig. 12
figure 12

The gain in time (a) and error (b) obtained by MPI-SDK unbalanced k=9.3 versus the classical MPI

5 Future works

In [1], the authors present a comparison of two phylogenetic trees inference codes, RAxML and PBPI. They present an experimental data by a comparison between PS3 cluster performance and QS20 dual-Cell/BE blades with the same total number of SPEs. They observed that the performance of the PS3 cluster is reasonably close to the performance of the QS20 cluster. We make the same comparison with our algorithm and observed that k was very dependent by hardware architecture. For example, if k=1.1, Time PS3>Time QS22, for k=7.3, Time PS3<Time QS22. We want to use SIMD-ization and unrolling characteristics in the calculation made by SPU units. It is expected to increase about 4 times the speed of calculation. Amending the balance PPU-SPU upload units in that SPU units will process a greater number of computing intervals (Fig. 13).

Fig. 13
figure 13

Comparison between values of time (a) and error (b) obtained by MPI-SDK unbalanced rectangle method k=9.3 and SIMD rectangle method k=1.1

As we can observe, the k coefficient was determined as 1.1 for the SIMD version. That means that PPU and SPU get approximatively the same loading.