Dynamic, unbalanced distribution of tasks on a PS3 cluster system for double precision calculation
 932 Downloads
 2 Citations
Abstract
The article aims to describe an algorithm for dynamic distribution of tasks in a parallel computing system consisting of a network of 9 PlayStation3 (PS3) stations. Due to the fact that the SPU cores do not have units of double precision calculation, these calculations require a longer time. Therefore, an unbalanced distribution of computing tasks on a heterogeneous cluster system as 9×PS3 is necessary. The algorithm produces an estimation of the value of π by calculating the integral \(\pi= \int_{0}^{1} \frac{4}{1+x^{2}}\) and makes PPUSPU empirical distribution of tasks on a smaller number of intervals. After determining the optimum loading between cores 18×PPU (PowerPC Unit) and 54×SPU (Synergetic Processor Unit), the integral calculation algorithm goes to a larger number of intervals.
Keywords
Cell B.E. Cluster PS3 π calculation Unbalanced distribution Double precision1 Introduction
The authors want to build a heterogeneous multicoremanycore computing system and implement a dynamic distribution algorithm in double precision calculations. Thus, an HPC system that consists of 9 PS3 systems (one of them is the master node) connected through a Gbyte router was assembled and set up. After installing the operating system on each node, an OpenMPI library and Cell SDK 3.0 development tools is installed. The algorithm is developed to calculate the approximate value of π, using \(\pi= \int_{0}^{1} \frac{4}{1+x^{2}}\) formula. MPI library was initially used to distribute the calculation area on the 18 threads with the following command: mpirun np 18 pi value, where value is the computing resolution and was chosen with the following values: 10^{5}, 10^{6}, 10^{7}, 10^{8}, 10^{9}, 10^{10}, 10^{11}, 10^{12}. For each resolution, we obtained different values of error and computing time. After these tests, the intent is to introduce in the calculation the SPU cores on each node (9×6 SPUs). Thus, it makes a distribution area integral calculation for the 72 processes available for 9×PS3 cluster. Because the SPU cores do not have a double precision unit, these calculations are performed with great delay. So, we have two possibilities to solve this dilemma: The first one is to return to the calculations distribution only on the PPU 18 processes and the second one is to introduce also the SPU systems into account, but giving them a smaller amount of calculation than to the PPU units. An adaptive algorithm that distributes the total amount of computation between PPU and SPU units is performed. Distribution report is given by a k coefficient that is empirically determined from a mathematical formula for a small number of intervals to calculate the integral. After determining the k factor, it applies automatically to all values in the range: 10^{5}–10^{12}. The coefficient k is determined for the minimum time, when the PPU and SPU units complete their calculations in an equal time (with a margin of 5 %), from the tests results that the k coefficient has different values depending on the number of mathematical calculations performed. The calculation of the coefficient k is presented for different methods of calculating the integral: the trapeze method, the rectangle method, and the experimental results obtained in different distribution models: MPI (only PPU threads) and MPISDK (the PPU and SPU threads).
2 Experimental resources
2.1 PS3 cluster architecture and setting

PS3 is openplatform—meaning, it can run different operating systems (e.g., Fedora Core 8 for PPC).

The PS3 System contains Cell B.E. processor slightly different from the original (and 6×SPU 1×PPU).

Very low price approximately $300 makes it very attractive as a computing node in a cluster system.

Formatting the 9 PS3 nodes.

Installing the operating system Fedora Core 8 for PPC 64.

Installing the SSH service on each station and NFS, used to MPI communication and file sharing.

NFS setting up master node station and the other eight client stations.

Installed and configures OpenMPI library on all stations.

Installing Cell SDK 3.0 on each station.

t _{setup}(n)—time to set up the nodes; initiate PPU and SPU threads.

max(t _{com}(i))—max of times to transmit initial parameters to PPU threads (MPIlibrary).

t _{SPU}(6)—max of times to transmit initial parameters to SPU threads (CellSDKlibrary).

max(t _{calculation}(1/n))—max time of π fractions.

t _{PPUπ }—time to add π fractions from SPU to PPU.

max(t _{comR }(i))—max time to send the partial result from nodes to master node using MPI library.

\(t_{\mathrm{master}_{n}\mathrm{ode}\pi}\)—time to add π fractions from nodes to master node.
Even if it increases the number of nodes (e.g., 23×23 nodes) in a twolevel tree architecture, the total time of calculation decreases with t _{calculation}(1/n) and increases with time to set up the 1058 threads on 529 nodes.
2.2 Cell architecture
CELL processor is a heterogeneous multicore architecture, being developed by a IBMSonyToshiba consortium. It is built around a PowerPC processor, 64 bit (PPE), eightcore computing SIMD (SPEn), a memory controller, and an I/O controller interface. The communication between computing elements PPUSPUs is done through a high speed bus—Element Interconnect Bus (EIB). At a clock frequency of 3.2 GHz, the maximum theoretical performance for SPE (Single Precision SP) is 25.6 GFlops, resulting in a performance of 204.8 GFlops overall for 8 SPUs. For (DPDouble Precision) theoretical maximum performance for a single SPE is 12.08 GFlops and 102.4 GFlops for 8 SPUs [8].
EIB provides a maximum bandwidth of 204.8 GB/s data transfer onchip between the PPU, SPU, memory interface, and I/O controller. Memory controller interface provides a bandwidth of 25.2 GB/s with main memory. PPU unit runs the operating system and drives the SPUs. PPU memory hierarchy is similar to the units of conventional processors with 32 KB L1 cache and 512 KB L2 cache.
SPEs are designed for high performance processing of massive data and intensive calculation. The SPSs memory hierarchy consists of a set of 128×128bit SIMD registers, 256 KB of local memory (LSLocal Store) and offchip main memory shared through PPU unit. SIMD operations can run with four different granularity: 16 of 8bit integer, 8 of 16bit integer, 4 of 32bit integer, or single precision floating point numbers and 2 of 64bit double precision floating point numbers. Local memory (LS) and an SPS can hold code and data and can only access code and data of his memory.
Data transfers between LS memories and the main memory, as well as, the transfers between LS memories are achieved by DMA. DMA transfers are asynchronous and allow SPEs to overlap computation times with transfer times. LS memory management is done exclusively by software. Mailboxes mechanisms for communication between the PPE and SPEs are provided. Each mailbox is able to retain up to four 32bit data elements at any time [4].
3 Experimental work
The algorithm which will be described can be ported to another AMP architecture (asymmetric multicore platforms) which includes two levels of performance processors: Fast Cores and Slow Cores—dual ISA heterogeneous multicore architecture [2]. The k coefficient is used to unbalanced distribution of tasks on PPUs and SPUs so that its time work and energy consumption is balanced. This applies only to systems with performance asymmetry (SPUSlow Core and PPUFast Core) [7]. In fact, SPU’s is much faster, but they are made for image processing. In mathematical calculation is slower because they don’t have a double precision unit (only in PS3 systems).
In [6], the authors present an implementation of core features of MPI for Cell. This implementation views each SPU as a node for an MPI process. In our implementation, we use not only SPU’s but also PPU units. On every PPU unit is started two Linux threads. The application is started on the master node with mpirun np 18 pi value command. The np 18 argument means that the master node will initiate 2 processes on every node using MPI library. In the same time, the master node send to the other nodes the k coefficient and a part of equally distributed from total amount quantity of calculation using MPI library. The even thread on every PPU unit (every node) starts one thread on every SPU processor and sends to SPU threads the quantity of calculation, using SDKmailbox facility. The quantity of calculation for SPU threads is determined using k coefficient. After that, PPU even thread makes own calculation and waits for receive the fraction of π calculation from every SPU thread and add them (using the same SDKmailbox facility). The odd PPU thread makes only own π fraction calculation. Finally, the master node receives the all fractions of π value from every nodes (18 threads) and adds them (using MPI library).
The k coefficient depends on the ratio of power computing between Fast Cores (PPU) and Slow Cores (SPU) and number of them. Even if we increase the number of nodes the optimal k is the same, because the nodes have the same hardware architecture (k is linear with scalability of the system). In [9], the authors are using a variety of algorithms for a few communications operations. In our situation, we do not need to use these types of communications because we are using only mailbox communication to send to each SPU only start, stop, and resolution values for every interval of integral assigned to SPUs.
3.1 Trapezoid area calculation method using the MPI distribution
me variable is the rank of the process (between 1 and 18). n variable defines the computing precision and receives the value of value parameter. h variable is the step through which the x variable is incremented in integral area calculation of 0→1.
3.2 Trapezoid area calculation method using MPISDK distribution
After determine the k coefficient, the value of n gradually increases from 10^{5} to 10^{12} to obtain a better approach of π. Optimal value for k was determined to be approximately: 6<k<10. For values of k greater than 10, decreases the effectiveness of using SPU units. The chart to determine the coefficient k is shown in Fig. 6. nppu variable represents the number of points calculated for a PPU unit. nspu variable represents the number of points calculated for the SPU unit. Variable nprocs is the number of processes launched on the cluster system (between 1 and 18).
Figure 4 (marked with plus) shows the results from running the test programs ppu.c and spu.c on 18 threads (9×2PPU) and 54 threads (9×6SPU) with an optimal ratio unbalance, determined on small values of n, k=7.3 and n′ between 10^{5} and 10^{12}. It can be seen that the way to increase computing time meets the same form as shape marked with circle. It can be concluded that the coefficient k, determined for the optimal time for small values of n, is optimal for large values of it that is linear with increasing resolution.

Pi→20+LOG(Error) obtained for the classical version of MPI.

Err 7.3→20+LOG(Error) obtained for the optimized version of MPISDK unbalanced with a coefficient k=7.3.

Pi→SQRT(Time(s)) obtained for the classical version of MPI.

Time 7.3→SQRT(Time(s)) obtained for the optimized version of MPISDK unbalanced with a coefficient k=7.3.

Using SPUs in the double precision calculation, greatly delayed his performance.

Making a PPUSPU unbalanced distribution brings an improvement in computing time, in comparison with the use of PPU units only.

Distribution coefficient must be determined for a small number of calculation points (a few tenths of a second) and then used to calculate a high resolution (10^{12}—several hours).

Distribution coefficient is linear with resolution increasing.
3.3 Calculation of the rectangle area method using MPI distribution
Distribute the two processes on each node (PPU). Each of these processes calculates an equally distributed part of the integral defined by block:
3.4 Rectangle area calculation method using MPISDK distribution
After running the program, we determine the optimal ratio k=9.3 and n between 10^{5} and 10^{12} (Fig. 8—marked with minus). Figure 8 show the results that are obtained by comparing the two methods used in calculating the π value—rectangle method (classical MPI 18 PPU threads and optimized MPISDK unbalanced PPU 18 threads and 54 threads SPU).
As the trapezoid method, we found an improvement in time calculation between MPI distribution only on PPUs and MPISDK unbalanced distribution between PPU and SPSs, about 10 %.
As we can be seen from Fig. 9, there is a significant improvement in computing time for the same distribution (MPI) between the two methods (trapezoidalrectangle) because fewer math computations run in the second method.
4 Conclusions

The introduction of the SPU units into the calculation of (DPDouble Precision) improves calculation time by about 50 % but is well below of SPU units potential if they had a unit DP.

Also, we get a lower error in calculating of the π value.

To obtain maximum efficiency of computing time, you must use an unbalanced distribution formula calculations between SPU and PPU units.

Distribution coefficient k takes different values depending on the number and complexity of the calculations.

Preferably use SPU units in double precision calculations, but reserve their use to a lower level.
5 Future works
As we can observe, the k coefficient was determined as 1.1 for the SIMD version. That means that PPU and SPU get approximatively the same loading.
Notes
Acknowledgements
This paper was supported by the project “Progress and development through postdoctoral research and innovation in engineering and applied sciences—PriDE—Contract no. POSDRU/89/1.5/S/57083”, project cofunded from European Social Fund through Sectoral Operational Program for Human Resources 2007–2013.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
References
 1.Blagojevic F, CurtisMaury M, Yeom JS, Schneider S, Nikolopoulos DS (2008) Scheduling asymmetric parallelism on a PlayStation3 cluster. In: CCGRID ’08 proceedings of the 2008 eighth IEEE international symposium on cluster computing and the grid, pp 146–153 CrossRefGoogle Scholar
 2.Calandrino JM, Baumberger D, Li T, Hahn S, Anderson JH (2007) Soft realtime scheduling on performance asymmetric multicore platforms. In: Real time and embedded technology and applications symposium, pp 101–112 Google Scholar
 3.
 4.IBMSony, Toshiba Corp. (2006–2008) Cell broadband engine programming handbook Google Scholar
 5.Koranne S (2009) Practical computing on cell broadband engine. Springer Science, Dordrecht CrossRefGoogle Scholar
 6.Krishna M, Kumar A, Jayam N, Senthilkumar G, Baruah PK, Sharma R, Kapoor S, Srinivasan A (2007) A synchronous mode MPI implementation on the cell BE architecture. In: Parallel and distributed processing and applications, vol 4742. Springer, Berlin, pp 982–991 CrossRefGoogle Scholar
 7.Li T, Brett P, Hohlt B, Knauerhase R, McElderry SD, Hahn S (2010) Operating system support for sharedISA asymmetric multicore architecture. In: Proceedings of the 16th IEEE international symposium on highperformance computer architecture, January 2010, pp 19–26 Google Scholar
 8.Scarpino M (2008) Programming the cell processor: for games, graphics, and computation, vol 744. Springer Science, Upper Saddle River Google Scholar
 9.Velamati MK, Kumar A, Jayam N, Senthilkumar G, Baruah PK, Sharma R, Kapoor S, Srinivasan A (2007) Optimization of collective communication in IntraCell MPI. In: High performance computing—HiPC 2007, vol 4873. Springer, Berlin, pp 488–499 CrossRefGoogle Scholar