Performance and Numerical Accuracy Evaluation of Heterogeneous Multicore Systems for Krylov Orthogonal Basis Computation
We study the numerical behavior of heterogeneous systems such as CPU with GPU or IBM Cell processors for some orthogonalization processes. We focus on the influence of the different floating arithmetic handling of these accelerators with Gram-Schmidt orthogonalization using single and double precision. We observe for dense matrices a loss of at worst 1 digit for CUDA-enabled GPUs as well as a speed-up of 20x, and 2 digits for the Cell processor for a 7x speed-up. For sparse matrices, the result between CPU and GPU is very close and the speed-up is 10x. We conclude that the Cell processor is a good accelerator for double precision because of its full IEEE compliance, and not sufficient for single precision applications. The GPU speed-up is better than Cell and the decent IEEE support delivers results close to the CPU ones for both precisions.
Keywordsparallel and distributed computing numerical algorithms for CS&E performance analysis
Unable to display preview. Download preview PDF.
- 1.An updated set of basic linear algebra subprograms (blas). ACM Trans. Math. Softw. 28(2), 135–151 (2002)Google Scholar
- 2.Arevalo, A., Matinata, R.M., (Raj)Pandian, M., Peri, E., Ruby, K., Thomas, F., Almond, C.: Architecture overview and its impact on programming. In: Programming the Cell Broadband Engine Architecture: Examples and Best Practices, ch. 4.61. IBM (2008)Google Scholar
- 3.Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: SC 2009: Proceedings of the 2009 ACM/IEEE Conference on Supercomputing. ACM, New York (2009)Google Scholar
- 6.NVidia Corporation. Nvidia: Cublas library. Technical report. Whitepaper. Part of CUDA ToolkitGoogle Scholar
- 8.Frigo, M., Johnson, S.G.: Fftw on the cell processor, http://www.fftw.org/cell/
- 10.Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys (1991)Google Scholar
- 11.Golub, G.H., Van Loan, C.F.: Matrix Computations (Johns Hopkins Studies in Mathematical Sciences). The Johns Hopkins University Press, Baltimore (1996)Google Scholar
- 13.IEEE: IEEE standard for binary floating-point arithmetic. ACM SIGPLAN Notices 22(2), 9–25 (1985)Google Scholar
- 14.Meuer, H., Strohmaier, E., Dongarra, J., Simon, H.: Architecture share over time, http://www.top500.org/overtime/list/32/archtype
- 15.NVIDIA. NVIDIA CUDA Programming Guide 2.0 (2008)Google Scholar
- 17.Takuya, Y., Daisuke, T., Taisuke, B., Mitsuhisa, S.: Parallel implementation of classical gram-schmidt orthogonalization using matrix multiplication. IPSJ SIG Technical Reports (63(HPC-106)), 31–36 (2006)Google Scholar