Abstract
To meet the needs of high performance computing, the Cell Broadband Engine owns many features that differ from traditional processors, such as the large number of synergistic processor elements, large register files, the ability to hide main-storage latency with concurrent computation and DMA transfers. The exploitation of those features requires the programmer to carefully tailor programs and simutaneously deal with various performance factors, including locality, load balance, communication overhead, and multi-level parallelism. These factors, unfortunately, are dependent on each other; an optimization that enhances one factor may degrade another. This paper presents our experience on optimizing LU decomposition, one of the commonly used algebra kernels in scientific computing, on Cell Broadband Engine. The optimizations exploit task-level, data-level, and communication-level parallelism. We study the effects of different task distribution strategies, prefetch, and software cache, and explore the tradeoff among different performance factors, stressing the interactions between different optimizations. This work offers some insights in the optimizations on heterogenous multi-core processors, including the selection of programming models, considerations in task distribution, and the holistic perspective required in optimizations.
Chapter PDF
Similar content being viewed by others
References
Angerson, E., Bai, Z., Dongarra, J., Greenbaum, A., McKenney, A., Du Croz, J., Hammarling, S., Demmel, J., Bischof, C., Sorensen, D.: LAPACK: A portable linear algebra library for high-performance computers. IEEE Supercomputing, 2–11 (1990)
Beaumont, O., Legrand, A., Rastello, F., Robert, Y.: Static LU decomposition on heterogeneous platforms. The International Journal of High Performance Computing Applications 15(3), 310–323 (2001)
Blackford, L.S., Choi, J., Cleary, A., D’Azeuedo, E., Demmel, J., Dhillon, I., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK user’s guide, Society for Industrial and Applied Mathematics, Philadelphia (1997)
Buoni, J.J., Farrell, P.A., Ruttan, A.: Algorithms for lu decomposition on a shared memory multiprocessor. Parallel Comput. 19(8), 925–937 (1993)
Chen, T., Zhang, T., Sura, Z., Tallada, M.G.: Prefetching irregular references for software cache on cell. In: CGO, pp. 155–164 (2008)
Eichenberger, A.E., et al.: Using advanced compiler technology to exploit the performance of the cell broadband enginetm architecture. IBM Syst. J. 45(1), 59–84 (2006)
Pham, D., et al.: The design and implementation of a first-generation cell processor. In: Proceedings of the IEEE International Solid-State Circuits Conference, ISSCC 2005 (2005)
IBM. Cell be programming tutorial, http://www-01.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine
IBM. Cell broadband engine sdk libraries v3.0 (2008), http://www.ibm.com/developerworks/power/cell
Jiang, Y., Zhang, E., Tian, K., Shen, X.: Is reuse distance applicable to data locality analysis on chip multiprocessors? In: Proceedings of the International Conference on Compiler Construction (2010)
Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005)
Yi, Q., Kennedy, K., You, H., Seymour, K., Dongarra, J.: Automatic blocking of qr and lu factorizations for locality. In: MSP 2004: Proceedings of the 2004 Workshop on Memory System Performance, pp. 12–22. ACM, New York (2004)
Zhang, E.Z., Jiang, Y., Shen, X.: Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs? In: PPoPP 2010: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 203–212 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 IFIP International Federation for Information Processing
About this paper
Cite this paper
Mao, F., Shen, X. (2010). LU Decomposition on Cell Broadband Engine: An Empirical Study to Exploit Heterogeneous Chip Multiprocessors. In: Ding, C., Shao, Z., Zheng, R. (eds) Network and Parallel Computing. NPC 2010. Lecture Notes in Computer Science, vol 6289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15672-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-15672-4_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15671-7
Online ISBN: 978-3-642-15672-4
eBook Packages: Computer ScienceComputer Science (R0)