# MulticoreBSP for C: A High-Performance Library for Shared-Memory Parallel Programming

- 396 Downloads
- 6 Citations

## Abstract

The bulk synchronous parallel (BSP) model, as well as parallel programming interfaces based on BSP, classically target distributed-memory parallel architectures. In earlier work, Yzelman and Bisseling designed a MulticoreBSP for Java library specifically for shared-memory architectures. In the present article, we further investigate this concept and introduce the new high-performance MulticoreBSP for C library. Among other features, this library supports nested BSP runs. We show that existing BSP software performs well regardless whether it runs on distributed-memory or shared-memory architectures, and show that applications in MulticoreBSP can attain high-performance results. The paper details implementing the Fast Fourier Transform and the sparse matrix–vector multiplication in BSP, both of which outperform state-of-the-art implementations written in other shared-memory parallel programming interfaces. We furthermore study the applicability of BSP when working on highly non-uniform memory access architectures.

## Keywords

High-performance computing Bulk synchronous parallel Shared-memory parallel programming Software library Fast Fourier transform Sparse matrix–vector multiplication## Notes

### Acknowledgments

Part of this work is funded by Intel and by the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT).

## References

- 1.Bisseling, R.H.: Parallel Scientific Computation: A Structured Approach Using BSP and MPI. Oxford University Press, Oxford (2004)CrossRefGoogle Scholar
- 2.Bonorden, O., Juurlink, B., von Otte, I., Rieping, I.: The Paderborn University BSP (PUB) library. Parallel Comput.
**29**(2), 187–207 (2003)CrossRefGoogle Scholar - 3.Buluç, A., Fineman, J.T., Frigo, M., Gilbert, J.R., Leiserson, C.E.: Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In: SPAA’09: Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures, pp. 233–244. ACM, New York, NY (2009)Google Scholar
- 4.Buluç, A., Williams, S., Oliker, L., Demmel, J.: Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In: International Parallel and Distributed Processing Symposium (IPDPS), pp. 721–733. IEEE Press, Piscataway, NJ (2011)Google Scholar
- 5.De la Torre, P., Kruskal, C.P.: Submachine locality in the bulk synchronous setting. In: Bougé, L., Fraigniaud, P., Mignotte, A., Robert, Y. (eds.) Euro-Par’96 Parallel Processing. Lecture Notes in Computer Science, vol. 1124, pp. 352–358. Springer, Berlin (1996)Google Scholar
- 6.Franchetti, F., Püschel, M., Voronenko, Y., Chellappa, S., Moura, J.M.F.: Discrete Fourier transform on multicore. IEEE Signal Process. Mag., special issue on Signal Processing on Platforms with Multiple Cores.
**26**(6), 90–102 (2009)Google Scholar - 7.Frigo, M.: A fast Fourier transform compiler. In: Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, PLDI’99, pp. 169–180. ACM, New York, NY (1999)Google Scholar
- 8.Hamidouche, K., Falcou, J., Etiemble, D.: Hybrid bulk synchronous parallelism library for clustered SMP architectures. In: Proceedings Fourth International Workshop on High-level Parallel Programming and Applications, pp. 55–62. ACM, New York, NY (2010)Google Scholar
- 9.Hill, J.M.D., McColl, B., Stefanescu, D.C., Goudreau, M.W., Lang, K., Rao, S.B., Suel, T., Tsantilas, T., Bisseling, R.H.: BSPlib: the BSP programming library. Parallel Comput.
**24**(14), 1947–1980 (1998)CrossRefGoogle Scholar - 10.Hinsen, K.: High-level parallel software development with Python and BSP. Parallel Process. Lett.
**13**(03), 473–484 (2003)CrossRefMathSciNetGoogle Scholar - 11.IEEE: Std. 1003.1-2008 Portable Operating System Interface (POSIX) Base Specifications, Issue 7. IEEE Standards for Information Technology. IEEE Press, Piscataway, NJ (2008)Google Scholar
- 12.Inda, M.A., Bisseling, R.H.: A simple and efficient parallel FFT algorithm using the BSP model. Parallel Comput.
**27**(14), 1847–1878 (2001)CrossRefzbMATHMathSciNetGoogle Scholar - 13.Javed, N., Loulergue, F.: OSL: Optimized bulk synchronous parallel skeletons on distributed arrays. In: Dou, Y., Gruber, R., Joller, J. (eds.) Advanced Parallel Processing Technologies. Lecture Notes in Computer Science, vol. 5737, pp. 436–451. Springer, Berlin (2009)Google Scholar
- 14.Keßler, C.W.: NestStep: nested parallelism and virtual shared memory for the BSP model. J. Supercomput.
**17**, 245–262 (2000)CrossRefzbMATHGoogle Scholar - 15.Liu, J., Wu, J., Panda, D.K.: High performance RDMA-based MPI implementation over Infiniband. Int. J. Parallel Program.
**32**(3), 167–198 (2004)CrossRefzbMATHGoogle Scholar - 16.Loulergue, F., Gava, F., Billiet, D.: Bulk synchronous parallel ML: Modular implementation and performance prediction. In: International Conference on Computational Science, Part II, Lecture Notes in Computer Science, vol. 3515, pp. 1046–1054. Springer, Berlin (2005)Google Scholar
- 17.Püschel, M., Franchetti, F., Voronenko, Y.: Spiral. In: Encyclopedia of Parallel Computing. Springer, Berlin (2011)Google Scholar
- 18.Suijlen, W.: BSPonMPI. http://sourceforge.net/projects/bsponmpi/. Accessed 10 Feb 2013
- 19.Valiant, L.G.: A bridging model for parallel computation. Commun. ACM
**33**(8), 103–111 (1990)CrossRefGoogle Scholar - 20.Valiant, L.G.: A bridging model for multi-core computing. J. Comput. Syst. Sci.
**77**(1), 154–166 (2011)CrossRefzbMATHMathSciNetGoogle Scholar - 21.Vastenhouw, B., Bisseling, R.H.: A two-dimensional data distribution method for parallel sparse matrix-vector multiplication. SIAM Rev.
**47**(1), 67–95 (2005)CrossRefzbMATHMathSciNetGoogle Scholar - 22.Yzelman, A.N., Roose, D.: High-level strategies for parallel shared-memory sparse matrix-vector multiplication. IEEE Trans. Parallel Distrib. Syst. (2013, in press)Google Scholar
- 23.Yzelman, A.N., Bisseling, R.H.: Two-dimensional cache-oblivious sparse matrix-vector multiplication. Parallel Comput.
**37**(12), 806–819 (2011)CrossRefGoogle Scholar - 24.Yzelman, A.N., Bisseling, R.H.: An object-oriented bulk synchronous parallel library for multicore programming. Concurr. Comput. Pract. Exper.
**24**(5), 533–553 (2012)CrossRefGoogle Scholar