International Journal of Parallel Programming

, Volume 42, Issue 4, pp 619–642 | Cite as

MulticoreBSP for C: A High-Performance Library for Shared-Memory Parallel Programming

  • A. N. YzelmanEmail author
  • R. H. Bisseling
  • D. Roose
  • K. Meerbergen


The bulk synchronous parallel (BSP) model, as well as parallel programming interfaces based on BSP, classically target distributed-memory parallel architectures. In earlier work, Yzelman and Bisseling designed a MulticoreBSP for Java library specifically for shared-memory architectures. In the present article, we further investigate this concept and introduce the new high-performance MulticoreBSP for C library. Among other features, this library supports nested BSP runs. We show that existing BSP software performs well regardless whether it runs on distributed-memory or shared-memory architectures, and show that applications in MulticoreBSP can attain high-performance results. The paper details implementing the Fast Fourier Transform and the sparse matrix–vector multiplication in BSP, both of which outperform state-of-the-art implementations written in other shared-memory parallel programming interfaces. We furthermore study the applicability of BSP when working on highly non-uniform memory access architectures.


High-performance computing Bulk synchronous parallel   Shared-memory parallel programming Software library Fast Fourier transform Sparse matrix–vector multiplication 



Part of this work is funded by Intel and by the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT).


  1. 1.
    Bisseling, R.H.: Parallel Scientific Computation: A Structured Approach Using BSP and MPI. Oxford University Press, Oxford (2004)CrossRefGoogle Scholar
  2. 2.
    Bonorden, O., Juurlink, B., von Otte, I., Rieping, I.: The Paderborn University BSP (PUB) library. Parallel Comput. 29(2), 187–207 (2003)CrossRefGoogle Scholar
  3. 3.
    Buluç, A., Fineman, J.T., Frigo, M., Gilbert, J.R., Leiserson, C.E.: Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In: SPAA’09: Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures, pp. 233–244. ACM, New York, NY (2009)Google Scholar
  4. 4.
    Buluç, A., Williams, S., Oliker, L., Demmel, J.: Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In: International Parallel and Distributed Processing Symposium (IPDPS), pp. 721–733. IEEE Press, Piscataway, NJ (2011)Google Scholar
  5. 5.
    De la Torre, P., Kruskal, C.P.: Submachine locality in the bulk synchronous setting. In: Bougé, L., Fraigniaud, P., Mignotte, A., Robert, Y. (eds.) Euro-Par’96 Parallel Processing. Lecture Notes in Computer Science, vol. 1124, pp. 352–358. Springer, Berlin (1996)Google Scholar
  6. 6.
    Franchetti, F., Püschel, M., Voronenko, Y., Chellappa, S., Moura, J.M.F.: Discrete Fourier transform on multicore. IEEE Signal Process. Mag., special issue on Signal Processing on Platforms with Multiple Cores. 26(6), 90–102 (2009)Google Scholar
  7. 7.
    Frigo, M.: A fast Fourier transform compiler. In: Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, PLDI’99, pp. 169–180. ACM, New York, NY (1999)Google Scholar
  8. 8.
    Hamidouche, K., Falcou, J., Etiemble, D.: Hybrid bulk synchronous parallelism library for clustered SMP architectures. In: Proceedings Fourth International Workshop on High-level Parallel Programming and Applications, pp. 55–62. ACM, New York, NY (2010)Google Scholar
  9. 9.
    Hill, J.M.D., McColl, B., Stefanescu, D.C., Goudreau, M.W., Lang, K., Rao, S.B., Suel, T., Tsantilas, T., Bisseling, R.H.: BSPlib: the BSP programming library. Parallel Comput. 24(14), 1947–1980 (1998)CrossRefGoogle Scholar
  10. 10.
    Hinsen, K.: High-level parallel software development with Python and BSP. Parallel Process. Lett. 13(03), 473–484 (2003)CrossRefMathSciNetGoogle Scholar
  11. 11.
    IEEE: Std. 1003.1-2008 Portable Operating System Interface (POSIX) Base Specifications, Issue 7. IEEE Standards for Information Technology. IEEE Press, Piscataway, NJ (2008)Google Scholar
  12. 12.
    Inda, M.A., Bisseling, R.H.: A simple and efficient parallel FFT algorithm using the BSP model. Parallel Comput. 27(14), 1847–1878 (2001)CrossRefzbMATHMathSciNetGoogle Scholar
  13. 13.
    Javed, N., Loulergue, F.: OSL: Optimized bulk synchronous parallel skeletons on distributed arrays. In: Dou, Y., Gruber, R., Joller, J. (eds.) Advanced Parallel Processing Technologies. Lecture Notes in Computer Science, vol. 5737, pp. 436–451. Springer, Berlin (2009)Google Scholar
  14. 14.
    Keßler, C.W.: NestStep: nested parallelism and virtual shared memory for the BSP model. J. Supercomput. 17, 245–262 (2000)CrossRefzbMATHGoogle Scholar
  15. 15.
    Liu, J., Wu, J., Panda, D.K.: High performance RDMA-based MPI implementation over Infiniband. Int. J. Parallel Program. 32(3), 167–198 (2004)CrossRefzbMATHGoogle Scholar
  16. 16.
    Loulergue, F., Gava, F., Billiet, D.: Bulk synchronous parallel ML: Modular implementation and performance prediction. In: International Conference on Computational Science, Part II, Lecture Notes in Computer Science, vol. 3515, pp. 1046–1054. Springer, Berlin (2005)Google Scholar
  17. 17.
    Püschel, M., Franchetti, F., Voronenko, Y.: Spiral. In: Encyclopedia of Parallel Computing. Springer, Berlin (2011)Google Scholar
  18. 18.
    Suijlen, W.: BSPonMPI. Accessed 10 Feb 2013
  19. 19.
    Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)CrossRefGoogle Scholar
  20. 20.
    Valiant, L.G.: A bridging model for multi-core computing. J. Comput. Syst. Sci. 77(1), 154–166 (2011)CrossRefzbMATHMathSciNetGoogle Scholar
  21. 21.
    Vastenhouw, B., Bisseling, R.H.: A two-dimensional data distribution method for parallel sparse matrix-vector multiplication. SIAM Rev. 47(1), 67–95 (2005)CrossRefzbMATHMathSciNetGoogle Scholar
  22. 22.
    Yzelman, A.N., Roose, D.: High-level strategies for parallel shared-memory sparse matrix-vector multiplication. IEEE Trans. Parallel Distrib. Syst. (2013, in press)Google Scholar
  23. 23.
    Yzelman, A.N., Bisseling, R.H.: Two-dimensional cache-oblivious sparse matrix-vector multiplication. Parallel Comput. 37(12), 806–819 (2011)CrossRefGoogle Scholar
  24. 24.
    Yzelman, A.N., Bisseling, R.H.: An object-oriented bulk synchronous parallel library for multicore programming. Concurr. Comput. Pract. Exper. 24(5), 533–553 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • A. N. Yzelman
    • 1
    • 2
    Email author
  • R. H. Bisseling
    • 3
  • D. Roose
    • 2
  • K. Meerbergen
    • 2
  1. 1.Flanders ExaScience Lab (Intel Labs Europe)HeverleeBelgium
  2. 2.Department of Computer ScienceKU LeuvenHeverleeBelgium
  3. 3.Department of MathematicsUtrecht UniversityUtrechtThe Netherlands

Personalised recommendations