Skip to main content
Log in

A 2D algorithm with asymmetric workload for the UPC conjugate gradient method

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript


This paper examines four different strategies, each one with its own data distribution, for implementing the parallel conjugate gradient (CG) method and how they impact communication and overall performance. Firstly, typical 1D and 2D distributions of the matrix involved in CG computations are considered. Then, a new 2D version of the CG method with asymmetric workload, based on leaving some threads idle during part of the computation to reduce communication, is proposed. The four strategies are independent of sparse storage schemes and are implemented using Unified Parallel C (UPC), a Partitioned Global Address Space (PGAS) language. The strategies are evaluated on two different platforms through a set of matrices that exhibit distinct sparse patterns, demonstrating that our asymmetric proposal outperforms the others except for one matrix on one platform.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others


  1. For practical purposes, the algorithm is often used with preconditioners but this is not in the scope of this paper.


  1. Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatooh RA, Frederickson PO, Lasinski TA, Schreiber RS, Simon HD, Venkatakrishnan V, Weeratunga SK (1991) The NAS parallel benchmarks. Int J High Perform Comput Appl 5:63–73

    Article  Google Scholar 

  2. Dongarra J, Heroux MA (2013) Toward a new metric for ranking high performance computing systems. Technical Report SAND2013-4744, Sandia National Laboratories, USA

  3. Petitet A, Whaley RC, Dongarra J, Cleary A (2014) HPL-a portable implementation of the high-performance linpack benchmark for distributed-memory computers. (Last visit August 2014)

  4. Top500 Supercomputer Sites. (Last visit August 2014)

  5. Berkeley UPC Project. (Last visit August 2014)

  6. El-Ghazawi T, Carlson W, Sterling T, Yelick K (2003) UPC: distributed shared-memory programming. Wiley-Interscience, Hoboken

  7. Mallón DA, Gómez A, Mouriño JC, Taboada GL, Teijeiro C, Touriño J, Fraguela BB, Doallo R, Wibecan B (2009) UPC performance evaluation on a multicore system. In: Proceedings of the 3rd conference on partitioned global address space programming models (PGAS’09). Ashburn, Virginia, USA

  8. Shan H, Blagojević F, Min SJ, Hargrove P, Jin H, Fuerlinger K, Koniges A, Wright NJ (2010) A programming model performance study using the NAS parallel benchmarks. Sci Program 18(3–4):153–167

    Google Scholar 

  9. Zheng Y (2010) Optimizing UPC programs for multi-core systems. Sci Program 18(3–4):183–191

    Google Scholar 

  10. The DEGAS Project. (Last visit August 2014)

  11. Chen WY, Bonachea D, Duell J, Husbands P, Iancu C, Yelick K (2003) A performance analysis of the Berkeley UPC compiler. In: Proceedings of the 17th international conference on supercomputing (ICS’03), San Francisco, CA, USA, pp 63–73

  12. El-Ghazawi T, Cantonnet F (2002) UPC performance and potential: a NPB experimental study. In: Proceedings of the 14th ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC’02), Baltimore, MD, USA, pp 1–26

  13. Jin H, Hood R, Mehrotra P (2009) A practical study of UPC using the NAS parallel benchmarks. In: Proceedings of the 3rd conference on partitioned global address space programming models (PGAS’09), Ashburn, Virginia, USA

  14. Vuduc R, Demmel JW, Yelick KA (2005) OSKI: a library of automatically tuned sparse matrix kernels. J Phys Conf Ser 16(1):521–530

    Article  Google Scholar 

  15. Pichel JC, Heras DB, Cabaleiro JC, García-Loureiro AJ, Rivera FF (2010) Increasing the locality of iterative methods and its application to the simulation of semiconductor devices. Int J High Perform Comput Appl 24(2):136–153

    Article  Google Scholar 

  16. Belgin M, Back G, Ribbens CJ (2009) Pattern-based sparse matrix representation for memory-efficient SMVM kernels. In: Proceedings of the 23rd international conference on supercomputing (ICS’09), Yorktown Heights, NY, USA, pp 100–109

  17. Kourtis K, Goumas G, Koziris N (2008) Optimizing sparse matrix–vector multiplication using index and value compression. In: Proceedings of the 5th conference on computing frontiers (CF’08), Ischia, Italy, pp 87–96

  18. Kourtis K, Karakasis V, Goumas G, Kozirisl N (2011) CSX: an extended compression format for SpMV on shared memory systems. In: Proceedings of the 16th ACM SIGPLAN annual symposium on principles and practice of parallel programming (PPoPP’11), San Antonio, TX, USA, pp 12–16

  19. Willcock J, Lumsdaine A (2006) Accelerating sparse matrix computations via data compression. In: Proceedings of the 20th international conference on supercomputing (ICS’06), Cairns, Australia, pp 307–316

  20. González-Domínguez J, García-López O, Taboada GL, Martín MJ, Touriño J (2012) Performance evaluation of sparse matrix products in UPC. J Supercomput 64(1):63–73

    Google Scholar 

  21. Ismail L (2010) Communication issues in parallel conjugate gradient method using a star-based network. In: Proceedings of the 1st international conference on computer applications and industrial electronics (ICCAIE’10), Kuala Lumpur, Malaysia

  22. Chen F, Theobald KB, Gao GR (2004) Implementing parallel conjugate gradient on the EARTH multithreaded architecture. In: Proceedings of the 6th IEEE international conference on cluster computing (CLUSTER’04), San Diego, CA, USA, pp 459–469

  23. Barrett R, Berry M, Chan TF, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C, van der Vorst H (1994) Templates for the solution of linear systems: building blocks for iterative methods, 2nd edn. SIAM, Philadelphia

    Google Scholar 

  24. Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia

    Book  MATH  Google Scholar 

  25. Williams S, Oliker L, Vuduc RW, Shalf J, Yelick K, Demmel J (2007) Optimization of sparse matrix–vector multiplication on emerging multicore platforms. In: Proceedings of the 19th ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC’07), Reno, NV, USA

  26. The University of Florida Sparse Matrix Collection. (Last visit August 2014)

Download references


This work was funded by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P), by the Galician Government (Consolidation Program of Competitive Reference Groups GRC2013/055) and by the U.S. Department of Energy (Contract No. DE-AC03-76SF00098).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jorge González-Domínguez.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

González-Domínguez, J., Marques, O.A., Martín, M.J. et al. A 2D algorithm with asymmetric workload for the UPC conjugate gradient method. J Supercomput 70, 816–829 (2014).

Download citation

  • Published:

  • Issue Date:

  • DOI: