The Journal of Supercomputing

, Volume 70, Issue 2, pp 816–829 | Cite as

A 2D algorithm with asymmetric workload for the UPC conjugate gradient method

  • Jorge González-Domínguez
  • Osni A. Marques
  • María J. Martín
  • Juan Touriño


This paper examines four different strategies, each one with its own data distribution, for implementing the parallel conjugate gradient (CG) method and how they impact communication and overall performance. Firstly, typical 1D and 2D distributions of the matrix involved in CG computations are considered. Then, a new 2D version of the CG method with asymmetric workload, based on leaving some threads idle during part of the computation to reduce communication, is proposed. The four strategies are independent of sparse storage schemes and are implemented using Unified Parallel C (UPC), a Partitioned Global Address Space (PGAS) language. The strategies are evaluated on two different platforms through a set of matrices that exhibit distinct sparse patterns, demonstrating that our asymmetric proposal outperforms the others except for one matrix on one platform.


Conjugate gradient PGAS UPC Performance optimization  Data distribution 



This work was funded by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P), by the Galician Government (Consolidation Program of Competitive Reference Groups GRC2013/055) and by the U.S. Department of Energy (Contract No. DE-AC03-76SF00098).


  1. 1.
    Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatooh RA, Frederickson PO, Lasinski TA, Schreiber RS, Simon HD, Venkatakrishnan V, Weeratunga SK (1991) The NAS parallel benchmarks. Int J High Perform Comput Appl 5:63–73CrossRefGoogle Scholar
  2. 2.
    Dongarra J, Heroux MA (2013) Toward a new metric for ranking high performance computing systems. Technical Report SAND2013-4744, Sandia National Laboratories, USAGoogle Scholar
  3. 3.
    Petitet A, Whaley RC, Dongarra J, Cleary A (2014) HPL-a portable implementation of the high-performance linpack benchmark for distributed-memory computers. (Last visit August 2014)
  4. 4.
    Top500 Supercomputer Sites. (Last visit August 2014)
  5. 5.
    Berkeley UPC Project. (Last visit August 2014)
  6. 6.
    El-Ghazawi T, Carlson W, Sterling T, Yelick K (2003) UPC: distributed shared-memory programming. Wiley-Interscience, HobokenGoogle Scholar
  7. 7.
    Mallón DA, Gómez A, Mouriño JC, Taboada GL, Teijeiro C, Touriño J, Fraguela BB, Doallo R, Wibecan B (2009) UPC performance evaluation on a multicore system. In: Proceedings of the 3rd conference on partitioned global address space programming models (PGAS’09). Ashburn, Virginia, USAGoogle Scholar
  8. 8.
    Shan H, Blagojević F, Min SJ, Hargrove P, Jin H, Fuerlinger K, Koniges A, Wright NJ (2010) A programming model performance study using the NAS parallel benchmarks. Sci Program 18(3–4):153–167Google Scholar
  9. 9.
    Zheng Y (2010) Optimizing UPC programs for multi-core systems. Sci Program 18(3–4):183–191Google Scholar
  10. 10.
    The DEGAS Project. (Last visit August 2014)
  11. 11.
    Chen WY, Bonachea D, Duell J, Husbands P, Iancu C, Yelick K (2003) A performance analysis of the Berkeley UPC compiler. In: Proceedings of the 17th international conference on supercomputing (ICS’03), San Francisco, CA, USA, pp 63–73Google Scholar
  12. 12.
    El-Ghazawi T, Cantonnet F (2002) UPC performance and potential: a NPB experimental study. In: Proceedings of the 14th ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC’02), Baltimore, MD, USA, pp 1–26Google Scholar
  13. 13.
    Jin H, Hood R, Mehrotra P (2009) A practical study of UPC using the NAS parallel benchmarks. In: Proceedings of the 3rd conference on partitioned global address space programming models (PGAS’09), Ashburn, Virginia, USAGoogle Scholar
  14. 14.
    Vuduc R, Demmel JW, Yelick KA (2005) OSKI: a library of automatically tuned sparse matrix kernels. J Phys Conf Ser 16(1):521–530CrossRefGoogle Scholar
  15. 15.
    Pichel JC, Heras DB, Cabaleiro JC, García-Loureiro AJ, Rivera FF (2010) Increasing the locality of iterative methods and its application to the simulation of semiconductor devices. Int J High Perform Comput Appl 24(2):136–153CrossRefGoogle Scholar
  16. 16.
    Belgin M, Back G, Ribbens CJ (2009) Pattern-based sparse matrix representation for memory-efficient SMVM kernels. In: Proceedings of the 23rd international conference on supercomputing (ICS’09), Yorktown Heights, NY, USA, pp 100–109Google Scholar
  17. 17.
    Kourtis K, Goumas G, Koziris N (2008) Optimizing sparse matrix–vector multiplication using index and value compression. In: Proceedings of the 5th conference on computing frontiers (CF’08), Ischia, Italy, pp 87–96Google Scholar
  18. 18.
    Kourtis K, Karakasis V, Goumas G, Kozirisl N (2011) CSX: an extended compression format for SpMV on shared memory systems. In: Proceedings of the 16th ACM SIGPLAN annual symposium on principles and practice of parallel programming (PPoPP’11), San Antonio, TX, USA, pp 12–16Google Scholar
  19. 19.
    Willcock J, Lumsdaine A (2006) Accelerating sparse matrix computations via data compression. In: Proceedings of the 20th international conference on supercomputing (ICS’06), Cairns, Australia, pp 307–316Google Scholar
  20. 20.
    González-Domínguez J, García-López O, Taboada GL, Martín MJ, Touriño J (2012) Performance evaluation of sparse matrix products in UPC. J Supercomput 64(1):63–73Google Scholar
  21. 21.
    Ismail L (2010) Communication issues in parallel conjugate gradient method using a star-based network. In: Proceedings of the 1st international conference on computer applications and industrial electronics (ICCAIE’10), Kuala Lumpur, MalaysiaGoogle Scholar
  22. 22.
    Chen F, Theobald KB, Gao GR (2004) Implementing parallel conjugate gradient on the EARTH multithreaded architecture. In: Proceedings of the 6th IEEE international conference on cluster computing (CLUSTER’04), San Diego, CA, USA, pp 459–469Google Scholar
  23. 23.
    Barrett R, Berry M, Chan TF, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C, van der Vorst H (1994) Templates for the solution of linear systems: building blocks for iterative methods, 2nd edn. SIAM, PhiladelphiaGoogle Scholar
  24. 24.
    Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. Society for Industrial and Applied Mathematics, PhiladelphiaCrossRefMATHGoogle Scholar
  25. 25.
    Williams S, Oliker L, Vuduc RW, Shalf J, Yelick K, Demmel J (2007) Optimization of sparse matrix–vector multiplication on emerging multicore platforms. In: Proceedings of the 19th ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC’07), Reno, NV, USAGoogle Scholar
  26. 26.
    The University of Florida Sparse Matrix Collection. (Last visit August 2014)

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Jorge González-Domínguez
    • 1
  • Osni A. Marques
    • 2
  • María J. Martín
    • 3
  • Juan Touriño
    • 3
  1. 1.Parallel and Distributed Architectures GroupJohannes Gutenberg University-MainzMainzGermany
  2. 2.Computational Research DivisionLawrence Berkeley National LaboratoryBerkeleyUSA
  3. 3.Computer Architecture GroupUniversity of A CoruñaA CoruñaSpain

Personalised recommendations