The Journal of Supercomputing

, Volume 70, Issue 2, pp 786–798 | Cite as

Optimizing an APSP implementation for NVIDIA GPUs using kernel characterization criteria

  • Hector Ortega-Arranz
  • Yuri Torres
  • Arturo Gonzalez-Escribano
  • Diego R. Llanos


During the last years, GPU manycore devices have demonstrated their usefulness to accelerate computationally intensive problems. Although arriving at a parallelization of a highly parallel algorithm is an affordable task, the optimization of GPU codes is a challenging activity. The main reason for this is the number of parameters, programming choices, and tuning techniques available, many of them related with complex and sometimes hidden architecture details. A useful strategy to systematically attack these optimization problems is to characterize the different kernels of the application, and use this knowledge to select appropriate configuration parameters. The All-Pair Shortest-Path (APSP) problem is a well-known problem in graph theory whose objective is to find the shortest paths between any pairs of nodes in a graph. This problem can be solved by highly parallel and computational intensive tasks, being a good candidate to be exploited by manycore devices. In this paper, we use kernel characterization criteria to optimize an APSP algorithm implementation for NVIDIA GPUs. Our experimental results show that the combined use of proper configuration policies, and the concurrent kernels capability of new CUDA architectures, leads to a performance improvement of up to 62 % with respect to one of the possible configurations recommended by CUDA, considered as baseline.


APSP Cache configuration Concurrent kernel GPU  Kernel characterization Threadblock size 



This research has been partially supported by Ministerio de Economía y Competitividad (Spain) and ERDF program of the European Union: CAPAP-H4 network (TIN2011-15734-E), MOGECOPP project (TIN2011-25639); and Junta de Castilla y León (Spain) ATLAS project (VA172A12-2).


  1. 1.
    Barceló J, Codina E, Casas J, Ferrer JL, García D (2005) Microscopic traffic simulation: a tool for the design, analysis and evaluation of intelligent transport systems. J Intell Robot Syst 41:173–203CrossRefGoogle Scholar
  2. 2.
    Cormen TH, Stein C, Rivest RL, Leiserson CE (2001) Introduction to algorithms, 2nd edn. McGraw-Hill Higher Education, Burr Ridge, Il 60521Google Scholar
  3. 3.
    Crauser A, Mehlhorn K, Meyer U, Sanders P (1998) A parallelization of Dijkstra’s shortest path algorithm. In: Brim L, Gruska J, Zlatuška J (eds) Mathematical foundations of computer science 1998, LNCS, vol 1450. Springer, Berlin, pp 722–731CrossRefGoogle Scholar
  4. 4.
    Dasgupta A (2011) CUDA performance analyzer. Ph.D. thesis, School of Electrical and Computer Engineering, Georgia Institute of TechnologyGoogle Scholar
  5. 5.
    Dijkstra EW (1959) A note on two problems in connexion with graphs. Numer Math 1:269–271MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Farooqui N, Kerr A, Diamos G, Yalamanchili S, Schwan K (2011) A framework for dynamically instrumenting GPU compute applications within GPU Ocelot. In: Proceedings of 4th workshop on GPGPU, GPGPU-4, x. ACM, New York, NY, pp 9:1–9:9Google Scholar
  7. 7.
    Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to GPU codes. InPar 2012:1–10Google Scholar
  8. 8.
    Harris M (2008) Optimizing parallel reduction in CUDA. NVIDIAGoogle Scholar
  9. 9.
    Kirk DB, Hwu WW (2010) Programming massively parallel processors: a hands-on approach. Morgan Kaufmann, San Francisco, CA, USA, p 258Google Scholar
  10. 10.
    Martín P, Torres R, Gavilanes A (2009) CUDA solutions for the SSSP problem. In: Allen G, Nabrzyski J, Seidel E, van Albada G, Dongarra J, Sloot P (eds) Computational science—ICCS 2009, LNCS, vol 5544. Springer, Berlin, pp 904–913Google Scholar
  11. 11.
    Nobari S, Lu X, Karras P, Bressan S (2011) Fast random graph generation. In: Proceedings of 14th international Conference on EDBT/ICDT ’11. ACM, NY, pp 331–342Google Scholar
  12. 12.
    Ortega-Arranz H, Torres Y, Llanos DR., Gonzalez-Escribano A (2013) A new GPU-based approach to the shortest path problem. In: High performance computing and simulation (HPCS), 2013 international Conference on, pp 505–512Google Scholar
  13. 13.
    Rétvári G, Bíró JJ, Cinkler T (2007) On shortest path representation. IEEE ACM Trans Netw 15:1293–1306CrossRefGoogle Scholar
  14. 14.
    Torres Y, González-Escribano A, Llanos DR (2012) uBench: performance impact of CUDA block geometry. In: Techniocal report IT-DI-2012-0001, Universidad de ValladolidGoogle Scholar
  15. 15.
    Torres Y, Gonzalez-Escribano A, Llanos DR (2013) uBench: exposing the impact of CUDA block geometry in terms of performance. J Supercomput 65:1–14Google Scholar
  16. 16.
    Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Hector Ortega-Arranz
    • 1
  • Yuri Torres
    • 1
  • Arturo Gonzalez-Escribano
    • 1
  • Diego R. Llanos
    • 1
  1. 1.Dpto. InformáticaUniversidad de ValladolidValladolidSpain

Personalised recommendations