Skip to main content

VGL: a high-performance graph processing framework for the NEC SX-Aurora TSUBASA vector architecture


Developing efficient graph algorithms implementations is an extremely important problem of modern computer science, since graphs are frequently used in various real-world applications. Graph algorithms typically belong to the data-intensive class, and thus using architectures with high-bandwidth memory potentially allows to solve many graph problems significantly faster compared to modern multicore CPUs. Among other supercomputer architectures, vector systems, such as the SX family of NEC vector supercomputers, are equipped with high-bandwidth memory. However, the highly irregular structure of many real-world graphs makes it extremely challenging to implement graph algorithms on vector systems, since these implementations are usually bulky and complicated, and a deep understanding of vector architectures hardware features is required. This paper presents the world first attempt to develop an efficient and simultaneously simple graph processing framework for modern vector systems. Our vector graph library (VGL) framework targets NEC SX-Aurora TSUBASA as a primary vector architecture and provides relatively simple computational and data abstractions. These abstractions incorporate many vector-oriented optimization strategies into a high-level programming model, allowing quick implementation of new graph algorithms with a small amount of code and minimal knowledge about features of vector systems. In this paper, we evaluate the VGL performance on four widely used graph processing problems: breadth-first search, single source shortest paths, connected components, and page rank. The provided comparative performance analysis demonstrates that the VGL-based implementations achieve significant acceleration over the existing high-performance frameworks and libraries: up to 14 times speedup over multicore CPUs (Ligra, Galois, GAPBS) and up to 3 times speedup compared to NVIDIA GPU (Gunrock, NVGRAPH) implementations.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. 1.

    Afanasyev I, Voevodin VV, Voevodin VV, Komatsu K, Kobayashi H (2019) Developing efficient implementations of shortest paths and page rank algorithms for NEC SX-Aurora TSUBASA architecture. Lobachevskii J Math 40(11):1753−1762

    Article  Google Scholar 

  2. 2.

    Afanasyev IV, Antonov AS, Nikitenko DA, Voevodin VV, Voevodin VV, Komatsu K, Watanabe O, Musa A, Kobayashi H (2018) Developing efficient implementations of bellman-ford and forward-backward graph algorithms for nec sx-ace. Supercomput Front Innov 5(3):65–69

    Google Scholar 

  3. 3.

    Afanasyev IV, Voevodin VV, Voevodin VV, Komatsu K, Kobayashi H (2019) Analysis of relationship between simd-processing features used in nvidia gpus and nec sx-aurora tsubasa vector processors. In: International Conference on Parallel Computing Technologies. Springer, pp 125–139

  4. 4.

    Beamer S, Asanović K, Patterson D (2013) Direction-optimizing breadth-first search. Sci Program 21(3–4):137–148

    Google Scholar 

  5. 5.

    Besta M, Podstawski M, Groner L, Solomonik E, Hoefler T (2017) To push or to pull: On reducing communication and synchronization in graph computations. In: Proceedings of the 26th international symposium on high-performance parallel and distributed computing. pp 93–104

  6. 6.

    Chakrabarti D, Zhan Y, Faloutsos C (2004) R-mat: a recursive model for graph mining. In: Proceedings of the 2004 SIAM International Conference on Data Mining. SIAM, pp 442–446

  7. 7.

    Egawa R, Komatsu K, Momose S, Isobe Y, Musa A, Takizawa H, Kobayashi H (2017) Potential of a modern vector supercomputer for practical applications: performance evaluation of SX-ACE. pp 3948–3976

  8. 8.

    Fu Z, Personick M, Thompson B (2014) Mapgraph: a high level api for fast development of high performance graph analytics on gpus. In: Proceedings of workshop on GRAph data management experiences and systems. pp 1–6

  9. 9.

    Goldberg A, Radzik T (1993) A heuristic improvement of the bellman-ford algorithm. Stanford Univ CA Dept of Computer Science, Technical report

  10. 10.

    Hillis WD, Steele GL Jr (1986) Data parallel algorithms. Commun ACM 29(12):1170–1183

    Article  Google Scholar 

  11. 11.

    Ilic A, Pratas F, Sousa L (2013) Cache-aware roofline model: upgrading the loft. IEEE Comput Archit Lett 13(1):21–24

    Article  Google Scholar 

  12. 12.

    Khorasani F, Vora K, Gupta R, Bhuyan LN (2014) Cusha: vertex-centric graph processing on gpus. In: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. pp 239–252

  13. 13.

    Komatsu K, Egawa R, Isobe Y, Ogata R, Takizawa H, Kobayashi H (2015) An approach to the highest efficiency of the HPCG benchmark on the SX-ACE supercomputer. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC15). Poster, pp 1–2

  14. 14.

    Komatsu K, Momose S, Isobe Y, Watanabe O, Musa A, Yokokawa M, Aoyama T, Sato M, Kobayashi H (2018) Performance evaluation of a vector supercomputer sx-aurora tsubasa. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18. IEEE Press, Piscataway, pp 54:1–54:12

  15. 15.

    Liu H, Huang HH (2015) Enterprise: breadth-first graph traversal on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp 1–12

  16. 16.

    Meyer U, Sanders P (2003) $\delta $-stepping: a parallelizable shortest path algorithm. J Algorithms 49(1):114–152

    MathSciNet  Article  Google Scholar 

  17. 17.

    Murphy RC, Wheeler KB, Barrett BW, Ang JA (2010) Introducing the graph 500. Cray Users Group (CUG) 19:45–74

    Google Scholar 

  18. 18.

    Nguyen D, Lenharth A, Pingali K (2013) A lightweight infrastructure for graph analytics. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. pp 456–471

  19. 19.

    Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: bringing order to the web. Technical report, Stanford InfoLab

  20. 20.

    Shiloach Y, Vishkin U (1980) An o (log n) parallel connectivity algorithm. Technical report, Computer Science Department, Technion

  21. 21.

    Shun J, Blelloch GE (2013) Ligra: a lightweight graph processing framework for shared memory. In: ACM sigplan notices, vol. 48. ACM, pp 135–146

  22. 22.

    Stanford Large Network Dataset Collection-SNAP.

  23. 23.

    The Koblenz Network Collection-KONECT.

  24. 24.

    Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens JD (2016) Gunrock: a high-performance graph processing library on the gpu. In: Proceedings of the 21st ACM SIGPLAN symposium on principles and practice of parallel programming. pp 1–12

  25. 25.

    Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76

    Article  Google Scholar 

  26. 26.

    Yamada Y, Momose S (2018) Vector engine processor of nec brand-new supercomputer sx-aurora TSUBASA. In: Intenational symposium on high performance chips (Hot Chips2018)

  27. 27.

    Zhang Y, Kiriansky V, Mendis C, Amarasinghe S, Zaharia M (2017) Making caches work for graph analytics. In: 2017 IEEE International Conference on Big Data (Big Data). IEEE, pp 293–302

  28. 28.

    Zhong J, He B (2013) Medusa: simplified graph processing on gpus. IEEE Trans Parallel Distrib Syst 25(6):1543–1552

    Article  Google Scholar 

Download references


The results described in Section 5 were obtained in Lomonosov Moscow State University with the financial support of the Russian Science Foundation (Agreement N 20-11-20194). The reported study was funded by RFBR, Project Number 19-37-90002.

Author information



Corresponding author

Correspondence to Ilya V. Afanasyev.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Afanasyev, I.V., Voevodin, V.V., Komatsu, K. et al. VGL: a high-performance graph processing framework for the NEC SX-Aurora TSUBASA vector architecture. J Supercomput 77, 8694–8715 (2021).

Download citation


  • Graph algorithms
  • Graph frameworks
  • VGL
  • Vector processing
  • High-performance computing