Efficient BreadthFirst Search on Massively Parallel and DistributedMemory Machines
 4.6k Downloads
 3 Citations
Abstract
There are many largescale graphs in real world such as Web graphs and social graphs. The interest in largescale graph analysis is growing in recent years. BreadthFirst Search (BFS) is one of the most fundamental graph algorithms used as a component of many graph algorithms. Our new method for distributed parallel BFS can compute BFS for one trillion vertices graph within half a second, using large supercomputers such as the KComputer. By the use of our proposed algorithm, the KComputer was ranked 1st in Graph500 using all the 82,944 nodes available on June and November 2015 and June 2016 38,621.4 GTEPS. Based on the hybrid BFS algorithm by Beamer (Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, IPDPSW ’13, IEEE Computer Society, Washington, 2013), we devise sets of optimizations for scaling to extreme number of nodes, including a new efficient graph data structure and several optimization techniques such as vertex reordering and load balancing. Our performance evaluation on KComputer shows that our new BFS is 3.19 times faster on 30,720 nodes than the base version using the previously known best techniques.
Keywords
Distributedmemory BreadthFirst Search Graph5001 Introduction
Graphs have quickly become one of the most important data structures in modern IT, such as in social media where the massive number of users is modeled as vertices and their social connections as edges, and collectively analyzed to implement various advanced services. Another example is to model biophysical structures and phenomena, such as brain’s synaptic connections, or interaction network between proteins and enzymes, thereby being able to diagnose diseases in the future. The common properties among such modern applications of graphs are their massive size and complexity, reaching up to billions of edges and trillions of vertices, resulting in not only tremendous storage requirements but also compute power to conduct their analysis.
With such high interest in analytics of large graphs, a new benchmark called the Graph500 [8, 11] was proposed in 2010. Since the predominant use of supercomputers had been for numerical computing, most of the HPC benchmarks such as the Top500 Linpack had been compute centric. The Graph500 benchmark instead measures the data analytics performance of supercomputers, in particular those for graphs, with the metric called traversed edges per second or TEPS. More specifically, the benchmark measures the performance of BreadthFirst Search (BFS), which is utilized as a kernel for important and more complex algorithms such as connected components analysis and centrality analysis. Also, the target graph used in the benchmark is a scalefree, smalldiameter graph called the Kronecker graph, which is known to model realistic graphs arising out of practical applications, such as Web and social networks, as well as those that arise from life science applications. As such, attaining high performance on the Graph500 represents the important abilities of a machine to process reallife, largescale graphs arising from bigdata applications.
We have conducted a series of work [11, 12, 13] to accelerate BFS in a distributedmemory environment. Our new work extends the data structures and algorithm called hybrid BFS [2] that is known to be effective smalldiameter graphs, so that it scales to toptier supercomputers with tens of thousands of nodes with millionscale CPU cores with multigigabyte/s interconnect. In particular, we apply our algorithm to the Riken’s KComputer [15] with 82,944 compute nodes and 663,552 CPU cores, once the fastest supercomputer in the world on the Top500 in 2011 with over 10 Petaflops. The result obtained is currently No. 1 on the Graph500 for two consecutive editions in 2016, with significant TEPS performance advantage compared to the result obtained on the Sequoia supercomputer hosted by Lawrence Livermore National Laboratory in the USA, which is a machine with twice the size and performance compared to the KComputer, with over 20 Petaflops and embodying approximately 1.6 million cores. This demonstrates that top supercomputers compete for the top ranks on the Graph500, but the Top500 ranking does not necessarily directly translate in this regard; rather architectural properties other than the amount of FPUs, as well as algorithmic advances, play a major role in attaining top performance, indicating the importance of codesign of future toplevel machines including those for exascale, with graphcentric applications in mind .
In fact, the top ranks of the Graph500 has been historically dominated by largescale supercomputers to date, with other competing infrastructures such as Clouds being notably missing; performance measurements of the various work including ours reveal that this is fundamental, in that interconnect performance plays a significant role in the overall performance of largescale BFS, and this is one of the biggest differentiators between supercomputers and Clouds.
2 Background: Hybrid BFS
2.1 The Base Hybrid BFS Algorithm
For a large but smalldiameter graphs such as the Kronecker graph used in the Graph500, the hybrid BFS algorithm [2] (Fig. 3) that heuristically minimizes the number of edges to be scanned by switching between topdown and bottomup, has been identified as very effective in significantly increasing the performance of BFS.
2.2 Parallel and Distributed BFS Algorithm
In Figs. 5 and 6, f, n, and \(\pi \) correspond to frontier, next, and parent in the base sequential algorithms, respectively. Allgatherv() and alltoallv() are standard MPI collectives. Beamer [3]’s proposal encodes f, c, n, w as 1 bit per vertex for optimization. Paralleldistributed hybrid BFS is similar to the sequential algorithm in Fig. 4, heuristically switching between topdown and bottomup per each iteration step, being essentially a hybrid of algorithms in Figs. 5 and 6.
In parallel 2D bottomup BFS algorithm in Fig. 6, each search step is broken down into C substeps assuming that an adjacency matrix is partitioned into \(R \times C\) submatrices in a twodimensional, and during each substep, a given vertex’s edges will be examined by only one processor. During each substep, a processor processes 1/C of the assigned vertices in the processor row. After each substep, it passes on the responsibility for those vertices to the processor to its right and accepts new vertices from the processor to its left. This pairwise communication sends which vertices have been completed (called found parents), so that the next processor will have the knowledge to skip examining over them. This has the effect of the processor responsible for processing a vertex rotating right along the row for each substep. When a vertex finds a valid parent to become visited, its index along with its discovered parent is queued up and sent to the processor responsible for the corresponding segment of the parent array to update it. Each step of the algorithm in Fig. 6 has four major operations [3];

Each processor is given the segment of the frontier corresponding to their assigned submatrix.

Search for parents with the information available locally.

Send updates of children that found parents and process updates for own segment of parents.

Send completed to the right neighbor and receive completed for the next substep from the left neighbor.
3 Problems of Hybrid BFS in ExtremeScale Supercomputers
Although the algorithm in Sect. 2 would work efficiently on a smallscale machine, for extremely large, up to and beyond millioncore scale supercomputers toward exascale, various problems would manifest themselves which severely limit the performance and scalability of BFS. We describe the problems in Sect. 3 and present our solutions in Sect. 4.
3.1 Problems with the Data Structure of the Adjacency Matrix
The data structure describing the adjacency matrix is of significant importance as it directly affects the computational complexity of graph traversal. For small machines, the typical strategy is to employ the Compressed Sparse Row (CSR) format, commonly employed in numerical computing to express sparse matrices. However, we first show that direct use of CSR is impractical due to its memory requirements on a large machine; we then show that the existing proposed solutions, DCSR [4] and Coarse index + Skip list [6] that intend to reduce the footprint at the cost of increased computational complexity, are still insufficient for large graphs with significant computational requirement.
3.1.1 Compressed Sparse Row (CSR)
This indicates that, for large machines, as C gets larger, the memory requirement per node increases, as the memory requirement of rowstarts is \(V'C\). In fact, for very large graphs on machines with thousands of nodes, rowstarts can become significantly larger than dst, making its straightforward implementation impractical.
There is a set of work that proposes to compress rowstarts, such as DCSR [4] and Coarse index + Skip list [6], but they involve nonnegligible performance overhead as we describe below:
3.1.2 DCSR
DCSR [4] was proposed to improve the efficiency of matrixmatrix multiplication in a distributedmemory environment. The key idea is to eliminate the rowstarts value for rows that has no nonzero values, thereby compressing rowstarts. Instead, two supplemental data structures called the JC and AUX arrays are employed to calculate the appropriate offset in the dst array. The drawback is that one needs to iterate in order to navigate over the JC array from the AUX array, resulting in significant overhead for repeated access of sparse structures, which is a common operation for BFS.
3.1.3 Coarse Index + Skip List
Another proposal [6] was made in order to efficiently implement BreadthFirst Search for 1D partioning in a distributedmemory environment. Sixtyfour rows of nonzero elements are batched into a skip list, and by having the rowstarts hold the pointer to the skip list, this method compresses the overall size of the rowstarts to be 1/64th the original size. Since each skip list embodies 64 rows of data, we can traverse all 64 rows contiguously, making algorithms with batched row access efficient in addition to data compression. However, for sparse accesses, on average one would have to traverse and skip over 31 elements to access the designated matrix element, potentially introducing significant overhead.
3.1.4 Other Sparse Matrix Formats
There are other known sparse matrix formats that do not utilize rowstarts [9], significantly saving memory; however, although such formats would be useful for algorithms that systematically iterate over all elements of a matrix, they perform badly for BFS where individual accesses to the edges of a given vertex need to be efficient.
3.2 Problems with Communication Overhead
Hybrid BFS with 2D partitioning scales for small number of nodes, but its scalability is known to quickly saturate when the number of nodes scales beyond thousands [3].
Communication cost of bottomup search [3]
Operation  Comm type  Comm complexity per step  Data transfer per each search (64 bit word) 

Transpose  P2P  O(1)  \(s_b V / 64\) 
Frontier Gather  Allgather  O(1)  \(s_b VR / 64\) 
Parent Updates  P2P  O(C)  2V 
Rotate Completed  P2P  O(C)  \(s_b VC / 64\) 
As we can see in Table 1, the communication cost of Frontier Gather and Rotate Completed is proportional to R and C in the submatrix portioning—being one of the primary sources overhead when number of nodes are in the thousands or more. Moreover, lines 21 and 26 involve synchronous communication with other nodes, and the number of communication is proportional to C, again becoming significant overhead. Finally, it is very difficult to achieve perfect load balancing, as a small number of vertices tend to involve number of edges that could be orders of magnitude larger than the average; this could result in sever load imbalance in simple algorithms that assume even distribution of vertices and edges.
Such difficulties have been the primary reasons why one could not obtain near linear speedups, even in weak scaling, as the number of compute nodes the associated graph sizes increased to thousands or more on a very large machine. We next introduce our extremely scalable hybrid BFS that alleviates these problems, to achieve utmost scalability for Graph500 execution on the KComputer.
4 Our Extremely Scalable Hybrid BFS
The problems associated with previous algorithms are largely storage and communication overheads of extremely large graphs scaling to be analyzed over thousands of nodes or more. These are fundamental to the fact that we are handling irregular, largescale “big” data structures and not floating point numerical values. In order to alleviate the problems, we propose several solutions that are unique to graph algorithms
4.1 BitmapBased Sparse Matrix Representation
Examples of bitmapbased sparse matrix representation
Edges list  SRC  0 0 6 7 
DST  4 5 3 1  
CSR  Rowstarts  0 2 2 2 2 2 2 3 4 
DST  4 5 3 1  
Bitmapbased sparse matrix representation  Offset  0 1 3 
Bitmap  1 0 0 0 0 0 1 1  
Rowstarts  0 2 3 4  
DST  4 5 3 1  
DCSR  AUX  0 1 1 3 
JC  0 6 7  
Rowstarts  0 2 3 4  
DST  4 5 3 1 
Theoretical order and the actual pernode measured memory consumptions of bitmapbased CSR compared to previous proposals
Data structure  CSR  Bitmapbased CSR  

Order  Actual  Order  Actual  
Offset  –  –  \(V'C/64\)  32 MB 
Bitmap  –  –  \(V'C/64\)  32 MB 
Rowstarts  \(V'C\)  2048 MB  \(V'p\)  190 MB 
DST  \(V' \hat{d}\)  1020 MB  \(V' \hat{d}\)  1020 MB 
Total  \(V'(C+ \hat{d})\)  3068 MB  \(V'(C/32+p+\hat{d})\)  1274 MB 
Data structure  DCSR  Coarse index + Skip list  

Order  Actual  Order  Actual  
AUX  \(V'p\)  190 MB  –  – 
JC  \(V'p\)  190 MB  –  – 
Rowstarts  \(V'p\)  190 MB  \(V'C/64\)  32 MB 
DST or skip list  \(V' \hat{d}\)  1020 MB  \(V' \hat{d} + V'p\)  1210 MB 
Total  \(V'(3p+ \hat{d})\)  1590 MB  \(V'(C/64+p+\hat{d})\)  1242 MB 
4.2 Reordering of the Vertex IDs
Another associated problem with BFS is the randomness of memory accesses of graph data, in contrast to traditional numerical computing using CSR such as the Conjugate Gradient method, where the access to the row elements of a matrix can become contiguous. Here, we attempt to exploit similar locality properties.
The basic idea is as follows: As described in Sect. 2.2, much of the information regarding hybrid BFS is held in bitmaps that represent the vertices, each bit corresponding to a vertex. When we execute BFS over a graph, higherdegree vertices are typically accessed more often; as such, by clustering access to such vertices by reordering them according to their degrees (i.e., # of edges), we can expect to achieve higher locality. This is similar to switching rows in a matrix in a sparse numerical algorithm to achieve higher locality. In [12], they proposed such reordering for topdown BFS, where they only utilize the reordered vertices where needed, while maintaining the original BFS tree with original vertex IDs for overall efficiency. Unfortunately, this method cannot be used for hybrid BFS; instead, we propose the following algorithm.
Reordered IDs of the vertices are computed by sorting them topdown according to their degrees on a pernode basis and then reassigning the new IDs according to their order. We do not conduct any internode reordering. A subadjacency matrix on each node stores reordered IDs of the vertices. The mapping information between original vertex ID and its reordered vertex ID is maintained by an owner node where the vertex is located. When constructing an adjacency matrix of the graph, the original vertex ID is converted to the reordered ID by (a) firstly performing alltoall communication once over all the nodes in a row of processor grid in 2D partitioning to compute the degree information of each vertex, and then (b) secondly computing the reordered IDs by sorting all the vertices according to their degrees and then (c) thirdly performing alltoall communication again over all the nodes in a column and a row of processor grid in order to convert the vertex IDs in the subadjacency matrix on each node to the reordered IDs.
Adding the original IDs for both the source and the destination
Offset  0 1 3 
Bitmap  1 0 0 0 0 0 1 1 
SRC(Orig)  2 0 1 
Rowstarts  0 2 3 4 
DST  2 3 0 1 
DST(Orig)  4 5 3 1 
The solution to this problem is to add two arrays SRC(Orig) and DST(Orig) as shown in Table 4. Both arrays hold the original indices of the reordered vertices. When the algorithm writes to the resulting BFS tree, the original ID is referenced from either of the arrays instead of the reordered ID, avoiding alltoall communication. Also, a favorable byproduct of vertex reordering is removal of vertices with no edges, allowing further compaction of the data structure, since such vertices will never show up in the resulting BFS tree.
4.3 Optimizing InterNode Communications for BottomUp BFS
The original bottomup BFS algorithm shown in Fig. 6 conducts communication per each substep, C times per each iteration assuming that we have 2D partitioning of \(R \times C\) for an adjacency matrix). For large systems, such frequent communication presents significant overhead and thus subject to the following optimizations:
4.3.1 Optimizing Parent Updates Communication
4.3.2 Overlapping Computation and Communication in Rotate Completed Operation
We also attempt to overlap computation in lines 12–20 (Fig. 8) and communication in line 21 in “Rotate Completed” operation mentioned in Sect. 2. If the substeps are set to C steps as the original method, we would not be able to overlap the computation and communication since the computational result depends on the result in a previous substep. Thus, we increase this substep from C to multiple substeps such as 2C, 4C. For the KComputer described in Sect. 5, we increase this to 4C. If the substeps is set to 4C, the computational result depends on the one in 4 substeps before, and we can perform parallel execution by overlapping computation and communication for 4 substeps. In this case, when the computation is performed for 2 substeps, the communication for other 2 substeps are simultaneously executed. The communication is accelerated by allocating these 2 substeps to 2 different communication channels in the 6D torus network of the KComputer that supports multiple channel communication using rDMA.
4.4 Reducing Communication with Better Partitioning
Bitmapbased CSR data communication volume (difference from Table 1 is in italics)
Operation  Comm type  Comm complexity per step  Data transfer per each search (64 bit word) 

Frontier Gather  Allgather  O(1)  \(s_b VR / 64\) 
Parent Updates  Alltoall  \(\textit{O(1)}\)  2V 
Rotate Completed  P2P  O(C)  \(s_b VC / 64\) 
4.5 Load Balancing the TopDown Algorithm
We resolve the following loadbalancing problem for the topdown algorithm. As shown in Fig. 5 lines 10–14, we need to create \(t_{i,j}\) from the edges of each vertex in the frontier; this is implemented so that the each vertex pair of the edges is placed in a temporary buffer and then copied to the communication buffer just prior to alltoallv(). Here, as we see in Fig. 10, thread parallelism is utilized so that each thread gets assigned equal number of frontier vertices. However, since the distribution of edges per each vertex is quite uneven, this will cause significant load imbalance among the threads.
5 Machine ArchitectureSpecific Communication Optimizations for the KComputer
The optimizations we have proposed so far are applicable to any large supercomputer that supports MPI+OpenMP hybrid parallelism. We now present further optimizations specific to the KComputer, exploiting its unique architectural capabilities. In particular, the nodetonode interconnect employed in the KComputer is a proprietary “Tofu” network that implements a sixdimensional torus topology, with high injection bandwidth and multidirectional DMA to achieve extremely high performance in communicationintensive HPC applications. We exploit the features of the Tofu network to achieve high performance on BFS as well.
5.1 Mapping to the SixDimensional Torus “Tofu” Network
Since our bitmapbased hybrid BFS employs twodimensional \(R \times C\) partitioning, there is a choice of how to map this onto the sixdimensional Tofu network, whose dimensions are named “x, y, z, a, b, c.” One obvious choice is to assign three dimensions to each R and C (say \(R=x,y,z\) and \(C=a,b,c\)), allowing physically proximal communications for adjacent nodes in the \(R \times C\) partitioning. Another interesting option is to assign \(R=y,z\) and \(C=x,a,b,c\), where we achieve square \(288 \times 288\) partitioning when we use the entire KComputer. We test both cases in the benchmark for comparison.
5.2 Bidirectional Simultaneous Communication for BottomUp BFS
Each node on the KComputer has six 5 Gigabyte/s bidirectional links to comprise a sixdimensional torus and allows simultaneous DMA to four of the six links. BlueGene/Q has a similar mechanism. By exploiting such simultaneous communication capabilities over multiple links, we can significantly speed up the communication for bottomup BFS. In particular, we have optimized Rotate Completed communication by communicating simultaneously to both directions, as shown in Fig. 12. Here, ci,j is the data to be communicated, and s is the number of steps up to 2C or 4C steps. We caseanalyze s to even/odd to communicate to different directions simultaneously .
6 Performance Evaluation
We now present the results of the Graph500 benchmark using our hybrid BFS on the entire KComputer. The Graph500 benchmark measures the performance of each machine by the (traversed edges per second (TEPS) value of the BFS algorithm on a synthetically generated Kronecker graphs, with parameters A=0.57, B=0.19, C=0.19, D=0.05. The size of the graph is expressed by the scale parameter where the \(\#\hbox {vertices}\, = 2^{Scale}\), and the \(\#\,\hbox {edges}\, = \#\,\hbox {vertices}\, \times 16.\)
The KComputer is located at the Riken AICS facility in Japan, with each node embodying a 8core Fujitsu SPARC64 VIIIfx processor and 16 GB of memory. The Tofu network composes a sixdimensional torus as mentioned, with each link being bidirectional 5GB/s. The total number of nodes is 82,944, or embodying 663,552 CPU cores and approximately 1.3 Petabytes of memory.
6.1 Effectiveness of the Proposed Methods
We measure the effectiveness of the proposed methods using up to 15,360 nodes of the KComputer. We increased the number of nodes in the increments of 60, with minimum being Scale 29 (approximately 537 million vertices and 8.59 billion edges), up to Scale 37. We picked a random vertex as the root of BFS and executed each benchmark 300 times. The reported value is the median of the 300 runs.
We first compared our bitmapbased sparse matrix representation to previous approaches, namely DCSR [4] and Coarse index + Skip list [6]. Figure 13 shows the weak scaling result of the execution performance in GTEPS, and Figs. 14, 15, and 16 shows various execution metrics—#instructions, time, and memory consumed. The processing of “Reading Graph” in Figs. 14 and 15 corresponds to lines 10–14 of Fig. 5 and lines 12–20 of Fig. 6. “Synchronization” is the interthread barrier synchronization over all computation. Since the barrier is implemented with “spin wait,” the number of executed instructions for this barrier is large compared with others.
6.2 Using the Entire KComputer
By using the entire KComputer, we were able to obtain 38,621.4 GTEPS using 82,944 nodes and 663,552 cores with a Scale 40 problem in June 2015. This bested the previous record of 23,751 GTEPS recorded by LLNL’s Sequoia BlueGene/Q supercomputer, with 98,304 nodes and 1,572,864 cores with a Scale 41 problem.
By all means, it is not clear whether we have hit the ultimate limit of the machine, i.e., whether or not we can tune the efficiency any further just by algorithmic changes. We know that BFS algorithm used for Sequoia is quite different from our proposed one, and it would be interesting to compare the algorithms vs. machines effect by crossexecution of the two (our algorithm on Sequoia and LLNL’s algorithm on the KComputer) and conducting a detailed analysis of both to investigate further optimization opportunities.
7 Related Work
As we mentioned, Yoo [16] proposed an effective method for 2D graph partitioning for BFS in a largescale distributedmemory computing environment; the base algorithm itself was a simple topdown BFS and was evaluated on a largescale environment 32,768 node BlueGene/L.
Buluc et al. [5] conducted extensive performance studies of partitioning schemes for BFS on largescale machines at LNBL, Hopper (6,392 nodes) and Franklin (9,660 nodes), comparing 1D and 2D partitioning strategies. Satish et al. [10] proposed an efficient BFS algorithm on commodity supercomputing clusters consisting of Intel CPU and the Infiniband Network. Checconi et al. [7] proposed an efficient paralleldistributed BFS on BlueGene using a communication method called “wave” that proceeds independently along the rows of the virtual processor grids. All the efforts here, however, use a topdown approach only as the underlying algorithm and are fundamentally at a disadvantage for graphs such as the Graph500 Kronecker graph whose diameter is relatively small compared to its size, as many realworld graphs are.
Hybrid BFS by Beamer [2] is the seminal work that solves this problem, on which our work is based. Efficient parallelization in a distributedmemory environment on a supercomputer is much more difficult and includes the early work by Beamer [3] and the work by Checconi [6] which uses a 1D partitioning approach. The latter is very different to ours, not only in the difference in partitioning being 1D compared to our 2D, but also in taking advantage of the simplicity in ingeniously replicating the vertices with large number of edges among all the nodes, achieving very good overall load balancing. Performance evaluation on BlueGene/Q 65536 nodes has achieved 16,599 GTEPS, and it would be interesting to consider utilizing some of the strategies in our work.
8 Conclusion
For many graphs we see in the real world, with relatively small diameter compared to its size, hybrid BFS is known to be very efficient. The problem has been that, although various algorithms have been proposed to parallelize the algorithm in a distributedmemory environment, such as the work by Beamer [3] using 2D partitioning, the algorithms failed to scale or be efficient for modern machines with tens of thousands of nodes and millionscale cores, due to the increase in memory and communication requirements overwhelming even the best machines. Our proposed hybrid BFS algorithm overcomes such problems by combination of various new techniques, such as bitmapbased sparse matrix representation, reordering of vertex ID, as well as new methods for communication optimization and load balancing. Detailed performance on the KComputer revealed the effectiveness of each of our approach, with the combined effect of all achieving over \(3{\times }\) speedup over previous approaches, and scaling to the entire 82,944 nodes of the machine effectively. The resulting performance of 38,621.4 GTEPS allowed the KComputer to be ranked No. 1 on the Graph500 in June 2015 by a significant margin, and it has retained this rank to this date as of June 2016. We hope to further advance the optimizations to other graph algorithms, such as SSSP, on largescale machines.
Notes
Acknowledgements
This research was supported by the Japan Science and Technology Agency’s CREST project titled “Development of System Software Technologies for postPeta Scale High Performance Computing.”
References
 1.Ajima Y, Takagi Y, Inoue T, Hiramoto S, Shimizu T (2011) The tofu interconnect. In: 2011 IEEE 19th Annual Symposium on High Performance Interconnects, pp 87–94. doi: 10.1109/HOTI.2011.21
 2.Beamer S, Asanović K, Patterson D (2012) Directionoptimizing breadthfirst search. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp 12:1–12:10. IEEE Computer Society Press, Los Alamitos, CA, USA. http://dl.acm.org/citation.cfm?id=2388996.2389013
 3.Beamer S, Buluc A, Asanovic K, Patterson D (2013) Distributedmemory breadthfirst search revisited: Enabling bottomup search. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, IPDPSW ’13, pp 1618–1627. IEEE Computer Society, Washington, DC, USA. doi: 10.1109/IPDPSW.2013.159
 4.Buluc A, Gilbert JR (2008) On the representation and multiplication of hypersparse matrices. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp 1–11. doi: 10.1109/IPDPS.2008.4536313
 5.Buluç A, Madduri K (2011) Parallel breadthfirst search on distributedmemory systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pp 65:1–65:12. ACM, New York, NY, USA. doi: 10.1145/2063384.2063471
 6.Checconi F, Petrini F (2014) Traversing trillions of edges in real time: graph exploration on largescale parallel machines. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp 425–434. doi: 10.1109/IPDPS.2014.52
 7.Checconi F, Petrini F, Willcock J, Lumsdaine A, Choudhury AR, Sabharwal Y (2012) Breaking the speed and scalability barriers for graph exploration on distributedmemory machines. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp 13:1–13:12. IEEE Computer Society Press, Los Alamitos, CA, USA. http://dl.acm.org/citation.cfm?id=2388996.2389014
 8.Graph500: http://www.graph500.org/
 9.Montagne E, Ekambaram A (2004) An optimal storage format for sparse matrices. Inf Process Lett 90(2):87–92. doi: 10.1016/j.ipl.2004.01.014 MathSciNetCrossRefzbMATHGoogle Scholar
 10.Satish N, Kim C, Chhugani J, Dubey P (2012) Largescale energyefficient graph traversal: a path to efficient dataintensive supercomputing. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp 1–11. doi: 10.1109/SC.2012.70
 11.Suzumura T, Ueno K, Sato H, Fujisawa K, Matsuoka S (2011) Performance characteristics of graph500 on largescale distributed environment. In: Proceedings of the 2011 IEEE International Symposium on Workload Characterization, IISWC ’11, pp 149–158. IEEE Computer Society, Washington, DC, USA. doi: 10.1109/IISWC.2011.6114175
 12.Ueno K, Suzumura T (2012) Highly scalable graph search for the graph500 benchmark. In: Proceedings of the 21st International Symposium on HighPerformance Parallel and Distributed Computing, HPDC ’12, pp 149–160. ACM, New York, NY, USA. doi: 10.1145/2287076.2287104
 13.Ueno K, Suzumura T (2013) Parallel distributed breadth first search on GPU. In: 20th Annual International Conference on High Performance Computing, pp 314–323. doi: 10.1109/HiPC.2013.6799136
 14.Yasui Y, Fujisawa K (2014) Fast and energyefficient BreadthFirst Search on a single NUMA system. Springer International Publishing, Cham. doi: 10.1007/9783319075181_23 CrossRefGoogle Scholar
 15.Yokokawa M, Shoji F, Uno A, Kurokawa M, Watanabe T (2011) The k computer: Japanese nextgeneration supercomputer development project. In: Proceedings of the 17th IEEE/ACM International Symposium on Lowpower Electronics and Design, ISLPED ’11, pp 371–372. IEEE Press, Piscataway, NJ, USA. http://dl.acm.org/citation.cfm?id=2016802.2016889
 16.Yoo A, Chow E, Henderson K, McLendon W, Hendrickson B, Catalyurek U (2005) A scalable distributed parallel breadthfirst search algorithm on bluegene/l. In: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC ’05, pp 25. IEEE Computer Society, Washington, DC, USA. doi: 10.1109/SC.2005.4
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.