WolfPath: Accelerating Iterative Traversing-Based Graph Processing Algorithms on GPU

There is the significant interest nowadays in developing the frameworks of parallelizing the processing for the large graphs such as social networks, Web graphs, etc. Most parallel graph processing frameworks employ iterative processing model. However, by benchmarking the state-of-art GPU-based graph processing frameworks, we observed that the performance of iterative traversing-based graph algorithms (such as Bread First Search, Single Source Shortest Path and so on) on GPU is limited by the frequent data exchange between host and GPU. In order to tackle the problem, we develop a GPU-based graph framework called WolfPath to accelerate the processing of iterative traversing-based graph processing algorithms. In WolfPath, the iterative process is guided by the graph diameter to eliminate the frequent data exchange between host and GPU. To accomplish this goal, WolfPath proposes a data structure called Layered Edge list to represent the graph, from which the graph diameter is known before the start of graph processing. In order to enhance the applicability of our WolfPath framework, a graph preprocessing algorithm is also developed in this work to convert any graph into the format of the Layered Edge list. We conducted extensive experiments to verify the effectiveness of WolfPath. The experimental results show that WolfPath achieves significant speedup over the state-of-art GPU-based in-memory and out-of-memory graph processing frameworks.


Introduction
The demand for efficiently processing large scale graphs has been growing fast nowadays.This is because graphs can be used to describe a wide range of objects and computations and graph-based data structures are the basis of many applications [19,[22][23][24]26,30,36,46,48].Traditionally, motivated by the need to process very large graphs, many frameworks have been developed for processing large graphs on distributed systems.Examples of such frameworks include Pregel [28], GraphLab [25], PowerGraph [9] and GraphX [10].However, since developing distributed graph algorithm is challenging, some researchers divert their attention to design the graph processing system that handle large scale graphs on a single PC.The research endeavours in this direction have delivered the systems such as GraphChi [17], PathGraph [45], GraphQ [39], LLAMA [27] and GridGraph [51].However, these systems suffer from the limited degree of parallelism in conventional processors.GPU is renowned for its potential to offer the massive degree of parallelism in many areas [3,4,11,21,32,[42][43][44]49,52].Therefore, much research now resort to use GPU to accelerate the graph processing process.The exemplar GPU-based graph processing systems include Medusa [47], Gunrock [40], CuSha [16], Frog [35] and MapGraph [6].
Many of these graph processing frameworks employ iterative processing techniques [38,41,50].Namely, graph processing involves many iterations.Some iterative graph processing algorithms use the threshold value (e.g., in the PageRank algorithm) or the number of vertices/edges (e.g., in the Minimum-cut algorithm) to determine the termination of the algorithms.In these algorithms, the iteration count is known beforehand.
However, in iterative traversing-based graph processing, the algorithm is driven by the graph structure and the termination of the algorithm is determined by the states of vertices/edges.Therefore, these algorithms need to check the state of vertices/edges at the end of every iteration to determine whether to run the next iteration.In each iteration, either synchronous or asynchronous methods can be used to compute and update the values of vertices or edges in the graph.The processing terminates when all vertices meet the termination criteria.
The aforementioned termination criteria is application-specific (e.g, the newly computed results of all vertices remain unchanged from the previous iteration).The number of iterations is unknown before the algorithm starts.Such a termination method does not cause any problem on CPU-based graph processing systems, because checking the termination condition on CPU does not incur much overhead.On the other hand, termination checking is a time consuming process in GPU-accelerated systems.This is because all the threads in different thread blocks need to synchronize their decision at this point.However, current GPU devices and frameworks (CUDA [5] and OpenCL [37]) do not support synchronization among different thread blocks during the execution of the kernel.Therefore, to synchronize between different thread blocks, the program has to exit the kernel, and copy the data back to the host and use the CPU to determine whether the computation process is complete.This frequent data exchange between host and GPU introduces considerable overhead.
To address this problem, we present WolfPath, a framework that is designed to improve the iterative traversing-based graph processing algorithms (such as BFS and SSSP) on GPU.Our development is motived by the facts that: (1) the iterative graph processing algorithm converges when the computations on all vertices have been completed; (2) the iterative traversing-based graph processing on GPU requires frequent data exchange between GPU and host memory to determine the convergence point; (3) the maximum number of iterations needed to complete a traversing-based graph processing algorithm is determined by the graph's diameter (longest shortest path).
WolfPath has following features.First, a graph in WolfPath is represented by a tree structure.In doing so, we manage to obtain very useful information, such as the graph diameter, the degree of each vertex and the traversal order of the graph, which will be used by WolfPath in graph computations.Second, we design a layered graph structure, which is used by WolfPath to optimize GPU computations.More concretely, for all vertices in the same depth of the graph tree, we group all the edges that use these vertices as source vertex into a layer.All the edges in the same layer can be processed in parallel, and coalesced memory access can be guaranteed.Last but not least, based on the information we gain from the graph model, we design a computation model that does not require frequent data exchange between host and GPU.
The rest of the paper is organised as follows.Section 2 overviews the limitation of iterative graph processing technique used by the state-of-art graph processing frameworks.Section 3 presents the graph modelling and in-memory data structure proposed in this paper.Section 4 presents the details of the WolfPath framework, including the iterative processing model and the graph partition method when the graph can not fit into GPU.Experimental evaluation is presented in Sect. 5. Section 6 discusses related work.Finally, Sect.7 concludes this paper.

Motivation: the Limitation of Current Approach
Many parallel graph processing frameworks are based on the iterative processing model.In this model, the computation goes through many iterations.In each iteration, either synchronous (such as Bulk Synchronous parallel used by Pregel) or asynchronous (Parallel sliding window used by GraphChi) methods can be used to compute and update the vertex or edge values.The computation terminates when all vertices meet the application specific termination criterion (e.g, the results of all vertices remain unchanged).
In order to determine if all processes/threads meet the termination condition, the processes/threads need to communicate with each other.On CPU based parallel graph processing systems, the processes/threads can communicate through messaging pass-ing or shared memory.On GPU, the threads are organised in thread blocks and the communication can be divided into two types: intra-and inter-block communication.
Intra-communication refers to the communications between the threads within a block, which is achieved via shared memory or global memory in GPU.On the contrary, the inter-block communication is the communications across different thread blocks.However, there is no explicit support for data communication across different thread blocks.Currently, inter-block data communication is realized through the GPU global memory followed by a barrier synchronization in CPU [5,37].The barrier is implemented by terminating the execution of the current kernel and re-launching the kernel.
The reason for the lack of support for inter-block communications on GPU is as follows.On GPU, the number of thread blocks launched by an application is normally much larger than the number of Streaming Multiprocessors (SM).However, When a large number of threads try to communicate between different blocks, it can cause the deadlock.An example is given as follows to illustrate the deadlock issue.Suppose that there are 5 thread blocks and only 4 SMs and that each thread block will occupy all resources on a SM.Assume blocks 1-4 execute on 4 SMs.When synchronization occurs, blocks 1-4 will wait until block 5 finishes.However, block 5 will never be executed on any SM since all SMs are busy and there are no resources available.Consequently, the deadlock occurs.
Due to the lack of support for inter-block communications, implementing iterative graph computation algorithm on GPU is much more challenging than on CPU.To demonstrate this, let us first consider how the iterative computation is implemented on CPU.Algorithm 1 shows the high level structure of the iterative computation on CPU.The loop is controlled by the flag variable, which is set to be true at the beginning.Next, all processes/threads execute a user-defined function, compute(), and then invoke the update_condition function to check if the user-specified termination condition is met.Each process/thread has its own f lag variable, which is updated by update_condition function.The update_condition function returns false if the program reaches the termination condition and returns true if otherwise.The f lag variable is synchronised between all the processes/threads.If its value is f alse, the iteration terminates.

Algorithm 1 Iterative Computation on CPU
When executing the above code in parallel on a CPU-based system, the synchronization of the f lag variable can be easily achieved through shared memory or message passing.However, due to the fact that the current GPUs do not support synchronisation between different thread blocks, it takes more efforts to achieve the synchronization on GPU.The only solution that can ensure that the shared variable is properly synchronized across all thread blocks is to exit the kernel.To implement the iterative processing on GPU, many state-of-art graph processing frameworks use the following approach.The f lag variable is stored in the global memory of GPU.Each thread also has a local d_ f lag variable.If a thread meets the termination condition in the current iteration, it sets its own d_ f lag to false.Then d_ f lag is synchronised between all the threads within a thread block.One thread in each thread block updates the global f lag variable if the value of d_ f lag in this thread block is f alse.Next, the program exits the kernel and copies the value of f lag back to the host, which is used by the host program to determine whether another kernel should be launched (i.e., continuing to perform the computation).This technique is outlined Algorithms 2 and 3. Clearly, in order to decide whether to launch the next iteration, the value of f lag needs to be exchanged between host and GPU frequently.These operations incur a significant overhead.If the number of iterations needed to complete the graph computation is known beforehand, the exchange of f lag between host and GPU can be eliminated, which can potentially improve the performance.

Algorithm 2 Iterative processing kernel
We conduct the experiments to investigate the extra overhead caused by exchanging the f lag variable.Four real world graphs, which are listed in Table 1, are used in this experiment.We determine the diameter of the graph using the BFS algorithm implementated on CuSha framework [5].We record the computation time and the number of iterations it takes to complete the algorithm.We can prove that the number of iterations the iterative graph computation has to perform must be less than or equal to the graph diameter.Instead of using the f lag value to determine the termination condition of the graph computation, we terminate the computation when the number of iterations equals to the graph diameter.We re-run the modified program and record the computation time.The results are listed in Table 2.
As can be seen from Table 2, the computation time of modified program is much faster than the original version.We also noticed that the original program takes a fewer number of iterations than the modified version in some cases.This is because the computation converges fast in those cases.Therefore, using the graph diameter as the termination condition in those cases causes the extra overhead of performing unnecessary computations.In order to compare these two types of overhead, we mea- sured the average time for performing data copying and the average computation time per iteration in the experiments.The results are listed in Table 3: The results from Table 3 show that the data copy time is greater than the computation time by 90-1500 times.These results indicate that it is worth performing the excessive iterations, comparing with copying data.We observe another interesting point in Table 3.No matter which graph the algorithm is processing, only one integer (i.e., the value of the f lag variable) is copied between GPU and host.However, the average memory copying time per iteration is different for different graphs.This is because the synchronisation cost between thread blocks is different for different graphs.More threads are involved in synchronisation, the longer the synchronization takes.
All these results support our proposed strategy.Namely, it can improve performance to use the maximum number of iterations as the termination condition so as to eliminate the need of data copying between GPU and CPU.However, a question arises from the strategy: how to know the number of iterations needed for different graphs before the graph processing algorithm starts?This question motivates us to design a new graph representation that help determine the graph diameter and further develop the novel and GPU-friendly graph processing methods.

Graph Representation in WolfPath
In this section, we present the graph model to help determine the number of iterations when processing a graph on GPU iteratively.We also present the detailed data structure of our in-memory graph representation, which is designed in the way that can improve the coalescence in-memory access so as to achieve the higher performance.

Graph Modelling in WolfPath
The experimental results shown in last section suggest that using the number of iterations can improve the performance in graph processing.However, because graph processing algorithms are data driven, it is difficult to determine the number of iterations for different graph inputs before the program starts.Much research [14,15,18,28] have been conducted to tackle this problem.The research shows that when processing graph algorithms iteratively, the upper bound of the number of iterations is the diameter of a graph, i.e., the number of the nodes on the path corresponding to the graph diameter [31].
In WolfPath, we model the graph as a layered tree structure.That is, we first represent the graph as a tree-like structure, then group the vertices that in the same depth into one layer.By modelling the graph this way, the diameter of the graph is the distance from the root vertex to the deepest leaf vertex.
If some vertices in the same layer are connected, we duplicate these vertices in the next level.By duplicating these vertices, the vertex value updated in the current level can be sent to the next level.The reason for this design is as follow.The vertices in the current level and the next level form a group of edges.If a vertex is both source and destination of different edges, the calculated values of their neighbouring vertices may not settle (i.e., the calculated values are not final values of the vertices) after one iteration.Therefore, by duplicating these vertices in the next level, the updated value of their neighbouring vertices will be recomputed.
Based on the above description, given a graph G = (V, E), a layered tree T = (V t , E t ) is defined as follows.V t ⊆ V and E t ⊆ E. The root vertex of the tree, denoted by v rt ∈ V t , is the vertex which does not have in-edges.degree in (v) denotes the in-degree of vertex v. Then degree in (v rt ) = 0. 1b gives the tree structure of the graph shown in Fig. 1a.

Graph Data Structure in WolfPath
As shown in the previous research [16], one of the main factors that limits the performance of graph processing on GPU is the non-coalesced memory access.This is because most graphs have highly irregular structures.Hence, it is important to design a data structure for the graph storing in the GPU global memory so as to achieve the coalesced global memory access.Based on the layered tree structure, we design a layered edge list structure to store the graph in the memory.In this design, each level in the layered tree is represented by two arrays, source array and destination array, which are used to store the source and destination vertexes of each edge in that level respectively.The i-th entry in the source array and the i-th entry in the destination array form an edge in the level.We also create an array for each layer to store the updated values of the destination vertices in that layer.An example is shown in Fig. 2.
It provides the following benefits to use this structure to store a graph in memory.First, the consecutive threads can read/write the consecutive elements from/to the source and destination arrays in the global memory.Therefore the coalesced global memory access can be guaranteed.Second, because all the edges in each layer are independent to each other, we can process them in parallel.In addition, if one layer is too big to fit in the GPU global memory, we can split it into multiple smaller edge lists.Third, multiple layers can be combined together to fully utilise the computation power of GPU.We will discuss the last two advantages in more detail in a later section.Fig. 2 An example of layered edge list structure for graph in Fig. 1a memory and converts it into the CSR format.Next, it uses Algorithm 4 to build the layered tree, and then writes the layered tree back to a file stored in the disk.It is worth noting that this program is only run once for a graph.When processing a graph, WolfPath will first check if there exists a corresponding layered tree file.If the file exists, WolfPath will use it as the input.Otherwise, it will convert the graph into this new format.Algorithm 4 is used to build the layered tree.
Algorithm 4 Building layered tree for all neighbor u of v do 10:

16:
Q.enqueue(u) 17: Algorithm 4 is based on the breadth-first algorithm (BFS).The algorithm constructs an layered tree T for graph G (Line 2).It also creates Queue Q (Line 3).The algorithm starts with adding the vertex v rt into Q (Line 4).In order to quickly retrieve the level information of each node, the algorithm also maintains an array called node_level (Line 5), which is used to store the level information of each node.The size of this array is equal to the number of vertices (denoted by V ) in the graph G.This array is indexed by vertex id.The algorithm initialises all vertices in this array to 0 (Line 5).Then the algorithm performs the following steps iteratively (Line 6): it pops out the first vertex v from the queue (Line 7), and reads its level information from node_level (Line 8).∀e : v → u ∈ E (Line 9), if u / ∈ T (Line 10), or u has already been added into T but is in the same level as v (Line 14), the algorithm adds edge v, u in T (Line 11, 15).Next, the algorithm puts u in the queue by performing enqueue(u) (Line 12, 16), sets the level of u to the current level plus 1 (Line 13 and 17).This process repeats until Q becomes empty.

The Computation Model of WolfPath
The computation process of WolfPath is as follows.In WolfPath, the graph is processed layer by layer.For each layer, three operations are performed by GPU in parallel: the read, compute and write operations.For i-th level in the graph, the read operation reads the updated vertex value from the global vertex array.The compute operation acts on each edge and uses the data gathered from the read operation to compute the value for its edge/vertices.The write operation writes the updated value to the global vertex value array.So the updated values can be used in next iteration by the read operation.Hence, The computation model in WolfPath is synchronous and guarantees that all updates from a previous compute phase are seen only after the write operation is completed and before the next read operation starts.The whole process terminates when all the levels have been processed, that is, the number of iterations is equal to the number of levels of the graph.This process is outlined in Algorithm 5. for all vertexes in parallel in level i do 11: write vertexes value to update value array

Combined Edge List
By representing the graph in the layered tree format, we can gain the knowledge about how many layers there are in the graph and use it to determine the number of iterations needed for most graph algorithms.However, because WolfPath processes the graph layer by layer and the graphs have the nature of irregular data structure, such representation may cause the under-utilisation of GPU.
Each layer in the graph may have different numbers of edges.This number varies dramatically between levels.For instance, consider the graph shown in Fig. 1.The first layer has 2 edges only.on the other hand, the second and third layer have 4 and 6 edges respectively.Hence, the number of threads required to process the first level is far less than the computing power (i.e., the number of processing cores) of GPU.
To overcome this problem, we propose the combined edge list, which is a large edge list that is constructed from multiple layers of edge lists.The combined edge list is constructed in the following way.We first define a number M E, which is the minimum amount of edges to be processed by GPU.Then we add the number of edges level by level starting from the first level of the layered tree.Once the total number of edges is greater or equal to M E, we group these levels together and then re-count the edges from the next level.This process repeats until all levels have been processed.
The way of building the combined edge list ensures that each combined edge list is formed by consecutive edge lists from the layered tree.Hence, each combined edge list can be treated as a sub-graph of the original graph.Therefore, the number of iterations required to process a combined edge list is equal to the number of levels used to form this tree.So Algorithm 5 is re-designed as Algorithm 6.

8: parallel process all edges in C E L[i]
It is very important that we group the consecutive levels together to form a combined edge list.Because the result of the vertices in level i depends on those in level i − 1, the results from the previous iteration need to be passed to the next iteration, which can be easily achieved by grouping the consecutive levels together.If we group the non-consecutive levels into a list, passing results between different levels requires lot of data transferring between host and GPU memory, which will harm the performance.
There remains one question: how do we choose the value for M E? If the number of M E is too small, the resulting edge list may not fully occupy the GPU.On the contrary, if it is too large, the size of the edge list may exceed the size of the GPU memory.Since it is desired that the GPU is fully occupied, the maximum active threads can be used to determine the minimum number of edges per combined edge list.The maximum number of active threads is the number of threads that a GPU can run simultaneously in theory, which can be found by Eq. 1, where N sm is the number of multiprocessors (SMs), M W psm is the maximum number of resident warps per SM and the T pw is the threads per warp.

Out of GPU Memory Processing
Comparing with the host platform, the GPU has a limited memory space.The size of real world graphs may be from few gigabytes to terabytes, which are too large to fit in the GPU global memory.Therefore, in this section, we develop an out-of-GPUmemory engine that can process such large scale graphs.The general process of developing an out-of-GPU-memory engine is to first partition the graph into sub-graphs that can fit into GPU memory, and then process these subgraphs in GPU one at a time.Therefore, our first objective in designing such an engine is to achieve good performance in graph partitioning.Ideally, the performance of this process should be as close as possible to the performance of loading the graph into memory.The second objective is that after partitioning the graph, the graph framework has to deal with the frequent data exchange between GPU and host memory.Otherwise, the performance will take hit.
Based on the design of Layered Edge List, these two goals can be achieved.The process of partitioning the graph is similar to building the combined edge list.We start with the first layer and accumulate the vertices in the layers until the size of accumulated vertices becomes larger than the GPU globa memory.We group all the accumulated vertices as a sub-graph and start the accumulating process again from the current layer.The partitioning process is complete when we have processed all layers in the graph.
The complexity of such partitioning method is O N , where N is the number of layers in the graph.Given the fact that most real world graphs, especially the social network graphs, do not have a big diameter, the number of N will not be very large.
After partitioning the graph, each sub-graph is processed in order based on their positions in the graph.That is, the processing starts with the sub-graph that contains the first layer.The next sub-graph to be processed is the one that follows the first sub-graph.In addition, the updated values of the vertices in the last layer of the current sub-graph need to be passed to the next sub-graph.This can be achieved by retaining the updated vertex values in the global memory after the computation is finished.Therefore, when next sub-graph is loaded into the GPU global memory, the data needed by the subgraph is in the global memory.To process each sub-graph, we use the same method as that for in-GPU-memory processing.Combining multiple layers into a sub-graph enables us to fully utilise the GPU.
It is possible that the size of one layer is larger than the GPU memory. in this case, we can split this layer into multiple parts and compute one part at a time.This works because all the edges in a layer are independent to each other and it is therefore safe to partition a layer into multiple small chunks and process them separately.

Optimisation for GPU
When implementing WolfPath, we take advantage of the hardware architecture of GPU to optimise the performance.Specifically, WolfPath uses the shared memory to improve the performance of random access to the memory and performs computation and communication asynchronously.

Shared Memory
Writing the updated value into the updated value array will cause the random access to the global memory.To address this issue, we use the shared memory in GPU to store the vertex array, which can enhance the performance of random access, since the access to the shared memory is much faster than global memory.
The following method is used in WolfPath to use the shared memory effectively.The shared memory in GPU is shared between all the threads in a thread block.Therefore, the input edge list is split into small blocks in WolfPath, and a vertex array is constructed for each block.Each edge block is processed by one thread block.Multiple thread blocks are processed in parallel.In each thread block, edges are processed in parallel by the threads.
During the execution, the threads first fetch the updated vertex values into the shared memory of the block.The consecutive threads of the block read consecutive elements of the local vertex value array.Hence, reading requests are coalesced to the minimum number of memory transactions.After the computation, the threads first write the newly computed values into the shared memory, and then perform synchronisation between different thread blocks by writing the data from the shared memory to the global memory.

Asynchronous Execution
WolfPath asynchronously performs computation and communication.Specifically, it leverages the CUDA Streams and the hardware support such as Hyper-Qs provided by NVIDIAs Kepler to enable data streaming and computation in parallel.WolfPath creates multiple CUDA Streams to launch multiple kernels and overlap memory copying and computation in order to transfer data asynchronously.This is motivated by the fact that an edge list in WolfPath can be split into many sub-arrays.Each of these sub-arrays is independent to each other.WolfPath exploits this fact and does not move the entire edge list in a single copy performed by a CUDA stream.Instead, WolfPath creates multiple CUDA Streams to move these sub-arrays to the GPU.As the result, many hardware queues in GPU are used concurrently, which improves the overall throughput.

Experimental Evaluation
In this section, we evaluate the performance of WolfPath using two types of graph dataset: small sized graphs that can fit into the GPU global memory (called in-memory graphs in the experiments) and large scale graphs that do not fit (out-of-memory graphs).The size of a graph is defined as the amount of memory required to store the edges, vertices, and edge/vertices values in user-defined datatypes.
Small graphs are used to evaluate WolfPath's performance in in-memory graph processing against other state-of-art in-memory graph processing systems (e.g., CuSha [16] and Virtual-Warp-Centric [13], and the large graphs are used to compare WolfPath with other out-of-core frameworks that can process large graphs on a single PC (e.g., GraphChi and X-Stream).
The eight graphs listed in Tables 1 and 4 are publicly available.They cover a broad range of sizes and sparsity and come from different real-world origins.For example, Live-Journal is directed social networks graph.RoadNetCA is the California road network.orkut is an undirected social network graph.uk-2002 is a large crawl of the .ukdomains, in which vertices are the pages and edges are links.
We choose two widely used searching algorithms to evaluate the performance, namely Breadth First Search (BFS) and Single Source Shortest Paths (SSSP).
The experiments were conducted on a system with a Nvidia GeForce GTX 780Ti graphic card, which has 12 SMX multiprocessors and 3 GB GDDR5 RAM.On the host side, we use the Intel Core i5-3570 CPU operating at 3.4 GHZ with 32 GB DDR3 RAM.The benchmarks were evaluated using CUDA 6.5 on Fedora 21.All the programs were compiled with the highest optimisation level (-O3).

Comparison with Existing In-Memory Graph Processing Frameworks
In this section, we compare WolfPath with the state-of-art in-memory processing solutions such as CuSha [16] and Virtual Warp Centric [13].In the experiments, we use the CuSha-CW method, because this strategy provides the best performance in all CuSha strategies.Both CuSha and Virtual Warp Centric apply multi-level optimisations to the in-memory workloads.
We first compare the computation times among WolfPath, CuSha and VWC.We also list the breakdown performances in Fig. 5.In these experiments, the Data Transfer is time taken to move data from the host to GPU, the computation refers to the time taken for actual execution of the algorithm.We also recorded the number of iterations that each algorithm takes in different systems, these results are listed in Table 5.
As can be seen from these results, WolfPath outperforms CuSha and Virtual Warp Centric.Although WolfPath requires more iteration to complete the algorithm, the average speedup of WolfPath over CuSha is more than 100×, and 400× over VWC.This is due to the elimination of memory copy operations.Also, the performance of VWC is the worst among 3 systems, because VWC does not guarantee the coa- lesced memory access.On the other hand, with carefully designed data structures, both WolfPath and CuSha can access graph edge in a sequential manner, hence memory performance is much better.We also noticed that WolfPath takes more iterations than other implementations.This is because in iterative computation model, the computation can be converged very fast, but WolfPath is bounded by the graph diameter, which is the upper bound of the iteration count.However, as discussed in the previous section, compare to memory copy operation, computation is much faster.Therefore, WolfPath still outperforms other systems.

Comparison with GPU Out-of-Memory Frameworks
The results shown in the last section demonstrate WolfPath's performance in processing in-memory graphs.However, many real-world graphs are too large to fit in GPU memory.In this section, we examine the WolfPath's ability to process large graphs which cannot fit into GPU memory.To the best of our knowledge, the state-of-art GPU-based graph processing frameworks [6,16,47] assume that the input graphs can fit in the GPU memory.Therefore, in this work, we compare WolfPath (WP) with two CPU-based, out-of-memory graph processing framework: GraphChi (GC) [17] and X-Stream (XS) [33].To avoid disk I/O overhead in systems such as GraphChi and X-Stream, the dataset selected in the experiments can fit in host memory but not in GPU memory.
As shown in Figs. 6 and 7, WolfPath achieves an average speedup of 3000× and 4000× over GraphChi and X-Stream (running with 4 threads), respectively, despite its need to move the data between GPU and CPU.We also list the computation time and iteration counts of three systems in  As can be seen from the Table 6, although WolfPath performs more iterations than GraphChi and X-stream, it still outperforms them.This performance improvement is due to the massive parallel processing power provided by GPU, while GraphChi and X-Stream are CPU-based and their degrees of parallelism are limited. 123

Memory Occupied by Different Graph Representation
From Fig. 5, we can see that VWC has the shortest data transfer time.This is because it represents the graph in the CSR format, which is memory efficient.However, in order to have sequential access to the edges, both WolfPath and CuSha represent graphs with edges, which consume more memory space than CSR.In this section, we evaluate the cost of using the Layered Edge list representation in terms of required memory space against CSR and CuSha's CW representation.Figure 8 shows the memory consumed by WolfPath's Layered Edge List, CuSha-CW and CSR.The Layered Edge List and CuSha-CW need 1.37× and 2.81× more space on average than CSR.CuSha uses 2.05× more memory than WolfPath, because it represents each edge with 4 arrays.

Preprocessing Time
Table 7 shows the preprocessing time of WolfPath, CuSha, and GraphChi.The preprocessing time refers to the time taken to convert the graph from the raw data to the framework specified format (e.g, layered tree in WolfPath or Shard in GraphChi).It consists of the graph traversing time and the time to write the data into the storage.Because CuSha is unable to process the graph larger than GPU memory, the corresponding cells in the table are marked as NA.
The first observation from the table is that (1) for in-memory graphs, CuSha preprocesses the data faster than other two systems, and (2) WolfPath is the slowest system.This is because CuSha does not write the processed data back to the disk.GraphChi OP-WP WP Fig. 9 Performance comparison with and without optimisation will only write a copy of data into shard.In contrast, WolfPath traverses the data using the BFS-based algorithm and then writes the data into a temporary buffer before it writes the data to the hard disk.Therefore, the workload of WolfPath is heavier than two other systems.For graphs larger than GPU memory, WolfPath performs better than GraphChi when processing uk2002 and arabic2005.This is because GraphChi generates many shard files for these two graphs, and hence it takes longer to write to the disk.
From this experiment, we argue that in WolfPath, although the preprocessing is time-consuming, the pre-processing is worthwhile because of the following reasons: First, for each graph, WolfPath only needs to convert it once.Second, the resultant format provides better locality and performance for iterative graph computations.

Effect of Optimisation Techniques
As shown in previous experimental results, the memory accessing operations are the dominant factor that affect the performance, and therefore are our primary target of optimization.Figure 9 shows that the performance improves significantly thanks to the 2 optimization techniques discussed in Sect.4.5, including asynchronous execution and shared memory synchronisation.For example, without these optimization techniques, the execution time of the BFS algorithm is 71ms with the Livejournal graph.With the optimisation the execution time drops to 48ms.

Related Work
Using GPUs for graph processing was first introduced by Harish and Narayanan [12].Since then, the CSR format has become the mainstream representation to store graphs on GPU.Merrill et al. [29] present a work efficient BFS algorithm.They use different approaches to minimize the workload imbalance.Virtual Warp Centric was proposed in [13] to tackle the workload imbalance problem and reduce the intra-warp divergence.
Medusa [47] is a GPU-based graph processing framework that focuses on abstractions for easy programming.MapGraph [6] implements the runtime-based optimisation to deliver good performance.Based on the size of the frontier and the size of the adjacency lists for the vertices in the frontier, MapGraph can choose from different scheduling strategies.
The graph processing solutions described above use the CSR format to represent the graph, hence suffering from the random access to the graph data.CuSha [16] addresses the inefficient memory access by introducing the G-Shard and Concatenated Windows.However, as shown in this paper, CuSha requires frequent data exchange between host and GPU, which leads to long overall execution time.
All of the approaches above make the fundamental assumption that the input graphs fit into GPU memory, which limits the usage of these solutions.However, WolfPath does not suffer from such restriction.
Most existing Out-of-Memory graph processing frameworks are CPU based, these frameworks are aiming to process graphs that do not fit into host memory.For instance, GraphChi [17] is the first disk-based graph processing framework and designed to run on a single machine with limited memory.X-Stream graph processing framework [33] uses the edge-centric processing model that takes as input a binary formatted edge-list.It does not require preprocessing, but requires frequent disk I/O to fetch data.
Totem [7,8] is a hybrid platform that uses both GPU and CPU.It statically partitions a graph into the parts of residing in GPU and host memories based on the degree of vertices.However, as the size of the graph increases, only a fixed portion of the graph is able to fit in the GPU memory, resulting in GPU underutilization.GraphReduce [34] aims to process the graphs that exceed the capacity of the GPU memory.It partitions the graph into shards and loads one or more shards into the GPU memory at a time.In GraphReduce, each shard contains a disjoint subset of vertices.The edges in each shard are sorted in a specific order.

Conclusion
In this paper, we develop WolfPath: a GPU-based graph processing framework for processing iterative traversing-based graph processing algorithms efficiently.A new data structure called Layered Edge List is introduced to represent the graph.This structure helps eliminate the frequent data exchange between host and GPU.We also propose a graph preprocessing algorithm that can convert an arbitrary graph into the layered structure.The experimental results show that WolfPath achieves the significant speedup over the state-of-art in-GPU-memory and out-of-memory graph processing frameworks.

Fig. 1
Fig. 1 An example graph and its layered tree representation.a Example graph and b Layered tree representation

1 : 6 : 8 :
function processing(Layered Tree T , root vertex v rt ) 2: i ← 0 3: v rt = DE F AU LT _V ALU E 4: while i < T.level do 5: if i! = 0 then for all vertexes in parallel in level i do 7: read vertex value from level i − 1 for all Edges in parallel in level i do

Fig- ures 3
and 4 show the speedup of WolfPath over CuSha and VWC.

Fig. 5
Fig. 5 Execution time breakdown of WolfPath, CuSha and VWC on different algorithms and graphs.The time unit is milliseconds.a BFS and b SSSP

Table 1
Real world graphs used in the experiments

Table 2
The computation time comparison of Original CuSha and Modified CuSha

Table 6 .
Since X-Stream does not require any

Table 5
Number of iterations executed by different systems and total execution times (ms)

Table 6
Execution times of WolfPath, GraphChi and X-Stream on different algorithms and graphs.The time is in seconds CSR Wolf CuSha-CWFig.8 Memory occupied by each graph using CSR, CuSha-CW, WolfPath