In this section, we evaluate our algorithm and its performance by experimental analysis. In the following sections, we denote our first algorithm using pseudorandom number generators as cuPPA-Pure and the second algorithm using hash function as cuPPA-Hash. We demonstrate the accuracy of our algorithm by showing that our algorithm produces networks with power-law degree distribution as desired. We also compare the runtime of our algorithm using other sequential and parallel algorithms.
Hardware and Software
We used a computer consisting of 6 AMD Phenom(tm) II 6174 processor with 3.3 GHz clock speed and 64 GB system memory. The machine also incorporates a NVidia 1080 GPU with 8 GB memory. The operating system is Ubuntu 16.04 LTS, and all software on this machine was compiled with GNU gcc 4.6.3 with optimization flags -O3. The CUDA compilation tools V8 were used for the GPU code along with nvcc compiler. In additional experiments, we used another system consisting of 4 NVidia Tesla P100 GPUs with 16 GB memory each.
Degree Distribution
To demonstrate the accuracy of cuPPA-Pure and cuPPA-Hash, we compared those with the sequential Barabási–Albert (SBA) [9] and the sequential copy model (SCM) algorithms. The degree distributions of the networks generated by SBA, SCM, cuPPA-Pure, and cuPPA-Hash are shown in Fig. 3 in a \(\log\)–\(\log\) scale. We used \(n=500M\) vertices, each generating \(d=4\) new edges with a total of two billion (\(2\times 10^9\)) edges. The distribution is heavy-tailed, which is a distinct feature of power-law networks. The exponent \(\gamma\) of the power-law degree distribution is measured to be 2.7. This supports the fact that for the finite average degree of a scale-free network, the exponent should be \(2< \gamma < \infty\) [12]. Also notice that the degree distributions of SBA and SCM are quite identical. The degree distributions of both cuPPA algorithms are also similar to both SBA and SCM.
Visualization of Generated Graphs
In order to gain an idea of the structure and degree distributions, we obtained a visualization of some of the networks generated by our algorithm. We generated the visualizations using a popular network visualization tool called Gephi. Bearing aesthetics in mind and to minimize undue clutter, we focused on a few small networks by choosing \(n=10{,}000\), \(p=0.5\), and \(d=1,2,4\). The visualizations are shown in Figs. 4, 5, and 6.
Effect of Edge Probability on Degree Distribution
As mentioned earlier, the strength of the copy model is the capability of generating other preferential attachment networks by simply varying one parameter, namely, the probability p. In Fig. 7, we display the degree distribution of the generated networks by varying p using both cuPPA-Pure and cuPPA-Hash. When \(p=0\), all edges are produced by copy edges, and thus, the network becomes a star network where all additional vertices connect to the d initial vertices. With a small value of p (\(p=0.01\)), we can generate a network with a very long tail. When we set \(p=0.5\), we get the Barabási–Albert networks which exhibit a straight line in log–log scale. When we increase p to 1, we get a network consisting entirely of direct edges that do not form any tail.
Waiting Queue Size of cuPPA-Pure
As mentioned in Sect. 3.3, the waiting queue size depends on p and d. To evaluate the impact of p and d, we ran simulations using 1280 CUDA threads (20 blocks and 64 threads per block) where each thread only executed one vertex. The value of p is varied from 0 to 1 with different probability values. We also varied the value of d from 1 to 4096 as increasing powers of 2. In Fig. 8, we show the number of items placed in the waiting queue per vertex for different combinations of p and d. We also added the worst case value as a line in the plot. As seen from the figure, in the worst case with \(p=0\), the maximum size of the waiting queue increases linearly with d for smaller values of d (up to 64) and afterward, it does not increase much compared to d. Therefore, for smaller values of d, we need to have provisions for at least d items per vertex in the waiting queue.
However, as the round progresses, the maximum size of the waiting queue decreases significantly as shown in Fig. 9. For this figure, we also ran cuPPA using 1280 CUDA threads (20 blocks and 64 threads per block) to generate networks with \(d=512, 256, 128, 64\) and \(p=0.5\). Each CUDA thread processes exactly one vertex per round. Only the first 100 rounds are shown for brevity. From Fig. 9, we can see that as the round progresses, the size of the waiting queue per round decreases dramatically for all different values of d. This indicates that we could process more vertices in later rounds using the same amount of queue memory. Therefore, we can dynamically change the size of the segments between two consecutive rounds to increase parallelization. Based on these observations regarding the size of the waiting queue, we designed an adaptive version of cuPPA-Pure that monitors the maximum size of the waiting queue and manages the segment size accordingly as discussed in Sect. 3.3. We call this version cuPPA-Dynamic and use it for all other experiments.
Runtime Performance
In this section, we analyze the runtime and performance of cuPPA-Pure and cuPPA-Hash relative to other algorithms and show the variation of performances against various parameters.
Runtime Comparison with Existing Algorithms (Single GPU)
To the best of our knowledge, our algorithm is the first GPU-based parallel algorithm to generate preferential attachment networks. Therefore, it is not possible to compare the runtime with other GPU-based algorithms. Instead, we compare with the existing non-GPU algorithms. The total run times of both versions of cuPPA and the existing algorithms are shown in Fig. 10 for generating two billion edges (\(n=500M\), \(d=4\)). In this experiment, we used a single NVidia GeForce 1080 GPU with 8 GB memory.
-
Sequential Algorithms: We compare cuPPA with two efficient sequential algorithms: SBA [9] and SCM [17]. For SCM, we used two implementations: one with the pseudorandom number generators (called SCM-Pure) and the other with hash functions (called SCM-Hash). We also compared our algorithm with a reference sequential graph generation library from the Graph500 [1] reference code that uses SKG to generate networks.
As shown in Fig. 10, SKG from Graph500 takes the longest time to generate 2B scale-free networks—25.39 minutes. In comparison, our GPU-based algorithm is \(650\,\times\) faster.
We also found that SCM-Pure is slightly faster than the SBA algorithm. The hash-based SCM-Hash essentially recomputes all the copy edges and therefore takes approximately 70% more time than the SCM-Pure algorithm. However, the hashing technique is shown to scale to a large number of processors making it a viable candidate for large network generation using many processors [24]. On the other hand, the GPU-based cuPPA-Pure generates the network in just 2.32 s on the NVidia 1080 GPU with \(78\,\times\) to \(94\,\times\) speedup. Also note that cuPPA-Hash is slightly slower than cuPPA-Pure on a single GPU due to more computation.
-
Parallel Algorithms: We also compared cuPPA with a distributed-memory (PPA-DM) [4] and a shared-memory (PPA-SM) [7] parallel algorithms. As shown in Fig. 10, both of the cuPPA algorithms outperform PPA-DM on a system with 24 processors. The main reason is that unlike PPA-DM, cuPPA algorithms do not require complex synchronizations and message communications.
Due to the unavailability of the PPA-SM code, we compared the runtime to generate the largest graph (\(n=10^7, d=10\)) reported in [7] with the corresponding runtime of cuPPA. PPA-SM generates the network using 16 cores of Intel Xeon CPU E5-2698 2.30 GHz in approximately 7.5 s, whereas cuPPA-Pure generates the same network in just 0.3 s.
Runtime Versus Number of Vertices (Single GPU)
First, we examine the runtime performance of cuPPA-Pure (fastest of the two algorithms in a single GPU) with increasing number of vertices n. Here, we examine two cases. In the first case, we set \(d=4\), vary \(p=\{0,0.001,\) 0.25, 0.5, 0.75, \(1\}\), and vary \(n=\{1.9M,\) 3.9M, 7.8M, 15.6M, 31.25M, 62.5M, 125M, 250M, \(500M\}\) to see how the runtime changes with increasing number of vertices for different p. The corresponding runtime is shown in Fig. 11. In the second case, we set \(p=0.5\), vary \(d=\{1,2,4,8,16,32,64,128\}\), and vary \(n=\{60K,\) 120K, 240K, 480K, 960K, 1.92M, 3.84M, \(7.68M\}\) to see how the runtime changes with increasing number of vertices for different d. The corresponding runtime is shown in Fig. 12.
From Figs. 11 and 12, we can observe that for any fixed set of values for p and d, with increasing n, the runtime increases linearly, indicating that the algorithm scales very well with increasing value of n.
Runtime Versus Degree of Preferential Attachment (Single GPU)
Next, we examine the runtime performance of cuPPA with increasing d. The runtime is shown in Fig. 13. Here, we set \(n=7812500\), vary \(p=\{0, 0.00001,\) \(0.001, 0.25, 0.5, 0.75, 1\}\), and vary \(d=\{1, 2, 4, 8, 16, 32, 64, 128\}\) to see how the runtime changes for increasing value of d for different p. As seen from the figure, with increasing d, the runtime increases almost linearly. Therefore, the algorithm is observed to scale well for increasing value of d. Note that higher values of d are typically unlikely. However, we included higher values of d for performance measurement purpose. Also notice that the runtime is the largest for \(p=0\). With a small value of \(p=0.00001\), the runtime drops significantly and does not change much for higher values of p. Since the typical values of p are much larger than 0, this observation suggests that cuPPA performs well for real-world scenarios.
Runtime Versus Probability of Copy Edge (Single GPU)
Next we examine the runtime performance of cuPPA with increasing p. The runtime is shown in Fig. 14. Here, we used three different sets of values for n and d (\(\langle n=500M, d=4 \rangle\), \(\langle n=125M, d=16 \rangle\), and \(\langle n=31.25M, d=64 \rangle\)), and vary \(p=\{0,\) 0.00001, 0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, \(1.00\}\). As seen from the figure, the runtime reduces dramatically with a slight increase of \(p=0\) to \(p=0.00001\) up to \(p=0.1\) in all of these three cases. Then, the runtime reduces almost linearly up to \(p=0.9\) and then reduces sharply toward \(p=1\). With lower values of p, most of the edges are produced by copy edges. Therefore, the size of the waiting queue increases, thereby increasing the runtime. As the value of p increases toward 1, most of the edges are created using direct edges, and therefore, fewer items are stored in the waiting queue.
Runtime Varied with the Number of Threads (Single GPU)
To understand how the performance of cuPPA depends on the number of threads, we set \(p=0.5\) and used four different sets of n and d to generate networks. We also varied the number of CUDA threads per block from 64, 128, 256, 512, to 1024. The runtime (solid lines) and the relative speedup (dashed lines) of the experiments are shown in Fig. 15. Let \(t_{T}\) be the runtime of cuPPA using T threads. Then, the relative speedup is defined as \(\frac{t_T}{t_{64}}\) in this experiments, i.e., the speedup gained compared to the runtime of cuPPA using 64 threads. Figure 15 is shown in two y-axes, the left and right axis correspond to the runtime and relative speedup, respectively. From the figure, the best performance is observed with 512 threads per block for all cases. Therefore, in our final algorithm, we use up to 512 threads per block.
Runtime Performance of cuPPA-Hash (Multiple GPUs)
Next, we evaluate the performance of cuPPA-Hash for multiple GPUs. In this experiment, we used a machine consisting of 4 NVidia Tesla P100 GPUs with 16 GB memory each.
Strong Scaling
To study the strong scaling of the algorithm, we generated a network of 4B edges using \(n=1B\) and \(d=4\). We used 1 to 4 GPUs for the experiment. The strong scaling is presented in Fig. 16. From the figure, we can clearly see that cuPPA-Hash achieves perfect linear speedup by the virtue of being an embarrassingly parallel algorithm.
Generating Large Networks
Using cuPPA-Hash with 4 GPUs, we are able to generate a network of 16B edges (\(n=2B\) and \(d=8\)) in just 7 s. That represents a rate of 2.29 billion edges per second, which is unprecedented in this domain.