Design of CUDA-based PageRank algorithm for link analysis.
All seventeen graphs used in below experiments are
stored in the MatrixMarket (.mtx) file format, and obtained from the
SuiteSparse Matrix Collection. These include: web-Stanford, web-BerkStan,
web-Google, web-NotreDame, soc-Slashdot0811, soc-Slashdot0902,
soc-Epinions1, coAuthorsDBLP, coAuthorsCiteseer, soc-LiveJournal1,
coPapersCiteseer, coPapersDBLP, indochina-2004, italy_osm,
great-britain_osm, germany_osm, asia_osm. The experiments are implemented
in C++, and compiled using GCC 9 with optimization level 3 (-O3).
The iterations taken with each test case is measured. 500
is the
maximum iterations allowed. Statistics of each test case is
printed to standard output (stdout), and redirected to a log file,
which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case.
This experiment (block-adjust-launch) was for finding a suitable launch
config for CUDA block-per-vertex. For the launch config, the
block-size (threads) was adjusted from 32
-1024
, and the grid-limit
(max grid-size) was adjusted from 1024
-32768
. Each config was run 5 times
per graph to get a good time measure.
MAXx64
appears to be a good config for most graphs. Here MAX
is the
grid-limit, and 64
is the block-size. This launch config is for the
entire graph, and could be slightly different for subset of graphs. Also note
that this applies to Tesla V100 PCIe 16GB, and could be different for other
GPUs. In order to measure error, nvGraph pagerank is taken as a reference.
This experiment (thread-adjust-launch) was for finding a suitable launch
config for CUDA thread-per-vertex. For the launch config, the
block-size (threads) was adjusted from 32
-1024
, and the grid-limit
(max grid-size) was adjusted from 1024
-32768
. Each config was run 5 times
per graph to get a good time measure.
On average, the launch config doesn't seem to have a good enough impact on
performance. However 8192x128
appears to be a good config. Here 8192
is the
grid-limit, and 128
is the block-size. Comparing with [graph properties],
seems it would be better to use 8192x512
for graphs with high avg.
density, and 8192x32
for graphs with high avg. degree. Maybe, sorting
the vertices by degree can have a good effect (due to less warp divergence).
Note that this applies to Tesla V100 PCIe 16GB, and would be different for
other GPUs. In order to measure error, nvGraph pagerank is taken as a
reference.
This experiment (block-sort-by-indegree) was for finding the effect of sorting
vertices and/or edges by in-degree for CUDA block-per-vertex based PageRank.
For this experiment, sorting of vertices and/or edges was either NO
, ASC
, or
DESC
. This gives a total of 3 * 3 = 9
cases. Each case is run on multiple
graphs, running each 5 times per graph for good time measure.
Results show that sorting in most cases is not faster. In fact, in a number of cases, sorting actually slows dows performance. Maybe (just maybe) this is because sorted arrangement tend to overflood certain memory chunks with too many requests. In order to measure error, nvGraph pagerank is taken as a reference.
This experiment (thread-sort-by-indegree) was for finding the effect of
sorting vertices and/or edges by in-degree for CUDA thread-per-vertex based
PageRank. For this experiment, sorting of vertices and/or edges was either NO
,
ASC
, or DESC
. This gives a total of 3 * 3 = 9
cases. Each case is run on
multiple graphs, running each 5 times per graph for good time measure.
Results show that sorting in most cases is slower. Maybe this is because sorted arrangement tends to overflood certain memory chunks with too many requests. In order to measure error, nvGraph pagerank is taken as a reference.
This experiment (switched-sort-by-indegree) was for finding the effect of
sorting vertices and/or edges by in-degree for CUDA switched-per-vertex
based PageRank. For this experiment, sorting of vertices and/or edges was either
NO
, ASC
, or DESC
. This gives a total of 3 * 3 = 9
cases. NO
here means
that vertices are partitioned by in-degree (edges remain unchanged). Each case
is run on multiple graphs, running each 5 times per graph for good time measure.
Results show that sorting in most cases is not faster. Its better to simply partition vertices by degree. In order to measure error, nvGraph pagerank is taken as a reference.
This experiment (switched-adjust-block-launch) was for finding a suitable
launch config for CUDA switched-per-vertex for block approach. For the
launch config, the block-size (threads) was adjusted from 32
-1024
, and
the grid-limit (max grid-size) was adjusted from 1024
-32768
. Each config
was run 5 times per graph to get a good time measure.
MAXx256
appears to be a good config for most graphs. Here MAX
is the
grid-limit, and 256
is the block-size. Note that this applies to Tesla
V100 PCIe 16GB, and would be different for other GPUs. In order to measure
error, nvGraph pagerank is taken as a reference.
This experiment (switched-adjust-thread-launch) was for finding a suitable
launch config for CUDA switched-per-vertex for thread approach. For the
launch config, the block-size (threads) was adjusted from 32
-1024
, and
the grid-limit (max grid-size) was adjusted from 1024
-32768
. Each config
was run 5 times per graph to get a good time measure.
MAXx512
appears to be a good config for most graphs. Here MAX
is the
grid-limit, and 512
is the block-size. Note that this applies to Tesla
V100 PCIe 16GB, and would be different for other GPUs. In order to measure
error, nvGraph pagerank is taken as a reference.
For this experiment (switched-adjust-switch-point), switch_degree
was varied
from 2
- 1024
, and switch_limit
was varied from 1
- 1024
.
switch_degree
defines the in-degree at which pagerank kernel switches from
thread-per-vertex approach to block-per-vertex. switch_limit
defines
the minimum block size for thread-per-vertex / block-per-vertex approach
(if a block size is too small, it is merged with the other approach block). Each
case is run on multiple graphs, running each 5 times per graph for good time
measure. It seems switch_degree
of 64 and switch_limit
of 32 would
be a good choice.
nvGraph PageRank appears to use L2-norm per-iteration scaling. This is
(probably) required for finding a solution to eigenvalue problem. However,
as the eigenvalue for PageRank is 1
, this is not necessary. This experiement
was for observing if this was indeed true, and that any such per-iteration
scaling doesn't affect the number of iterations needed to converge.
In this experiment (adjust-iteration-scaling), PageRank was computed with L1, L2, or L∞-norm and the effect of L1 or L2-norm scaling of ranks was compared with baseline (L0). Results match the above assumptions, and indeed no performance benefit is observed (except a reduction in a single iteration for soc-Slashdot0811, soc-Slashdot-0902, soc-LiveJournal1, and italy_osm graphs).
This experiment (compare-nvgraph) was for comparing the performance between finding pagerank using nvGraph, finding pagerank using CUDA, and finding pagerank using a single thread (sequential). Each technique was attempted on different types of graphs, running each technique 5 times per graph to get a good time measure. CUDA is the switched-per-vertex approach running on GPU. CUDA based pagerank is indeed much faster than sequential (CPU). In order to measure error, nvGraph pagerank is taken as a reference.
- PageRank Algorithm, Mining massive Datasets (CS246), Stanford University
- CUDA by Example :: Jason Sanders, Edward Kandrot
- Managed memory vs cudaHostAlloc - TK1
- SuiteSparse Matrix Collection