-
Notifications
You must be signed in to change notification settings - Fork 26
/
README
170 lines (130 loc) · 6.68 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
Release Notes for xGPU
----------------------
Overview:
xGPU is a library for performing the cross-multiplication step of the
FX correlator algorithm, which is popular for radio astronomy signal
processing.
Precision:
By default xGPU accepts signed 8-bit integer input, which is then
converted to and computed in 32-bit floating point, with the final
result output in 32-bit floating point. However, on architectures
that support the dp4a instruction (sm_61, sm_70 and sm_72 at the time
of writing) a pure integer correlator is supported (8-bit integer
multiply, 32-bit integer accumulation). This can provide up to a 4x
speedup versus using a floating point correlator. Note, that if dp4a is
requested on an unsupported architecture, then the computation will be
emulated, providing the correct answer but much slower than native
dp4a.
Software Compatibility:
The library has been tested under Linux (Ubuntu 16.04 and CentOS 7)
using release 9.2 of the CUDA toolkit. However the expectation is
that all versions of CUDA should be compatible.
Hardward Compatibility:
For a list of supported devices see,
https://developer.nvidia.com/cuda-gpus
xGPU has been tested and known to work on all GPUs from Fermi onwards.
While this library will run on pre-Fermi GPUs with appropriate changes
to the Makefile, note that the kernels make Fermi-specific
optimizations and so will likely lead to sub-standard performance on
sm1.x CUDA architectures. Note that as of CUDA 9.0, Fermi support has been
removed from the CUDA toolkit with Kepler being the minimum.
Building the Library:
The library, library query tool "xgpuinfo", and the sample program
"cuda_correlator" can be built by changing into the src subdirectoy
and running "make".
$ cd src
$ make
The default architecture that is targeted is Kepler (sm_35), though
this can be overriden with the CUDA_ARCH variable. E.g.,
$ make CUDA_ARCH=sm_60
would build xGPU for Pascal GP100. The possible options are
arch example chips example products dp4a
sm_20 GF100, GF110 Tesla 2050, Tesla 2070
sm_21 GF114, GF116 GTX 460
sm_30 GK104, GK106, GK107 Tesla K10, GTX 680
sm_35 GK110, GK180 Tesla K20, Tesla K40
sm_50 GM107, GM108 GTX 750
sm_52 GM200, GM204, GM206, GM207 Tesla M40, Tesla M40, GTX 980
sm_53 GM20B Jetson TX1
sm_60 GP100 Tesla P100
sm_61 GP102, GP104, GP106, GP107 Tesla P40, Tesla P4, GTX 1080 x
sm_62 GP10B Jetson TX2
sm_70 GV100 Tesla V100, Titan V x
sm_72 GV10B Xavier x
In general one should target the same architecture that they are
running on. While code compiled for an older architecture will in
general run on a more recent architecture, this may lead to
sub-optimal performance. For example, native dp4a instructions will
only be generated for specific architectures.
Other options include
CUDA_DIR # path to CUDA toolkit installation (default /usr/local/cuda)
DEBUG # set any specific compiler flags (default is -O3)
TEXTURE_DIM # whether to use 1-d or 2-d textures (default is 1)
DP4A # whether to use dp4a instruction (default is no)
For the full list of environment variables that control compilation
and installation options see the top of src/Makefile.
Currently, a number of sizing parameters must be specified when
building the library. Default values of these parameters are
specified near the top of src/xgpu_info.h. The default values can be
overridden on the make command line to suit your instrument's needs.
The options that can be given on the make command line are shown here
with there default values.
NPOL=2
NSTATION=256
NFREQUENCY=10
NTIME=1024
NTIME_PIPE=128
Note that NTIME_PIPE must be a multiple of 4 (16 when dp4a is used)
and NTIME must be a multiple of NTIME_PIPE. The preprocessor will
error out if those two conditions are not met.
For example, to compile with NSTATION set to 128 and all other
parameters at their default values:
$ make NSTATION=128
Installing the Library:
The library can be installed by changing into the src subdirectoy and
running "make install". By default, this will install xgpuinfo into
/usr/local/bin, xgpu.h into /usr/local/include, and libxgpu.so (or
libxgpu.dll on Cygwin) to /usr/local/lib. Specifying
"prefix=/some/path" on the "make install" command line will install
these files into /some/path/bin, /some/path/include, and
/some/path/lib instead.
$ cd src
$ make install # install under /usr/local
$ make install prefix=$HOME/local # install under $HOME/local
Using the Library:
The library can be called from C or C++ code. To use the library,
your source files need to #include <xpgu.h> and your executable needs
to be linked with libxpgu.so (or libxgpu.dll on Cygwin). On UNIX
systems, this usaually means adding "-L/path/to/lib/dir" and "-lxpgu"
to the link command line.
Please see the comments in xgpu.h as well as the usage in the sample
program cuda_correlator.cu for more details on how to use the library.
This library has been designed to be interfaced with other parts of an
FX correlator pipeline, and so not much can be achieved in isolation.
A simple test program "cuda_correlator.cu" is included which performs
cross-multiplication on the host and the device and verifies the
device obtained the correct answer. The many options regarding number
of stations, frequency channels etc. are set in the top of this file.
Benchmarking Performance:
xGPU includes an additional benchmarking utility: CUBE - CUDA
BEnchmarking. This uses C-preprocessor directive to obtain arithmetic
throughput and device memory bandwidth performance. To invoke a
benchmarking run, one simply has to execute the "bench" script. This
will perform four runs of the test. The first two of these are
concerned with counting all flops and transfers performed by the
kernels, and measuring the time taken for each of these steps. The
latter two are concerened with measuring the asynchronous performance
of the device<->host transfers. By default the results are printed to
stdout, though they are output to file (cube_benchmark.log and
cube_benchmark.csv).
Acknowledging xGPU:
If you find this code useful in your work, please cite:
M. A. Clark, P. C. La Plante, and L. J. Greenhill, "Accelerating Radio
Astronomy Cross-Correlation with Graphics Processing units",
[arXiv:1107.4264 [astro-ph]].
Authors:
Kate Clark (NVIDIA)
Paul La Plante (Loyola University Maryland)
Lincoln Greenhill (Harvard-Smithsonian Center for Astrophysics)
David MacMahon (University of California, Berkeley)
Ben Barsdell (NVIDIA)