README

Release Notes for xGPU
----------------------

Overview:

xGPU is a library for performing the cross-multiplication step of the
FX correlator algorithm, which is popular for radio astronomy signal
processing.

Precision:

By default xGPU accepts signed 8-bit integer input, which is then
converted to and computed in 32-bit floating point, with the final
result output in 32-bit floating point.  However, on architectures
that support the dp4a instruction (sm_61, sm_70 and sm_72 at the time
of writing) a pure integer correlator is supported (8-bit integer
multiply, 32-bit integer accumulation).  This can provide up to a 4x
speedup versus using a floating point correlator.  Note, that if dp4a is
requested on an unsupported architecture, then the computation will be
emulated, providing the correct answer but much slower than native
dp4a.

Software Compatibility:

The library has been tested under Linux (Ubuntu 16.04 and CentOS 7)
using release 9.2 of the CUDA toolkit.  However the expectation is
that all versions of CUDA should be compatible.

Hardward Compatibility:

For a list of supported devices see,

https://developer.nvidia.com/cuda-gpus

xGPU has been tested and known to work on all GPUs from Fermi onwards.
While this library will run on pre-Fermi GPUs with appropriate changes
to the Makefile, note that the kernels make Fermi-specific
optimizations and so will likely lead to sub-standard performance on
sm1.x CUDA architectures.  Note that as of CUDA 9.0, Fermi support has been
removed from the CUDA toolkit with Kepler being the minimum.

Building the Library:

The library, library query tool "xgpuinfo", and the sample program
"cuda_correlator" can be built by changing into the src subdirectoy
and running "make".

  $ cd src
  $ make

The default architecture that is targeted is Kepler (sm_35), though
this can be overriden with the CUDA_ARCH variable.  E.g.,

  $ make CUDA_ARCH=sm_60

would build xGPU for Pascal GP100.  The possible options are

arch    example chips               example products                    dp4a
sm_20   GF100, GF110                Tesla 2050, Tesla 2070
sm_21   GF114, GF116                GTX 460
sm_30   GK104, GK106, GK107         Tesla K10, GTX 680
sm_35   GK110, GK180                Tesla K20, Tesla K40
sm_50   GM107, GM108                GTX 750
sm_52   GM200, GM204, GM206, GM207  Tesla M40, Tesla M40, GTX 980
sm_53   GM20B                       Jetson TX1
sm_60   GP100                       Tesla P100
sm_61   GP102, GP104, GP106, GP107  Tesla P40, Tesla P4, GTX 1080       x
sm_62   GP10B                       Jetson TX2
sm_70   GV100                       Tesla V100, Titan V                 x
sm_72   GV10B                       Xavier                              x

In general one should target the same architecture that they are
running on.  While code compiled for an older architecture will in
general run on a more recent architecture, this may lead to
sub-optimal performance.  For example, native dp4a instructions will
only be generated for specific architectures.

Other options include

  CUDA_DIR    # path to CUDA toolkit installation (default /usr/local/cuda)
  DEBUG       # set any specific compiler flags (default is -O3)
  TEXTURE_DIM # whether to use 1-d or 2-d textures (default is 1)
  DP4A        # whether to use dp4a instruction (default is no)

For the full list of environment variables that control compilation
and installation options see the top of src/Makefile.

Currently, a number of sizing parameters must be specified when
building the library.  Default values of these parameters are
specified near the top of src/xgpu_info.h.  The default values can be
overridden on the make command line to suit your instrument's needs.
The options that can be given on the make command line are shown here
with there default values.

  NPOL=2
  NSTATION=256
  NFREQUENCY=10
  NTIME=1024
  NTIME_PIPE=128

Note that NTIME_PIPE must be a multiple of 4 (16 when dp4a is used)
and NTIME must be a multiple of NTIME_PIPE.  The preprocessor will
error out if those two conditions are not met.

For example, to compile with NSTATION set to 128 and all other
parameters at their default values:
  
  $ make NSTATION=128

Installing the Library:

The library can be installed by changing into the src subdirectoy and
running "make install".  By default, this will install xgpuinfo into
/usr/local/bin, xgpu.h into /usr/local/include, and libxgpu.so (or
libxgpu.dll on Cygwin) to /usr/local/lib.  Specifying
"prefix=/some/path" on the "make install" command line will install
these files into /some/path/bin, /some/path/include, and
/some/path/lib instead.

  $ cd src
  $ make install                    # install under /usr/local
  $ make install prefix=$HOME/local # install under $HOME/local

Using the Library:

The library can be called from C or C++ code.  To use the library,
your source files need to #include <xpgu.h> and your executable needs
to be linked with libxpgu.so (or libxgpu.dll on Cygwin).  On UNIX
systems, this usaually means adding "-L/path/to/lib/dir" and "-lxpgu"
to the link command line.

Please see the comments in xgpu.h as well as the usage in the sample
program cuda_correlator.cu for more details on how to use the library.

This library has been designed to be interfaced with other parts of an
FX correlator pipeline, and so not much can be achieved in isolation.
A simple test program "cuda_correlator.cu" is included which performs
cross-multiplication on the host and the device and verifies the
device obtained the correct answer.  The many options regarding number
of stations, frequency channels etc. are set in the top of this file.

Benchmarking Performance:

xGPU includes an additional benchmarking utility: CUBE - CUDA
BEnchmarking.  This uses C-preprocessor directive to obtain arithmetic
throughput and device memory bandwidth performance.  To invoke a
benchmarking run, one simply has to execute the "bench" script.  This
will perform four runs of the test.  The first two of these are
concerned with counting all flops and transfers performed by the
kernels, and measuring the time taken for each of these steps.  The
latter two are concerened with measuring the asynchronous performance
of the device<->host transfers.  By default the results are printed to
stdout, though they are output to file (cube_benchmark.log and
cube_benchmark.csv).

Acknowledging xGPU:

If you find this code useful in your work, please cite:

M. A. Clark, P. C. La Plante, and L. J. Greenhill, "Accelerating Radio
Astronomy Cross-Correlation with Graphics Processing units",
[arXiv:1107.4264 [astro-ph]].

Authors:

Kate Clark (NVIDIA)
Paul La Plante (Loyola University Maryland)
Lincoln Greenhill (Harvard-Smithsonian Center for Astrophysics)
David MacMahon (University of California, Berkeley)
Ben Barsdell (NVIDIA)