Rust CUDA 0.2
This release marks the start of fixing many of the fundamental issues in the codegen, as well as implementing some of the most needed features for writing performant kernel.
This release mostly covers quality of life changes, bug fixes, and some performance improvements.
Nightly
Required nightly has been updated to 12/4/21, This fixes rust-analyzer not working sometimes.
PTX Backend
DCE (Dead Code Elimination)
DCE has been implemented, we switched to an alternative way of linking together dependencies which now drastically reduces the amount of work libnvvm has to do, as well as removes any globals or functions not directly or indirectly used by kernels. This reduced the PTX size of the path tracer example from about 20kloc to 2.3 kloc.
Address Spaces
CUDA Address Spaces have been mostly implemented, any user-defined static that does not rely on interior mutability will be placed in the constant address space (__constant__
), otherwise it will be placed in the generic address space (which is global for globals). This also allowed us to implement basic static shared memory support.
Libm override
The codegen automatically overrides calls to libm with calls to libdevice. This is to allow existing no_std crates to take advantage of architecture-optimized math intrinsics. This can be disabled from cuda_builder if you need strict determinism. This also reduces PTX size a good amount in math-heavy kernels (3.8kloc to 2.3kloc in our path tracer). It also reduces register usage by a little bit, which can yield performance gains.
cuda_std
- Added address space query and conversion functions in
cuda_std::ptr
. - Added
#[externally_visible]
for making sure the codegen does not eliminate a function if not used by a kernel - Added
#[address_space(...)]
for making the codegen put a static in a specific address space, mostly internal and unsafe. - Added basic static shared memory support with
cuda_std::shared_array!
Cust
Cust 0.2 was actually released some time ago but these were the changes in 0.2 and 0.2.1:
- Added
Device::as_raw
. - Added
MemoryAdvise
for unified memory advising. - Added
MemoryAdvise::prefetch_host
andMemoryAdvise::prefetch_device
for telling CUDA to explicitly fetch unified memory somewhere. - Added
MemoryAdvise::advise_read_mostly
. - Added
MemoryAdvise::preferred_location
andMemoryAdvise::unset_preferred_location
.
Note that advising APIs are only present on high end GPUs such as V100s. StreamFlags::NON_BLOCKING
has been temporarily disabled because of soundness concerns.- Change
GpuBox::as_device_ptr
andGpuBuffer::as_device_ptr
to take&self
instead of&mut self
. - Rename
DBuffer
->DeviceBuffer
. This is how it was in rustacuda, but it was changed
at some point, but now we reconsidered that it may be the wrong choice. - Renamed
DBox
->DeviceBox
. - Renamed
DSlice
->DeviceSlice
. - Remove
GpuBox::as_device_ptr_mut
andGpuBuffer::as_device_ptr_mut
. - Remove accidentally added
vek
default feature. vek
feature now usesdefault-features = false
, this also meansRgb
andRgba
no longer implementDeviceCopy
.