Skip to content

Rust CUDA 0.2

Compare
Choose a tag to compare
@RDambrosio016 RDambrosio016 released this 05 Dec 17:57
· 94 commits to master since this release

This release marks the start of fixing many of the fundamental issues in the codegen, as well as implementing some of the most needed features for writing performant kernel.

This release mostly covers quality of life changes, bug fixes, and some performance improvements.

Nightly

Required nightly has been updated to 12/4/21, This fixes rust-analyzer not working sometimes.

PTX Backend

DCE (Dead Code Elimination)

DCE has been implemented, we switched to an alternative way of linking together dependencies which now drastically reduces the amount of work libnvvm has to do, as well as removes any globals or functions not directly or indirectly used by kernels. This reduced the PTX size of the path tracer example from about 20kloc to 2.3 kloc.

Address Spaces

CUDA Address Spaces have been mostly implemented, any user-defined static that does not rely on interior mutability will be placed in the constant address space (__constant__), otherwise it will be placed in the generic address space (which is global for globals). This also allowed us to implement basic static shared memory support.

Libm override

The codegen automatically overrides calls to libm with calls to libdevice. This is to allow existing no_std crates to take advantage of architecture-optimized math intrinsics. This can be disabled from cuda_builder if you need strict determinism. This also reduces PTX size a good amount in math-heavy kernels (3.8kloc to 2.3kloc in our path tracer). It also reduces register usage by a little bit, which can yield performance gains.

cuda_std

  • Added address space query and conversion functions in cuda_std::ptr.
  • Added #[externally_visible] for making sure the codegen does not eliminate a function if not used by a kernel
  • Added #[address_space(...)] for making the codegen put a static in a specific address space, mostly internal and unsafe.
  • Added basic static shared memory support with cuda_std::shared_array!

Cust

Cust 0.2 was actually released some time ago but these were the changes in 0.2 and 0.2.1:

  • Added Device::as_raw.
  • Added MemoryAdvise for unified memory advising.
  • Added MemoryAdvise::prefetch_host and MemoryAdvise::prefetch_device for telling CUDA to explicitly fetch unified memory somewhere.
  • Added MemoryAdvise::advise_read_mostly.
  • Added MemoryAdvise::preferred_location and MemoryAdvise::unset_preferred_location.
    Note that advising APIs are only present on high end GPUs such as V100s.
  • StreamFlags::NON_BLOCKING has been temporarily disabled because of soundness concerns.
  • Change GpuBox::as_device_ptr and GpuBuffer::as_device_ptr to take &self instead of &mut self.
  • Rename DBuffer -> DeviceBuffer. This is how it was in rustacuda, but it was changed
    at some point, but now we reconsidered that it may be the wrong choice.
  • Renamed DBox -> DeviceBox.
  • Renamed DSlice -> DeviceSlice.
  • Remove GpuBox::as_device_ptr_mut and GpuBuffer::as_device_ptr_mut.
  • Remove accidentally added vek default feature.
  • vek feature now uses default-features = false, this also means Rgb and Rgba no longer implement DeviceCopy.