Directed rounding #2576

orkolorko · 2024-12-03T00:38:22Z

In this pull request we add call to the intrinsics for directed add, sub, mul, div and fma and a small tutorial on how to expose intrinsics. The ptx code is as expected.

maleadt · 2024-12-03T13:43:57Z

I think it would be better to use RoundingMode arguments instead of different functions. Base does this as well: https://github.com/JuliaLang/julia/blob/2590e675885b97579a7531c343a546f6f5bbcbe5/base/rounding.jl#L469-L472.

orkolorko · 2024-12-03T17:31:24Z

What do you think if we define aliases with RoundingMode? Something like:
@device_override Base.fma(x, y, z, ::RoundingMode{:ToZero}) = CUDA.fma_rn(x, y, z).

I tried to pass a vector of RoundingMode to a CUDA kernel, but it did not work.
Another thing is that at the moment, I'm not confident that NVPTX is calling the right intrisics, I'm investigating the numerical behavior. Can you point me in the direction of where to check this?

maleadt · 2024-12-03T18:41:19Z

What do you think if we define aliases with RoundingMode? Something like:
@device_override Base.fma(x, y, z, ::RoundingMode{:ToZero}) = CUDA.fma_rn(x, y, z).

Base.fma doesn't typically take this argument, so adding this definition wouldn't make your code generic. Might be something valuable to add to Base though.

Why wouldn't you unify the CUDA.fma definitions using a RoundingMode arg?

I'm not confident that NVPTX is calling the right intrisics, I'm investigating the numerical behavior. Can you point me in the direction of where to check this?

You can always inspect the SASS code using @device_code_sass.

orkolorko · 2024-12-08T20:30:27Z

I implemented the calls using a RoundingMode arg; I kept the function names and aliased them, but if you prefer I can remove the function names. I was able to expose the MMA interface, and tested against the test kernel in
#1426

function kernel_wmma_f64_lowlevel(a_dev, b_dev, c_dev, d_dev)
    a_frag = WMMA.llvm_wmma_load_a_col_m8n8k4_global_stride_f64(pointer(a_dev), 8)
    b_frag = WMMA.llvm_wmma_load_b_col_m8n8k4_global_stride_f64(pointer(b_dev), 4)
    c_frag = WMMA.llvm_wmma_load_c_col_m8n8k4_global_stride_f64(pointer(c_dev), 8)

    #d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag)
    #d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag, RoundToZero)
    #d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag, RoundUp)
    d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag, RoundDown)
    
    WMMA.llvm_wmma_store_d_col_m8n8k4_global_stride_f64(pointer(d_dev), d_frag, 8)
    return nothing
end

function call_kernel()
    m = n = 8
    k = 4
    dtype_a = dtype_b = Float64
    dtype_c = dtype_d = Float64

    d_a = CUDA.rand(dtype_a, m, k)
    d_b = CUDA.rand(dtype_b, k, n)
    d_c = CUDA.rand(dtype_c, m, n)
    d_d = CUDA.zeros(dtype_d, m, n)

    CUDA.@sync @cuda kernel_wmma_f64_lowlevel(d_a, d_b, d_c, d_d)
    return nothing
end

Everything seems to work fine! I also checked the numerical results for the operations and everything is fine.
As a reference it may be worth recording here that there is an error in the PTX documentation, i.e., there is a clash on the fragment size between here which states correctly that the accumulator fragments are A vector expression containing two .f64 elements from the matrix C. while here says it is a single element (wrong!!!).

maleadt

Thanks. Keeping the _rn etc names seems fine to me seeing how CUDA defines them as well.

docs/src/tutorials/exposing_new_intrinsics.jl

maleadt · 2024-12-09T11:30:46Z

docs/src/tutorials/exposing_new_intrinsics.jl

+# The binary operations as add, sub, mul, div have been implemented through a macro
+
+function test_add!(out, x, y)
+    I = threadIdx().x
+    if I == 1
+        out[I] = CUDA.add(x, y, RoundNearest)
+    elseif I == 2
+        out[I] = CUDA.add(x, y, RoundToZero)
+    elseif I == 3
+        out[I] = CUDA.add(x, y, RoundUp)
+    elseif I == 4
+        out[I] = CUDA.add(x, y, RoundDown)
+    end
+    return
+end
+
+out_d = CuArray(zeros(4))
+@cuda threads = 4 test_add!(out_d, 1.0, 2^(-54))
+out_h = Array(out_d)
+
+function test_sub!(out, x, y)
+    I = threadIdx().x
+    if I == 1
+        out[I] = CUDA.sub(x, y, RoundNearest)
+    elseif I == 2
+        out[I] = CUDA.sub(x, y, RoundToZero)
+    elseif I == 3
+        out[I] = CUDA.sub(x, y, RoundUp)
+    elseif I == 4
+        out[I] = CUDA.sub(x, y, RoundDown)
+    end
+    return
+end
+
+out_d = CuArray(zeros(4))
+@cuda threads = 4 test_sub!(out_d, 1.0, 2^(-53))
+out_h = Array(out_d)
+
+function test_mul!(out, x, y)
+    I = threadIdx().x
+    if I == 1
+        out[I] = CUDA.mul(x, y, RoundNearest)
+    elseif I == 2
+        out[I] = CUDA.mul(x, y, RoundToZero)
+    elseif I == 3
+        out[I] = CUDA.mul(x, y, RoundUp)
+    elseif I == 4
+        out[I] = CUDA.mul(x, y, RoundDown)
+    end
+    return
+end
+
+out_d = CuArray(zeros(4))
+@cuda threads = 4 test_mul!(out_d, 1.0 - 2^(-52), 1.0 + 2^(-52))
+out_h = Array(out_d)


Not sure how this part is still relevant to the 'defining an intrinsic' tutorial?

Left only one example

maleadt · 2024-12-09T11:33:48Z

src/device/intrinsics/math.jl

-
-

Unrelated change.

maleadt · 2024-12-09T11:34:03Z

src/device/intrinsics/wmma.jl

+                                 "f32" => Float32,
+                                 "f64" => Float64


Unrelated changes?

I added intrinsics calls for WMMA with directed rounding modes

Can you keep that to a separate PR? We also currently don't support Float64 WMMA, see #1426.

…DirectedRounding

orkolorko added 3 commits November 30, 2024 09:41

fma and add working, sub not working

6ed8086

Added tests in tutorials

fbee09f

Different examples

3354104

orkolorko added 5 commits December 7, 2024 21:54

Using RoundToNearest etc... in intrinsic call

244a39a

Added to the tutorial

f8ec736

Directed rounding MMA exposed

7346a6a

Removed TODO file with test

f8872a4

Removed debug

8e4cffd

orkolorko mentioned this pull request Dec 8, 2024

WMMA Float64 #1426

Draft

maleadt requested changes Dec 9, 2024

View reviewed changes

orkolorko added 4 commits December 9, 2024 16:04

Added rounding to WMMA config, need to propagate back!

1042ebb

Reverted to WMMA.Config without rounding

0e7fa43

Added rounding as a default keyword argument in mma

c40489c

Removed TODO

4de16b1

orkolorko mentioned this pull request Dec 16, 2024

CUDA compilers insert extraneous FMAs, breaking MultiFloats.jl algorithms dzhang314/MultiFloats.jl#23

Open

maleadt force-pushed the master branch 11 times, most recently from 5ceeb9f to a4a071b Compare December 19, 2024 20:01

maleadt force-pushed the master branch 4 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18

maleadt added enhancement New feature or request needs tests Tests are requested. cuda kernels Stuff about writing CUDA kernels. labels Dec 20, 2024

orkolorko added 4 commits December 20, 2024 09:24

Finished reverting Rounding in Config

48c36d0

Moved tutorial to hacking section

5f8fff5

Merge branch 'DirectedRounding' of github.com:orkolorko/CUDA.jl into …

30105ee

…DirectedRounding

Revert wmma.jl

3c2d721

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Directed rounding #2576

Directed rounding #2576

orkolorko commented Dec 3, 2024 •

edited

Loading

maleadt commented Dec 3, 2024

orkolorko commented Dec 3, 2024

maleadt commented Dec 3, 2024

orkolorko commented Dec 8, 2024 •

edited

Loading

maleadt left a comment

maleadt Dec 9, 2024

orkolorko Dec 10, 2024

maleadt Dec 9, 2024

maleadt Dec 9, 2024

orkolorko Dec 10, 2024

maleadt Dec 20, 2024

Directed rounding #2576

Are you sure you want to change the base?

Directed rounding #2576

Conversation

orkolorko commented Dec 3, 2024 • edited Loading

maleadt commented Dec 3, 2024

orkolorko commented Dec 3, 2024

maleadt commented Dec 3, 2024

orkolorko commented Dec 8, 2024 • edited Loading

maleadt left a comment

Choose a reason for hiding this comment

maleadt Dec 9, 2024

Choose a reason for hiding this comment

orkolorko Dec 10, 2024

Choose a reason for hiding this comment

maleadt Dec 9, 2024

Choose a reason for hiding this comment

maleadt Dec 9, 2024

Choose a reason for hiding this comment

orkolorko Dec 10, 2024

Choose a reason for hiding this comment

maleadt Dec 20, 2024

Choose a reason for hiding this comment

orkolorko commented Dec 3, 2024 •

edited

Loading

orkolorko commented Dec 8, 2024 •

edited

Loading