From da78ffb5ae9c5d8d1b745237839e04bd4a1fd01b Mon Sep 17 00:00:00 2001 From: Jack Taylor <108682042+jataylo@users.noreply.github.com> Date: Fri, 6 Dec 2024 05:58:07 +0000 Subject: [PATCH] [release/2.2] [ROCm] Correct numerical issues in layer norm backwards kernel (#140259) (#1767) It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation. On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory. In this kernel (https://github.com/pytorch/pytorch/pull/87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd. Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140259 Approved by: https://github.com/jianyuh (cherry picked from commit 001f7366a71cd19e4b460624ab76053225c8676e) Fixes #ISSUE_NUMBER --- aten/src/ATen/native/cuda/layer_norm_kernel.cu | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu index 13e9e0230ca8d..de67d2f65bcc0 100644 --- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu +++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu @@ -840,8 +840,8 @@ void cuLoadWriteStridedInputs( { int i1 = i1_block+thr_load_row_off; if (i1 < i1_end) { - T curr_mean = mean[i1]; - T curr_rstd = rstd[i1]; + T_ACC curr_mean = mean[i1]; + T_ACC curr_rstd = rstd[i1]; for (int k = 0; k < blockDim.y; ++k) { int i2 = i2_off + k; int load_idx = i1*N+i2;