From da78ffb5ae9c5d8d1b745237839e04bd4a1fd01b Mon Sep 17 00:00:00 2001
From: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Date: Fri, 6 Dec 2024 05:58:07 +0000
Subject: [PATCH] [release/2.2] [ROCm] Correct numerical issues in layer norm
 backwards kernel (#140259) (#1767)

It was raised that the backwards layer norm on AMD was slightly off the
accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which
processes strided input and accumulates the partial gradients into
shared memory.

In this kernel (https://github.com/pytorch/pytorch/pull/87635) we
truncated `mean` and `rstd` from T_ACC type to T which causes numerical
issues in the warp buffers created in this kernel. This PR will use the
correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so
this was not an issue for NV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f7366a71cd19e4b460624ab76053225c8676e)

Fixes #ISSUE_NUMBER
---
 aten/src/ATen/native/cuda/layer_norm_kernel.cu | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
index 13e9e0230ca8d..de67d2f65bcc0 100644
--- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu
+++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
@@ -840,8 +840,8 @@ void cuLoadWriteStridedInputs(
 {
   int i1 = i1_block+thr_load_row_off;
   if (i1 < i1_end) {
-    T curr_mean = mean[i1];
-    T curr_rstd = rstd[i1];
+    T_ACC curr_mean = mean[i1];
+    T_ACC curr_rstd = rstd[i1];
     for (int k = 0;  k < blockDim.y;  ++k) {
       int i2 = i2_off + k;
       int load_idx = i1*N+i2;