Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss is nan, stopping training #30

Open
JunLiangZ opened this issue Feb 27, 2024 · 11 comments
Open

Loss is nan, stopping training #30

JunLiangZ opened this issue Feb 27, 2024 · 11 comments

Comments

@JunLiangZ
Copy link

During the training process, the problem of loss being nan occurred. Why is this?

@jasscia18
Copy link

During the training process, the problem of loss being nan occurred. Why is this?

我也出现了这个问题,请问你解决了吗

@radarFudan
Copy link

Maybe try Float32 and reduce learning rate, BF16 can suffer from some stability issue.

@zhenyuZ-HUST
Copy link

保证--if_amp为False,似乎能解决这个问题。(Try setting --if_amp to False)

@sailor-z
Copy link

Hi,
if_amp = False doesn't work for me. I also tried using a small learning rate, but the problem still exists. Does anyone know how to handle it?

@BranStarkkk
Copy link

Hi, if_amp = False doesn't work for me. I also tried using a small learning rate, but the problem still exists. Does anyone know how to handle it?

I also have this problem, have you ever solved the problem?

@sailor-z
Copy link

Hi, if_amp = False doesn't work for me. I also tried using a small learning rate, but the problem still exists. Does anyone know how to handle it?

I also have this problem, have you ever solved the problem?

Not really. It seems all vision mambas have the same problem.

@CacatuaAlan
Copy link

set AMP=False may work or just set lower lr

@BranStarkkk
Copy link

Hi, if_amp = False doesn't work for me. I also tried using a small learning rate, but the problem still exists. Does anyone know how to handle it?

I also have this problem, have you ever solved the problem?

Not really. It seems all vision mambas have the same problem.

I just change the backbone Vim to other vision mamba model, and it works...
Its name is VMamba.

@sailor-z
Copy link

Hi, if_amp = False doesn't work for me. I also tried using a small learning rate, but the problem still exists. Does anyone know how to handle it?

I also have this problem, have you ever solved the problem?

Not really. It seems all vision mambas have the same problem.

I just change the backbone Vim to other vision mamba model, and it works... Its name is VMamba.

Thanks for the information! I'll look into it.

@mdchuc
Copy link

mdchuc commented May 30, 2024

Got same problem, fixed by dividing the sum of forward/backward hidden states by 2 to make hidden states/residuals of all layers have similar magnitude. Check out the detail: #90

@Karn3003
Copy link

@mdchuc, do you have any idea why, in code, they are flipping the out_b across the dim=-1? Shouldn't it be dim = 1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants