Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_scale/grads_have_scale,ZeroDivisionError: float division by zero #20

Open
chengzhen123 opened this issue May 22, 2023 · 2 comments

Comments

@chengzhen123
Copy link

Traceback (most recent call last):
File "train.py", line 202, in
lr=args.lr, device=device, img_scale=args.scale, val_percent=args.val / 100)
File "train.py", line 95, in train_net
scaled_loss.backward()
File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\contextlib.py", line 119, in exit
next(self.gen)
File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp\handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp_process_optimizer.py", line 135, in post_backward_models_are_masters
scale_override=(grads_have_scale, stashed_have_scale, out_scale))
File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp\scaler.py", line 183, in unscale_with_stashed
out_scale/grads_have_scale,
ZeroDivisionError: float division by zero

epochs跑到2次,就报这个错误,查到网上说将lr改小一个等级,就可以。我把lr从0.01 改成 0.001,到了26epoch又报这个错误。
请问,是否有其他方法消除这个错误?以及这个错误由什么引起的?谢谢

@Susu0812
Copy link

回溯(最近一次调用):文件 “train.py”,第 202 行,在 lr=args.lr, device=device, img_scale=args.scale, val_percent=args.val / 100) 文件 “train.py”,第 95 行,在 train_net scaled_loss.backward() 文件中 “D:\ProgramData\Anaconda3\envs\UNet3plus\lib\contextlib.py”,第 119 行,在退出 next(self.gen) 文件 “D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp\handle.py”,行123,在scale_loss optimizer._post_amp_backward(loss_scaler)文件“D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp_process_optimizer.py”中,第249行,在post_backward_no_master_weights post_backward_models_are_masters(scaler,params,stashed_grads)文件中,文件“D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp_process_optimizer.py”,第135行,在post_backward_models_are_mastersscale_override=(grads_have_scale, stashed_have_scale, out_scale)) 文件 “D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp\scaler.py”,第 183 行,在 unscale_with_stashed out_scale/grads_have_scale 中,零除错误:浮点除以零

epochs跑到2次,就报这个错误,查到网上说将LR改小一个等级,就可以。我把lr从0.01 改成 0.001,到了26epoch又报这个错误。 请问,是否有其他方法消除这个错误?以及这个错误由什么引起的?谢谢

我也遇到了相同的问题,请问您解决了吗

@lxy5513
Copy link

lxy5513 commented Apr 23, 2024

继续改小lr...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants