Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

作者您好,我也遇到了确定性算法的警告导致模型不能运行,同评论区的问题一样,我是4张3090一起训练de设置了0,1,2,3.指定单卡直接显示cuda错误,希望您能给出建议 #16

Open
urban-drummer opened this issue Jul 18, 2024 · 4 comments

Comments

@urban-drummer
Copy link

segment/train: weights=/media/dell/lhx/yolo/ASF-YOLO/yolov5l-seg.pt, cfg=/media/dell/lhx/yolo/ASF-YOLO/models/segment/asf-yolo.yaml, data=/media/dell/lhx/yolo/ASF-YOLO/data/bcc.yaml, hyp=/media/dell/lhx/yolo/ASF-YOLO/data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=8, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=../runs_2/train-seg, name=improve, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, mask_ratio=4, no_overlap=False
YOLOv5  2024-5-30 Python-3.8.0 torch-2.3.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24260MiB)
CUDA:1 (NVIDIA GeForce RTX 3090, 24260MiB)
CUDA:2 (NVIDIA GeForce RTX 3090, 24260MiB)
CUDA:3 (NVIDIA GeForce RTX 3090, 24260MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir ../runs_2/train-seg', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=1

             from  n    params  module                                  arguments                     

0 -1 1 7040 models.common.Conv [3, 64, 6, 2, 2]
1 -1 1 73984 models.common.Conv [64, 128, 3, 2]
2 -1 3 156928 models.common.C3 [128, 128, 3]
3 -1 1 295424 models.common.Conv [128, 256, 3, 2]
4 -1 6 1118208 models.common.C3 [256, 256, 6]
5 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
6 -1 9 6433792 models.common.C3 [512, 512, 9]
7 -1 1 4720640 models.common.Conv [512, 1024, 3, 2]
8 -1 3 9971712 models.common.C3 [1024, 1024, 3]
9 -1 1 2624512 models.common.SPPF [1024, 1024, 5]
10 -1 1 525312 models.common.Conv [1024, 512, 1, 1]
11 4 1 132096 models.common.Conv [256, 512, 1, 1]
12 [-1, 6, -2] 1 0 models.common.Zoom_cat [512]
13 -1 3 3019776 models.common.C3 [1536, 512, 3, False]
14 -1 1 131584 models.common.Conv [512, 256, 1, 1]
15 2 1 33280 models.common.Conv [128, 256, 1, 1]
16 [-1, 4, -2] 1 0 models.common.Zoom_cat [256]
17 -1 3 756224 models.common.C3 [768, 256, 3, False]
18 -1 1 590336 models.common.Conv [256, 256, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 3 2495488 models.common.C3 [512, 512, 3, False]
21 -1 1 2360320 models.common.Conv [512, 512, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 3 9971712 models.common.C3 [1024, 1024, 3, False]
24 [4, 6, 8] 1 460544 models.common.ScalSeq [256]
25 [17, -1] 1 12325 models.common.attention_model [256]
26 [-1, 20, 23] 1 1393558 models.yolo.Segment [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], 32, 256, [256, 512, 1024]]
asf-yolo summary: 407 layers, 48465467 parameters, 48465467 gradients, 155.4 GFLOPs

Transferred 602/671 items from /media/dell/lhx/yolo/ASF-YOLO/yolov5l-seg.pt
AMP: checks passed ✅
optimizer: SGD(lr=0.01) with parameter groups 110 weight(decay=0.0), 116 weight(decay=0.0005), 114 bias
WARNING ⚠️ DP not recommended, use torch.distributed.run for best DDP Multi-GPU results.
See Multi-GPU Tutorial at ultralytics/yolov5#475 to get started.
train: Scanning /media/dell/lhx/yolo/ASF-YOLO/datasets/BCC/labels/train.cache... 128 images, 0 backgrounds, 0 corrupt: 100%|██████████| 128/128 00:00
val: Scanning /media/dell/lhx/yolo/ASF-YOLO/datasets/BCC/labels/val.cache... 32 images, 0 backgrounds, 0 corrupt: 100%|██████████| 32/32 00:00

AutoAnchor: 4.36 anchors/target, 0.970 Best Possible Recall (BPR). Anchors are a poor fit to dataset ⚠️, attempting to improve...
AutoAnchor: WARNING ⚠️ Extremely small objects found: 47 of 1235 labels are <3 pixels in size
AutoAnchor: Running kmeans for 9 anchors on 1235 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.7403: 100%|██████████| 1000/1000 00:00
AutoAnchor: thr=0.25: 0.9571 best possible recall, 6.31 anchors past thr
AutoAnchor: n=9, img_size=640, metric_all=0.391/0.743-mean/best, past_thr=0.495-mean: 25,43, 88,52, 51,155, 92,121, 163,129, 116,183, 236,232, 160,418, 350,452
AutoAnchor: Done ⚠️ (original anchors better than new anchors, proceeding with original anchors)
Plotting labels to ../runs_2/train-seg/improve3/labels.jpg...
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to ../runs_2/train-seg/improve3
Starting training for 100 epochs...

  Epoch    GPU_mem   box_loss   seg_loss   obj_loss   cls_loss  Instances       Size

0%| | 0/16 00:03
Traceback (most recent call last):
File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 658, in
main(opt)
File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 554, in main
train(opt.hyp, opt, device, callbacks)
File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 317, in train
scaler.scale(loss).backward()
File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: max_pool3d_with_indices_backward_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

进程已结束,退出代码1

@mkang315
Copy link
Owner

Did you look at and follow YOLOv5 Multi-GPU Tutorial?

@urban-drummer
Copy link
Author

您是否查看并遵循了YOLOv5 多 GPU 教程
作者您好,我严格遵循了多卡训练的流程,设定断点确定开启的是ddp模式,此外尝试了指定单卡训练得到报错的结果都是如上max_pool3d_with_indices_backward_cuda

@urban-drummer
Copy link
Author

感谢您的帮助,目前我将torch.use_deterministic_algorithms(True, warn_only=True)添加到scaler.scale(loss).backward()前模型可以运行但伴随着警告,正在进一步的确认错误,大概率是版本兼容性的问题

@mkang315
Copy link
Owner

Sorry for the inconvenience. If there is still a reported error, you may try to put our 'models' in the folder of 'models' of YOLOv5 and add some of the import dependencies from ours. Our code was generated based on that of YOLOv5. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants