-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
你好,我使用单卡可以训练,当使用多卡时,会出错,请问你尝试过多卡训练吗? #9
Comments
单机多卡,我这边是测试过没有问题的,两张2070的卡,可以训练。根据你的错误反馈,猜测是不是显卡容量不够了?还有更详细的日志没有? |
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: operator(): block: [41/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: operator(): block: [41,0,0,0], thread: [0,0,0], thread: [0,0,0] Assertion @wuzhihao7788 |
看日志好像是越界问题了,查看你的数据集制作是不是有问题,你可以现在单卡上测试一下,是否还在报错 |
@wuzhihao7788 |
是的 |
因为这个错误不是很明显,我抽时间在跑一下代码,看一下,是不是没有和github上的代码保持一致。我实验没有问题了,争取明天给你一个答复。 |
@wuzhihao7788 你能发一份COCO的标签文件给我吗?train.txt val.txt和label.names 我先验证一下我的数据制作有没有问题。我的邮箱[email protected] |
@wuzhihao7788 |
@wuzhihao7788 问题应该是多卡训练时候图片被分到了多张卡,但是计算loss时,对应的标签用的是整个batchsize的。 |
谢谢你的反馈,我现在在debug这段代码,看一下具体是什么原因,等代码修改完,上传完成,我会告知你的。 |
亲,最新的代码已经更新完成,请你重新拉取一下,再用多卡训练,看是否还有错误。 |
@wuzhihao7788 你好,可以了 |
@wuzhihao7788 测试的时候遇到错误 |
好的明天我在看一下验证部分的代码 |
亲,最新的代码已经更新完成,多卡训练,验证集已经没有问题 |
@wuzhihao7788 |
也算可以,我的训练日志都属于断断续续的。根据我这单卡训练的经验,无论是coco数据集还是小型数据集,能看到一定效果差不多执行需要在24个epoch左右 |
亲,后面你能把你训练的模型贡献出来吗?我这条件比较有限,还没有训练到一个比较充分的模型。 |
@wuzhihao7788 我这边继续训练,有好的模型会贡献的 |
感谢 |
@wuzhihao7788 yolov5l,总epoch设置的100,训练了92个epoch |
感谢你的训练,这个结果跟官方的还是有差距的。因为官方的训练epoch数为300个。卡是V100。你还能继续训练一下吗?模型应该还会继续进行收敛。
官方结果。[email protected] : 47.7mAP0.5:66.5
…------------------ 原始邮件 ------------------
发件人: "wuzhihao7788/yolodet-pytorch" <[email protected]>;
发送时间: 2020年11月18日(星期三) 晚上6:31
收件人: "wuzhihao7788/yolodet-pytorch"<[email protected]>;
抄送: "wuzhihao"<[email protected]>;"Mention"<[email protected]>;
主题: Re: [wuzhihao7788/yolodet-pytorch] 你好,我使用单卡可以训练,当使用多卡时,会出错,请问你尝试过多卡训练吗? (#9)
@wuzhihao7788 yolov5l,总epoch设置的100,训练了92个epoch
{"mode": "val", "epoch": 92, "iter": 7329, "lr": 0.00022, "P": 0.33703, "R": 0.62344, "[email protected]": 0.51406, "[email protected]:0.95": 0.35654}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@wuzhihao7788 训练yolov5l太慢了,我准备尝试yolov5s,训练这个比较快。 |
嗯嗯,好的,你可以直接把最大epoch设置为300训练,其他参数可以不用调整
…------------------ 原始邮件 ------------------
发件人: "wuzhihao7788/yolodet-pytorch" <[email protected]>;
发送时间: 2020年11月19日(星期四) 上午10:46
收件人: "wuzhihao7788/yolodet-pytorch"<[email protected]>;
抄送: "wuzhihao"<[email protected]>;"Mention"<[email protected]>;
主题: Re: [wuzhihao7788/yolodet-pytorch] 你好,我使用单卡可以训练,当使用多卡时,会出错,请问你尝试过多卡训练吗? (#9)
@wuzhihao7788 训练yolov5l太慢了,我准备尝试yolov5s,训练这个比较快。
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@wuzhihao7788 yolov5s训练了300个epoch,AP只能到30,离37.7还有差距的 |
Traceback (most recent call last):
File "tools/train.py", line 144, in
main()
File "tools/train.py", line 140, in main
train_detector(model,datasets,cfg,validate=args.validate,timestamp=timestamp,meta=meta)
File "/data1/xieyangyang/yolodet-pytorch/yolodet/apis/train.py", line 161, in train_detector
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/data1/xieyangyang/yolodet-pytorch/yolodet/apis/runner.py", line 331, in run
epoch_runner(data_loaders[i], **kwargs)
File "/data1/xieyangyang/yolodet-pytorch/yolodet/apis/runner.py", line 220, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/data1/xieyangyang/yolodet-pytorch/yolodet/apis/train.py", line 116, in batch_processor
losses = model(**data)
File "/data4/xieyangyang/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data4/xieyangyang/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data4/xieyangyang/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data4/xieyangyang/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/data4/xieyangyang/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/data4/xieyangyang/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/data4/xieyangyang/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data1/xieyangyang/yolodet-pytorch/yolodet/models/detectors/base.py", line 55, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/data1/xieyangyang/yolodet-pytorch/yolodet/models/detectors/YOLOv5Detector.py", line 100, in forward_train
head_loss = self.head.loss(*head_loss_inputs)
File "/data1/xieyangyang/yolodet-pytorch/yolodet/models/heads/yolo.py", line 99, in loss
bbox_loss, confidence_loss, class_loss = multi_apply(self.loss_single, pred, indices,tbox,tcls,ancher,self.conf_balances,ignore_mask)
File "/data1/xieyangyang/yolodet-pytorch/yolodet/utils/util.py", line 32, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/data1/xieyangyang/yolodet-pytorch/yolodet/models/heads/yolo.py", line 91, in loss_single
return self.yolov5_loss_single(pred, indices,tbox,tcls,anchors, conf_balances,ignore_mask)
File "/data1/xieyangyang/yolodet-pytorch/yolodet/models/heads/yolo.py", line 210, in yolov5_loss_single
pwh = (ps[:, 2:4].sigmoid() * 2) ** 2 * torch.from_numpy(anchors).to(device) # 0-4倍缩放 model.hyp['anchor_t']=4
RuntimeError: CUDA error: device-side assert triggered
The text was updated successfully, but these errors were encountered: