Skip to content

fair1m_1_5 baseline无法运行 #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Pull256 opened this issue Aug 9, 2022 · 6 comments
Open

fair1m_1_5 baseline无法运行 #49

Pull256 opened this issue Aug 9, 2022 · 6 comments

Comments

@Pull256
Copy link

Pull256 commented Aug 9, 2022

测试环境

windows10 21H2
wsl2 ubuntu 22.04 LTS 4.19.128-microsoft-standard
miniconda3+python3.7
cuda 11.7 显卡型号1060

错误

使用CUDA时

Loading config from: configs/s2anet/s2anet_r50_fpn_1x_fair1m_1_5.py
[w 0809 11:21:49.947748 32 init.py:1344] load parameter fc.weight failed ...
[w 0809 11:21:49.947908 32 init.py:1344] load parameter fc.bias failed ...
[w 0809 11:21:50.017176 32 init.py:1363] load total 267 params, 2 failed
Tue Aug 9 11:21:50 2022 Start running
Traceback (most recent call last):
File "tools/run_net.py", line 56, in
main()
File "tools/run_net.py", line 47, in main
runner.run()
File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 84, in run
self.train()
File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 126, in train
losses = self.model(images,targets)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(*args, **kw)
File "/home/la/JT/JDet/python/jdet/models/networks/s2anet.py", line 35, in execute
outputs = self.bbox_head(features, targets)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(*args, **kw)
File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 627, in execute
return self.loss(*outs,*self.parse_targets(targets))
File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 360, in loss
sampling=self.sampling)
File "/home/la/JT/JDet/python/jdet/models/boxes/anchor_target.py", line 74, in anchor_target
unmap_outputs=unmap_outputs)
File "/home/la/JT/JDet/python/jdet/utils/general.py", line 53, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/home/la/JT/JDet/python/jdet/models/boxes/anchor_target.py", line 127, in anchor_target_single
if not inside_flags.any(0):
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 1735, in to_bool
return ori_bool(v.item())
RuntimeError: [f 0809 11:21:58.030555 32 executor.cc:665]
Execute fused operator(26/2009) failed.
[JIT Source]: /home/la/.cache/jittor/jt1.3.5/g++11.2.0/py3.7.13/Linux-4.19.128xc4/IntelRCoreTMi5xbf/default/cu11.7.99/jit/__opkey0_broadcast_to__Tx_float32__DIM_7__BCAST_19__opkey1_reindex__Tx_float32__XDIM_4__YD___hash_8f91e55bdd99985a_op.cc
[OP TYPE]: fused_op:( broadcast_to, reindex, binary.multiply, reduce.add,)
[Input]: float32[64,64,3,3,]backbone.layer1.0.conv2.weight, float32[2,64,256,256,],
[Output]: float32[2,64,256,256,],
[Async Backtrace]: ---
tools/run_net.py:56 <>
tools/run_net.py:47


/home/la/JT/JDet/python/jdet/runner/runner.py:84
/home/la/JT/JDet/python/jdet/runner/runner.py:126
/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py:950 <call>
/home/la/JT/JDet/python/jdet/models/networks/s2anet.py:30
/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py:950 <call>
/home/la/JT/JDet/python/jdet/models/backbones/resnet.py:166
/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py:950 <call>
/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/nn.py:2054
/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py:950 <call>
/home/la/JT/JDet/python/jdet/models/backbones/resnet.py:84
/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py:950 <call>
/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/nn.py:847
[Reason]: [f 0809 11:21:58.030132 32 helper_cuda.h:128] CUDA error at /home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/src/mem/allocator/cuda_managed_allocator.cc:23 code=2( cudaErrorMemoryAllocation ) cudaMallocManaged(&ptr, size)

加入参数--no_cuda

Tue Aug 9 11:15:30 2022 Start running
Traceback (most recent call last):
File "tools/run_net.py", line 56, in
main()
File "tools/run_net.py", line 47, in main
runner.run()
File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 84, in run
self.train()
File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 126, in train
losses = self.model(images,targets)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(*args, **kw)
File "/home/la/JT/JDet/python/jdet/models/networks/s2anet.py", line 35, in execute
outputs = self.bbox_head(features, targets)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(*args, **kw)
File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 625, in execute
outs = multi_apply(self.forward_single, feats, self.anchor_strides)
File "/home/la/JT/JDet/python/jdet/utils/general.py", line 53, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 236, in forward_single
align_feat = self.align_conv(x, refine_anchor.clone(), stride)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(*args, **kw)
File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 722, in execute
x = self.relu(self.deform_conv(x, offset_tensor))
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(*args, **kw)
File "/home/la/JT/JDet/python/jdet/ops/dcn_v1.py", line 696, in execute
self.dilation, self.groups, self.deformable_groups)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 1603, in apply
return func(*args, **kw)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 1559, in call
ori_res = self.execute(*args)
File "/home/la/JT/JDet/python/jdet/ops/dcn_v1.py", line 589, in execute
raise NotImplementedError
NotImplementedError

已经进行的操作

搜索了一下, code=2( cudaErrorMemoryAllocation )似乎和内存有关,当程序需要的内存不足时会报错,在本res里面搜索,发现有个issue有类似的错误代码,但不知道如何解决
而加入不使用cuda的参数后报错我也不是很理解,只知道是子类没有实现父类要求一定要实现的接口

@514flowey
Copy link
Collaborator

514flowey commented Aug 9, 2022

有可能是显存不够导致的,运行时需要大约12G的显存。最好使用12G以上显存的显卡。
可以把batch_size改为1试试。

@Pull256
Copy link
Author

Pull256 commented Aug 9, 2022

有可能是显存不够导致的,运行时需要大约12G的显存。最好使用12G以上显存的显卡。 可以把batch_size改为1试试。

感谢您的回复,在configs/s2anet/s2anet_r50_fpn_1x_fair1m_1_5.py更改batch_size=1之后,错误依旧,同时参考论坛做了几个测试
python -m jittor.test.test_cuda 正常
python -m jittor.test.test_array 报错 code=700( cudaErrorIllegalAddress ) , code=4( CUDNN_STATUS_INTERNAL_ERROR )
python -m jittor.test.test_resnet 训练正常
加入环境变量use_cuda_managed_allocator=0后,错误依旧
感觉似乎确实是电脑的问题,因为python -m jittor.test.test_resnet是正常的,但是换一个数据集就不行了,不知道我的推测对不对。

@514flowey
Copy link
Collaborator

请问您的电脑大概拥有多大的显存呢?

@Pull256
Copy link
Author

Pull256 commented Aug 9, 2022

我是1060-6g的笔记本电脑,和您说的12g相差甚远😂

@cxjyxxme
Copy link
Collaborator

那您可以尝试一些更轻量的模型试试,或者租用服务器训练。

@vicdxxx
Copy link

vicdxxx commented Sep 1, 2022

我2080Ti 11G现存也跑不起来,一样的问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants