-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training was interrupted after the first epoch #499
Comments
Hello, thank you for your interest in our work! This is an automated response. Please note that most technical problems are due to:
sudo rm -rf yolov3 # remove exising repo
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # git clone latest
python3 detect.py # verify detection
python3 train.py # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE
If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you! |
@TOMLEUNGKS BTW do not use K80's to train, they are the least price efficient GPU: |
thank you @glenn-jocher for the reminder! |
same problem with custom datset. my environment: @TOMLEUNGKS did you solved this problem? |
@TOMLEUNGKS |
@uefall I see, I labeled images using LabelImg, which will give you standard Yolo annotations. My training went smooth after switching to another AWS EC2 instance. Not sure what caused the problem tho. |
@TOMLEUNGKS @uefall great glad to head everything is working. I'll close the issue now. |
I am using AWS EC2 to train a custom model with my own dataset (6 classes, 12594 images in total). After the first epoch, the training was interrupted. And here is the error log:
[ec2-user@ip-172-31-4-237 yolo_retrain_v2]$ python3 train.py --cfg cfg/yolov3.cfg --weights weights/yolov3.weights --epochs 500
Fontconfig warning: ignoring UTF-8: not a valid region tag
Namespace(accumulate=2, adam=False, arc='defaultpw', batch_size=32, bucket='', cache_images=False, cfg='cfg/yolov3.cfg', data='data/coco.data', device='', epochs=500, evolve=False, img_size=416, img_weights=False, multi_scale=False, name='', nosave=False, notest=False, prebias=False, rect=False, resume=False, transfer=False, var=None, weights='weights/yolov3.weights')
Using CUDA device0 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device1 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device2 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device3 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device4 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device5 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device6 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device7 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
Reading labels (12594 found, 0 missing, 0 empty for 12594 images): 100%|##########################################################| 12594/12594 [00:00<00:00, 16385.53it/s]
Model Summary: 222 layers, 6.15507e+07 parameters, 6.15507e+07 gradients
Starting training for 500 epochs...
0%| | 0/394 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 415, in
train() # train normally
File "train.py", line 261, in train
pred = model(imgs)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 459, in forward
self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
; (2) making sure allforward
function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforward
function. Please include the loss function and the structure of the return value offorward
of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:518)frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f08ce370273 in /home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&) + 0x734 (0x7f0918c3c9e4 in /home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: + 0x691a4c (0x7f0918c2ba4c in /home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x1d3ef4 (0x7f091876def4 in /home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #33: __libc_start_main + 0xf5 (0x7f092c5af445 in /lib64/libc.so.6)
Exception in thread Thread-3172:
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ec2-user/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py", line 21, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 284, in rebuild_storage_fd
fd = df.detach()
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
May I know how this error can be fixed? I tried --nosave, but the error was the same. Thank you so much for your generous help!
Regards,
Tom
The text was updated successfully, but these errors were encountered: