Training was interrupted after the first epoch #499

TOMLEUNGKS · 2019-09-15T07:36:47Z

I am using AWS EC2 to train a custom model with my own dataset (6 classes, 12594 images in total). After the first epoch, the training was interrupted. And here is the error log:

[ec2-user@ip-172-31-4-237 yolo_retrain_v2]$ python3 train.py --cfg cfg/yolov3.cfg --weights weights/yolov3.weights --epochs 500
Fontconfig warning: ignoring UTF-8: not a valid region tag
Namespace(accumulate=2, adam=False, arc='defaultpw', batch_size=32, bucket='', cache_images=False, cfg='cfg/yolov3.cfg', data='data/coco.data', device='', epochs=500, evolve=False, img_size=416, img_weights=False, multi_scale=False, name='', nosave=False, notest=False, prebias=False, rect=False, resume=False, transfer=False, var=None, weights='weights/yolov3.weights')
Using CUDA device0 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device1 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device2 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device3 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device4 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device5 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device6 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)
device7 _CudaDeviceProperties(name='Tesla K80', total_memory=11441MB)

Reading labels (12594 found, 0 missing, 0 empty for 12594 images): 100%|##########################################################| 12594/12594 [00:00<00:00, 16385.53it/s]
Model Summary: 222 layers, 6.15507e+07 parameters, 6.15507e+07 gradients
Starting training for 500 epochs...

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
 0/499     2.95G      1.55      2.87      13.7      18.1        18       416: 100%|##################################################| 394/394 [07:04<00:00,  1.08s/it]
           Class    Images   Targets         P         R       mAP        F1: 100%|######################################################| 2/2 [00:06<00:00,  3.04s/it]
             all        60        64     0.503     0.896     0.827     0.614

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size

0%| | 0/394 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 415, in
train() # train normally
File "train.py", line 261, in train
pred = model(imgs)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 459, in forward
self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:518)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f08ce370273 in /home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&) + 0x734 (0x7f0918c3c9e4 in /home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: + 0x691a4c (0x7f0918c2ba4c in /home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x1d3ef4 (0x7f091876def4 in /home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #33: __libc_start_main + 0xf5 (0x7f092c5af445 in /lib64/libc.so.6)

Exception in thread Thread-3172:
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ec2-user/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py", line 21, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 284, in rebuild_storage_fd
fd = df.detach()
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

May I know how this error can be fixed? I tried --nosave, but the error was the same. Thank you so much for your generous help!

Regards,
Tom

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2019-09-16T11:54:06Z

Hello, thank you for your interest in our work! This is an automated response. Please note that most technical problems are due to:

Your changes to the default repository. If your issue is not reproducible in a fresh git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:

sudo rm -rf yolov3  # remove exising repo
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # git clone latest
python3 detect.py  # verify detection
python3 train.py  # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE

Your custom data. If your issue is not reproducible with COCO data we can not debug it. Visit our Custom Training Tutorial for exact details on how to format your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.
Your environment. If your issue is not reproducible in a GCP Quickstart Guide VM we can not debug it. Ensure you meet the requirements specified in the README: Unix, MacOS, or Windows with Python >= 3.7, Pytorch >= 1.1, etc. You can also use our Google Colab Notebook to test your code in working environment.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

glenn-jocher · 2019-09-16T12:04:10Z

@TOMLEUNGKS BTW do not use K80's to train, they are the least price efficient GPU:
https://github.com/ultralytics/yolov3#speed

TOMLEUNGKS · 2019-09-17T03:39:39Z

thank you @glenn-jocher for the reminder!

uefall · 2019-09-17T07:54:23Z

same problem with custom datset.

my environment:
python ver 3.7.4
pytorch ver 1.2.0
cuda 10

@TOMLEUNGKS did you solved this problem?
this issues is the same as #331

TOMLEUNGKS · 2019-09-17T08:23:47Z

the problem cannot be solved. @uefall
solutions from #331 do not work (using even no. of data)

Do you have any thoughts?

uefall · 2019-09-18T00:47:09Z

@TOMLEUNGKS
I found there is a mistake when I create my own dataset
the label file should be class x_center y_center width height while i made it left,top,w,h
after fix this,the training goes well

TOMLEUNGKS · 2019-09-18T03:56:24Z

@uefall I see, I labeled images using LabelImg, which will give you standard Yolo annotations.

My training went smooth after switching to another AWS EC2 instance. Not sure what caused the problem tho.

glenn-jocher · 2019-09-18T10:09:45Z

@TOMLEUNGKS @uefall great glad to head everything is working. I'll close the issue now.

TOMLEUNGKS added the bug Something isn't working label Sep 15, 2019

glenn-jocher closed this as completed Sep 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training was interrupted after the first epoch #499

Training was interrupted after the first epoch #499

TOMLEUNGKS commented Sep 15, 2019

glenn-jocher commented Sep 16, 2019 •

edited

Loading

glenn-jocher commented Sep 16, 2019

TOMLEUNGKS commented Sep 17, 2019

uefall commented Sep 17, 2019

TOMLEUNGKS commented Sep 17, 2019

uefall commented Sep 18, 2019

TOMLEUNGKS commented Sep 18, 2019

glenn-jocher commented Sep 18, 2019

Training was interrupted after the first epoch #499

Training was interrupted after the first epoch #499

Comments

TOMLEUNGKS commented Sep 15, 2019

glenn-jocher commented Sep 16, 2019 • edited Loading

glenn-jocher commented Sep 16, 2019

TOMLEUNGKS commented Sep 17, 2019

uefall commented Sep 17, 2019

TOMLEUNGKS commented Sep 17, 2019

uefall commented Sep 18, 2019

TOMLEUNGKS commented Sep 18, 2019

glenn-jocher commented Sep 18, 2019

glenn-jocher commented Sep 16, 2019 •

edited

Loading