Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what can I do for RuntimeError: Trying to create tensor with negative dimension -1592267047: [-1592267047] #1688

Closed
Jackyinuo opened this issue Dec 14, 2020 · 7 comments
Labels
question Further information is requested Stale Stale and schedule for closing soon

Comments

@Jackyinuo
Copy link

❔Question

Starting training for 300 epochs...

 Epoch   gpu_mem       box       obj       cls     total   targets  img_size
 0/299     5.73G   0.09355   0.08561   0.08285     0.262       154       640: 100%|██████████| 3697/3697 [37:33<00:00,  1.64it/s] 
           Class      Images     Targets           P           R      [email protected]  [email protected]:.95: 100%|██████████| 157/157 [01:31<00:00,  1.71it/s]

 Epoch   gpu_mem       box       obj       cls     total   targets  img_size
 1/299     6.98G   0.09522    0.1221   0.08584    0.3032       247       640: 100%|██████████| 3697/3697 [35:22<00:00,  1.74it/s]
           Class      Images     Targets           P           R      [email protected]  [email protected]:.95:   0%|          | 0/157 [00:00<?, ?it/s]

Analyzing anchors... anchors/target = 4.45, Best Possible Recall (BPR) = 0.9949
all 5e+03 3.63e+04 0.0145 0.00296 0.00248 0.000805
Traceback (most recent call last):
File "train.py", line 503, in
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 336, in train
results, maps, times = test.test(opt.data,
File "/disk1/huihui/yolov5/test.py", line 120, in test
output = non_max_suppression(inf_out, conf_thres=conf_thres, iou_thres=iou_thres, labels=lb)
File "/disk1/huihui/yolov5/utils/general.py", line 332, in non_max_suppression
i = torchvision.ops.nms(boxes, scores, iou_thres) # NMS
File "/home/phzhou/anaconda3/envs/pt1/lib/python3.8/site-packages/torchvision/ops/boxes.py", line 42, in nms
return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
RuntimeError: Trying to create tensor with negative dimension -1592267047: [-1592267047]

Additional context

@Jackyinuo Jackyinuo added the question Further information is requested label Dec 14, 2020
@Jackyinuo Jackyinuo changed the title RuntimeError: Trying to create tensor with negative dimension -1592267047: [-1592267047] what can I do for RuntimeError: Trying to create tensor with negative dimension -1592267047: [-1592267047] Dec 14, 2020
@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 14, 2020

@Jackyinuo that's very strange. You may have an environment problem, I would try to reproduce your error in a verified working environment like Google Colab or our Docker image, and if the error appears there then please raise a full bug report here. I'll post you our default reply below.

Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:

  • Your modified or out-of-date code. If your issue is not reproducible in a new git clone version of this repo we can not debug it. Before going further run this code and verify your issue persists:
$ git clone https://github.com/ultralytics/yolov5 yolov5_new  # clone latest
$ cd yolov5_new
$ python detect.py  # verify detection

# CODE TO REPRODUCE YOUR ISSUE HERE
  • Your custom data. If your issue is not reproducible in one of our 3 common datasets (COCO, COCO128, or VOC) we can not debug it. Visit our Custom Training Tutorial for guidelines on training your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of your labels and images.

  • Your environment. If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, verify your environment meets all of the requirements.txt dependencies specified below. If in doubt, download Python 3.8.0 from https://www.python.org/, create a new venv, and then install requirements.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

@zhiqwang
Copy link
Contributor

I think this is a bug of nms, refer to pytorch/vision#1705 here.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tihhanovski
Copy link

btw, I have similar problem with custom dataset with YOLOv8
Got this error every time I used Apple M1 GPU: yolo detect train data=tools_v2.yaml model=yolov8n.pt epochs=100 imgsz=640 device=mps
Training was successfull when I did not used GPU: yolo detect train data=tools_v2.yaml model=yolov8n.pt epochs=100 imgsz=640

@glenn-jocher
Copy link
Member

@tihhanovski it seems like you're encountering an issue that might be related to the interaction between PyTorch's NMS implementation and the MPS backend on Apple's M1 GPU. The error you're experiencing with YOLOv5 and a similar issue with YOLOv8 suggest that this could be a broader compatibility problem with MPS.

Given the reference to a PyTorch/Vision issue, it's possible that the problem lies within the underlying library rather than YOLOv5 or YOLOv8 directly. However, ensuring that you're using the latest versions of PyTorch and torchvision that support the MPS backend could potentially resolve this issue. Apple's M1 GPUs have specific requirements, and compatibility is continually improving.

For now, as a workaround, training without the GPU on the M1 (as you've done successfully) is a valid approach, albeit slower. You might also consider running your training on a different machine with a more widely supported GPU architecture (e.g., NVIDIA's CUDA) if that's an option for you.

We appreciate your patience and understanding as these compatibility issues are worked out. The rapid development of machine learning frameworks and hardware often leads to these kinds of challenges, but they are usually resolved with time as updates are released. Keep an eye on updates from PyTorch and torchvision that might address this issue more directly.

If you haven't already, please ensure your environment is up to date with the latest versions of all relevant libraries. If the problem persists, consider raising an issue on the PyTorch GitHub to bring more attention to MPS backend compatibility problems. Your detailed feedback can help the developers prioritize and address these issues more effectively.

Thank you for your contribution to the community by highlighting this issue. Your efforts help improve the tool for everyone. 🙏

@Chase-Nicholas
Copy link

I had a similar issue while training on a single image. After I added more than one image, training on my M1 GPU worked.

@glenn-jocher
Copy link
Member

Thank you for sharing your experience! It's interesting to hear that adding more images resolved the issue on your M1 GPU. This suggests that the problem might be related to how the MPS backend handles certain operations with very small datasets.

For anyone encountering similar issues, here are a few additional tips that might help:

  1. Update Your Environment: Ensure you are using the latest versions of PyTorch and torchvision, as updates often include bug fixes and improvements for compatibility with different hardware, including Apple's M1 GPUs.

  2. Batch Size and Dataset Size: As you've noted, increasing the number of images in your dataset can sometimes resolve unexpected issues. This might be due to how certain operations are optimized for larger batches or datasets.

  3. Alternative Backends: If you continue to experience issues with the MPS backend, consider using CPU for training, as it appears to work without issues. Alternatively, if you have access to a machine with an NVIDIA GPU, using CUDA is another reliable option.

  4. Community and Documentation: Keep an eye on the PyTorch GitHub issues and discussions for updates and potential fixes related to the MPS backend.

Here's a small code snippet to ensure you're using the latest versions of PyTorch and torchvision:

pip install --upgrade torch torchvision

We appreciate your patience and contributions to improving the YOLOv5 experience for everyone. If you encounter further issues or have more insights to share, please feel free to continue the discussion here. Your feedback is invaluable to the community! 😊

Thank you again, and happy training! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests

5 participants