Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

nvidia-caffe and nvidia-digits docker support for cuda8.0? #209

Closed
kertansul opened this issue Sep 29, 2016 · 7 comments
Closed

nvidia-caffe and nvidia-digits docker support for cuda8.0? #209

kertansul opened this issue Sep 29, 2016 · 7 comments

Comments

@kertansul
Copy link

Hi, I'm using a GTX1080 with nvidia-docker/digits and getting error message while running AlexNet:

relu2 needs backward computation.
conv2 needs backward computation.
pool1 needs backward computation.
norm1 needs backward computation.
relu1 needs backward computation.
conv1 needs backward computation.
label_val-data_1_split does not need backward computation.
val-data does not need backward computation.
This network produces output accuracy
This network produces output loss
Network initialization done.
Solver scaffolding done.
Starting Optimization
Solving
Learning Rate Policy: step
Iteration 0, Testing net (#0)
Ignoring source layer train-data
Test net output #0: accuracy = 0.0999041
Test net output #1: loss = 2.30515 (* 1 = 2.30515 loss)
Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE

Checked the nvidia/digits github and it seems to be something related to cuda7.5:
NVIDIA/DIGITS#925
However, I wanted to use containerization for deep learning frameworks.

Will nvidia update the docker images for cuda8.0?
Or how could I build nvidia-caffe and nvidia-digits dockerfiles for cuda8.0?

digits

@3XX0
Copy link
Member

3XX0 commented Sep 29, 2016

We will provide new CUDA 8.0 images eventually. In the meantime, see this comment

@kertansul
Copy link
Author

@3XX0 Thanks! I missed that thread during search..

So once I built the nvidia/caffe with cuda8.0, how should I tweak the nvidia/digits?

@3XX0
Copy link
Member

3XX0 commented Sep 29, 2016

Once you have the caffe image the only thing you need to do is rebuild the digits one. You can change the FROM directive to point to your local caffe image.

If you already tagged it with the same name (i.e. caffe:0.15) then you can directly rebuild digits with make -C ubuntu-14.04/digits 4.0

@kertansul
Copy link
Author

@3XX0 I'm stuck at error while running nvidia-docker/ubuntu-14.04/digits/4.0/Dockerfile:

Step 6 : RUN apt-get update && apt-get install -y --no-install-recommends --force-yes torch7-nv=0.9.99-1+cuda8.0 graphviz gcc libhdf5-dev digits=$DIGITS_PKG_VERSION && rm -rf /var/lib/apt/lists/*

after a couple of lines ......

Fetched 22.2 MB in 29s (756 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package torch7-nv
E: Unable to locate package digits

I've tried:

  1. Rebuild caffe based on issue208 and tag it with name caffe:0.15, then rebuild digits with make -C ubuntu-14.04/digits 4.0
    => Success, but the rebuilding procedure seems to replace my self-built caffe:0.15 (cuda8.0) with the original caffe:0.15 (cuda7.5). Tested AlexNet on Cifar10, still hit the same old error

  2. To prevent the image replacement, tried to modify nvidia-docker/mk/caffe.mk:
    under 0.15: 7.5-cudnn5-runtime to 8.0-cudnn5-runtime
    under 0.15: comment out $(NV_DOCKER) build -t caffe:$@ $(CURDIR)/$@
    and then issue make -C ubuntu-14.04/digits 4.0. I was able to generate cuda images with tag 8.0-runtime and 8.0-cudnn5-runtime but stuck at the "Unable to locate package"..

Also tried using the original parameters "torch7-nv=0.9.99-1+cuda7.5" but nothing changes

@flx42
Copy link
Member

flx42 commented Oct 3, 2016

@kertansul you need to add this line: https://github.com/NVIDIA/nvidia-docker/blob/master/ubuntu-14.04/cuda/7.5/runtime/cudnn5/Dockerfile#L4
This is the package containing torch7-nv and digits.

But be careful that if you install torch7-nv through this repo, you will get the CUDA 7.5 version.
For DIGITS it doesn't matter.

@kertansul
Copy link
Author

@flx42 hi, I add the line before ENV DIGITS_PKG_VERSION 4.0.0-1, bump into 2 errors

Error 1: NO_PUBKEY F60F4B3D7FA2AF80
Solved by adding RUN wget -qO - http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/7fa2af80.pub | sudo apt-key add - based on this reference

Error 2:
The following packages have unmet dependencies:
digits : Depends: python-caffe-nv (>= 0.13) but it is not going to be installed
Depends: caffe-nv (>= 0.13) but it is not going to be installed
torch7-nv : Depends: cuda-cudart-7-5 but it is not installable
Depends: cuda-curand-7-5 but it is not installable
Depends: cuda-cublas-7-5 but it is not installable
Depends: cuda-ld-conf-7-5 but it is not going to be installed
Depends: cuda-license-7-5 but it is not installable
Depends: libnccl1 (>= 1.1.1) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

I'm guessing this is happening because I'm mixing up cuda8.0 and cuda7.5 ...
Tried adding apt-get install cuda but results in Unable to locate package
What am I missing?

@flx42
Copy link
Member

flx42 commented Feb 6, 2017

Test with the new images, they support CUDA 8.0 now. However, we don't have DIGITS 5.0 yet.

@flx42 flx42 closed this as completed Feb 6, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants