nvidia-caffe and nvidia-digits docker support for cuda8.0? #209

kertansul · 2016-09-29T08:11:40Z

Hi, I'm using a GTX1080 with nvidia-docker/digits and getting error message while running AlexNet:

relu2 needs backward computation.
conv2 needs backward computation.
pool1 needs backward computation.
norm1 needs backward computation.
relu1 needs backward computation.
conv1 needs backward computation.
label_val-data_1_split does not need backward computation.
val-data does not need backward computation.
This network produces output accuracy
This network produces output loss
Network initialization done.
Solver scaffolding done.
Starting Optimization
Solving
Learning Rate Policy: step
Iteration 0, Testing net (#0)
Ignoring source layer train-data
Test net output #0: accuracy = 0.0999041
Test net output #1: loss = 2.30515 (* 1 = 2.30515 loss)
Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE

Checked the nvidia/digits github and it seems to be something related to cuda7.5:
NVIDIA/DIGITS#925
However, I wanted to use containerization for deep learning frameworks.

Will nvidia update the docker images for cuda8.0?
Or how could I build nvidia-caffe and nvidia-digits dockerfiles for cuda8.0?

3XX0 · 2016-09-29T08:58:07Z

We will provide new CUDA 8.0 images eventually. In the meantime, see this comment

kertansul · 2016-09-29T09:23:12Z

@3XX0 Thanks! I missed that thread during search..

So once I built the nvidia/caffe with cuda8.0, how should I tweak the nvidia/digits?

3XX0 · 2016-09-29T17:09:30Z

Once you have the caffe image the only thing you need to do is rebuild the digits one. You can change the FROM directive to point to your local caffe image.

If you already tagged it with the same name (i.e. caffe:0.15) then you can directly rebuild digits with make -C ubuntu-14.04/digits 4.0

kertansul · 2016-09-30T09:10:42Z

@3XX0 I'm stuck at error while running nvidia-docker/ubuntu-14.04/digits/4.0/Dockerfile:

Step 6 : RUN apt-get update && apt-get install -y --no-install-recommends --force-yes torch7-nv=0.9.99-1+cuda8.0 graphviz gcc libhdf5-dev digits=$DIGITS_PKG_VERSION && rm -rf /var/lib/apt/lists/*

after a couple of lines ......

Fetched 22.2 MB in 29s (756 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package torch7-nv
E: Unable to locate package digits

I've tried:

Rebuild caffe based on issue208 and tag it with name caffe:0.15, then rebuild digits with make -C ubuntu-14.04/digits 4.0
=> Success, but the rebuilding procedure seems to replace my self-built caffe:0.15 (cuda8.0) with the original caffe:0.15 (cuda7.5). Tested AlexNet on Cifar10, still hit the same old error
To prevent the image replacement, tried to modify nvidia-docker/mk/caffe.mk:
under 0.15: 7.5-cudnn5-runtime to 8.0-cudnn5-runtime
under 0.15: comment out $(NV_DOCKER) build -t caffe:$@ $(CURDIR)/$@
and then issue make -C ubuntu-14.04/digits 4.0. I was able to generate cuda images with tag 8.0-runtime and 8.0-cudnn5-runtime but stuck at the "Unable to locate package"..

Also tried using the original parameters "torch7-nv=0.9.99-1+cuda7.5" but nothing changes

flx42 · 2016-10-03T22:28:47Z

@kertansul you need to add this line: https://github.com/NVIDIA/nvidia-docker/blob/master/ubuntu-14.04/cuda/7.5/runtime/cudnn5/Dockerfile#L4
This is the package containing torch7-nv and digits.

But be careful that if you install torch7-nv through this repo, you will get the CUDA 7.5 version.
For DIGITS it doesn't matter.

kertansul · 2016-10-05T07:16:17Z

@flx42 hi, I add the line before ENV DIGITS_PKG_VERSION 4.0.0-1, bump into 2 errors

Error 1: NO_PUBKEY F60F4B3D7FA2AF80
Solved by adding RUN wget -qO - http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/7fa2af80.pub | sudo apt-key add - based on this reference

Error 2:
The following packages have unmet dependencies:
digits : Depends: python-caffe-nv (>= 0.13) but it is not going to be installed
Depends: caffe-nv (>= 0.13) but it is not going to be installed
torch7-nv : Depends: cuda-cudart-7-5 but it is not installable
Depends: cuda-curand-7-5 but it is not installable
Depends: cuda-cublas-7-5 but it is not installable
Depends: cuda-ld-conf-7-5 but it is not going to be installed
Depends: cuda-license-7-5 but it is not installable
Depends: libnccl1 (>= 1.1.1) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

I'm guessing this is happening because I'm mixing up cuda8.0 and cuda7.5 ...
Tried adding apt-get install cuda but results in Unable to locate package
What am I missing?

flx42 · 2017-02-06T19:07:22Z

Test with the new images, they support CUDA 8.0 now. However, we don't have DIGITS 5.0 yet.

flx42 added the dockerfile label Oct 31, 2016

lukeyeager mentioned this issue Nov 18, 2016

Pascal boards: CURAND_STATUS_LAUNCH_FAILURE NVIDIA/caffe#270

Closed

flx42 closed this as completed Feb 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-caffe and nvidia-digits docker support for cuda8.0? #209

nvidia-caffe and nvidia-digits docker support for cuda8.0? #209

kertansul commented Sep 29, 2016

3XX0 commented Sep 29, 2016

kertansul commented Sep 29, 2016

3XX0 commented Sep 29, 2016 •

edited

Loading

kertansul commented Sep 30, 2016

flx42 commented Oct 3, 2016

kertansul commented Oct 5, 2016

flx42 commented Feb 6, 2017

nvidia-caffe and nvidia-digits docker support for cuda8.0? #209

nvidia-caffe and nvidia-digits docker support for cuda8.0? #209

Comments

kertansul commented Sep 29, 2016

3XX0 commented Sep 29, 2016

kertansul commented Sep 29, 2016

3XX0 commented Sep 29, 2016 • edited Loading

kertansul commented Sep 30, 2016

after a couple of lines ......

flx42 commented Oct 3, 2016

kertansul commented Oct 5, 2016

flx42 commented Feb 6, 2017

3XX0 commented Sep 29, 2016 •

edited

Loading