beeva-poc-distributed-tensorflow

Proof of Concept with Tensorflow & Multi-GPUs at BEEVA Research Lab

Experiment 2: multi-GPU

Training on tensorflow single machine, multiple GPU
Dataset: MNIST. 60000 train samples, 10000 test samples
Model: Simple Convnet (5 layers) inspired by LeNet
Based on Transparent multi-gpu training on Tensorflow with Keras. Custom fork to implement example and fix TF 1.0 compatibility
Note 1: first (failed) attempt was using tf-slim. More info
Infrastructure 1: AWS p2.8x (8 gpus nvidia Tesla K80). Deep Learning 2.0 AMI. Amazon Linux (CentOS), Keras==1.2.2, libcudnn.so.5
Infrastructure 2: Google n1-standard-16 with 2 gpus (nvidia Tesla K80), Ubuntu 16.04 LTS, tensorflow-gpu==1.0.1, Keras==2.0.2 and 1.2.2, NVIDIA Driver 375.39, libcudnn.so.5 (CuDNN 5.1)
Infrastructure 3: Google n1-highmem-32 with 8 gpus (nvidia Tesla K80), Ubuntu 16.04 LTS, tensorflow-gpu==1.2.0, Keras==2.0.5, NVIDIA Driver 375.39, libcudnn.so.5 (CuDNN 5.1)
Infrastructure 4: Azure NC24 with 4 gpu (nvidia Tesla K80), mxnet==0.9.5 (mxnet-0.9.5-py3.5), mxnet, Release 0.10.0.post2, NVIDIA Driver v367.48, CUDA 8.0 (V8.0.61), libcudnn.so.5.1.10 (CuDNN 5.1)
Infrastructure 3* (3/08/2017): Google n1-highmem-32 with 8 gpus (nvidia Tesla K80), Ubuntu 16.04 LTS, tensorflow-gpu==1.2.0, Keras==2.0.5, NVIDIA Driver 375.66, libcudnn.so.5.1.10 (CuDNN 5.1)
Note 2: Our goal was to compare p2.8x on AWS with 8 gpus on GCE in terms of performance and price. Finally we only tested 2 GPUs on GCE due to the poor performance we got in relation to expected results. See detailed issues
Note 3: Tests repeated on 21/06/2017 with Infrastructure 2, tensorflow-gpu==1.2.0, keras==2.0.5 and very similar results

Deploy

[Only Google] Install CUDA, CuDNN, Keras and Tensorflow

# Install CUDA. Source: https://cloud.google.com/compute/docs/gpus/add-gpus#install-driver-script
sudo su
# Execute as root
#!/bin/bash
echo "Checking for CUDA and installing."
# Check for CUDA and try to install.
if ! dpkg-query -W cuda; then
  # The 16.04 installer works with 16.10.
  curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
  dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
  apt-get update
  apt-get install cuda -y
fi
exit

# Install CuDNN. Source: https://askubuntu.com/questions/767269/how-can-i-install-cudnn-on-ubuntu-16-04
# Go to https://developer.nvidia.com/rdp/cudnn-download
# Upload to Google Cloud Storage
gsutil cp -r gs://poc-tensorflow-gpus .
tar xvzf poc-tensorflow-gpus/cudnn-8.0-linux-x64-v5.1.tgz
cd cuda
sudo cp -P include/cudnn.h /usr/include
sudo cp -P lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*
cd

# Install Tensorflow-gpu
sudo apt-get install python-pip python-dev python-virtualenv
wget https://pypi.python.org/packages/04/c4/ffb89dbea9e43e82665ff088fd08aa25aa93301aa8c480de278c8f576ea1/tensorflow_gpu-1.0.1-cp27-cp27mu-manylinux1_x86_64.whl#md5=c06b11dee765a99b1814ca393aaf558a
pip install tensorflow_gpu-1.0.1-cp27-cp27mu-manylinux1_x86_64.whl

# Install Keras
pip install keras==2.0.2

[Optional] Modify print time format on Keras generic_utils.py

sudo nano /usr/lib/python2.7/dist-packages/Keras-1.2.2-py2.7.egg/keras/utils/generic_utils.py

Clone keras-extras

git clone https://github.com/beeva-enriqueotero/keras-extras

Run

Launch multi-gpu experiment

time python keras-extras/examples/mnist_cnn_multi.py  --extras `pwd`/keras-extras/ --gpus 2

Results:

infrastructure	batch size	gpus	Accuracy (validation)	Epochs	Training time (s/epoch)
1	128	1	0.9884	12	6.8
1	128	2	0.9898	12	5.2
1	128	3	error	error	error
1	128	4	0.9891	12	4.9
1	128	8	0.9899	12	6.4
2	128	1	0.9892	12	7.1
2	128	2	0.9891	12	50.0+-1.0
3	128	1	0.9901	12	56.5+-0.5
3	128	2	0.9901	12	56.5+-0.5
3	128	8	0.9895	12	75.0+-8.0
4	128	1	0.9896	12	8.5+-0.1
4	128	2	0.9899	12	12.0+-1.0
4	128	4	0.9905	12	15.1+-0.5
4	512	4	0.9872	12	4.9+-0.5
4	1024	4	0.9852	12	3.6+-0.3
4	2048	4	0.9788	12	3.0+-0.3
3*	1024	0	0.9851	12	26.9+-0.1
3*	1024	1	0.9844	12	9.2+-0.2
3*	1024	2	0.9851	12	8.5+-0.2
3*	1024	4	0.9851	12	8.1+-0.2
3*	1024	8	0.9848	12	8.2+-0.2
3*	128	0	0.9896	12	39.0+-0.1
3*	128	1	0.9896	12	62.9+-0.5
3*	128	2	0.9896	12	62.0+-0.5
3*	2048	1	0.9789	12	6.6+-0.1
3*	2048	2	0.9789	12	5.8+-0.5
3*	2048	4	0.9790	12	4.5+-0.2
3*	2048	8	0.9765	12	4.9+-0.2
3*	4096	8	0.9633	12	3.1+-0.2

Conclusions:

Due to technical implementation details, only even number of gpus allowed
CuDNN is mandatory to run the experiments
Google Engine Documentation about attaching GPUs to instances doesn't include references to CuDNN
Both AWS and Google GPUs use PCIe host bridge topology (PHB)
Azure NC24 GPUs have SOC connections

Issues:

GCE attached GPU instances didn't support GPUDirect peer to peer memory access. Now they do! Tested on 3/08/2017

 labs@instance-1:/usr/local/cuda-8.0/samples/0_Simple/simpleP2P$ ./simpleP2P
    [./simpleP2P] - Starting...
    Checking for multiple GPUs...
    CUDA-capable device count: 2
    > GPU0 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
    > GPU1 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
    Checking GPU(s) for support of peer to peer memory access...
    > Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : No
    > Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : No
    Two or more GPUs with SM 2.0 or higher capability are required for ./simpleP2P.
    Peer to Peer access is not available amongst GPUs in the system, waiving test.

Google GPUs use PCI instead of PCI express

sudo lshw -C "display" | grep capabilities
# GCE output:
# capabilities: msi bus_master cap_list
# AWS EC2 p2.8x output:
# capabilities: pm msi pciexpress bus_master cap_list

Tensorflow on Google GPUs can't use DMA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_multigpu.md

README_multigpu.md

beeva-poc-distributed-tensorflow

Experiment 2: multi-GPU

Deploy

Run

Results:

Conclusions:

Issues:

Files

README_multigpu.md

Latest commit

History

README_multigpu.md

File metadata and controls

beeva-poc-distributed-tensorflow

Experiment 2: multi-GPU

Deploy

Run

Results:

Conclusions:

Issues: