hma02@SOE-IBM2 (master *) examples $ python3 test_bsp.py Theano-MPI started 2 workers for 1.updating ResNet50 params through iterations and 2.exchange the params with BSP(cdd,nccl32) See output log. Using cuDNN version 5110 on context None Using cuDNN version 5110 on context None Mapped name None to device cuda0: Tesla P100-SXM2-16GB (0002:01:00.0) Mapped name None to device cuda1: Tesla P100-SXM2-16GB (0003:01:00.0) rank0: bad list is [], extended to 10008 rank0: bad list is [], extended to 390 Total number of layers: 176 Total number of layers: 176 [64 3 7 7] [64] [64] [64] [256 64 1 1] [256] [256] [64 64 1 1] [64] [64] [64 64 3 3] [64] [64] [256 64 1 1] [256] [256] [ 64 256 1 1] [64] [64] [64 64 3 3] [64] [64] [256 64 1 1] [256] [256] [ 64 256 1 1] [64] [64] [64 64 3 3] [64] [64] [256 64 1 1] [256] [256] [512 256 1 1] [512] [512] [128 256 1 1] [128] [128] [128 128 3 3] [128] [128] [512 128 1 1] [512] [512] [128 512 1 1] [128] [128] [128 128 3 3] [128] [128] [512 128 1 1] [512] [512] [128 512 1 1] [128] [128] [128 128 3 3] [128] [128] [512 128 1 1] [512] [512] [128 512 1 1] [128] [128] [128 128 3 3] [128] [128] [512 128 1 1] [512] [512] [1024 512 1 1] [1024] [1024] [256 512 1 1] [256] [256] [256 256 3 3] [256] [256] [1024 256 1 1] [1024] [1024] [ 256 1024 1 1] [256] [256] [256 256 3 3] [256] [256] [1024 256 1 1] [1024] [1024] [ 256 1024 1 1] [256] [256] [256 256 3 3] [256] [256] [1024 256 1 1] [1024] [1024] [ 256 1024 1 1] [256] [256] [256 256 3 3] [256] [256] [1024 256 1 1] [1024] [1024] [ 256 1024 1 1] [256] [256] [256 256 3 3] [256] [256] [1024 256 1 1] [1024] [1024] [ 256 1024 1 1] [256] [256] [256 256 3 3] [256] [256] [1024 256 1 1] [1024] [1024] [2048 1024 1 1] [2048] [2048] [ 512 1024 1 1] [512] [512] [512 512 3 3] [512] [512] [2048 512 1 1] [2048] [2048] [ 512 2048 1 1] [512] [512] [512 512 3 3] [512] [512] [2048 512 1 1] [2048] [2048] [ 512 2048 1 1] [512] [512] [512 512 3 3] [512] [512] [2048 512 1 1] [2048] [2048] [2048 1000] [1000] model size 24.373 M floats loading 94931 started loading 95301 started compiling training function... INFO (theano.gof.compilelock): Waiting for existing lock by process '90598' (I am process '90597') INFO (theano.gof.compilelock): To manually release the lock, delete /export/mlrg/hma02/.theano/compiledir_Linux-4.4--generic-ppc64le-with-Ubuntu-16.04-xenial-ppc64le-3.5.2-64/lock_dir INFO (theano.gof.compilelock): Waiting for existing lock by process '90598' (I am process '90597') INFO (theano.gof.compilelock): To manually release the lock, delete /export/mlrg/hma02/.theano/compiledir_Linux-4.4--generic-ppc64le-with-Ubuntu-16.04-xenial-ppc64le-3.5.2-64/lock_dir INFO (theano.gof.compilelock): Waiting for existing lock by process '90598' (I am process '90597') INFO (theano.gof.compilelock): To manually release the lock, delete /export/mlrg/hma02/.theano/compiledir_Linux-4.4--generic-ppc64le-with-Ubuntu-16.04-xenial-ppc64le-3.5.2-64/lock_dir INFO (theano.gof.compilelock): Waiting for existing lock by process '90598' (I am process '90597') INFO (theano.gof.compilelock): To manually release the lock, delete /export/mlrg/hma02/.theano/compiledir_Linux-4.4--generic-ppc64le-with-Ubuntu-16.04-xenial-ppc64le-3.5.2-64/lock_dir INFO (theano.gof.compilelock): Waiting for existing lock by process '90598' (I am process '90597') INFO (theano.gof.compilelock): To manually release the lock, delete /export/mlrg/hma02/.theano/compiledir_Linux-4.4--generic-ppc64le-with-Ubuntu-16.04-xenial-ppc64le-3.5.2-64/lock_dir INFO (theano.gof.compilelock): Waiting for existing lock by process '90597' (I am process '90598') INFO (theano.gof.compilelock): To manually release the lock, delete /export/mlrg/hma02/.theano/compiledir_Linux-4.4--generic-ppc64le-with-Ubuntu-16.04-xenial-ppc64le-3.5.2-64/lock_dir INFO (theano.gof.compilelock): Waiting for existing lock by process '90598' (I am process '90597') INFO (theano.gof.compilelock): To manually release the lock, delete /export/mlrg/hma02/.theano/compiledir_Linux-4.4--generic-ppc64le-with-Ubuntu-16.04-xenial-ppc64le-3.5.2-64/lock_dir INFO (theano.gof.compilelock): Waiting for existing lock by process '90598' (I am process '90597') INFO (theano.gof.compilelock): To manually release the lock, delete /export/mlrg/hma02/.theano/compiledir_Linux-4.4--generic-ppc64le-with-Ubuntu-16.04-xenial-ppc64le-3.5.2-64/lock_dir compiling validation function... Compile time: 354.818 s /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:127: RuntimeWarning: overflow encountered in add a = a + a /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:129: RuntimeWarning: invalid value encountered in subtract temp1 = temp - a /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:138: RuntimeWarning: invalid value encountered in subtract itemp = int_conv(temp-a) /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:162: RuntimeWarning: overflow encountered in add a = a + a /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:164: RuntimeWarning: invalid value encountered in subtract temp1 = temp - a /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:171: RuntimeWarning: invalid value encountered in subtract if any(temp-a != zero): /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:127: RuntimeWarning: overflow encountered in add a = a + a /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:129: RuntimeWarning: invalid value encountered in subtract temp1 = temp - a /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:138: RuntimeWarning: invalid value encountered in subtract itemp = int_conv(temp-a) /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:162: RuntimeWarning: overflow encountered in add a = a + a /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:164: RuntimeWarning: invalid value encountered in subtract temp1 = temp - a /usr/local/lib/python3.5/dist-packages/numpy/core/machar.py:171: RuntimeWarning: invalid value encountered in subtract if any(temp-a != zero): Traceback (most recent call last): File "/export/mlrg/hma02/.local_minsky/lib/python3.5/site-packages/hickle-3.0.0-py3.5.egg/hickle/hickle.py", line 483, in load assert 'CLASS' in h5f.attrs.keys() AssertionError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/export/mlrg/hma02/.local_minsky/lib/python3.5/site-packages/Theano_MPI-1.0-py3.5.egg/theanompi/models/data/proc_load_mpi.py", line 96, in arr = hkl.load(str(filename)).astype('float32') File "/export/mlrg/hma02/.local_minsky/lib/python3.5/site-packages/hickle-3.0.0-py3.5.egg/hickle/hickle.py", line 490, in load import hickle_legacy File "/export/mlrg/hma02/.local/lib/python3.5/site-packages/hickle_legacy.py", line 39 print "Error: cannot open file. Please pass either a filename string, a file object, or a h5py.File" ^ SyntaxError: Missing parentheses in call to 'print' Traceback (most recent call last): File "/export/mlrg/hma02/.local_minsky/lib/python3.5/site-packages/hickle-3.0.0-py3.5.egg/hickle/hickle.py", line 483, in load assert 'CLASS' in h5f.attrs.keys() AssertionError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/export/mlrg/hma02/.local_minsky/lib/python3.5/site-packages/Theano_MPI-1.0-py3.5.egg/theanompi/models/data/proc_load_mpi.py", line 96, in arr = hkl.load(str(filename)).astype('float32') File "/export/mlrg/hma02/.local_minsky/lib/python3.5/site-packages/hickle-3.0.0-py3.5.egg/hickle/hickle.py", line 490, in load import hickle_legacy File "/export/mlrg/hma02/.local/lib/python3.5/site-packages/hickle_legacy.py", line 39 print "Error: cannot open file. Please pass either a filename string, a file object, or a h5py.File" ^ SyntaxError: Missing parentheses in call to 'print' ------------------------------------------------------- Child job 3 terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[35953,3],0] Exit code: 1 -------------------------------------------------------------------------- Rule session 90353 terminated with return code: 1.