Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OOM on memory pool expansion #496

Closed
jakirkham opened this issue Aug 13, 2020 · 2 comments · Fixed by #498
Closed

[BUG] OOM on memory pool expansion #496

jakirkham opened this issue Aug 13, 2020 · 2 comments · Fixed by #498
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@jakirkham
Copy link
Member

Describe the bug

When the new memory pool expands, it appears it may be expanding by too much. As a result, one can run into an OOM error. In this case the GPU has 32510MiB available. Though allocating 2x 8GB allocations result in OOM on the second allocation.

Steps/Code to reproduce bug

In [1]: import rmm

In [2]: rmm.reinitialize(pool_allocator=True)

In [3]: db1 = rmm.DeviceBuffer(size=(8 * 2**30))

In [4]: db2 = rmm.DeviceBuffer(size=(8 * 2**30))
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-4-25bed4057da3> in <module>
----> 1 db2 = rmm.DeviceBuffer(size=(8 * 2**30))

rmm/_lib/device_buffer.pyx in rmm._lib.device_buffer.DeviceBuffer.__cinit__()

MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp68: cudaErrorMemoryAllocation out of memory

Expected behavior

The pool would expand such that the next allocation would fit without causing an OOM error.

Environment details (please complete the following information):

  • Environment location: Bare-metal (DGX-1)
  • Method of RMM install: Conda
  • Please run and attach the output of the rmm/print_env.sh script to gather relevant environment details
$ bash print_env.sh 
**git***
commit cfbb2975f96bced32ad9cd2d8e6cfb7bb00701f1 (HEAD -> branch-0.15, rapidsai/branch-0.15)
Merge: b88ad8b f623d2c
Author: Keith Kraus <[email protected]>
Date:   Thu Aug 13 13:25:37 2020 -0400

    Merge pull request #493 from kkraus14/remove_cuda_init_hack
    
    [REVIEW] Remove cuda init hack

***OS Information***
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2020-03-04"
DGX_SWBUILD_VERSION="4.4.0"
DGX_COMMIT_ID="ee09ebc"
DGX_PLATFORM="DGX Server for DGX-1"
DGX_SERIAL_NUMBER="QTFCOU822000C"
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.4 LTS"
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Linux dgx15 4.15.0-76-generic #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

***GPU Information***
Thu Aug 13 16:06:37 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   33C    P0    56W / 300W |  16611MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   30C    P0    42W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   28C    P0    41W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   31C    P0    43W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   32C    P0    42W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   29C    P0    41W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     26554      C   ...m/miniconda/envs/rapids15dev/bin/python 16599MiB |
+-----------------------------------------------------------------------------+

***CPU***
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:            1
CPU MHz:             3235.326
CPU max MHz:         3600.0000
CPU min MHz:         1200.0000
BogoMIPS:            4390.49
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            51200K
NUMA node0 CPU(s):   0-19,40-59
NUMA node1 CPU(s):   20-39,60-79
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

***CMake***
/datasets/jkirkham/miniconda/envs/rapids15dev/bin/cmake
cmake version 3.18.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

***g++***
/datasets/jkirkham/miniconda/envs/rapids15dev/bin/g++
g++ (crosstool-NG 1.24.0.123_1667d2b) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


***nvcc***
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

***Python***
/datasets/jkirkham/miniconda/envs/rapids15dev/bin/python
Python 3.8.5

***Environment Variables***
PATH                            : /datasets/jkirkham/miniconda/envs/rapids15dev/bin:/datasets/jkirkham/miniconda/condabin:/usr/local/cuda/bin:/opt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin
LD_LIBRARY_PATH                 : 
NUMBAPRO_NVVM                   : 
NUMBAPRO_LIBDEVICE              : 
CONDA_PREFIX                    : /datasets/jkirkham/miniconda/envs/rapids15dev
PYTHON_PATH                     : 

***conda packages***
/datasets/jkirkham/miniconda/condabin/conda
# packages in environment at /datasets/jkirkham/miniconda/envs/rapids15dev:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                      1_llvm    conda-forge
abseil-cpp                20200225.2           he1b5a44_2    conda-forge
aiohttp                   3.6.2            py38h516909a_0    conda-forge
appdirs                   1.4.3                      py_1    conda-forge
argon2-cffi               20.1.0           py38h1e0a361_1    conda-forge
arrow-cpp                 0.17.1          py38h1234567_11_cuda    conda-forge
arrow-cpp-proc            1.0.0                      cuda    conda-forge
asciitree                 0.3.3                      py_2    conda-forge
async-timeout             3.0.1                   py_1000    conda-forge
attrs                     19.3.0                     py_0    conda-forge
autoconf                  2.69            pl526h14c3975_9    conda-forge
automake                  1.16.2                  pl526_1    conda-forge
aws-sdk-cpp               1.7.164              hba45d7a_2    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.1                      py_0    conda-forge
binutils                  2.34                 h2122c62_9    conda-forge
binutils_impl_linux-64    2.34                 h53a641e_7    conda-forge
binutils_linux-64         2.34                hc952b39_20    conda-forge
black                     19.10b0                    py_4    conda-forge
blas                      2.16                        mkl    conda-forge
bleach                    3.1.5              pyh9f0ad1d_0    conda-forge
blosc                     1.20.0               he1b5a44_0    conda-forge
bokeh                     2.1.1            py38h32f6830_0    conda-forge
boost                     1.72.0           py38h9de70de_0    conda-forge
boost-cpp                 1.72.0               h8e57a91_0    conda-forge
brotli                    1.0.7             he1b5a44_1004    conda-forge
brotlipy                  0.7.0           py38h1e0a361_1000    conda-forge
brunsli                   0.1                  he1b5a44_0    conda-forge
bzip2                     1.0.8                h516909a_2    conda-forge
c-ares                    1.16.1               h516909a_0    conda-forge
c-compiler                1.1.1                h516909a_0    conda-forge
ca-certificates           2020.6.20            hecda079_0    conda-forge
cairo                     1.16.0            hcf35c78_1003    conda-forge
certifi                   2020.6.20        py38h32f6830_0    conda-forge
cffi                      1.14.1           py38h5bae8af_0    conda-forge
cfitsio                   3.470                hce51eda_6    conda-forge
chardet                   3.0.4           py38h32f6830_1006    conda-forge
charls                    2.1.0                he1b5a44_2    conda-forge
click                     7.1.2              pyh9f0ad1d_0    conda-forge
click-plugins             1.1.1                      py_0    conda-forge
cligj                     0.5.0                      py_0    conda-forge
cloudpickle               1.5.0                      py_0    conda-forge
cmake                     3.18.0               h5c55442_0    conda-forge
colorcet                  2.0.1                      py_0    conda-forge
compilers                 1.1.1                         0    conda-forge
cryptography              3.0              py38h766eaa4_0    conda-forge
cudatoolkit               10.2.89              hfd86e86_1    defaults
cudf                      0.15.0a200813   py38_gf836f8ff9_4521    rapidsai-nightly
cudf_kafka                0.15.0a200813   py38_gf836f8ff9_4521    rapidsai-nightly
cudnn                     7.6.5                cuda10.2_0    defaults
cugraph                   0.15.0a200813   py38_g01132a48_895    rapidsai-nightly
cuml                      0.15.0a200812   cuda10.2_py38_g141d7c981_1907    rapidsai-nightly
cupy                      7.7.0            py38hb1193b0_0    conda-forge
curl                      7.71.1               he644dc0_4    conda-forge
cusignal                  0.15.0a200813   py38_g84cef35_416    rapidsai-nightly
cuspatial                 0.15.0a200813   py38_g1992b3c_260    rapidsai-nightly
custreamz                 0.15.0a200813   py38_gf836f8ff9_4521    rapidsai-nightly
cuxfilter                 0.15.0a200813   py38_g55fc7cb_201    rapidsai-nightly
cxx-compiler              1.1.1                hc9558a2_0    conda-forge
cycler                    0.10.0                     py_2    conda-forge
cyrus-sasl                2.1.27               h063b49f_1    conda-forge
cython                    0.29.21          py38h950e882_0    conda-forge
cytoolz                   0.10.1           py38h516909a_0    conda-forge
dask                      2.22.0                     py_0    conda-forge
dask-core                 2.22.0                     py_0    conda-forge
dask-cuda                 0.15.0a200813          py38_105    rapidsai-nightly
dask-cudf                 0.15.0a200813   py38_gf836f8ff9_4521    rapidsai-nightly
dask-image                0.3.0              pyh9f0ad1d_0    conda-forge
dask-labextension         3.0.0                      py_0    conda-forge
dask-xgboost              0.2.0.dev28      cuda10.2py38_0    rapidsai-nightly
datashader                0.10.0                     py_0    conda-forge
datashape                 0.5.4                      py_1    conda-forge
decorator                 4.4.2                      py_0    conda-forge
defusedxml                0.6.0                      py_0    conda-forge
distributed               2.22.0           py38h32f6830_0    conda-forge
dlpack                    0.3                  he1b5a44_1    conda-forge
double-conversion         3.1.5                he1b5a44_2    conda-forge
entrypoints               0.3             py38h32f6830_1001    conda-forge
expat                     2.2.9                he1b5a44_2    conda-forge
faiss-proc                1.0.0                      cuda    rapidsai-nightly
fastavro                  0.24.0           py38h1e0a361_0    conda-forge
fasteners                 0.14.1                     py_3    conda-forge
fastrlock                 0.5              py38h950e882_0    conda-forge
fiona                     1.8.13           py38h033e0f6_1    conda-forge
flake8                    3.8.3                      py_1    conda-forge
fontconfig                2.13.1            h86ecdb6_1001    conda-forge
fortran-compiler          1.1.1                he991be0_0    conda-forge
freetype                  2.10.2               he06d7ca_0    conda-forge
freexl                    1.0.5             h516909a_1002    conda-forge
fsspec                    0.8.0                      py_0    conda-forge
gcc_impl_linux-64         7.5.0                hd420e75_6    conda-forge
gcc_linux-64              7.5.0               h09487f9_20    conda-forge
gdal                      3.0.4           py38h172510d_10    conda-forge
geopandas                 0.8.1                      py_0    conda-forge
geos                      3.8.1                he1b5a44_0    conda-forge
geotiff                   1.6.0                h05acad5_0    conda-forge
gettext                   0.19.8.1          hc5be6a0_1002    conda-forge
gflags                    2.2.2             he1b5a44_1004    conda-forge
gfortran_impl_linux-64    7.5.0                hdf63c60_6    conda-forge
gfortran_linux-64         7.5.0               h09487f9_20    conda-forge
giflib                    5.2.1                h516909a_2    conda-forge
glib                      2.65.0               h6f030ca_0    conda-forge
glog                      0.4.0                h49b9bf7_3    conda-forge
grpc-cpp                  1.30.2               heedbac9_0    conda-forge
gxx_impl_linux-64         7.5.0                hdf63c60_6    conda-forge
gxx_linux-64              7.5.0               h09487f9_20    conda-forge
hdf4                      4.2.13            hf30be14_1003    conda-forge
hdf5                      1.10.6          nompi_h3c11f04_101    conda-forge
heapdict                  1.0.1                      py_0    conda-forge
icu                       64.2                 he1b5a44_1    conda-forge
idna                      2.10               pyh9f0ad1d_0    conda-forge
imagecodecs               2020.5.30        py38h36e1e94_2    conda-forge
imageio                   2.9.0                      py_0    conda-forge
importlib-metadata        1.7.0            py38h32f6830_0    conda-forge
importlib_metadata        1.7.0                         0    conda-forge
iniconfig                 1.0.1              pyh9f0ad1d_0    conda-forge
ipykernel                 5.3.4            py38h23f93f0_0    conda-forge
ipympl                    0.5.7              pyh9f0ad1d_1    conda-forge
ipython                   7.17.0           py38h1cdfbd6_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
ipytree                   0.1.8                      py_0    conda-forge
ipywidgets                7.5.1                      py_0    conda-forge
isort                     5.4.1            py38h32f6830_0    conda-forge
jedi                      0.17.2           py38h32f6830_0    conda-forge
jinja2                    2.11.2             pyh9f0ad1d_0    conda-forge
joblib                    0.16.0                     py_0    conda-forge
jpeg                      9d                   h516909a_0    conda-forge
json-c                    0.13.1            hbfbb72e_1002    conda-forge
json5                     0.9.4              pyh9f0ad1d_0    conda-forge
jsonschema                3.2.0            py38h32f6830_1    conda-forge
jupyter-server-proxy      1.5.0                      py_0    conda-forge
jupyter_client            6.1.6                      py_0    conda-forge
jupyter_core              4.6.3            py38h32f6830_1    conda-forge
jupyterlab                2.2.4                      py_0    conda-forge
jupyterlab_server         1.2.0                      py_0    conda-forge
jxrlib                    1.1                  h516909a_2    conda-forge
kealib                    1.4.13               h33137a7_1    conda-forge
kiwisolver                1.2.0            py38hbf85e49_0    conda-forge
krb5                      1.17.1               hfafb76e_2    conda-forge
lcms2                     2.11                 hbd6801e_0    conda-forge
ld_impl_linux-64          2.34                 h53a641e_7    conda-forge
lerc                      2.2                  he1b5a44_0    conda-forge
libaec                    1.0.4                he1b5a44_1    conda-forge
libblas                   3.8.0                    16_mkl    conda-forge
libcblas                  3.8.0                    16_mkl    conda-forge
libcudf                   0.15.0a200813   cuda10.2_gf836f8ff9_4521    rapidsai-nightly
libcudf_kafka             0.15.0a200813   gf836f8ff9_4521    rapidsai-nightly
libcugraph                0.15.0a200813   cuda10.2_g01132a48_895    rapidsai-nightly
libcuml                   0.15.0a200812   cuda10.2_g141d7c981_1907    rapidsai-nightly
libcumlprims              0.15.0a200720       cuda10.2_57    rapidsai-nightly
libcurl                   7.71.1               hcdd3856_4    conda-forge
libcuspatial              0.15.0a200813   cuda10.2_g1992b3c_260    rapidsai-nightly
libdap4                   3.20.6               h1d1bd15_1    conda-forge
libedit                   3.1.20191231         h46ee950_1    conda-forge
libev                     4.33                 h516909a_0    conda-forge
libevent                  2.1.10               hcdb4288_1    conda-forge
libfaiss                  1.6.3           he61ee18_1_cuda    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.3.0               h24d8f2e_14    conda-forge
libgcrypt                 1.8.4             hf484d3e_1000    conda-forge
libgdal                   3.0.4               he6a97d6_10    conda-forge
libgfortran-ng            7.5.0               hdf63c60_14    conda-forge
libgomp                   9.3.0               h24d8f2e_14    conda-forge
libgpg-error              1.36                 he1b5a44_0    conda-forge
libgsasl                  1.8.0                         2    conda-forge
libhwloc                  2.1.0                h3c4fd83_0    conda-forge
libiconv                  1.15              h516909a_1006    conda-forge
libkml                    1.3.0             hb574062_1011    conda-forge
liblapack                 3.8.0                    16_mkl    conda-forge
liblapacke                3.8.0                    16_mkl    conda-forge
libllvm9                  9.0.1                he513fc3_1    conda-forge
libnetcdf                 4.7.4           nompi_h84807e1_105    conda-forge
libnghttp2                1.41.0               hab1572f_1    conda-forge
libntlm                   1.4               h516909a_1002    conda-forge
libpng                    1.6.37               hed695b0_2    conda-forge
libpq                     12.3                 h5513abc_0    conda-forge
libprotobuf               3.12.4               h8b12597_0    conda-forge
librdkafka                1.4.0                h40bdf00_0    conda-forge
librmm                    0.15.0a200813   cuda10.2_gcfbb297_659    rapidsai-nightly
libsodium                 1.0.18               h516909a_0    conda-forge
libspatialindex           1.9.3                he1b5a44_3    conda-forge
libspatialite             4.3.0a            h2482549_1038    conda-forge
libssh2                   1.9.0                hab1572f_5    conda-forge
libstdcxx-ng              9.3.0               hdf63c60_14    conda-forge
libtiff                   4.1.0                hc7e4089_6    conda-forge
libtool                   2.4.6             h516909a_1003    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libuv                     1.34.0               h516909a_0    conda-forge
libwebp                   1.1.0                h56121f0_4    conda-forge
libwebp-base              1.1.0                h516909a_3    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxgboost                1.1.0dev.rapidsai0.15      cuda10.2_1    rapidsai-nightly
libxml2                   2.9.10               hee79883_0    conda-forge
libzopfli                 1.0.3                he1b5a44_0    conda-forge
line_profiler             3.0.2            py38hc9558a2_0    conda-forge
llvm-openmp               10.0.1               hc9558a2_0    conda-forge
llvmlite                  0.33.0           py38h4f45e52_1    conda-forge
locket                    0.2.0                      py_2    conda-forge
lz4                       3.1.0            py38h66f7c9e_0    conda-forge
lz4-c                     1.9.2                he1b5a44_1    conda-forge
m4                        1.4.18            h14c3975_1001    conda-forge
make                      4.3                  h516909a_0    conda-forge
markdown                  3.2.2                      py_0    conda-forge
markupsafe                1.1.1            py38h1e0a361_1    conda-forge
matplotlib-base           3.3.0            py38h91b0d89_1    conda-forge
mccabe                    0.6.1                      py_1    conda-forge
mistune                   0.8.4           py38h1e0a361_1001    conda-forge
mkl                       2020.2                      256    conda-forge
monotonic                 1.5                        py_0    conda-forge
more-itertools            8.4.0                      py_0    conda-forge
msgpack-python            1.0.0            py38hbf85e49_1    conda-forge
multidict                 4.7.5            py38h1e0a361_1    conda-forge
multipledispatch          0.6.0                      py_0    conda-forge
munch                     2.5.0                      py_0    conda-forge
nbconvert                 5.6.1            py38h32f6830_1    conda-forge
nbformat                  5.0.7                      py_0    conda-forge
nccl                      2.5.7.1              hc6a2c23_0    conda-forge
ncurses                   6.2                  he1b5a44_1    conda-forge
networkx                  2.4                        py_1    conda-forge
ninja                     1.10.0               hc9558a2_0    conda-forge
nodejs                    13.13.0              hf5d1a2b_0    conda-forge
notebook                  6.1.3            py38h32f6830_0    conda-forge
numba                     0.50.1           py38hcb8c335_1    conda-forge
numcodecs                 0.6.4            py38he1b5a44_0    conda-forge
numpy                     1.19.1           py38h8854b6b_0    conda-forge
olefile                   0.46                       py_0    conda-forge
openjpeg                  2.3.1                h981e76c_3    conda-forge
openssl                   1.1.1g               h516909a_1    conda-forge
packaging                 20.4               pyh9f0ad1d_0    conda-forge
pandas                    1.0.5            py38hcb8c335_0    conda-forge
pandoc                    2.10.1               h516909a_0    conda-forge
pandocfilters             1.4.2                      py_1    conda-forge
panel                     0.9.7                      py_0    conda-forge
param                     1.9.3                      py_0    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.7.1              pyh9f0ad1d_0    conda-forge
partd                     1.1.0                      py_0    conda-forge
pathspec                  0.8.0              pyh9f0ad1d_0    conda-forge
pcre                      8.44                 he1b5a44_0    conda-forge
perl                      5.26.2            h516909a_1006    conda-forge
pexpect                   4.8.0            py38h32f6830_1    conda-forge
pickleshare               0.7.5           py38h32f6830_1001    conda-forge
pillow                    7.2.0            py38h9776b28_1    conda-forge
pims                      0.5                pyh9f0ad1d_1    conda-forge
pip                       20.2.2                     py_0    conda-forge
pixman                    0.38.0            h516909a_1003    conda-forge
pkg-config                0.29.2            h516909a_1006    conda-forge
pluggy                    0.13.1           py38h32f6830_2    conda-forge
poppler                   0.87.0               h4190859_1    conda-forge
poppler-data              0.4.9                         1    conda-forge
postgresql                12.3                 h8573dbc_0    conda-forge
proj                      7.0.0                h966b41f_5    conda-forge
prometheus_client         0.8.0              pyh9f0ad1d_0    conda-forge
prompt-toolkit            3.0.6                      py_0    conda-forge
psutil                    5.7.2            py38h1e0a361_0    conda-forge
pthread-stubs             0.4               h14c3975_1001    conda-forge
ptyprocess                0.6.0                   py_1001    conda-forge
py                        1.9.0              pyh9f0ad1d_0    conda-forge
py-xgboost                1.1.0dev.rapidsai0.15  cuda10.2py38_1    rapidsai-nightly
pyarrow                   0.17.1          py38h1234567_11_cuda    conda-forge
pycodestyle               2.6.0              pyh9f0ad1d_0    conda-forge
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pyct                      0.4.6                      py_0    conda-forge
pyct-core                 0.4.6                      py_0    conda-forge
pydeck                    0.4.1              pyh9f0ad1d_0    conda-forge
pyee                      7.0.2              pyh9f0ad1d_0    conda-forge
pyflakes                  2.2.0              pyh9f0ad1d_0    conda-forge
pygments                  2.6.1                      py_0    conda-forge
pynvml                    8.0.4                      py_1    conda-forge
pyopenssl                 19.1.0                     py_1    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pyppeteer                 0.0.25                     py_1    conda-forge
pyproj                    2.6.1.post1      py38h7521cb9_0    conda-forge
pyrsistent                0.16.0           py38h1e0a361_0    conda-forge
pysocks                   1.7.1            py38h32f6830_1    conda-forge
pytest                    6.0.1            py38h32f6830_0    conda-forge
pytest-asyncio            0.12.0           py38h32f6830_1    conda-forge
python                    3.8.5           h4d41432_2_cpython    conda-forge
python-blosc              1.9.1            py38hcb8c335_0    conda-forge
python-confluent-kafka    1.3.0            py38h1e0a361_1    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python-snappy             0.5.4            py38h7cfaab3_1    conda-forge
python_abi                3.8                      1_cp38    conda-forge
pytorch                   1.6.0           py3.8_cuda10.2.89_cudnn7.6.5_0    pytorch
pytz                      2020.1             pyh9f0ad1d_0    conda-forge
pyviz_comms               0.7.6              pyh9f0ad1d_0    conda-forge
pywavelets                1.1.1            py38h8790de6_1    conda-forge
pyyaml                    5.3.1            py38h1e0a361_0    conda-forge
pyzmq                     19.0.2           py38ha71036d_0    conda-forge
rapids                    0.15.0a200813   cuda10.2_py38_g54d1e7e_196    rapidsai-nightly
rapids-xgboost            0.15            cuda10.2_py38_g98f436b_172    rapidsai-nightly
re2                       2020.07.06           he1b5a44_1    conda-forge
readline                  8.0                  he28a2e2_2    conda-forge
regex                     2020.7.14        py38h1e0a361_0    conda-forge
requests                  2.24.0             pyh9f0ad1d_0    conda-forge
rhash                     1.3.6             h14c3975_1001    conda-forge
rmm                       0.15.0a200813   py38_gcfbb297_659    rapidsai-nightly
rtree                     0.9.4            py38h08f867b_1    conda-forge
scikit-image              0.17.2           py38hcb8c335_1    conda-forge
scikit-learn              0.23.2           py38hee58b96_0    conda-forge
scipy                     1.5.2            py38h8c5af15_0    conda-forge
send2trash                1.5.0                      py_0    conda-forge
setuptools                49.4.0           py38h32f6830_0    conda-forge
shapely                   1.7.0            py38hd168ffb_3    conda-forge
simpervisor               0.3                        py_1    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
slicerator                1.0.0                      py_0    conda-forge
snappy                    1.1.8                he1b5a44_3    conda-forge
sortedcontainers          2.2.2              pyh9f0ad1d_0    conda-forge
sparse                    0.10.0                     py_0    conda-forge
spdlog                    1.7.0                hc9558a2_2    conda-forge
sqlite                    3.32.3               h4cf870e_1    conda-forge
streamz                   0.5.5              pyh9f0ad1d_0    conda-forge
tbb                       2020.1               hc9558a2_0    conda-forge
tblib                     1.6.0                      py_0    conda-forge
terminado                 0.8.3            py38h32f6830_1    conda-forge
testpath                  0.4.4                      py_0    conda-forge
threadpoolctl             2.1.0              pyh5ca1d4c_0    conda-forge
thrift-cpp                0.13.0               h62aa4f2_2    conda-forge
tifffile                  2020.7.24                  py_0    conda-forge
tiledb                    1.7.7                h8efa9f0_3    conda-forge
tk                        8.6.10               hed695b0_0    conda-forge
toml                      0.10.1             pyh9f0ad1d_0    conda-forge
toolz                     0.10.0                     py_0    conda-forge
torchvision               0.7.0                py38_cu102    pytorch
tornado                   6.0.4            py38h1e0a361_1    conda-forge
tqdm                      4.48.2             pyh9f0ad1d_0    conda-forge
traitlets                 4.3.3            py38h32f6830_1    conda-forge
treelite                  0.92             py38h4e709cc_2    conda-forge
treelite-runtime          0.92                     pypi_0    pypi
typed-ast                 1.4.1            py38h516909a_0    conda-forge
typing_extensions         3.7.4.2                    py_0    conda-forge
tzcode                    2020a                h516909a_0    conda-forge
ucx                       1.8.1+g6b29558       cuda10.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.15.0a200813+g6b29558        py38_201    rapidsai-nightly
urllib3                   1.25.10                    py_0    conda-forge
wcwidth                   0.2.5              pyh9f0ad1d_1    conda-forge
webencodings              0.5.1                      py_1    conda-forge
websockets                8.1              py38h1e0a361_1    conda-forge
wheel                     0.34.2                     py_1    conda-forge
widgetsnbextension        3.5.1            py38h32f6830_1    conda-forge
xarray                    0.16.0                     py_0    conda-forge
xerces-c                  3.2.2             h8412b87_1004    conda-forge
xgboost                   1.1.0dev.rapidsai0.15  cuda10.2py38_1    rapidsai-nightly
xorg-kbproto              1.0.7             h14c3975_1002    conda-forge
xorg-libice               1.0.10               h516909a_0    conda-forge
xorg-libsm                1.2.3             h84519dc_1000    conda-forge
xorg-libx11               1.6.11               h516909a_0    conda-forge
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxdmcp             1.1.3                h516909a_0    conda-forge
xorg-libxext              1.3.4                h516909a_0    conda-forge
xorg-libxrender           0.9.10            h516909a_1002    conda-forge
xorg-renderproto          0.11.1            h14c3975_1002    conda-forge
xorg-xextproto            7.3.0             h14c3975_1002    conda-forge
xorg-xproto               7.0.31            h14c3975_1007    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yaml                      0.2.5                h516909a_0    conda-forge
yarl                      1.4.2            py38h516909a_0    conda-forge
zarr                      2.4.0                      py_0    conda-forge
zeromq                    4.3.2                he1b5a44_3    conda-forge
zfp                       0.5.5                he1b5a44_1    conda-forge
zict                      2.0.0                      py_0    conda-forge
zipp                      3.1.0                      py_0    conda-forge
zlib                      1.2.11            h516909a_1007    conda-forge
zstandard                 0.14.0           py38h950e882_1    conda-forge
zstd                      1.4.5                h6597ccf_2    conda-forge

Additional context

We seem to be running into some form of this in UCX-Py's benchmarking tests of late. This cropped up after PR ( #466 ), which resulted in Python now using the new pool memory allocator instead of CNMeM.

@jakirkham jakirkham added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 13, 2020
@harrism harrism self-assigned this Aug 13, 2020
@harrism
Copy link
Member

harrism commented Aug 13, 2020

Let me take a look.

@harrism
Copy link
Member

harrism commented Aug 14, 2020

So, there were a few problems here, but I learned something: cudaMemGetInfo, which returns the "free" and "total" available global memory on the device, does not actually return 32GiB in total on a GPU with 32 GiB. It returns a couple hundred MB less than that. THerefore, when we initialize the pool to 1/2 device memory by default, on a V100 that is less than 16 GiB. Therefore, when the repro program above tries to allocate a 2nd 8 GiB buffer, it doesn't fit in the initial pool, so the pool has to grow. That's unfortunate because it increases fragmentation.

The reason for the OOM on that pool growth is also unknown -- the logic for growing the pool was like this:

size_t size_to_grow(size_t size) const
  {
    auto const remaining =
      rmm::detail::align_up(maximum_pool_size_ - pool_size(), allocation_alignment);
    auto const aligned_size = rmm::detail::align_up(size, allocation_alignment);
    if (aligned_size <= remaining / 2) {
      return remaining / 2;
    } else if (aligned_size <= remaining) {
      return remaining;
    } else {
      return 0;
    }
  };

This was falling into the else if, but even though cudaMalloc(remaining) bytes should have succeeded, it doesn't for whatever reason. In any case, I decided that this growth heuristic is too greedy. Instead, we should return aligned_size:

size_t size_to_grow(size_t size) const
  {
    auto const remaining =
      rmm::detail::align_up(maximum_pool_size_ - pool_size(), allocation_alignment);
    auto const aligned_size = rmm::detail::align_up(size, allocation_alignment);
    if (aligned_size <= remaining / 2) {
      return remaining / 2;
    } else if (aligned_size <= remaining) {
      return aligned_size;
    } else {
      return 0;
    }
  };

This will make a difference, for example if the current pool is 16GiB and you try to allocate 9GiB and it doesn't fit, it will grow the pool by 9GiB rather than by ~16GiB. That difference of 7GiB could be the difference of another program on the machine (or a library in your app that doesn't use RMM) running or not, so I think being less greedy here is better.

PR on the way.

There were also some issues with passing default parameters from Python (and still are).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants