-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT-unspill: warn when spill to disk triggers #705
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @madsbk .
# It is a bit hacky to forcefully capture the "distributed.worker" logger, | ||
# eventually it would be better to have a different logger. For now this | ||
# is ok, allowing users to read logs with client.get_worker_logs(), a | ||
# proper solution would require changes to Distributed. | ||
self.logger = logging.getLogger("distributed.worker") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not suggesting this needs to be done now, but we have LoggedBuffer
in DeviceHostFile
, which gives us another lever of spill logging control.
Admittedly I don't understand why, but the failure here is due to dask/dask#8029. It seems that either the @galipremsagar could you take a look if you understand what exactly is happening? The error is also below for simplicity: 13:05:00 distributed.worker - WARNING - Compute Failed
13:05:00 Function: percentiles_summary
13:05:00 args: (0 5
13:05:00 1 9
13:05:00 2 8
13:05:00 3 6
13:05:00 Name: key, dtype: int64, 2, 2, 1.0, array([3578036347, 3540470653, 3867988763, 4131519251, 3858203389,
13:05:00 2739875098, 1467161710, 2658682441, 64082476, 4118976548,
13:05:00 2773552643, 3754134243, 3198209345, 1857617921, 2568210797,
13:05:00 4106179159, 2285093661, 4289432073, 400221423, 2250820505,
13:05:00 1200833651, 583842503, 4103527463, 821022306, 213520005,
13:05:00 2533953639, 1007066952, 2392461660, 35926750, 3264511067,
13:05:00 4256487459, 2301083606, 132645779, 3646629135, 3293590294,
13:05:00 1375165179, 2136909933, 2362263536, 1154292738, 3484627981,
13:05:00 802512930, 2268364294, 3625019549, 314127436, 1131057533,
13:05:00 1362096678, 2384320514, 2674545900, 3227957034, 2164428858,
13:05:00 1453714696, 2150101184, 1369638522, 3455134164, 455700831,
13:05:00 2409685947, 3294707776, 1513167404, 1113875085, 1624681803,
13:05:00 2233018894, 3215077675, 1693934420, 3920719795, 1752710669,
13:05:00 2837861634, 477517417, 3534208232, 573994510, 215599225
13:05:00 kwargs: {}
13:05:00 Exception: TypeError("No dispatch for <class 'cudf.core.series.Series'>")
13:05:00
13:05:00 Process SpawnProcess-20:
13:05:00 Traceback (most recent call last):
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
13:05:00 self.run()
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 99, in run
13:05:00 self._target(*self._args, **self._kwargs)
13:05:00 File "/workspace/dask_cuda/tests/test_explicit_comms.py", line 245, in _test_dataframe_shuffle_merge
13:05:00 got = ddf1.merge(ddf2, on="key").set_index("key").compute()
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cudf/core.py", line 226, in set_index
13:05:00 **kwargs,
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/core.py", line 4235, in set_index
13:05:00 **kwargs,
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/shuffle.py", line 163, in set_index
13:05:00 df, index2, repartition, npartitions, upsample, partition_size
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/shuffle.py", line 35, in _calculate_divisions
13:05:00 divisions, sizes, mins, maxes = base.compute(divisions, sizes, mins, maxes)
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/base.py", line 568, in compute
13:05:00 results = schedule(dsk, keys, **kwargs)
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 2671, in get
13:05:00 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 1954, in gather
13:05:00 asynchronous=asynchronous,
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 846, in sync
13:05:00 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 325, in sync
13:05:00 raise exc.with_traceback(tb)
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 308, in f
13:05:00 result[0] = yield future
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
13:05:00 value = future.result()
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 1813, in _gather
13:05:00 raise exception.with_traceback(traceback)
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/partitionquantiles.py", line 421, in percentiles_summary
13:05:00 vals, n = _percentile(data, qs, interpolation=interpolation)
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/dispatch.py", line 29, in _percentile
13:05:00 func = percentile_dispatch.dispatch(type(a))
13:05:00 File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/utils.py", line 568, in dispatch
13:05:00 raise TypeError("No dispatch for {0}".format(cls))
13:05:00 TypeError: No dispatch for <class 'cudf.core.series.Series'>
13:05:01 Coverage.py warning: --include is ignored because --source is set (include-ignored) |
rerun tests |
Found the issue, This will be resolved by this commit: dask/dask@bb8d469 as part of dask/dask#8055 |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #705 +/- ##
================================================
+ Coverage 87.63% 89.45% +1.82%
================================================
Files 15 15
Lines 1658 1698 +40
================================================
+ Hits 1453 1519 +66
+ Misses 205 179 -26
Continue to review full report at Codecov.
|
Thanks @galipremsagar, @pentschev and @jakirkham |
@gpucibot merge |
Currently, JIT-unspill doesn't support spill to disk (#657), however Dask might trigger a spill-to-disk by accessing
self.data.fast
directly.This PR adds a
.fast
attribute to prevent a crash and raise a warning instead.