Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT-unspill: warn when spill to disk triggers #705

Merged
merged 1 commit into from
Aug 19, 2021

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Aug 18, 2021

Currently, JIT-unspill doesn't support spill to disk (#657), however Dask might trigger a spill-to-disk by accessing self.data.fast directly.

This PR adds a .fast attribute to prevent a crash and raise a warning instead.

@madsbk madsbk requested a review from a team as a code owner August 18, 2021 10:50
@github-actions github-actions bot added the python python code needed label Aug 18, 2021
@madsbk madsbk added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 18, 2021
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @madsbk .

Comment on lines +168 to +172
# It is a bit hacky to forcefully capture the "distributed.worker" logger,
# eventually it would be better to have a different logger. For now this
# is ok, allowing users to read logs with client.get_worker_logs(), a
# proper solution would require changes to Distributed.
self.logger = logging.getLogger("distributed.worker")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not suggesting this needs to be done now, but we have LoggedBuffer in DeviceHostFile, which gives us another lever of spill logging control.

@pentschev
Copy link
Member

pentschev commented Aug 18, 2021

Admittedly I don't understand why, but the failure here is due to dask/dask#8029. It seems that either the da.percentile dispatcher for cudf.Series isn't found anymore, or there was a change in codepath that previously would not try to call np.percentile on a cudf.Series.

@galipremsagar could you take a look if you understand what exactly is happening? The error is also below for simplicity:

13:05:00 distributed.worker - WARNING - Compute Failed
13:05:00 Function:  percentiles_summary
13:05:00 args:      (0    5
13:05:00 1    9
13:05:00 2    8
13:05:00 3    6
13:05:00 Name: key, dtype: int64, 2, 2, 1.0, array([3578036347, 3540470653, 3867988763, 4131519251, 3858203389,
13:05:00        2739875098, 1467161710, 2658682441,   64082476, 4118976548,
13:05:00        2773552643, 3754134243, 3198209345, 1857617921, 2568210797,
13:05:00        4106179159, 2285093661, 4289432073,  400221423, 2250820505,
13:05:00        1200833651,  583842503, 4103527463,  821022306,  213520005,
13:05:00        2533953639, 1007066952, 2392461660,   35926750, 3264511067,
13:05:00        4256487459, 2301083606,  132645779, 3646629135, 3293590294,
13:05:00        1375165179, 2136909933, 2362263536, 1154292738, 3484627981,
13:05:00         802512930, 2268364294, 3625019549,  314127436, 1131057533,
13:05:00        1362096678, 2384320514, 2674545900, 3227957034, 2164428858,
13:05:00        1453714696, 2150101184, 1369638522, 3455134164,  455700831,
13:05:00        2409685947, 3294707776, 1513167404, 1113875085, 1624681803,
13:05:00        2233018894, 3215077675, 1693934420, 3920719795, 1752710669,
13:05:00        2837861634,  477517417, 3534208232,  573994510, 215599225
13:05:00 kwargs:    {}
13:05:00 Exception: TypeError("No dispatch for <class 'cudf.core.series.Series'>")
13:05:00 

13:05:00 Process SpawnProcess-20:
13:05:00 Traceback (most recent call last):
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
13:05:00     self.run()
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 99, in run
13:05:00     self._target(*self._args, **self._kwargs)
13:05:00   File "/workspace/dask_cuda/tests/test_explicit_comms.py", line 245, in _test_dataframe_shuffle_merge
13:05:00     got = ddf1.merge(ddf2, on="key").set_index("key").compute()
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cudf/core.py", line 226, in set_index
13:05:00     **kwargs,
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/core.py", line 4235, in set_index
13:05:00     **kwargs,
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/shuffle.py", line 163, in set_index
13:05:00     df, index2, repartition, npartitions, upsample, partition_size
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/shuffle.py", line 35, in _calculate_divisions
13:05:00     divisions, sizes, mins, maxes = base.compute(divisions, sizes, mins, maxes)
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/base.py", line 568, in compute
13:05:00     results = schedule(dsk, keys, **kwargs)
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 2671, in get
13:05:00     results = self.gather(packed, asynchronous=asynchronous, direct=direct)
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 1954, in gather
13:05:00     asynchronous=asynchronous,
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 846, in sync
13:05:00     self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 325, in sync
13:05:00     raise exc.with_traceback(tb)
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 308, in f
13:05:00     result[0] = yield future
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
13:05:00     value = future.result()
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 1813, in _gather
13:05:00     raise exception.with_traceback(traceback)
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/partitionquantiles.py", line 421, in percentiles_summary
13:05:00     vals, n = _percentile(data, qs, interpolation=interpolation)
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/dispatch.py", line 29, in _percentile
13:05:00     func = percentile_dispatch.dispatch(type(a))
13:05:00   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/utils.py", line 568, in dispatch
13:05:00     raise TypeError("No dispatch for {0}".format(cls))
13:05:00 TypeError: No dispatch for <class 'cudf.core.series.Series'>
13:05:01 Coverage.py warning: --include is ignored because --source is set (include-ignored)

@galipremsagar
Copy link
Contributor

rerun tests

@galipremsagar
Copy link
Contributor

Found the issue, This will be resolved by this commit: dask/dask@bb8d469 as part of dask/dask#8055

@jakirkham
Copy link
Member

rerun tests

@codecov-commenter
Copy link

codecov-commenter commented Aug 19, 2021

Codecov Report

Merging #705 (b36c2d1) into branch-21.10 (8e6ab70) will increase coverage by 1.82%.
The diff coverage is 65.21%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.10     #705      +/-   ##
================================================
+ Coverage         87.63%   89.45%   +1.82%     
================================================
  Files                15       15              
  Lines              1658     1698      +40     
================================================
+ Hits               1453     1519      +66     
+ Misses              205      179      -26     
Impacted Files Coverage Δ
dask_cuda/cuda_worker.py 77.64% <ø> (ø)
dask_cuda/local_cuda_cluster.py 77.88% <0.00%> (ø)
dask_cuda/utils.py 84.25% <60.52%> (-3.03%) ⬇️
dask_cuda/initialize.py 92.59% <100.00%> (+3.70%) ⬆️
dask_cuda/proxify_host_file.py 99.41% <100.00%> (+0.02%) ⬆️
dask_cuda/explicit_comms/dataframe/shuffle.py 98.69% <0.00%> (+0.65%) ⬆️
dask_cuda/proxy_object.py 90.73% <0.00%> (+1.08%) ⬆️
dask_cuda/cli/dask_cuda_worker.py 97.14% <0.00%> (+1.42%) ⬆️
dask_cuda/device_host_file.py 71.77% <0.00%> (+1.61%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c175f4...b36c2d1. Read the comment docs.

@madsbk
Copy link
Member Author

madsbk commented Aug 19, 2021

Thanks @galipremsagar, @pentschev and @jakirkham

@madsbk
Copy link
Member Author

madsbk commented Aug 19, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4b9f525 into rapidsai:branch-21.10 Aug 19, 2021
@madsbk madsbk deleted the warn_spill_to_disk branch August 19, 2021 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants