-
Notifications
You must be signed in to change notification settings - Fork 450
cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats #441
Comments
It looks like we'll need to update the docs here, similar to NVIDIA/thrust#1587. Thanks for pointing this out! |
I think this approach to determinism is a bit cavalier. My original ask for switching the reduce implementation to be deterministic in #108 was because I was changing Tensorflow's reduction to use CUB and thus Tensorflow being deterministic now depends on CUB being deterministic. I haven't been working on TF for a long time now, but if the docs do guarantee determinism, then the process that lead to an implementation change in CUB that broke the doc contract should be examined, as it has possibly broken a lot of downstream software that relied on the documented behavior. |
We're discussing adding deterministic APIs, but it's a matter of prioritization and limited resources right now. Those guarantees were broken several years ago, unfortunately, and the current implementations of these algorithms cannot easily support this usecase. This is definitely not ideal, I 100% agree. Hopefully we'll be able to address this better in the future. |
@allisonvacanti Looking forward to seeing cub and thrust provides deterministic APIs. This will be very helpful for HPC and ML application. |
@allisonvacanti Do you know when the regression was introduced? At the end of 2017 the reduction should have been deterministic (around NVIDIA/thrust#121 ). There are then very few commits until sometime in 2020. Scanning the titles until now, I don't see anything obvious that switched the underlying reduction implementation algorithm. Would be interested to see where that happened. |
I'm not sure where it happened. I was not aware that this was broken until this issue was filed, and I don't see any changes in the log that would have affected this behavior. No idea. |
ReduceByKey is implemented as a prefix scan. With decoupled lookback it has the same issues as in NVIDIA/thrust#1587 |
@fkallen correct, thrust/cub reduce_by_key is implemented as a prefix scan by using @dumerrill 's decoupled look-back algorithm, reduce_by_key with prefix scan approach is good to me to get workload balance on different SMs. At this moment, I'm using a handcrafted implementation to solve these issues. Looking forward NVIDIA (@allisonvacanti @senior-zero @dumerrill) to provide a future-proof solution from thrust/cub as deterministic APIs. |
Filed NVIDIA/thrust#471 to summarize and describe the various issues with existing and recently removed determinism guarantees. Closing this (and other similar issues) to consolidate discussion -- your feedback is welcome on NVIDIA/thrust#471. |
Hi @allisonvacanti @lilohuang and others. Has there been a resolution for this in cub::DeviceReduce::ReduceByKey() function? Thanks in advance. |
Hello @BenikaHall! Reduce by key is based on decoupled look-back (like device scan). Unfortunately, it's still not deterministic. The work is tracked by NVIDIA/cccl#886. We have planned some work on this for 2.4 release. |
@allisonvacanti @senior-zero
cub::DeviceReduce::ReduceByKey web page describes "run-to-run" determinism for addition of floating point types, but the result looks wrong to me. Is it an expected behavior or a bug?
From my limited testing somehow I got run-to-run result from the below code with CUDA 11.6 SDK. BTW, you need to run the program multiple times and will occasionally see the error, it doesn't work like the documentation mentioned which provides "run-to-run" determinism.
Note: this issue is mainly for cub, there is a similar issue for thrust (NVIDIA/cccl#794)
The text was updated successfully, but these errors were encountered: