cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats #441

lilohuang · 2022-02-25T05:55:39Z

cub::DeviceReduce::ReduceByKey web page describes "run-to-run" determinism for addition of floating point types, but the result looks wrong to me. Is it an expected behavior or a bug?

From my limited testing somehow I got run-to-run result from the below code with CUDA 11.6 SDK. BTW, you need to run the program multiple times and will occasionally see the error, it doesn't work like the documentation mentioned which provides "run-to-run" determinism.

Note: this issue is mainly for cub, there is a similar issue for thrust (NVIDIA/cccl#794)

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/sequence.h>
#include <thrust/reduce.h>
#include <iostream>

#include <cub/cub.cuh>   // or equivalently <cub/device/device_reduce.cuh>


int main() {
   auto const numElements = 250000;
   thrust::device_vector<double> data(numElements, 0.1);
   thrust::device_vector<double> keys(numElements, 1);

   thrust::device_vector<double> keys_out1(numElements);
   thrust::device_vector<double> keys_out2(numElements);

   thrust::device_vector<double> out1(numElements);
   thrust::device_vector<double> out2(numElements);

   thrust::host_vector<double> hostOut1(numElements);
   thrust::host_vector<double> hostOut2(numElements);

   thrust::device_vector<int> num_runs_out(1);

   // first run
   {
      size_t temp_storage_bytes = 0;
      cub::DeviceReduce::ReduceByKey(
         nullptr, temp_storage_bytes,
         keys.begin(), keys_out1.begin(),
         data.begin(), out1.begin(),
         num_runs_out.begin(),
         thrust::plus<double>(), numElements);
      thrust::device_vector<char> d_temp_storage(temp_storage_bytes);
      cub::DeviceReduce::ReduceByKey(
         d_temp_storage.data().get(), temp_storage_bytes,
         keys.begin(), keys_out1.begin(),
         data.begin(), out1.begin(),
         num_runs_out.begin(),
         thrust::plus<double>(), numElements);
      // copy out1 to the host
      thrust::copy(out1.begin(), out1.begin() + num_runs_out[0], hostOut1.begin());
   }

   // second run
   {
      size_t temp_storage_bytes = 0;
      cub::DeviceReduce::ReduceByKey(
         nullptr, temp_storage_bytes,
         keys.begin(), keys_out2.begin(),
         data.begin(), out2.begin(),
         num_runs_out.begin(),
         thrust::plus<double>(), numElements);
      thrust::device_vector<char> d_temp_storage(temp_storage_bytes);
      cub::DeviceReduce::ReduceByKey(
         d_temp_storage.data().get(), temp_storage_bytes,
         keys.begin(), keys_out2.begin(),
         data.begin(), out2.begin(),
         num_runs_out.begin(),
         thrust::plus<double>(), numElements);
      // copy out2 to the host
      thrust::copy(out2.begin(), out2.begin() + num_runs_out[0], hostOut2.begin());
   }

   // Check the outputs are exactly the same
   for (int i = 0; i < num_runs_out[0]; i++) {
      if (hostOut1[i] != hostOut2[i]) {
         std::cout << "Element " << i << " is not equal" << std::endl;
      }
   }

   return 0;
}

alliepiper · 2022-03-07T19:24:30Z

It looks like we'll need to update the docs here, similar to NVIDIA/thrust#1587. Thanks for pointing this out!

ekelsen · 2022-03-20T05:06:53Z

I think this approach to determinism is a bit cavalier. My original ask for switching the reduce implementation to be deterministic in #108 was because I was changing Tensorflow's reduction to use CUB and thus Tensorflow being deterministic now depends on CUB being deterministic.

I haven't been working on TF for a long time now, but if the docs do guarantee determinism, then the process that lead to an implementation change in CUB that broke the doc contract should be examined, as it has possibly broken a lot of downstream software that relied on the documented behavior.

alliepiper · 2022-03-21T18:20:57Z

We're discussing adding deterministic APIs, but it's a matter of prioritization and limited resources right now.

Those guarantees were broken several years ago, unfortunately, and the current implementations of these algorithms cannot easily support this usecase. This is definitely not ideal, I 100% agree. Hopefully we'll be able to address this better in the future.

lilohuang · 2022-03-22T01:12:49Z

@allisonvacanti Looking forward to seeing cub and thrust provides deterministic APIs. This will be very helpful for HPC and ML application.

ekelsen · 2022-03-22T01:31:34Z

@allisonvacanti Do you know when the regression was introduced? At the end of 2017 the reduction should have been deterministic (around NVIDIA/thrust#121 ). There are then very few commits until sometime in 2020. Scanning the titles until now, I don't see anything obvious that switched the underlying reduction implementation algorithm. Would be interested to see where that happened.

alliepiper · 2022-03-22T18:20:47Z

I'm not sure where it happened. I was not aware that this was broken until this issue was filed, and I don't see any changes in the log that would have affected this behavior. No idea.

fkallen · 2022-03-25T21:49:35Z

ReduceByKey is implemented as a prefix scan. With decoupled lookback it has the same issues as in NVIDIA/thrust#1587

lilohuang · 2022-03-25T23:56:48Z

@fkallen correct, thrust/cub reduce_by_key is implemented as a prefix scan by using @dumerrill 's decoupled look-back algorithm, reduce_by_key with prefix scan approach is good to me to get workload balance on different SMs. At this moment, I'm using a handcrafted implementation to solve these issues. Looking forward NVIDIA (@allisonvacanti @senior-zero @dumerrill) to provide a future-proof solution from thrust/cub as deterministic APIs.

alliepiper · 2022-05-06T22:02:09Z

Filed NVIDIA/thrust#471 to summarize and describe the various issues with existing and recently removed determinism guarantees. Closing this (and other similar issues) to consolidate discussion -- your feedback is welcome on NVIDIA/thrust#471.

BenikaHall · 2023-12-08T04:19:16Z

Hi @allisonvacanti @lilohuang and others. Has there been a resolution for this in cub::DeviceReduce::ReduceByKey() function? Thanks in advance.

gevtushenko · 2023-12-08T05:59:37Z

Hi @allisonvacanti @lilohuang and others. Has there been a resolution for this in cub::DeviceReduce::ReduceByKey() function? Thanks in advance.

Hello @BenikaHall! Reduce by key is based on decoupled look-back (like device scan). Unfortunately, it's still not deterministic. The work is tracked by NVIDIA/cccl#886. We have planned some work on this for 2.4 release.

lilohuang mentioned this issue Nov 8, 2023

reduce_by_key results are non-deterministic for floats NVIDIA/cccl#794

Open

alliepiper added type: bug: functional Does not work as intended. only: docs Documentation changes only. Doesn't need code CI. P1: should have Necessary, but not critical. labels Mar 7, 2022

alliepiper added this to the 1.17.0 milestone Mar 7, 2022

lilohuang mentioned this issue Mar 20, 2022

RFE: Add deterministic option for reduce, scan, etc. #108

Closed

alliepiper modified the milestones: 1.17.0, 2.0.0 Apr 25, 2022

alliepiper closed this as completed May 6, 2022

alliepiper removed this from the 2.0.0 milestone May 6, 2022

alliepiper added the duplicate Already exists. label May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats #441

cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats #441

lilohuang commented Feb 25, 2022 •

edited

Loading

alliepiper commented Mar 7, 2022

ekelsen commented Mar 20, 2022

alliepiper commented Mar 21, 2022

lilohuang commented Mar 22, 2022

ekelsen commented Mar 22, 2022

alliepiper commented Mar 22, 2022

fkallen commented Mar 25, 2022

lilohuang commented Mar 25, 2022

alliepiper commented May 6, 2022

BenikaHall commented Dec 8, 2023 •

edited

Loading

gevtushenko commented Dec 8, 2023

cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats #441

cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats #441

Comments

lilohuang commented Feb 25, 2022 • edited Loading

alliepiper commented Mar 7, 2022

ekelsen commented Mar 20, 2022

alliepiper commented Mar 21, 2022

lilohuang commented Mar 22, 2022

ekelsen commented Mar 22, 2022

alliepiper commented Mar 22, 2022

fkallen commented Mar 25, 2022

lilohuang commented Mar 25, 2022

alliepiper commented May 6, 2022

BenikaHall commented Dec 8, 2023 • edited Loading

gevtushenko commented Dec 8, 2023

lilohuang commented Feb 25, 2022 •

edited

Loading

BenikaHall commented Dec 8, 2023 •

edited

Loading