Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add static_map::insert_or_apply aka reduce-by-key #515

Merged
merged 21 commits into from
Jul 4, 2024

Conversation

srinivasyadav18
Copy link
Contributor

This PR (continuation of PR #384) adds a function static_map::insert_or_apply, which either inserts a new key-value pair into the map, or, in case the key already exists, applies a reduction function over the associated value.

This effectively replaces the former cuco::static_reduction_map and thus superseeds #98.

The current implementation restricts the insert_or_apply() opeartion to accept Op Callable only with a specific signature i.e Op should be a callable object as Op(cuda::atomic_ref<T, Scope>, T>)

Closes #82 #384

Copy link

copy-pr-bot bot commented Jun 25, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sleeepyjack
Copy link
Collaborator

/ok to test

@sleeepyjack
Copy link
Collaborator

One small nit coming from the Doxygen checker:

/home/runner/work/cuCollections/cuCollections/include/cuco/operator.hpp:45: warning: Member insert_or_apply (variable) of  cuco is not documented.

@srinivasyadav18
Copy link
Contributor Author

srinivasyadav18 commented Jun 25, 2024

Benchmark results of reduction using cuco::op::reduce::sum

static_map_insert_or_apply_uniform_multiplicity

[0] Tesla T4

Key Value Distribution Multiplicity Samples CPU Time Noise GPU Time Noise Elem/s
I32 I32 UNIFORM 1 11x 157.325 ms 0.05% 157.318 ms 0.05% 635.657M
I32 I32 UNIFORM 2 11x 155.883 ms 0.03% 155.879 ms 0.03% 641.524M
I32 I32 UNIFORM 4 11x 154.608 ms 0.02% 154.601 ms 0.01% 646.825M
I32 I32 UNIFORM 8 11x 153.095 ms 0.02% 153.089 ms 0.02% 653.216M
I32 I32 UNIFORM 16 11x 150.484 ms 0.01% 150.477 ms 0.01% 664.553M
I64 I64 UNIFORM 1 11x 163.568 ms 0.03% 163.564 ms 0.03% 611.383M
I64 I64 UNIFORM 2 11x 160.726 ms 0.02% 160.719 ms 0.02% 622.202M
I64 I64 UNIFORM 4 11x 160.739 ms 0.02% 160.732 ms 0.01% 622.153M
I64 I64 UNIFORM 8 11x 160.102 ms 0.02% 160.097 ms 0.02% 624.621M
I64 I64 UNIFORM 16 11x 158.970 ms 0.03% 158.965 ms 0.03% 629.068M

static_set_insert_or_apply_uniform_occupancy

Key Value Distribution Occupancy Samples CPU Time Noise GPU Time Noise Elem/s
I32 I32 UNIFORM 0.1 11x 155.013 ms 0.02% 155.007 ms 0.01% 645.134M
I32 I32 UNIFORM 0.2 11x 154.275 ms 0.02% 154.270 ms 0.02% 648.216M
I32 I32 UNIFORM 0.3 11x 153.951 ms 0.02% 153.947 ms 0.02% 649.576M
I32 I32 UNIFORM 0.4 11x 153.518 ms 0.05% 153.512 ms 0.04% 651.415M
I32 I32 UNIFORM 0.5 11x 153.109 ms 0.04% 153.104 ms 0.04% 653.153M
I32 I32 UNIFORM 0.6 11x 152.931 ms 0.06% 152.926 ms 0.06% 653.911M
I32 I32 UNIFORM 0.7 11x 153.304 ms 0.04% 153.299 ms 0.04% 652.319M
I32 I32 UNIFORM 0.8 11x 155.583 ms 0.09% 155.578 ms 0.09% 642.763M
I32 I32 UNIFORM 0.9 11x 166.330 ms 0.07% 166.325 ms 0.07% 601.233M
I64 I64 UNIFORM 0.1 11x 161.143 ms 0.02% 161.138 ms 0.02% 620.586M
I64 I64 UNIFORM 0.2 11x 161.003 ms 0.06% 160.998 ms 0.06% 621.125M
I64 I64 UNIFORM 0.3 11x 160.691 ms 0.10% 160.686 ms 0.10% 622.332M
I64 I64 UNIFORM 0.4 11x 160.306 ms 0.11% 160.301 ms 0.11% 623.826M
I64 I64 UNIFORM 0.5 11x 160.245 ms 0.10% 160.239 ms 0.10% 624.067M
I64 I64 UNIFORM 0.6 11x 160.628 ms 0.12% 160.623 ms 0.12% 622.575M
I64 I64 UNIFORM 0.7 11x 161.962 ms 0.12% 161.956 ms 0.12% 617.450M
I64 I64 UNIFORM 0.8 11x 169.040 ms 0.07% 169.036 ms 0.07% 591.591M
I64 I64 UNIFORM 0.9 11x 192.509 ms 0.07% 192.505 ms 0.07% 519.466M

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first pass. Very clean code!

benchmarks/hash_table/static_map/insert_or_apply_bench.cu Outdated Show resolved Hide resolved
include/cuco/detail/static_map/kernels.cuh Show resolved Hide resolved
include/cuco/detail/static_map/static_map_ref.inl Outdated Show resolved Hide resolved
include/cuco/detail/static_map/static_map_ref.inl Outdated Show resolved Hide resolved
include/cuco/detail/static_map/static_map_ref.inl Outdated Show resolved Hide resolved
include/cuco/detail/static_map/static_map_ref.inl Outdated Show resolved Hide resolved
include/cuco/detail/static_map/static_map_ref.inl Outdated Show resolved Hide resolved
tests/static_map/insert_or_apply_test.cu Outdated Show resolved Hide resolved
tests/static_map/insert_or_apply_test.cu Outdated Show resolved Hide resolved
tests/static_map/insert_or_apply_test.cu Outdated Show resolved Hide resolved
@srinivasyadav18
Copy link
Contributor Author

Benchmark Results with Higher Multiplicity [1, 2, ... 128]

static_map_insert_or_apply_uniform_multiplicity

[0] Tesla T4

Key Value Distribution Multiplicity UseIdentity Samples CPU Time Noise GPU Time Noise Elem/s
I32 I32 UNIFORM 1 0 11x 144.949 ms 0.02% 144.940 ms 0.01% 689.940M
I32 I32 UNIFORM 2 0 11x 144.963 ms 0.02% 144.956 ms 0.02% 689.864M
I32 I32 UNIFORM 4 0 11x 144.360 ms 0.02% 144.352 ms 0.01% 692.752M
I32 I32 UNIFORM 8 0 11x 143.056 ms 0.02% 143.047 ms 0.01% 699.072M
I32 I32 UNIFORM 16 0 11x 140.739 ms 0.02% 140.733 ms 0.02% 710.564M
I32 I32 UNIFORM 32 0 11x 136.065 ms 0.05% 136.060 ms 0.05% 734.970M
I32 I32 UNIFORM 64 0 11x 127.536 ms 0.03% 127.528 ms 0.03% 784.139M
I32 I32 UNIFORM 128 0 11x 110.777 ms 0.04% 110.771 ms 0.04% 902.761M
I32 I32 UNIFORM 1 1 11x 145.787 ms 0.03% 145.781 ms 0.03% 685.962M
I32 I32 UNIFORM 2 1 11x 145.518 ms 0.09% 145.511 ms 0.09% 687.233M
I32 I32 UNIFORM 4 1 11x 144.732 ms 0.07% 144.726 ms 0.07% 690.963M
I32 I32 UNIFORM 8 1 11x 143.369 ms 0.03% 143.359 ms 0.03% 697.549M
I32 I32 UNIFORM 16 1 11x 140.970 ms 0.08% 140.965 ms 0.08% 709.397M
I32 I32 UNIFORM 32 1 11x 136.240 ms 0.06% 136.234 ms 0.06% 734.032M
I32 I32 UNIFORM 64 1 11x 127.631 ms 0.08% 127.626 ms 0.08% 783.541M
I32 I32 UNIFORM 128 1 11x 110.880 ms 0.06% 110.871 ms 0.05% 901.953M
I64 I64 UNIFORM 1 0 11x 151.494 ms 0.08% 151.489 ms 0.08% 660.113M
I64 I64 UNIFORM 2 0 11x 151.776 ms 0.14% 151.772 ms 0.14% 658.885M
I64 I64 UNIFORM 4 0 11x 151.188 ms 0.12% 151.182 ms 0.12% 661.453M
I64 I64 UNIFORM 8 0 11x 150.324 ms 0.11% 150.319 ms 0.11% 665.250M
I64 I64 UNIFORM 16 0 11x 149.007 ms 0.12% 149.001 ms 0.12% 671.134M
I64 I64 UNIFORM 32 0 11x 146.713 ms 0.10% 146.708 ms 0.10% 681.625M
I64 I64 UNIFORM 64 0 11x 142.196 ms 0.09% 142.191 ms 0.09% 703.281M
I64 I64 UNIFORM 128 0 11x 134.121 ms 0.07% 134.116 ms 0.07% 745.622M
I64 I64 UNIFORM 1 1 11x 151.800 ms 0.12% 151.795 ms 0.12% 658.784M
I64 I64 UNIFORM 2 1 11x 152.036 ms 0.13% 152.029 ms 0.13% 657.768M
I64 I64 UNIFORM 4 1 11x 151.485 ms 0.16% 151.480 ms 0.16% 660.154M
I64 I64 UNIFORM 8 1 11x 150.548 ms 0.12% 150.542 ms 0.12% 664.264M
I64 I64 UNIFORM 16 1 11x 149.221 ms 0.13% 149.216 ms 0.13% 670.170M
I64 I64 UNIFORM 32 1 11x 146.844 ms 0.09% 146.839 ms 0.09% 681.018M
I64 I64 UNIFORM 64 1 11x 142.461 ms 0.08% 142.456 ms 0.08% 701.973M
I64 I64 UNIFORM 128 1 11x 134.240 ms 0.05% 134.234 ms 0.05% 744.966M

static_set_insert_or_apply_uniform_occupancy

[0] Tesla T4

Key Value Distribution Occupancy UseIdentity Samples CPU Time Noise GPU Time Noise Elem/s
I32 I32 UNIFORM 0.1 0 11x 144.656 ms 0.10% 144.651 ms 0.10% 691.320M
I32 I32 UNIFORM 0.2 0 11x 144.288 ms 0.13% 144.280 ms 0.12% 693.099M
I32 I32 UNIFORM 0.3 0 11x 144.075 ms 0.11% 144.069 ms 0.11% 694.110M
I32 I32 UNIFORM 0.4 0 11x 143.949 ms 0.10% 143.943 ms 0.10% 694.721M
I32 I32 UNIFORM 0.5 0 11x 143.906 ms 0.11% 143.901 ms 0.11% 694.923M
I32 I32 UNIFORM 0.6 0 11x 144.216 ms 0.06% 144.210 ms 0.06% 693.431M
I32 I32 UNIFORM 0.7 0 11x 144.986 ms 0.05% 144.981 ms 0.05% 689.745M
I32 I32 UNIFORM 0.8 0 11x 147.738 ms 0.03% 147.733 ms 0.03% 676.895M
I32 I32 UNIFORM 0.9 0 11x 158.990 ms 0.03% 158.985 ms 0.03% 628.991M
I32 I32 UNIFORM 0.1 1 11x 145.038 ms 0.21% 145.033 ms 0.21% 689.501M
I32 I32 UNIFORM 0.2 1 11x 144.857 ms 0.15% 144.850 ms 0.15% 690.368M
I32 I32 UNIFORM 0.3 1 11x 144.537 ms 0.11% 144.530 ms 0.11% 691.897M
I32 I32 UNIFORM 0.4 1 11x 144.314 ms 0.08% 144.309 ms 0.08% 692.956M
I32 I32 UNIFORM 0.5 1 11x 144.310 ms 0.08% 144.305 ms 0.08% 692.979M
I32 I32 UNIFORM 0.6 1 11x 144.558 ms 0.05% 144.552 ms 0.05% 691.794M
I32 I32 UNIFORM 0.7 1 11x 145.342 ms 0.04% 145.336 ms 0.04% 688.059M
I32 I32 UNIFORM 0.8 1 11x 147.958 ms 0.03% 147.952 ms 0.03% 675.894M
I32 I32 UNIFORM 0.9 1 11x 159.325 ms 0.05% 159.319 ms 0.05% 627.671M
I64 I64 UNIFORM 0.1 0 11x 150.797 ms 0.12% 150.792 ms 0.12% 663.166M
I64 I64 UNIFORM 0.2 0 11x 150.744 ms 0.11% 150.739 ms 0.11% 663.398M
I64 I64 UNIFORM 0.3 0 11x 150.604 ms 0.08% 150.598 ms 0.08% 664.018M
I64 I64 UNIFORM 0.4 0 11x 150.692 ms 0.07% 150.687 ms 0.07% 663.629M
I64 I64 UNIFORM 0.5 0 11x 151.041 ms 0.05% 151.035 ms 0.05% 662.096M
I64 I64 UNIFORM 0.6 0 11x 151.906 ms 0.05% 151.899 ms 0.05% 658.333M
I64 I64 UNIFORM 0.7 0 11x 153.597 ms 0.06% 153.591 ms 0.06% 651.078M
I64 I64 UNIFORM 0.8 0 11x 161.468 ms 0.03% 161.462 ms 0.03% 619.341M
I64 I64 UNIFORM 0.9 0 11x 185.222 ms 0.14% 185.217 ms 0.14% 539.907M
I64 I64 UNIFORM 0.1 1 11x 150.990 ms 0.11% 150.984 ms 0.11% 662.323M
I64 I64 UNIFORM 0.2 1 11x 150.858 ms 0.09% 150.852 ms 0.09% 662.903M
I64 I64 UNIFORM 0.3 1 11x 150.779 ms 0.07% 150.773 ms 0.07% 663.248M
I64 I64 UNIFORM 0.4 1 11x 150.840 ms 0.06% 150.834 ms 0.06% 662.979M
I64 I64 UNIFORM 0.5 1 11x 151.153 ms 0.04% 151.147 ms 0.04% 661.608M
I64 I64 UNIFORM 0.6 1 11x 152.031 ms 0.05% 152.025 ms 0.05% 657.786M
I64 I64 UNIFORM 0.7 1 11x 153.696 ms 0.07% 153.690 ms 0.07% 650.659M
I64 I64 UNIFORM 0.8 1 11x 161.600 ms 0.04% 161.594 ms 0.04% 618.834M
I64 I64 UNIFORM 0.9 1 11x 185.493 ms 0.12% 185.487 ms 0.12% 539.121M

This reverts commit 45adafc, Because there is
no performance improvement in adding identity element optimization.
@sleeepyjack
Copy link
Collaborator

/ok to test

include/cuco/detail/static_map/static_map_ref.inl Outdated Show resolved Hide resolved
include/cuco/detail/static_map/static_map_ref.inl Outdated Show resolved Hide resolved
ref_type& ref_ = static_cast<ref_type&>(*this);
auto const expected_key = ref_.impl_.empty_slot_sentinel().first;

auto old_key =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, so we are not using any of the CAS algorithms but simply CASing keys all the time. We had long discussions here thus I could miss something obvious: what are the concerns of using packed CAS in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! This is a left over. We dont use attempt_insert_or_apply anymore, as it can only work when sentienel == identity_element. Because, in this situation we only need to perform key CAS as payload atomic operation can work in any order, as long as sentinel key is updated once.

Currently we can only do attempt_insert_stable which does stable insert and does not rely on identity_element.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we remove attempt_insert_or_apply if i's no longer used?

@PointKernel PointKernel added topic: static_map Issue related to the static_map topic: performance Performance related issue type: feature request New feature request Needs Review Awaiting reviews before merging labels Jul 1, 2024
@PointKernel
Copy link
Member

/ok to test

Copy link
Collaborator

@sleeepyjack sleeepyjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Most of my comments are small doc fixes.

include/cuco/detail/static_map/kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_map/kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_map/kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_map/static_map.inl Outdated Show resolved Hide resolved
include/cuco/static_map_ref.cuh Show resolved Hide resolved
@srinivasyadav18
Copy link
Contributor Author

Benchmark Results of cuco::insert_or_apply vs thrust::sort+reduce_by_key [BENCHMARK_CODE]

static_map_insert_or_apply_uniform_multiplicity : only insert_or_apply_async

[0] NVIDIA H100

Key Value Distribution Multiplicity NumInputs Samples CPU Time Noise GPU Time Noise Elem/s
I32 I32 UNIFORM 1 1 108080x 9.917 us 114.69% 4.626 us 1.56% 216.159K
I32 I32 UNIFORM 1 100 78567x 11.625 us 82.82% 6.364 us 1.44% 15.713M
I32 I32 UNIFORM 10 100 76124x 11.718 us 78.62% 6.568 us 2.13% 15.225M
I32 I32 UNIFORM 1 10000 53776x 14.521 us 56.27% 9.298 us 1.79% 1.076G
I32 I32 UNIFORM 10 10000 59964x 13.701 us 64.44% 8.338 us 1.96% 1.199G
I32 I32 UNIFORM 1000 10000 22427x 27.443 us 23.16% 22.295 us 0.81% 448.537M
I32 I32 UNIFORM 1 1000000 8592x 63.551 us 9.23% 58.195 us 0.32% 17.184G
I32 I32 UNIFORM 10 1000000 10644x 52.405 us 11.60% 46.977 us 0.71% 21.287G
I32 I32 UNIFORM 1000 1000000 8076x 67.190 us 8.63% 61.915 us 1.23% 16.151G
I32 I32 UNIFORM 100000 1000000 607x 829.849 us 0.69% 824.545 us 0.25% 1.213G
I32 I32 UNIFORM 1 100000000 40x 12.814 ms 0.05% 12.808 ms 0.02% 7.808G
I32 I32 UNIFORM 10 100000000 50x 10.193 ms 0.06% 10.188 ms 0.01% 9.816G
I32 I32 UNIFORM 1000 100000000 146x 3.446 ms 0.17% 3.440 ms 0.03% 29.072G
I32 I32 UNIFORM 100000 100000000 101x 4.961 ms 0.20% 4.955 ms 0.16% 20.182G
I32 I32 UNIFORM 10000000 100000000 11x 76.129 ms 0.03% 76.123 ms 0.03% 1.314G
I32 I32 UNIFORM 1 1000000000 11x 130.526 ms 0.01% 130.521 ms 0.01% 7.662G
I32 I32 UNIFORM 10 1000000000 11x 128.667 ms 0.41% 128.661 ms 0.41% 7.772G
I32 I32 UNIFORM 1000 1000000000 15x 33.494 ms 0.19% 33.488 ms 0.19% 29.861G
I32 I32 UNIFORM 100000 1000000000 17x 30.873 ms 0.34% 30.867 ms 0.34% 32.397G
I32 I32 UNIFORM 10000000 1000000000 11x 233.728 ms 0.01% 233.722 ms 0.01% 4.279G
I32 I32 UNIFORM 1000000000 1000000000 11x 766.520 ms 0.00% 766.515 ms 0.00% 1.305G
I64 I64 UNIFORM 1 1 92877x 10.672 us 98.38% 5.384 us 1.71% 185.752K
I64 I64 UNIFORM 1 100 83656x 11.158 us 86.88% 5.977 us 1.14% 16.731M
I64 I64 UNIFORM 10 100 74783x 11.954 us 78.92% 6.686 us 1.60% 14.957M
I64 I64 UNIFORM 1 10000 51819x 14.927 us 54.81% 9.649 us 2.01% 1.036G
I64 I64 UNIFORM 10 10000 52624x 14.829 us 56.19% 9.501 us 1.71% 1.052G
I64 I64 UNIFORM 1000 10000 22010x 27.877 us 22.81% 22.717 us 1.48% 440.193M
I64 I64 UNIFORM 1 1000000 6661x 80.488 us 7.25% 75.066 us 0.39% 13.322G
I64 I64 UNIFORM 10 1000000 9887x 55.774 us 10.32% 50.572 us 0.44% 19.774G
I64 I64 UNIFORM 1000 1000000 11859x 47.613 us 12.97% 42.163 us 0.84% 23.717G
I64 I64 UNIFORM 100000 1000000 728x 692.888 us 0.82% 687.483 us 0.24% 1.455G
I64 I64 UNIFORM 1 100000000 36x 14.278 ms 0.05% 14.272 ms 0.02% 7.007G
I64 I64 UNIFORM 10 100000000 39x 12.844 ms 0.05% 12.838 ms 0.02% 7.789G
I64 I64 UNIFORM 1000 100000000 138x 3.649 ms 0.16% 3.643 ms 0.02% 27.448G
I64 I64 UNIFORM 100000 100000000 164x 3.065 ms 0.21% 3.059 ms 0.10% 32.689G
I64 I64 UNIFORM 10000000 100000000 11x 64.236 ms 0.03% 64.230 ms 0.03% 1.557G
I64 I64 UNIFORM 1 1000000000 11x 143.398 ms 0.02% 143.392 ms 0.02% 6.974G
I64 I64 UNIFORM 10 1000000000 11x 143.681 ms 0.46% 143.675 ms 0.46% 6.960G
I64 I64 UNIFORM 1000 1000000000 14x 38.440 ms 0.13% 38.435 ms 0.13% 26.018G
I64 I64 UNIFORM 100000 1000000000 16x 31.523 ms 0.21% 31.517 ms 0.21% 31.729G
I64 I64 UNIFORM 10000000 1000000000 11x 150.555 ms 0.12% 150.549 ms 0.12% 6.642G
I64 I64 UNIFORM 1000000000 1000000000 11x 785.048 ms 0.01% 785.042 ms 0.01% 1.274G

static_map_insert_or_apply_sync_uniform_multiplicity: map_construction + insert_or_apply + retrieve_all

[0] NVIDIA H100

Key Value Distribution Multiplicity NumInputs Samples CPU Time Noise GPU Time Noise Elem/s
I32 I32 UNIFORM 1 1 14236x 39.764 us 16.43% 35.124 us 9.74% 28.471K
I32 I32 UNIFORM 1 100 14147x 39.965 us 15.98% 35.345 us 9.18% 2.829M
I32 I32 UNIFORM 10 100 13836x 40.725 us 14.58% 36.139 us 7.13% 2.767M
I32 I32 UNIFORM 1 10000 13470x 41.774 us 12.83% 37.121 us 2.58% 269.391M
I32 I32 UNIFORM 10 10000 13441x 41.801 us 12.72% 37.200 us 2.82% 268.817M
I32 I32 UNIFORM 1000 10000 10283x 53.245 us 12.03% 48.626 us 7.39% 205.653M
I32 I32 UNIFORM 1 1000000 2574x 199.268 us 3.29% 194.322 us 2.04% 5.146G
I32 I32 UNIFORM 10 1000000 3883x 133.326 us 4.09% 128.790 us 2.06% 7.765G
I32 I32 UNIFORM 1000 1000000 3498x 147.518 us 4.13% 142.968 us 2.57% 6.995G
I32 I32 UNIFORM 100000 1000000 564x 891.902 us 0.66% 887.355 us 0.42% 1.127G
I32 I32 UNIFORM 1 100000000 31x 16.513 ms 0.09% 16.507 ms 0.08% 6.058G
I32 I32 UNIFORM 10 100000000 47x 10.862 ms 0.08% 10.856 ms 0.06% 9.211G
I32 I32 UNIFORM 1000 100000000 142x 3.534 ms 0.15% 3.530 ms 0.08% 28.331G
I32 I32 UNIFORM 100000 100000000 99x 5.069 ms 0.18% 5.064 ms 0.15% 19.746G
I32 I32 UNIFORM 10000000 100000000 11x 76.669 ms 0.04% 76.665 ms 0.04% 1.304G
I32 I32 UNIFORM 1 1000000000 11x 238.584 ms 34.18% 238.579 ms 34.18% 4.191G
I32 I32 UNIFORM 10 1000000000 11x 133.714 ms 1.42% 133.708 ms 1.42% 7.479G
I32 I32 UNIFORM 1000 1000000000 15x 33.993 ms 2.28% 33.988 ms 2.28% 29.422G
I32 I32 UNIFORM 100000 1000000000 17x 31.042 ms 0.76% 31.038 ms 0.76% 32.219G
I32 I32 UNIFORM 10000000 1000000000 11x 234.404 ms 0.01% 234.400 ms 0.01% 4.266G
I32 I32 UNIFORM 1000000000 1000000000 11x 768.069 ms 0.01% 768.066 ms 0.01% 1.302G
I64 I64 UNIFORM 1 1 14169x 39.907 us 16.06% 35.290 us 9.30% 28.337K
I64 I64 UNIFORM 1 100 14221x 39.737 us 16.06% 35.161 us 9.39% 2.844M
I64 I64 UNIFORM 10 100 14625x 38.853 us 17.29% 34.189 us 10.59% 2.925M
I64 I64 UNIFORM 1 10000 13542x 41.516 us 12.78% 36.923 us 2.76% 270.833M
I64 I64 UNIFORM 10 10000 13426x 41.846 us 12.71% 37.243 us 2.85% 268.510M
I64 I64 UNIFORM 1000 10000 9734x 55.970 us 10.91% 51.369 us 6.25% 194.672M
I64 I64 UNIFORM 1 1000000 2235x 228.915 us 2.76% 223.715 us 1.44% 4.470G
I64 I64 UNIFORM 10 1000000 2725x 188.394 us 3.48% 183.536 us 2.17% 5.449G
I64 I64 UNIFORM 1000 1000000 4143x 125.205 us 4.43% 120.699 us 2.35% 8.285G
I64 I64 UNIFORM 100000 1000000 672x 749.144 us 0.78% 744.651 us 0.50% 1.343G
I64 I64 UNIFORM 1 100000000 24x 20.894 ms 0.14% 20.888 ms 0.13% 4.787G
I64 I64 UNIFORM 10 100000000 36x 14.156 ms 0.08% 14.150 ms 0.07% 7.067G
I64 I64 UNIFORM 1000 100000000 133x 3.785 ms 0.16% 3.780 ms 0.08% 26.456G
I64 I64 UNIFORM 100000 100000000 159x 3.151 ms 0.20% 3.146 ms 0.13% 31.782G
I64 I64 UNIFORM 10000000 100000000 11x 64.586 ms 0.03% 64.581 ms 0.03% 1.548G
I64 I64 UNIFORM 1 1000000000 11x 831.910 ms 0.45% 831.907 ms 0.45% 1.202G
I64 I64 UNIFORM 10 1000000000 11x 150.437 ms 0.95% 150.431 ms 0.95% 6.648G
I64 I64 UNIFORM 1000 1000000000 13x 39.079 ms 1.28% 39.074 ms 1.28% 25.592G
I64 I64 UNIFORM 100000 1000000000 16x 32.446 ms 0.22% 32.441 ms 0.22% 30.825G
I64 I64 UNIFORM 10000000 1000000000 11x 152.054 ms 1.00% 152.050 ms 1.00% 6.577G
I64 I64 UNIFORM 1000000000 1000000000 11x 786.932 ms 0.01% 786.929 ms 0.01% 1.271G

thrust_insert_or_apply_uniform_multiplicity thrust sort_by_key + reduce_by_key

[0] NVIDIA H100

Key Value Distribution Multiplicity NumInputs Samples CPU Time Noise GPU Time Noise Elem/s
I32 I32 UNIFORM 1 1 8516x 63.275 us 7.92% 58.717 us 1.52% 17.031K
I32 I32 UNIFORM 1 100 8473x 63.591 us 8.07% 59.013 us 2.10% 1.695M
I32 I32 UNIFORM 10 100 8483x 63.517 us 8.06% 58.944 us 2.07% 1.697M
I32 I32 UNIFORM 1 10000 5860x 89.776 us 5.34% 85.326 us 1.14% 117.197M
I32 I32 UNIFORM 10 10000 6292x 83.924 us 5.74% 79.466 us 1.22% 125.839M
I32 I32 UNIFORM 1000 10000 6305x 83.813 us 5.80% 79.314 us 1.19% 126.082M
I32 I32 UNIFORM 1 1000000 1577x 322.285 us 1.97% 317.201 us 1.14% 3.153G
I32 I32 UNIFORM 10 1000000 1584x 320.971 us 2.50% 315.791 us 1.84% 3.167G
I32 I32 UNIFORM 1000 1000000 1623x 313.149 us 1.96% 308.100 us 1.06% 3.246G
I32 I32 UNIFORM 100000 1000000 1665x 305.556 us 2.05% 300.457 us 1.13% 3.328G
I32 I32 UNIFORM 1 100000000 77x 6.534 ms 0.25% 6.528 ms 0.23% 15.318G
I32 I32 UNIFORM 10 100000000 82x 6.106 ms 0.24% 6.100 ms 0.22% 16.393G
I32 I32 UNIFORM 1000 100000000 87x 5.792 ms 0.21% 5.786 ms 0.18% 17.284G
I32 I32 UNIFORM 100000 100000000 88x 5.744 ms 0.22% 5.738 ms 0.19% 17.427G
I32 I32 UNIFORM 10000000 100000000 88x 5.725 ms 0.18% 5.719 ms 0.15% 17.487G
I32 I32 UNIFORM 1 1000000000 11x 50.674 ms 0.97% 50.669 ms 0.97% 19.736G
I32 I32 UNIFORM 10 1000000000 11x 133.028 ms 33.99% 133.023 ms 33.99% 7.517G
I32 I32 UNIFORM 1000 1000000000 11x 132.743 ms 35.40% 132.738 ms 35.40% 7.534G
I32 I32 UNIFORM 100000 1000000000 11x 133.029 ms 35.49% 133.024 ms 35.49% 7.517G
I32 I32 UNIFORM 10000000 1000000000 11x 133.027 ms 35.79% 133.022 ms 35.79% 7.518G
I32 I32 UNIFORM 1000000000 1000000000 11x 147.244 ms 24.01% 147.239 ms 24.01% 6.792G
I64 I64 UNIFORM 1 1 10579x 51.913 us 10.08% 47.266 us 2.13% 21.157K
I64 I64 UNIFORM 1 100 9541x 56.985 us 9.06% 52.409 us 2.26% 1.908M
I64 I64 UNIFORM 10 100 9409x 57.759 us 9.03% 53.145 us 2.32% 1.882M
I64 I64 UNIFORM 1 10000 5607x 93.703 us 5.34% 89.183 us 1.60% 112.129M
I64 I64 UNIFORM 10 10000 5634x 93.304 us 5.28% 88.762 us 1.27% 112.661M
I64 I64 UNIFORM 1000 10000 5616x 93.621 us 5.31% 89.037 us 1.22% 112.313M
I64 I64 UNIFORM 1 1000000 1029x 491.459 us 1.65% 486.352 us 1.24% 2.056G
I64 I64 UNIFORM 10 1000000 1038x 486.752 us 1.29% 481.742 us 0.75% 2.076G
I64 I64 UNIFORM 1000 1000000 1033x 489.316 us 1.61% 484.149 us 1.18% 2.065G
I64 I64 UNIFORM 100000 1000000 1041x 485.532 us 1.48% 480.499 us 1.03% 2.081G
I64 I64 UNIFORM 1 100000000 15x 34.288 ms 0.05% 34.282 ms 0.05% 2.917G
I64 I64 UNIFORM 10 100000000 15x 34.320 ms 0.06% 34.314 ms 0.06% 2.914G
I64 I64 UNIFORM 1000 100000000 15x 34.293 ms 0.06% 34.287 ms 0.06% 2.917G
I64 I64 UNIFORM 100000 100000000 15x 34.312 ms 0.05% 34.307 ms 0.05% 2.915G
I64 I64 UNIFORM 10000000 100000000 15x 34.303 ms 0.12% 34.298 ms 0.12% 2.916G
I64 I64 UNIFORM 1 1000000000 11x 375.076 ms 0.05% 375.071 ms 0.05% 2.666G
I64 I64 UNIFORM 10 1000000000 11x 423.237 ms 40.13% 423.232 ms 40.13% 2.363G
I64 I64 UNIFORM 1000 1000000000 11x 371.371 ms 0.23% 371.366 ms 0.23% 2.693G
I64 I64 UNIFORM 100000 1000000000 11x 371.471 ms 0.21% 371.466 ms 0.21% 2.692G
I64 I64 UNIFORM 10000000 1000000000 11x 371.516 ms 0.26% 371.511 ms 0.26% 2.692G
I64 I64 UNIFORM 1000000000 1000000000 11x 381.316 ms 8.38% 381.311 ms 8.38% 2.623G

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final cleanup requests. Great job! @srinivasyadav18

include/cuco/detail/static_map/static_map_ref.inl Outdated Show resolved Hide resolved
ref_type& ref_ = static_cast<ref_type&>(*this);
auto const expected_key = ref_.impl_.empty_slot_sentinel().first;

auto old_key =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we remove attempt_insert_or_apply if i's no longer used?

@PointKernel
Copy link
Member

/ok to test

@PointKernel
Copy link
Member

/ok to test

@PointKernel
Copy link
Member

/ok to test

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you for addressing all my reviews. I'm really happy with the final outcome of this work.

@PointKernel
Copy link
Member

The table of results is kind of hard to read, is there a plot of comparison between this code and using a sort based approach?

The quick takeaway is that the hash-based insert_or_apply method outperforms the sort-based solution in almost all cases. However, there is one particular scenario where the sort-based solution is more efficient: when dealing with very large data with extremely low cardinality, such as inserting the same value 1 billion times.

I have run out of my daily advanced data analysis quota with free ChatGPT. However, if you copy-paste the above table into ChatGPT, it will summarize the results effectively and generate a nice Python script to visualize the data.

@sleeepyjack
Copy link
Collaborator

sleeepyjack commented Jul 4, 2024

ChatGPT Pro shill here coming in hot with some AI analysis fresh outta the LLM oven. I was presented with this fancy 3D plotly graph. No clue how accurate it actually represents the numbers I've showed it: https://nvidia-my.sharepoint.com/:u:/p/djuenger/ETLjQTGXXZJKiDapTGruSyMBxqaMfgZXi2QMZUAfKaBoTw?e=lqabSu

(you have to download the file and open it locally since OneDrive preview can't comprehend this level of AI awesomeness)

Copy link
Collaborator

@sleeepyjack sleeepyjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! 👏

@mfbalin
Copy link

mfbalin commented Jul 4, 2024

@sleeepyjack
Copy link
Collaborator

@mfbalin right, there's an organization policy in place that prevents me from generating publicly accessible links to NV-hosted files. I uploaded the file to a public gist here: https://gist.github.com/sleeepyjack/8b038de36416af38d3284dabdf99814b

@mfbalin
Copy link

mfbalin commented Jul 4, 2024

@mfbalin right, there's an organization policy in place that prevents me from generating publicly accessible links to NV-hosted files. I uploaded the file to a public gist here: https://gist.github.com/sleeepyjack/8b038de36416af38d3284dabdf99814b

3D plots are surprisingly hard to read :). Seems like the performance is great, unless the multiplicity is too large or the hash table does not fit in the cache (L2?).

@sleeepyjack
Copy link
Collaborator

Yep, we already have some ideas up our sleeves to improve performance in high-multiplicity scenarios. Stay tuned for some of @srinivasyadav18 upcoming PRs ;)

@mfbalin
Copy link

mfbalin commented Jul 4, 2024

Yep, we already have some ideas up our sleeves to improve performance in high-multiplicity scenarios. Stay tuned for some of @srinivasyadav18 upcoming PRs ;)

Shared memory hash table for each block doing reduction first then committing to main hash table seems like a way to improve it. Not sure if it would slowdown the general case though. I wonder if you had a different idea :). I will be watching the developments closely.

@sleeepyjack
Copy link
Collaborator

That's exactly what we had in mind. 🎯 We're vibing ;)

@PointKernel PointKernel merged commit a7f87ac into NVIDIA:dev Jul 4, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review Awaiting reviews before merging topic: performance Performance related issue topic: static_map Issue related to the static_map type: feature request New feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants