-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add static_map::insert_or_apply aka reduce-by-key #515
Conversation
/ok to test |
One small nit coming from the Doxygen checker:
|
Benchmark results of reduction using
|
Key | Value | Distribution | Multiplicity | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s |
---|---|---|---|---|---|---|---|---|---|
I32 | I32 | UNIFORM | 1 | 11x | 157.325 ms | 0.05% | 157.318 ms | 0.05% | 635.657M |
I32 | I32 | UNIFORM | 2 | 11x | 155.883 ms | 0.03% | 155.879 ms | 0.03% | 641.524M |
I32 | I32 | UNIFORM | 4 | 11x | 154.608 ms | 0.02% | 154.601 ms | 0.01% | 646.825M |
I32 | I32 | UNIFORM | 8 | 11x | 153.095 ms | 0.02% | 153.089 ms | 0.02% | 653.216M |
I32 | I32 | UNIFORM | 16 | 11x | 150.484 ms | 0.01% | 150.477 ms | 0.01% | 664.553M |
I64 | I64 | UNIFORM | 1 | 11x | 163.568 ms | 0.03% | 163.564 ms | 0.03% | 611.383M |
I64 | I64 | UNIFORM | 2 | 11x | 160.726 ms | 0.02% | 160.719 ms | 0.02% | 622.202M |
I64 | I64 | UNIFORM | 4 | 11x | 160.739 ms | 0.02% | 160.732 ms | 0.01% | 622.153M |
I64 | I64 | UNIFORM | 8 | 11x | 160.102 ms | 0.02% | 160.097 ms | 0.02% | 624.621M |
I64 | I64 | UNIFORM | 16 | 11x | 158.970 ms | 0.03% | 158.965 ms | 0.03% | 629.068M |
static_set_insert_or_apply_uniform_occupancy
Key | Value | Distribution | Occupancy | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s |
---|---|---|---|---|---|---|---|---|---|
I32 | I32 | UNIFORM | 0.1 | 11x | 155.013 ms | 0.02% | 155.007 ms | 0.01% | 645.134M |
I32 | I32 | UNIFORM | 0.2 | 11x | 154.275 ms | 0.02% | 154.270 ms | 0.02% | 648.216M |
I32 | I32 | UNIFORM | 0.3 | 11x | 153.951 ms | 0.02% | 153.947 ms | 0.02% | 649.576M |
I32 | I32 | UNIFORM | 0.4 | 11x | 153.518 ms | 0.05% | 153.512 ms | 0.04% | 651.415M |
I32 | I32 | UNIFORM | 0.5 | 11x | 153.109 ms | 0.04% | 153.104 ms | 0.04% | 653.153M |
I32 | I32 | UNIFORM | 0.6 | 11x | 152.931 ms | 0.06% | 152.926 ms | 0.06% | 653.911M |
I32 | I32 | UNIFORM | 0.7 | 11x | 153.304 ms | 0.04% | 153.299 ms | 0.04% | 652.319M |
I32 | I32 | UNIFORM | 0.8 | 11x | 155.583 ms | 0.09% | 155.578 ms | 0.09% | 642.763M |
I32 | I32 | UNIFORM | 0.9 | 11x | 166.330 ms | 0.07% | 166.325 ms | 0.07% | 601.233M |
I64 | I64 | UNIFORM | 0.1 | 11x | 161.143 ms | 0.02% | 161.138 ms | 0.02% | 620.586M |
I64 | I64 | UNIFORM | 0.2 | 11x | 161.003 ms | 0.06% | 160.998 ms | 0.06% | 621.125M |
I64 | I64 | UNIFORM | 0.3 | 11x | 160.691 ms | 0.10% | 160.686 ms | 0.10% | 622.332M |
I64 | I64 | UNIFORM | 0.4 | 11x | 160.306 ms | 0.11% | 160.301 ms | 0.11% | 623.826M |
I64 | I64 | UNIFORM | 0.5 | 11x | 160.245 ms | 0.10% | 160.239 ms | 0.10% | 624.067M |
I64 | I64 | UNIFORM | 0.6 | 11x | 160.628 ms | 0.12% | 160.623 ms | 0.12% | 622.575M |
I64 | I64 | UNIFORM | 0.7 | 11x | 161.962 ms | 0.12% | 161.956 ms | 0.12% | 617.450M |
I64 | I64 | UNIFORM | 0.8 | 11x | 169.040 ms | 0.07% | 169.036 ms | 0.07% | 591.591M |
I64 | I64 | UNIFORM | 0.9 | 11x | 192.509 ms | 0.07% | 192.505 ms | 0.07% | 519.466M |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first pass. Very clean code!
Benchmark Results with Higher Multiplicity [1, 2, ... 128]static_map_insert_or_apply_uniform_multiplicity[0] Tesla T4
static_set_insert_or_apply_uniform_occupancy[0] Tesla T4
|
This reverts commit 45adafc, Because there is no performance improvement in adding identity element optimization.
/ok to test |
ref_type& ref_ = static_cast<ref_type&>(*this); | ||
auto const expected_key = ref_.impl_.empty_slot_sentinel().first; | ||
|
||
auto old_key = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK, so we are not using any of the CAS algorithms but simply CASing keys all the time. We had long discussions here thus I could miss something obvious: what are the concerns of using packed CAS in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! This is a left over. We dont use attempt_insert_or_apply
anymore, as it can only work when sentienel == identity_element
. Because, in this situation we only need to perform key CAS as payload atomic operation can work in any order, as long as sentinel key is updated once.
Currently we can only do attempt_insert_stable
which does stable insert and does not rely on identity_element
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we remove attempt_insert_or_apply
if i's no longer used?
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Most of my comments are small doc fixes.
Benchmark Results of
|
Key | Value | Distribution | Multiplicity | NumInputs | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s |
---|---|---|---|---|---|---|---|---|---|---|
I32 | I32 | UNIFORM | 1 | 1 | 108080x | 9.917 us | 114.69% | 4.626 us | 1.56% | 216.159K |
I32 | I32 | UNIFORM | 1 | 100 | 78567x | 11.625 us | 82.82% | 6.364 us | 1.44% | 15.713M |
I32 | I32 | UNIFORM | 10 | 100 | 76124x | 11.718 us | 78.62% | 6.568 us | 2.13% | 15.225M |
I32 | I32 | UNIFORM | 1 | 10000 | 53776x | 14.521 us | 56.27% | 9.298 us | 1.79% | 1.076G |
I32 | I32 | UNIFORM | 10 | 10000 | 59964x | 13.701 us | 64.44% | 8.338 us | 1.96% | 1.199G |
I32 | I32 | UNIFORM | 1000 | 10000 | 22427x | 27.443 us | 23.16% | 22.295 us | 0.81% | 448.537M |
I32 | I32 | UNIFORM | 1 | 1000000 | 8592x | 63.551 us | 9.23% | 58.195 us | 0.32% | 17.184G |
I32 | I32 | UNIFORM | 10 | 1000000 | 10644x | 52.405 us | 11.60% | 46.977 us | 0.71% | 21.287G |
I32 | I32 | UNIFORM | 1000 | 1000000 | 8076x | 67.190 us | 8.63% | 61.915 us | 1.23% | 16.151G |
I32 | I32 | UNIFORM | 100000 | 1000000 | 607x | 829.849 us | 0.69% | 824.545 us | 0.25% | 1.213G |
I32 | I32 | UNIFORM | 1 | 100000000 | 40x | 12.814 ms | 0.05% | 12.808 ms | 0.02% | 7.808G |
I32 | I32 | UNIFORM | 10 | 100000000 | 50x | 10.193 ms | 0.06% | 10.188 ms | 0.01% | 9.816G |
I32 | I32 | UNIFORM | 1000 | 100000000 | 146x | 3.446 ms | 0.17% | 3.440 ms | 0.03% | 29.072G |
I32 | I32 | UNIFORM | 100000 | 100000000 | 101x | 4.961 ms | 0.20% | 4.955 ms | 0.16% | 20.182G |
I32 | I32 | UNIFORM | 10000000 | 100000000 | 11x | 76.129 ms | 0.03% | 76.123 ms | 0.03% | 1.314G |
I32 | I32 | UNIFORM | 1 | 1000000000 | 11x | 130.526 ms | 0.01% | 130.521 ms | 0.01% | 7.662G |
I32 | I32 | UNIFORM | 10 | 1000000000 | 11x | 128.667 ms | 0.41% | 128.661 ms | 0.41% | 7.772G |
I32 | I32 | UNIFORM | 1000 | 1000000000 | 15x | 33.494 ms | 0.19% | 33.488 ms | 0.19% | 29.861G |
I32 | I32 | UNIFORM | 100000 | 1000000000 | 17x | 30.873 ms | 0.34% | 30.867 ms | 0.34% | 32.397G |
I32 | I32 | UNIFORM | 10000000 | 1000000000 | 11x | 233.728 ms | 0.01% | 233.722 ms | 0.01% | 4.279G |
I32 | I32 | UNIFORM | 1000000000 | 1000000000 | 11x | 766.520 ms | 0.00% | 766.515 ms | 0.00% | 1.305G |
I64 | I64 | UNIFORM | 1 | 1 | 92877x | 10.672 us | 98.38% | 5.384 us | 1.71% | 185.752K |
I64 | I64 | UNIFORM | 1 | 100 | 83656x | 11.158 us | 86.88% | 5.977 us | 1.14% | 16.731M |
I64 | I64 | UNIFORM | 10 | 100 | 74783x | 11.954 us | 78.92% | 6.686 us | 1.60% | 14.957M |
I64 | I64 | UNIFORM | 1 | 10000 | 51819x | 14.927 us | 54.81% | 9.649 us | 2.01% | 1.036G |
I64 | I64 | UNIFORM | 10 | 10000 | 52624x | 14.829 us | 56.19% | 9.501 us | 1.71% | 1.052G |
I64 | I64 | UNIFORM | 1000 | 10000 | 22010x | 27.877 us | 22.81% | 22.717 us | 1.48% | 440.193M |
I64 | I64 | UNIFORM | 1 | 1000000 | 6661x | 80.488 us | 7.25% | 75.066 us | 0.39% | 13.322G |
I64 | I64 | UNIFORM | 10 | 1000000 | 9887x | 55.774 us | 10.32% | 50.572 us | 0.44% | 19.774G |
I64 | I64 | UNIFORM | 1000 | 1000000 | 11859x | 47.613 us | 12.97% | 42.163 us | 0.84% | 23.717G |
I64 | I64 | UNIFORM | 100000 | 1000000 | 728x | 692.888 us | 0.82% | 687.483 us | 0.24% | 1.455G |
I64 | I64 | UNIFORM | 1 | 100000000 | 36x | 14.278 ms | 0.05% | 14.272 ms | 0.02% | 7.007G |
I64 | I64 | UNIFORM | 10 | 100000000 | 39x | 12.844 ms | 0.05% | 12.838 ms | 0.02% | 7.789G |
I64 | I64 | UNIFORM | 1000 | 100000000 | 138x | 3.649 ms | 0.16% | 3.643 ms | 0.02% | 27.448G |
I64 | I64 | UNIFORM | 100000 | 100000000 | 164x | 3.065 ms | 0.21% | 3.059 ms | 0.10% | 32.689G |
I64 | I64 | UNIFORM | 10000000 | 100000000 | 11x | 64.236 ms | 0.03% | 64.230 ms | 0.03% | 1.557G |
I64 | I64 | UNIFORM | 1 | 1000000000 | 11x | 143.398 ms | 0.02% | 143.392 ms | 0.02% | 6.974G |
I64 | I64 | UNIFORM | 10 | 1000000000 | 11x | 143.681 ms | 0.46% | 143.675 ms | 0.46% | 6.960G |
I64 | I64 | UNIFORM | 1000 | 1000000000 | 14x | 38.440 ms | 0.13% | 38.435 ms | 0.13% | 26.018G |
I64 | I64 | UNIFORM | 100000 | 1000000000 | 16x | 31.523 ms | 0.21% | 31.517 ms | 0.21% | 31.729G |
I64 | I64 | UNIFORM | 10000000 | 1000000000 | 11x | 150.555 ms | 0.12% | 150.549 ms | 0.12% | 6.642G |
I64 | I64 | UNIFORM | 1000000000 | 1000000000 | 11x | 785.048 ms | 0.01% | 785.042 ms | 0.01% | 1.274G |
static_map_insert_or_apply_sync_uniform_multiplicity: map_construction + insert_or_apply + retrieve_all
[0] NVIDIA H100
Key | Value | Distribution | Multiplicity | NumInputs | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s |
---|---|---|---|---|---|---|---|---|---|---|
I32 | I32 | UNIFORM | 1 | 1 | 14236x | 39.764 us | 16.43% | 35.124 us | 9.74% | 28.471K |
I32 | I32 | UNIFORM | 1 | 100 | 14147x | 39.965 us | 15.98% | 35.345 us | 9.18% | 2.829M |
I32 | I32 | UNIFORM | 10 | 100 | 13836x | 40.725 us | 14.58% | 36.139 us | 7.13% | 2.767M |
I32 | I32 | UNIFORM | 1 | 10000 | 13470x | 41.774 us | 12.83% | 37.121 us | 2.58% | 269.391M |
I32 | I32 | UNIFORM | 10 | 10000 | 13441x | 41.801 us | 12.72% | 37.200 us | 2.82% | 268.817M |
I32 | I32 | UNIFORM | 1000 | 10000 | 10283x | 53.245 us | 12.03% | 48.626 us | 7.39% | 205.653M |
I32 | I32 | UNIFORM | 1 | 1000000 | 2574x | 199.268 us | 3.29% | 194.322 us | 2.04% | 5.146G |
I32 | I32 | UNIFORM | 10 | 1000000 | 3883x | 133.326 us | 4.09% | 128.790 us | 2.06% | 7.765G |
I32 | I32 | UNIFORM | 1000 | 1000000 | 3498x | 147.518 us | 4.13% | 142.968 us | 2.57% | 6.995G |
I32 | I32 | UNIFORM | 100000 | 1000000 | 564x | 891.902 us | 0.66% | 887.355 us | 0.42% | 1.127G |
I32 | I32 | UNIFORM | 1 | 100000000 | 31x | 16.513 ms | 0.09% | 16.507 ms | 0.08% | 6.058G |
I32 | I32 | UNIFORM | 10 | 100000000 | 47x | 10.862 ms | 0.08% | 10.856 ms | 0.06% | 9.211G |
I32 | I32 | UNIFORM | 1000 | 100000000 | 142x | 3.534 ms | 0.15% | 3.530 ms | 0.08% | 28.331G |
I32 | I32 | UNIFORM | 100000 | 100000000 | 99x | 5.069 ms | 0.18% | 5.064 ms | 0.15% | 19.746G |
I32 | I32 | UNIFORM | 10000000 | 100000000 | 11x | 76.669 ms | 0.04% | 76.665 ms | 0.04% | 1.304G |
I32 | I32 | UNIFORM | 1 | 1000000000 | 11x | 238.584 ms | 34.18% | 238.579 ms | 34.18% | 4.191G |
I32 | I32 | UNIFORM | 10 | 1000000000 | 11x | 133.714 ms | 1.42% | 133.708 ms | 1.42% | 7.479G |
I32 | I32 | UNIFORM | 1000 | 1000000000 | 15x | 33.993 ms | 2.28% | 33.988 ms | 2.28% | 29.422G |
I32 | I32 | UNIFORM | 100000 | 1000000000 | 17x | 31.042 ms | 0.76% | 31.038 ms | 0.76% | 32.219G |
I32 | I32 | UNIFORM | 10000000 | 1000000000 | 11x | 234.404 ms | 0.01% | 234.400 ms | 0.01% | 4.266G |
I32 | I32 | UNIFORM | 1000000000 | 1000000000 | 11x | 768.069 ms | 0.01% | 768.066 ms | 0.01% | 1.302G |
I64 | I64 | UNIFORM | 1 | 1 | 14169x | 39.907 us | 16.06% | 35.290 us | 9.30% | 28.337K |
I64 | I64 | UNIFORM | 1 | 100 | 14221x | 39.737 us | 16.06% | 35.161 us | 9.39% | 2.844M |
I64 | I64 | UNIFORM | 10 | 100 | 14625x | 38.853 us | 17.29% | 34.189 us | 10.59% | 2.925M |
I64 | I64 | UNIFORM | 1 | 10000 | 13542x | 41.516 us | 12.78% | 36.923 us | 2.76% | 270.833M |
I64 | I64 | UNIFORM | 10 | 10000 | 13426x | 41.846 us | 12.71% | 37.243 us | 2.85% | 268.510M |
I64 | I64 | UNIFORM | 1000 | 10000 | 9734x | 55.970 us | 10.91% | 51.369 us | 6.25% | 194.672M |
I64 | I64 | UNIFORM | 1 | 1000000 | 2235x | 228.915 us | 2.76% | 223.715 us | 1.44% | 4.470G |
I64 | I64 | UNIFORM | 10 | 1000000 | 2725x | 188.394 us | 3.48% | 183.536 us | 2.17% | 5.449G |
I64 | I64 | UNIFORM | 1000 | 1000000 | 4143x | 125.205 us | 4.43% | 120.699 us | 2.35% | 8.285G |
I64 | I64 | UNIFORM | 100000 | 1000000 | 672x | 749.144 us | 0.78% | 744.651 us | 0.50% | 1.343G |
I64 | I64 | UNIFORM | 1 | 100000000 | 24x | 20.894 ms | 0.14% | 20.888 ms | 0.13% | 4.787G |
I64 | I64 | UNIFORM | 10 | 100000000 | 36x | 14.156 ms | 0.08% | 14.150 ms | 0.07% | 7.067G |
I64 | I64 | UNIFORM | 1000 | 100000000 | 133x | 3.785 ms | 0.16% | 3.780 ms | 0.08% | 26.456G |
I64 | I64 | UNIFORM | 100000 | 100000000 | 159x | 3.151 ms | 0.20% | 3.146 ms | 0.13% | 31.782G |
I64 | I64 | UNIFORM | 10000000 | 100000000 | 11x | 64.586 ms | 0.03% | 64.581 ms | 0.03% | 1.548G |
I64 | I64 | UNIFORM | 1 | 1000000000 | 11x | 831.910 ms | 0.45% | 831.907 ms | 0.45% | 1.202G |
I64 | I64 | UNIFORM | 10 | 1000000000 | 11x | 150.437 ms | 0.95% | 150.431 ms | 0.95% | 6.648G |
I64 | I64 | UNIFORM | 1000 | 1000000000 | 13x | 39.079 ms | 1.28% | 39.074 ms | 1.28% | 25.592G |
I64 | I64 | UNIFORM | 100000 | 1000000000 | 16x | 32.446 ms | 0.22% | 32.441 ms | 0.22% | 30.825G |
I64 | I64 | UNIFORM | 10000000 | 1000000000 | 11x | 152.054 ms | 1.00% | 152.050 ms | 1.00% | 6.577G |
I64 | I64 | UNIFORM | 1000000000 | 1000000000 | 11x | 786.932 ms | 0.01% | 786.929 ms | 0.01% | 1.271G |
thrust_insert_or_apply_uniform_multiplicity thrust sort_by_key + reduce_by_key
[0] NVIDIA H100
Key | Value | Distribution | Multiplicity | NumInputs | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s |
---|---|---|---|---|---|---|---|---|---|---|
I32 | I32 | UNIFORM | 1 | 1 | 8516x | 63.275 us | 7.92% | 58.717 us | 1.52% | 17.031K |
I32 | I32 | UNIFORM | 1 | 100 | 8473x | 63.591 us | 8.07% | 59.013 us | 2.10% | 1.695M |
I32 | I32 | UNIFORM | 10 | 100 | 8483x | 63.517 us | 8.06% | 58.944 us | 2.07% | 1.697M |
I32 | I32 | UNIFORM | 1 | 10000 | 5860x | 89.776 us | 5.34% | 85.326 us | 1.14% | 117.197M |
I32 | I32 | UNIFORM | 10 | 10000 | 6292x | 83.924 us | 5.74% | 79.466 us | 1.22% | 125.839M |
I32 | I32 | UNIFORM | 1000 | 10000 | 6305x | 83.813 us | 5.80% | 79.314 us | 1.19% | 126.082M |
I32 | I32 | UNIFORM | 1 | 1000000 | 1577x | 322.285 us | 1.97% | 317.201 us | 1.14% | 3.153G |
I32 | I32 | UNIFORM | 10 | 1000000 | 1584x | 320.971 us | 2.50% | 315.791 us | 1.84% | 3.167G |
I32 | I32 | UNIFORM | 1000 | 1000000 | 1623x | 313.149 us | 1.96% | 308.100 us | 1.06% | 3.246G |
I32 | I32 | UNIFORM | 100000 | 1000000 | 1665x | 305.556 us | 2.05% | 300.457 us | 1.13% | 3.328G |
I32 | I32 | UNIFORM | 1 | 100000000 | 77x | 6.534 ms | 0.25% | 6.528 ms | 0.23% | 15.318G |
I32 | I32 | UNIFORM | 10 | 100000000 | 82x | 6.106 ms | 0.24% | 6.100 ms | 0.22% | 16.393G |
I32 | I32 | UNIFORM | 1000 | 100000000 | 87x | 5.792 ms | 0.21% | 5.786 ms | 0.18% | 17.284G |
I32 | I32 | UNIFORM | 100000 | 100000000 | 88x | 5.744 ms | 0.22% | 5.738 ms | 0.19% | 17.427G |
I32 | I32 | UNIFORM | 10000000 | 100000000 | 88x | 5.725 ms | 0.18% | 5.719 ms | 0.15% | 17.487G |
I32 | I32 | UNIFORM | 1 | 1000000000 | 11x | 50.674 ms | 0.97% | 50.669 ms | 0.97% | 19.736G |
I32 | I32 | UNIFORM | 10 | 1000000000 | 11x | 133.028 ms | 33.99% | 133.023 ms | 33.99% | 7.517G |
I32 | I32 | UNIFORM | 1000 | 1000000000 | 11x | 132.743 ms | 35.40% | 132.738 ms | 35.40% | 7.534G |
I32 | I32 | UNIFORM | 100000 | 1000000000 | 11x | 133.029 ms | 35.49% | 133.024 ms | 35.49% | 7.517G |
I32 | I32 | UNIFORM | 10000000 | 1000000000 | 11x | 133.027 ms | 35.79% | 133.022 ms | 35.79% | 7.518G |
I32 | I32 | UNIFORM | 1000000000 | 1000000000 | 11x | 147.244 ms | 24.01% | 147.239 ms | 24.01% | 6.792G |
I64 | I64 | UNIFORM | 1 | 1 | 10579x | 51.913 us | 10.08% | 47.266 us | 2.13% | 21.157K |
I64 | I64 | UNIFORM | 1 | 100 | 9541x | 56.985 us | 9.06% | 52.409 us | 2.26% | 1.908M |
I64 | I64 | UNIFORM | 10 | 100 | 9409x | 57.759 us | 9.03% | 53.145 us | 2.32% | 1.882M |
I64 | I64 | UNIFORM | 1 | 10000 | 5607x | 93.703 us | 5.34% | 89.183 us | 1.60% | 112.129M |
I64 | I64 | UNIFORM | 10 | 10000 | 5634x | 93.304 us | 5.28% | 88.762 us | 1.27% | 112.661M |
I64 | I64 | UNIFORM | 1000 | 10000 | 5616x | 93.621 us | 5.31% | 89.037 us | 1.22% | 112.313M |
I64 | I64 | UNIFORM | 1 | 1000000 | 1029x | 491.459 us | 1.65% | 486.352 us | 1.24% | 2.056G |
I64 | I64 | UNIFORM | 10 | 1000000 | 1038x | 486.752 us | 1.29% | 481.742 us | 0.75% | 2.076G |
I64 | I64 | UNIFORM | 1000 | 1000000 | 1033x | 489.316 us | 1.61% | 484.149 us | 1.18% | 2.065G |
I64 | I64 | UNIFORM | 100000 | 1000000 | 1041x | 485.532 us | 1.48% | 480.499 us | 1.03% | 2.081G |
I64 | I64 | UNIFORM | 1 | 100000000 | 15x | 34.288 ms | 0.05% | 34.282 ms | 0.05% | 2.917G |
I64 | I64 | UNIFORM | 10 | 100000000 | 15x | 34.320 ms | 0.06% | 34.314 ms | 0.06% | 2.914G |
I64 | I64 | UNIFORM | 1000 | 100000000 | 15x | 34.293 ms | 0.06% | 34.287 ms | 0.06% | 2.917G |
I64 | I64 | UNIFORM | 100000 | 100000000 | 15x | 34.312 ms | 0.05% | 34.307 ms | 0.05% | 2.915G |
I64 | I64 | UNIFORM | 10000000 | 100000000 | 15x | 34.303 ms | 0.12% | 34.298 ms | 0.12% | 2.916G |
I64 | I64 | UNIFORM | 1 | 1000000000 | 11x | 375.076 ms | 0.05% | 375.071 ms | 0.05% | 2.666G |
I64 | I64 | UNIFORM | 10 | 1000000000 | 11x | 423.237 ms | 40.13% | 423.232 ms | 40.13% | 2.363G |
I64 | I64 | UNIFORM | 1000 | 1000000000 | 11x | 371.371 ms | 0.23% | 371.366 ms | 0.23% | 2.693G |
I64 | I64 | UNIFORM | 100000 | 1000000000 | 11x | 371.471 ms | 0.21% | 371.466 ms | 0.21% | 2.692G |
I64 | I64 | UNIFORM | 10000000 | 1000000000 | 11x | 371.516 ms | 0.26% | 371.511 ms | 0.26% | 2.692G |
I64 | I64 | UNIFORM | 1000000000 | 1000000000 | 11x | 381.316 ms | 8.38% | 381.311 ms | 8.38% | 2.623G |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final cleanup requests. Great job! @srinivasyadav18
ref_type& ref_ = static_cast<ref_type&>(*this); | ||
auto const expected_key = ref_.impl_.empty_slot_sentinel().first; | ||
|
||
auto old_key = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we remove attempt_insert_or_apply
if i's no longer used?
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you for addressing all my reviews. I'm really happy with the final outcome of this work.
The quick takeaway is that the hash-based I have run out of my daily advanced data analysis quota with free ChatGPT. However, if you copy-paste the above table into ChatGPT, it will summarize the results effectively and generate a nice Python script to visualize the data. |
ChatGPT Pro shill here coming in hot with some AI analysis fresh outta the LLM oven. I was presented with this fancy 3D plotly graph. No clue how accurate it actually represents the numbers I've showed it: https://nvidia-my.sharepoint.com/:u:/p/djuenger/ETLjQTGXXZJKiDapTGruSyMBxqaMfgZXi2QMZUAfKaBoTw?e=lqabSu (you have to download the file and open it locally since OneDrive preview can't comprehend this level of AI awesomeness) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! 👏
The link is accessible only by NVIDIA employees. |
@mfbalin right, there's an organization policy in place that prevents me from generating publicly accessible links to NV-hosted files. I uploaded the file to a public gist here: https://gist.github.com/sleeepyjack/8b038de36416af38d3284dabdf99814b |
3D plots are surprisingly hard to read :). Seems like the performance is great, unless the multiplicity is too large or the hash table does not fit in the cache (L2?). |
Yep, we already have some ideas up our sleeves to improve performance in high-multiplicity scenarios. Stay tuned for some of @srinivasyadav18 upcoming PRs ;) |
Shared memory hash table for each block doing reduction first then committing to main hash table seems like a way to improve it. Not sure if it would slowdown the general case though. I wonder if you had a different idea :). I will be watching the developments closely. |
That's exactly what we had in mind. 🎯 We're vibing ;) |
This PR (continuation of PR #384) adds a function
static_map::insert_or_apply
, which either inserts a new key-value pair into the map, or, in case the key already exists, applies a reduction function over the associated value.This effectively replaces the former cuco::static_reduction_map and thus superseeds #98.
The current implementation restricts the
insert_or_apply()
opeartion to acceptOp
Callable only with a specific signature i.e Op should be a callable object asOp(cuda::atomic_ref<T, Scope>, T>)
Closes #82 #384