You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When histogram is in shared memory (and compiler can see that), this inefficiency is optimized away. Nevertheless, block histogram allows histogram to be in global memory, which leads to suboptimal codegen (using gpu instead of cta scope on atom).
Describe the solution you'd like
Scoped atomics are Pascal+ feature, so we can consider something along the lines of:
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
Atomic-based specialization of block histogram is using device-wide atomics instead of block-wide ones:
cccl/cub/cub/block/specializations/block_histogram_atomic.cuh
Line 79 in cc7c1bb
When histogram is in shared memory (and compiler can see that), this inefficiency is optimized away. Nevertheless, block histogram allows histogram to be in global memory, which leads to suboptimal codegen (using
gpu
instead ofcta
scope onatom
).Describe the solution you'd like
Scoped atomics are Pascal+ feature, so we can consider something along the lines of:
Potential benchmark for this change:
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: