[FEA]: Reduce scope of histogram atomics #3357

gevtushenko · 2025-01-11T22:52:35Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

CUB

Is your feature request related to a problem? Please describe.

Atomic-based specialization of block histogram is using device-wide atomics instead of block-wide ones:

cccl/cub/cub/block/specializations/block_histogram_atomic.cuh

Line 79 in cc7c1bb

atomicAdd(histogram + items[i], 1);

When histogram is in shared memory (and compiler can see that), this inefficiency is optimized away. Nevertheless, block histogram allows histogram to be in global memory, which leads to suboptimal codegen (using gpu instead of cta scope on atom).

Describe the solution you'd like

Scoped atomics are Pascal+ feature, so we can consider something along the lines of:

NV_IF_TARGET(NV_PROVIDES_SM_60, 
             (atomicAdd_block(histogram + items[i], 1);), 
             (atomicAdd(histogram + items[i], 1);));

Potential benchmark for this change:

template <int BlockThreads, int ItemsPerThread, int Bins>
__global__ void kernel(int *data, int *histogram)
{
  using histogram_t = cub::BlockHistogram<int,
                                          BlockThreads,
                                          ItemsPerThread,
                                          Bins,
                                          cub::BlockHistogramAlgorithm::BLOCK_HISTO_ATOMIC>;
  __shared__ typename histogram_t::TempStorage temp_storage;

  int thread_data[ItemsPerThread];
  cub::LoadDirectStriped<BlockThreads>(threadIdx.x, data, thread_data);
  histogram_t(temp_storage).Histogram(thread_data, histogram + Bins * blockIdx.x);
}

template <class BlockThreads, class ItemsPerThread, class Bins>
void bench(nvbench::state &state, nvbench::type_list<BlockThreads, ItemsPerThread, Bins>)
{
  constexpr int block_threads    = BlockThreads::value;
  constexpr int items_per_thread = ItemsPerThread::value;
  constexpr int bins             = Bins::value;

  int grid_size  = 800;
  int input_size = block_threads * items_per_thread;
  thrust::device_vector<int> data(input_size);
  thrust::device_vector<int> histogram(bins * grid_size);
  thrust::tabulate(data.begin(), data.end(), [] __host__ __device__(int i) { return i % bins; });

  state.exec([&](nvbench::launch &launch) {
    kernel<block_threads, items_per_thread, bins>
      <<<grid_size, block_threads, 0, launch.get_stream()>>>(thrust::raw_pointer_cast(data.data()),
                                                             thrust::raw_pointer_cast(
                                                               histogram.data()));
  });
}

using block_threads = nvbench::enum_type_list<128, 256, 512>;
using items         = nvbench::enum_type_list<1, 3, 7>;
using bins          = nvbench::enum_type_list<10, 50, 100>;

NVBENCH_BENCH_TYPES(bench, NVBENCH_TYPE_AXES(block_threads, items, bins));

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

gevtushenko added the feature request New feature or request. label Jan 11, 2025

github-project-automation bot added this to CCCL Jan 11, 2025

github-project-automation bot moved this to Todo in CCCL Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Reduce scope of histogram atomics #3357

[FEA]: Reduce scope of histogram atomics #3357

gevtushenko commented Jan 11, 2025

[FEA]: Reduce scope of histogram atomics #3357

[FEA]: Reduce scope of histogram atomics #3357

Comments

gevtushenko commented Jan 11, 2025

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context