Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Reduce scope of histogram atomics #3357

Open
1 task done
gevtushenko opened this issue Jan 11, 2025 · 0 comments
Open
1 task done

[FEA]: Reduce scope of histogram atomics #3357

gevtushenko opened this issue Jan 11, 2025 · 0 comments
Labels
feature request New feature or request.

Comments

@gevtushenko
Copy link
Collaborator

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

Atomic-based specialization of block histogram is using device-wide atomics instead of block-wide ones:

atomicAdd(histogram + items[i], 1);

When histogram is in shared memory (and compiler can see that), this inefficiency is optimized away. Nevertheless, block histogram allows histogram to be in global memory, which leads to suboptimal codegen (using gpu instead of cta scope on atom).

Describe the solution you'd like

Scoped atomics are Pascal+ feature, so we can consider something along the lines of:

NV_IF_TARGET(NV_PROVIDES_SM_60, 
             (atomicAdd_block(histogram + items[i], 1);), 
             (atomicAdd(histogram + items[i], 1);));

Potential benchmark for this change:

template <int BlockThreads, int ItemsPerThread, int Bins>
__global__ void kernel(int *data, int *histogram)
{
  using histogram_t = cub::BlockHistogram<int,
                                          BlockThreads,
                                          ItemsPerThread,
                                          Bins,
                                          cub::BlockHistogramAlgorithm::BLOCK_HISTO_ATOMIC>;
  __shared__ typename histogram_t::TempStorage temp_storage;

  int thread_data[ItemsPerThread];
  cub::LoadDirectStriped<BlockThreads>(threadIdx.x, data, thread_data);
  histogram_t(temp_storage).Histogram(thread_data, histogram + Bins * blockIdx.x);
}

template <class BlockThreads, class ItemsPerThread, class Bins>
void bench(nvbench::state &state, nvbench::type_list<BlockThreads, ItemsPerThread, Bins>)
{
  constexpr int block_threads    = BlockThreads::value;
  constexpr int items_per_thread = ItemsPerThread::value;
  constexpr int bins             = Bins::value;

  int grid_size  = 800;
  int input_size = block_threads * items_per_thread;
  thrust::device_vector<int> data(input_size);
  thrust::device_vector<int> histogram(bins * grid_size);
  thrust::tabulate(data.begin(), data.end(), [] __host__ __device__(int i) { return i % bins; });

  state.exec([&](nvbench::launch &launch) {
    kernel<block_threads, items_per_thread, bins>
      <<<grid_size, block_threads, 0, launch.get_stream()>>>(thrust::raw_pointer_cast(data.data()),
                                                             thrust::raw_pointer_cast(
                                                               histogram.data()));
  });
}

using block_threads = nvbench::enum_type_list<128, 256, 512>;
using items         = nvbench::enum_type_list<1, 3, 7>;
using bins          = nvbench::enum_type_list<10, 50, 100>;

NVBENCH_BENCH_TYPES(bench, NVBENCH_TYPE_AXES(block_threads, items, bins));

Describe alternatives you've considered

No response

Additional context

No response

@gevtushenko gevtushenko added the feature request New feature or request. label Jan 11, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Status: Todo
Development

No branches or pull requests

1 participant