Synchronization for Collectives #3315

osayamenja · 2025-01-10T06:20:00Z

osayamenja
Jan 10, 2025

Is it necessary to invoke a barrier (__sync*()) for successive invocations of a CUB collective when shared temporary storage is not reused?

Say for example, I intend to do 64 CUB Block Inclusive scans consecutively (in a loop).

Rather than synchronizing to reuse the shared temp storage, I allocate 64 temp storage objects and use one per inclusive scan, do I still need to synchronize prior to invoking each successive scan? What if one thread does some intermediate work before invoking the collective?

Concretely, is the below functionally correct?

#define CAST_TO(T, p) static_cast<T*>(static_cast<void*>(p))

using BlockScan = cub::BlockScan<uint8_t, threads>;
__shared__ __align__(alignof(BlockScan::TempStorage)) cuda::std::byte scratch[sizeof(BlockScan::TempStorage) * 64];
auto* scanTempStorage = CAST_TO(typename BlockScan::TempStorage, scratch);

for(uint i = 0; i < 64; ++i){
 BlockScan(scanTempStorage[i]).InclusiveSum(/*args*/);
 if(threadIdx.x == 0){
  // Do some work
 }
}

Answered by elstehle

Jan 10, 2025

Rather than synchronizing to reuse the shared temp storage, I allocate 64 temp storage objects and use one per inclusive scan, do I still need to synchronize prior to invoking each successive scan?

You do not need to __syncthreads() if the shared memory allocations passed to consecutive Block* algorithm invocations do not overlap. Keep in mind that excessive shared memory allocations may reduce occupancy, though. In most cases reducing occupancy will result in worse performance than invoking a __syncthreads().

What if one thread does some intermediate work before invoking the collective?

It should not be a problem as long as the data that a thread passes to a Block* algorithm is avail…

View full answer

elstehle · 2025-01-10T09:01:56Z

elstehle
Jan 10, 2025
Collaborator

Rather than synchronizing to reuse the shared temp storage, I allocate 64 temp storage objects and use one per inclusive scan, do I still need to synchronize prior to invoking each successive scan?

You do not need to __syncthreads() if the shared memory allocations passed to consecutive Block* algorithm invocations do not overlap. Keep in mind that excessive shared memory allocations may reduce occupancy, though. In most cases reducing occupancy will result in worse performance than invoking a __syncthreads().

What if one thread does some intermediate work before invoking the collective?

It should not be a problem as long as the data that a thread passes to a Block* algorithm is available to that thread at the time it invokes the Block* algorithm.

1 reply

osayamenja Jan 10, 2025
Author

Thanks @elstehle! Yes, all 64 input items--resident in a register array--are available to each thread prior to starting the loop.

Indeed, thanks for highlighting the impact on occupancy as well.

For my workload, its carefully-selected shared memory allocation (16 KB) falls at the sweet spot of being favorable to occupancy and large enough to accommodate 64 temp storage objects, so all is well there :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronization for Collectives #3315

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Synchronization for Collectives #3315

osayamenja Jan 10, 2025

Replies: 1 comment · 1 reply

elstehle Jan 10, 2025 Collaborator

osayamenja Jan 10, 2025 Author

osayamenja
Jan 10, 2025

Replies: 1 comment 1 reply

elstehle
Jan 10, 2025
Collaborator

osayamenja Jan 10, 2025
Author