Synchronization for Collectives #3315
-
Is it necessary to invoke a barrier ( Say for example, I intend to do 64 CUB Block Inclusive scans consecutively (in a loop). Rather than synchronizing to reuse the shared temp storage, I allocate 64 temp storage objects and use one per inclusive scan, do I still need to synchronize prior to invoking each successive scan? What if one thread does some intermediate work before invoking the collective? Concretely, is the below functionally correct? #define CAST_TO(T, p) static_cast<T*>(static_cast<void*>(p))
using BlockScan = cub::BlockScan<uint8_t, threads>;
__shared__ __align__(alignof(BlockScan::TempStorage)) cuda::std::byte scratch[sizeof(BlockScan::TempStorage) * 64];
auto* scanTempStorage = CAST_TO(typename BlockScan::TempStorage, scratch);
for(uint i = 0; i < 64; ++i){
BlockScan(scanTempStorage[i]).InclusiveSum(/*args*/);
if(threadIdx.x == 0){
// Do some work
}
} |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
You do not need to
It should not be a problem as long as the data that a thread passes to a |
Beta Was this translation helpful? Give feedback.
You do not need to
__syncthreads()
if the shared memory allocations passed to consecutiveBlock*
algorithm invocations do not overlap. Keep in mind that excessive shared memory allocations may reduce occupancy, though. In most cases reducing occupancy will result in worse performance than invoking a__syncthreads()
.It should not be a problem as long as the data that a thread passes to a
Block*
algorithm is avail…