CUDA Warp Primitives and Sync Notes

October 8, 2023

361 views

3732 words

Volta Independent Thread Scheduling

Prior to Volta, all threads in a warp share the same PC. The scheduler schedules instruction based on "earliest-all-thread-convergence" principle. The diverged instructions are executed serially and all threads reconverge at earliest possible time. This has performance implications, as diverged branch may take very long time on only a few threads, causing the other threads to wait for long period of time, which leads to starvation.
In Volta, each thread maintains an individual PC. Diverged branches may be schedule interchangeably, and reconvergence is not enforced at earliest possible time. That is, although the code may be able to execute in a converged fashion after a conditional branch, it is not enforced and threads may execute converged code in a diverged manner. This alleviates starvation problem but makes thread convergence non-deterministic.

Independent Thread Scheduling
Volta Independent Thread Scheduling w/ Graphs

__syncwarp() and the __sync suffixed warp-level primitives are introduced to assert deterministic warp-level convergence and correctness on warp-level primitives, including reductions. The unsynced versions become invalid, as threads may be executing reduction or ballot in a diverged manner, which leads to erroneous results.

Why __syncwarp()

Intra-warp synchronization with __syncwarp() is potentially implemented with registers. There is only one warp-level "barrier"; that is, diverged threads can call __syncwarp() at different PC and still "converge".
Block-level synchronization with __syncthreads() is implemented with barriers. Each CTA has 16 barrier instances. Calls at different PCs will be assigned different barrier instances, thus different __syncthreads() will not converge with each other.

__syncthreads()
PTX barrier.sync

The first mask variable is for synchronization, not filtering. There is an implicit warp synchronization on the given mask prior to the reduction.
shfl reduction on diverged threads requires careful engineering. Diverged threads may yield invalid or undefined reduction value. It can be done correctly by manually forcing convergence and setting originally inactive threads' reduction value to 0.

Warp-Level Primitives
stackoverflow diverged __shfl

Cooperative Groups provide simple interfaces for warp level reductions with divergence.

Warp Primitives to Cooperative Groups Translation

Since warp-level reductions are performed without using atomic hardware (likely with warp registers), they can be faster than global/shared reduction/atomic when there is high contention.

Performance Analysis on Warp-Aggregated Filtering

CUDA Warp Primitives and Sync Notes

snowsr • 2023 年 10 月 08 日