Volta Independent Thread Scheduling
-
Prior to Volta, all threads in a warp share the same PC. The scheduler schedules instruction based on “earliest-all-thread-convergence” principle. The diverged instructions are executed serially and all threads reconverge at earliest possible time. This has performance implications, as diverged branch may take very long time on only a few threads, causing the other threads to wait for long period of time, which leads to starvation.
-
In Volta, each thread maintains an individual PC. Diverged branches may be schedule interchangeably, and reconvergence is not enforced at earliest possible time. That is, although the code may be able to execute in a converged fashion after a conditional branch, it is not enforced and threads may execute converged code in a diverged manner. This alleviates starvation problem but makes thread convergence non-deterministic.
__syncwarp()
and the __sync suffixed warp-level primitives are introduced to assert deterministic warp-level convergence and correctness on warp-level primitives, including reductions. The unsynced versions become invalid, as threads may be executing reduction or ballot in a diverged manner, which leads to erroneous results.
__syncwarp() Vs __syncthreads()
-
Intra-warp synchronization with
__syncwarp()
is potentially implemented with registers. There is only one warp-level “barrier”; that is, diverged threads can call__syncwarp()
at different PC and still “converge”. -
Block-level synchronization with
__syncthreads()
is implemented with barriers. Each CTA has 16 barrier instances. Calls at different PCs will be assigned different barrier instances, thus different__syncthreads()
will not converge with each other.
__shfl_X_sync() Warp Level Reduction on Floats
-
The first mask variable is for synchronization, not filtering. There is an implicit warp synchronization on the given mask prior to the reduction.
-
shfl reduction on diverged threads requires careful engineering. Diverged threads may yield invalid or undefined reduction value. It can be done correctly by manually forcing convergence and setting originally inactive threads’ reduction value to 0.
- Cooperative Groups provide simple interfaces for warp level reductions with divergence.
Warp-Aggregation Performance Implications
- Since warp-level reductions are performed without using atomic hardware (likely with warp registers), they can be faster than global/shared reduction/atomic when there is high contention.