libSCALE  0.2.0
A modern C++ CUDA API
Synchronisation Functions

Access to the more elaborate syncthreads variants. More...

Functions

__device__ void sp::syncthreads (int barrierID, int numWarps)
 A more powerful __syncthreads() More...
 
__device__ void sp::syncthreads (int barrierID)
 Like the other sp::syncthreads(), but implicitly synchronises all non-exited warps in the block. More...
 
__device__ void sp::syncthreads_arrive (int barrierID, int numWarps)
 Functions exactly like sp::syncthreads(), but this warp does not block. More...
 
__device__ int sp::syncthreads_count (int barrierID, int numWarps, bool predicate)
 Like sp::syncthreads(), but also returns a count of how many threads passed true for predicate. More...
 
__device__ int sp::syncthreads_count (int barrierID, bool predicate)
 sp::syncthreads_count(), implicitly applied to all non-exited warps. More...
 
__device__ bool sp::syncthreads_and (int barrierID, int numWarps, bool predicate)
 Like sp::syncthreads(), but also returns true iff all participating threads passed true for predicate. More...
 
__device__ bool sp::syncthreads_and (int barrierID, bool predicate)
 sp::syncthreads_and(), implicitly applied to all non-exited warps. More...
 
__device__ bool sp::syncthreads_or (int barrierID, int numWarps, bool predicate)
 Like sp::syncthreads(), but also returns true iff any participating threads passed true for predicate. More...
 
__device__ bool sp::syncthreads_or (int barrierID, bool predicate)
 sp::syncthreads_or(), implicitly applied to all non-exited warps. More...
 

Detailed Description

Access to the more elaborate syncthreads variants.

These are preferable to using inline PTX because:

Function Documentation

◆ syncthreads() [1/2]

__device__ void sp::syncthreads ( int  barrierID)

Like the other sp::syncthreads(), but implicitly synchronises all non-exited warps in the block.

This actually maps to a different underlying instruction, so is slightly cheaper (and much less annoying) than keeping track of the proper value to pass to numWarps yourself.

◆ syncthreads() [2/2]

__device__ void sp::syncthreads ( int  barrierID,
int  numWarps 
)

A more powerful __syncthreads()

This basically exposes the ptx bar.sync instruction directly.

There are 16 counters, initially zero. When a warp reaches syncthreads(X, Y), the X'th counter is increased by warpSize (regardless of how many threads of that warp were actually active at the time), and the warp goes to sleep. When the counter reaches Y, it is reset to zero and all threads sleeping on counter X wake up.

The behaviour is undefined if:

  • Y is not a multiple of warp size
  • Different values of Y are used for the same synchronisation operation (the target value of the counter will then become undefined, and you'll probably get deadlock).
  • X is less than 0 or more than 15. The ptx assembler doesn't detect this condition for some reason...

Vanilla __syncthreads() is equivalent to sp::syncthreads(0, num_warps_in_current_block), except it doesn't incur the cost of having to compute that second argument.

This is useful for performance in cases where you discover some threads want to early-return, but ordinarily can't because doing so would make a future __syncthreads() deadlock. What you can do instead is compute how many warps are going to survive and pass that as the numWarps argument here.

This mechanism also allows you to synchronise less than an entire block at once. You can use this to make different warps in the same block carry out independent work, synchronising with each other as they go along. This effectively allows you to use one block as if it were multiple blocks (but with the benefits of sharing caches, shared memory, and so on).

Parameters
barrierIDWhich barrier counter to use.
numWarpsNumber of warps to synchronise.

◆ syncthreads_and() [1/2]

__device__ bool sp::syncthreads_and ( int  barrierID,
bool  predicate 
)

sp::syncthreads_and(), implicitly applied to all non-exited warps.

◆ syncthreads_and() [2/2]

__device__ bool sp::syncthreads_and ( int  barrierID,
int  numWarps,
bool  predicate 
)

Like sp::syncthreads(), but also returns true iff all participating threads passed true for predicate.

◆ syncthreads_arrive()

__device__ void sp::syncthreads_arrive ( int  barrierID,
int  numWarps 
)

Functions exactly like sp::syncthreads(), but this warp does not block.

This allows a thread to notify a barrier without waiting on it. This may wake up other threads that were waiting. This is useful for implementing producer-consumer patterns, among other things.

◆ syncthreads_count() [1/2]

__device__ int sp::syncthreads_count ( int  barrierID,
bool  predicate 
)

sp::syncthreads_count(), implicitly applied to all non-exited warps.

◆ syncthreads_count() [2/2]

__device__ int sp::syncthreads_count ( int  barrierID,
int  numWarps,
bool  predicate 
)

Like sp::syncthreads(), but also returns a count of how many threads passed true for predicate.

◆ syncthreads_or() [1/2]

__device__ bool sp::syncthreads_or ( int  barrierID,
bool  predicate 
)

sp::syncthreads_or(), implicitly applied to all non-exited warps.

◆ syncthreads_or() [2/2]

__device__ bool sp::syncthreads_or ( int  barrierID,
int  numWarps,
bool  predicate 
)

Like sp::syncthreads(), but also returns true iff any participating threads passed true for predicate.