Access to the more elaborate syncthreads variants. More...

Functions
__device__ void	sp::syncthreads (int barrierID, int numWarps)
	A more powerful `__syncthreads()` More...

__device__ void	sp::syncthreads (int barrierID)
	Like the other `sp::syncthreads()`, but implicitly synchronises all non-exited warps in the block. More...

__device__ void	sp::syncthreads_arrive (int barrierID, int numWarps)
	Functions exactly like `sp::syncthreads()`, but this warp does not block. More...

__device__ int	sp::syncthreads_count (int barrierID, int numWarps, bool predicate)
	Like `sp::syncthreads()`, but also returns a count of how many threads passed true for `predicate`. More...

__device__ int	sp::syncthreads_count (int barrierID, bool predicate)
	`sp::syncthreads_count()`, implicitly applied to all non-exited warps. More...

__device__ bool	sp::syncthreads_and (int barrierID, int numWarps, bool predicate)
	Like `sp::syncthreads()`, but also returns true iff all participating threads passed true for `predicate`. More...

__device__ bool	sp::syncthreads_and (int barrierID, bool predicate)
	`sp::syncthreads_and()`, implicitly applied to all non-exited warps. More...

__device__ bool	sp::syncthreads_or (int barrierID, int numWarps, bool predicate)
	Like `sp::syncthreads()`, but also returns true iff any participating threads passed true for `predicate`. More...

__device__ bool	sp::syncthreads_or (int barrierID, bool predicate)
	`sp::syncthreads_or()`, implicitly applied to all non-exited warps. More...

Detailed Description

Access to the more elaborate syncthreads variants.

These are preferable to using inline PTX because:

They inhibit instruction scheduling less, improving optimisation.
They are slightly more pleasant to use.
They can be software-emulated on platforms that don't have instructions for them, making your code portable.

Function Documentation

◆ syncthreads() [1/2]

__device__ void sp::syncthreads ( int barrierID )

Like the other sp::syncthreads(), but implicitly synchronises all non-exited warps in the block.

This actually maps to a different underlying instruction, so is slightly cheaper (and much less annoying) than keeping track of the proper value to pass to numWarps yourself.

◆ syncthreads() [2/2]

__device__ void sp::syncthreads	(	int	barrierID,
		int	numWarps
	)

A more powerful __syncthreads()

This basically exposes the ptx bar.sync instruction directly.

There are 16 counters, initially zero. When a warp reaches syncthreads(X, Y), the X'th counter is increased by warpSize (regardless of how many threads of that warp were actually active at the time), and the warp goes to sleep. When the counter reaches Y, it is reset to zero and all threads sleeping on counter X wake up.

The behaviour is undefined if:

Y is not a multiple of warp size
Different values of Y are used for the same synchronisation operation (the target value of the counter will then become undefined, and you'll probably get deadlock).
X is less than 0 or more than 15. The ptx assembler doesn't detect this condition for some reason...

Vanilla __syncthreads() is equivalent to sp::syncthreads(0, num_warps_in_current_block), except it doesn't incur the cost of having to compute that second argument.

This is useful for performance in cases where you discover some threads want to early-return, but ordinarily can't because doing so would make a future __syncthreads() deadlock. What you can do instead is compute how many warps are going to survive and pass that as the numWarps argument here.

This mechanism also allows you to synchronise less than an entire block at once. You can use this to make different warps in the same block carry out independent work, synchronising with each other as they go along. This effectively allows you to use one block as if it were multiple blocks (but with the benefits of sharing caches, shared memory, and so on).

Parameters

barrierID	Which barrier counter to use.
numWarps	Number of warps to synchronise.

◆ syncthreads_and() [1/2]

__device__ bool sp::syncthreads_and	(	int	barrierID,
		bool	predicate
	)

sp::syncthreads_and(), implicitly applied to all non-exited warps.

◆ syncthreads_and() [2/2]

__device__ bool sp::syncthreads_and	(	int	barrierID,
		int	numWarps,
		bool	predicate
	)

Like sp::syncthreads(), but also returns true iff all participating threads passed true for predicate.

◆ syncthreads_arrive()

__device__ void sp::syncthreads_arrive	(	int	barrierID,
		int	numWarps
	)

Functions exactly like sp::syncthreads(), but this warp does not block.

This allows a thread to notify a barrier without waiting on it. This may wake up other threads that were waiting. This is useful for implementing producer-consumer patterns, among other things.

◆ syncthreads_count() [1/2]

__device__ int sp::syncthreads_count	(	int	barrierID,
		bool	predicate
	)

sp::syncthreads_count(), implicitly applied to all non-exited warps.

◆ syncthreads_count() [2/2]

__device__ int sp::syncthreads_count	(	int	barrierID,
		int	numWarps,
		bool	predicate
	)

Like sp::syncthreads(), but also returns a count of how many threads passed true for predicate.

◆ syncthreads_or() [1/2]

__device__ bool sp::syncthreads_or	(	int	barrierID,
		bool	predicate
	)

sp::syncthreads_or(), implicitly applied to all non-exited warps.

◆ syncthreads_or() [2/2]

__device__ bool sp::syncthreads_or	(	int	barrierID,
		int	numWarps,
		bool	predicate
	)

Like sp::syncthreads(), but also returns true iff any participating threads passed true for predicate.