Generic warp shuffle support supporting arbitrary input types. More...

Classes
struct	sp::Shfl
	Shuffler corresponding to `shfl()`. The offset parameter behaves like `shfl()`'s `srcLane` argument. More...

struct	sp::ShflUp
	Shuffler corresponding to `shfl_up()`. The offset parameter behaves like `shfl_up()`'s `delta` argument. More...

struct	sp::ShflDown
	Shuffler corresponding to `shfl_down()`. The offset parameter behaves like `shfl_down()`'s `delta` argument. More...

struct	sp::ShflXor
	Shuffler corresponding to `shfl_xor()`. The offset parameter behaves like `shfl_xor()`'s `laneMask` argument. More...

Functions
template<typename Shuffler = Shfl, typename T >
__device__ auto	sp::shuffle (T value, int offset, int logicalWarpSize=WARP_SIZE)
	Generic shuffle. More...

template<typename Shuffler = Shfl, typename T >
__device__ std::pair< bool, T >	sp::shufflePredicated (T value, int offset, int logicalWarpSize=WARP_SIZE, int laneID=0)
	Like `shuffle()`, but also yields a boolean indicating if the value that was read is valid. More...

Detailed Description

Generic warp shuffle support supporting arbitrary input types.

See also: sp::shuffle()

Function Documentation

◆ shuffle()

template<typename Shuffler = Shfl, typename T >

__device__ auto sp::shuffle	(	T	value,
		int	offset,
		int	logicalWarpSize = `WARP_SIZE`
	)

Generic shuffle.

This is the same as the CUDA __shfl_*_sync() intrinsics, except it works for any data type. The input is decomposed into 4-byte chunks which are sent with separate shuffle instructions and reassembled at the destination. There is no runtime overhead for this abstraction: it compiles to a sequence of shuffle instructions.

Warning: Input types that aren't a multiple of 4-bytes in length aren't supported (and won't compile). This could be fixed.

Example usage

struct SomeLargeStruct {
   int32_t stuff[8];
};
 
__device__ void example() {
   // Send a SomeLargeStruct instance to another thread.
   SomeLargeStruct foo;
 
   // Equivalent to having done `__shfl_down_sync(X, 1)` for every 32-bit chunk of `foo` in sequence.
   // ... But without the objectionableness.
   foo = shuffle<ShflDown>(foo, 1);
}

Template Parameters

Shuffler	A functor type specifying the shuffle operation to use (see below). By default, uses the Shfl shuffler, which is like shfl() and for which the offset argument is like shfl()'s srcLane argument.
T	The type of the value to shuffle.

Parameters

value	The value to shuffle.
offset	The offset argument to pass to the shuffler.
logicalWarpSize	The logical warp size to use.

◆ shufflePredicated()

template<typename Shuffler = Shfl, typename T >

__device__ std::pair< bool, T > sp::shufflePredicated	(	T	value,
		int	offset,
		int	logicalWarpSize = `WARP_SIZE`,
		int	laneID = `0`
	)

Like shuffle(), but also yields a boolean indicating if the value that was read is valid.

The returned boolean is true iff the returned payload was read from a lane that is in bounds. This operation is cheaper than writing such a boundary check in your code, unless we've implemented the compiler optimisation that does this and nobody remembered to fix this documentation yet.

This effectively gives you access to the second (predicate) output of the underlying shuffle instructions on NVIDIA® hardware without the downsides of using inline PTX (out compiler has some ability to instruction schedule these shuffles). On AMD, the predicate result is separately calculated, meaning this function lets you write optimal code in a portable way.

See also: sp::shuffle().