libSCALE  0.2.0
A modern C++ CUDA API

Generic warp shuffle support supporting arbitrary input types. More...

Classes

struct  sp::Shfl
 Shuffler corresponding to shfl(). The offset parameter behaves like shfl()'s srcLane argument. More...
 
struct  sp::ShflUp
 Shuffler corresponding to shfl_up(). The offset parameter behaves like shfl_up()'s delta argument. More...
 
struct  sp::ShflDown
 Shuffler corresponding to shfl_down(). The offset parameter behaves like shfl_down()'s delta argument. More...
 
struct  sp::ShflXor
 Shuffler corresponding to shfl_xor(). The offset parameter behaves like shfl_xor()'s laneMask argument. More...
 

Functions

template<typename Shuffler = Shfl, typename T >
__device__ auto sp::shuffle (T value, int offset, int logicalWarpSize=WARP_SIZE)
 Generic shuffle. More...
 
template<typename Shuffler = Shfl, typename T >
__device__ std::pair< bool, T > sp::shufflePredicated (T value, int offset, int logicalWarpSize=WARP_SIZE, int laneID=0)
 Like shuffle(), but also yields a boolean indicating if the value that was read is valid. More...
 

Detailed Description

Generic warp shuffle support supporting arbitrary input types.

See also
sp::shuffle()

Function Documentation

◆ shuffle()

template<typename Shuffler = Shfl, typename T >
__device__ auto sp::shuffle ( value,
int  offset,
int  logicalWarpSize = WARP_SIZE 
)

Generic shuffle.

This is the same as the CUDA __shfl_*_sync() intrinsics, except it works for any data type. The input is decomposed into 4-byte chunks which are sent with separate shuffle instructions and reassembled at the destination. There is no runtime overhead for this abstraction: it compiles to a sequence of shuffle instructions.

Warning
Input types that aren't a multiple of 4-bytes in length aren't supported (and won't compile). This could be fixed.

Example usage

struct SomeLargeStruct {
int32_t stuff[8];
};
__device__ void example() {
// Send a SomeLargeStruct instance to another thread.
SomeLargeStruct foo;
// Equivalent to having done `__shfl_down_sync(X, 1)` for every 32-bit chunk of `foo` in sequence.
// ... But without the objectionableness.
foo = shuffle<ShflDown>(foo, 1);
}
Template Parameters
ShufflerA functor type specifying the shuffle operation to use (see below). By default, uses the Shfl shuffler, which is like shfl() and for which the offset argument is like shfl()'s srcLane argument.
TThe type of the value to shuffle.
Parameters
valueThe value to shuffle.
offsetThe offset argument to pass to the shuffler.
logicalWarpSizeThe logical warp size to use.

◆ shufflePredicated()

template<typename Shuffler = Shfl, typename T >
__device__ std::pair< bool, T > sp::shufflePredicated ( value,
int  offset,
int  logicalWarpSize = WARP_SIZE,
int  laneID = 0 
)

Like shuffle(), but also yields a boolean indicating if the value that was read is valid.

The returned boolean is true iff the returned payload was read from a lane that is in bounds. This operation is cheaper than writing such a boundary check in your code, unless we've implemented the compiler optimisation that does this and nobody remembered to fix this documentation yet.

This effectively gives you access to the second (predicate) output of the underlying shuffle instructions on NVIDIA® hardware without the downsides of using inline PTX (out compiler has some ability to instruction schedule these shuffles). On AMD, the predicate result is separately calculated, meaning this function lets you write optimal code in a portable way.

See also
sp::shuffle().
Template Parameters
ShufflerA functor type specifying the shuffle operation to use (see below). By default, uses the Shfl shuffler, which is like shfl() and for which the offset argument is like shfl()'s srcLane argument.
TThe type of the value to shuffle.
Parameters
valueThe value to shuffle.
offsetThe offset argument to pass to the shuffler.
logicalWarpSizeThe logical warp size to use. If omitted, the entire warp is used.
laneIDThe lane ID of this warp. Needed only if you're using a non-default logicalWarpSize.