Generic warp shuffle support supporting arbitrary input types.
More...
|
struct | sp::Shfl |
| Shuffler corresponding to shfl() . The offset parameter behaves like shfl() 's srcLane argument. More...
|
|
struct | sp::ShflUp |
| Shuffler corresponding to shfl_up() . The offset parameter behaves like shfl_up() 's delta argument. More...
|
|
struct | sp::ShflDown |
| Shuffler corresponding to shfl_down() . The offset parameter behaves like shfl_down() 's delta argument. More...
|
|
struct | sp::ShflXor |
| Shuffler corresponding to shfl_xor() . The offset parameter behaves like shfl_xor() 's laneMask argument. More...
|
|
|
template<typename Shuffler = Shfl, typename T > |
__device__ auto | sp::shuffle (T value, int offset, int logicalWarpSize=WARP_SIZE) |
| Generic shuffle. More...
|
|
template<typename Shuffler = Shfl, typename T > |
__device__ std::pair< bool, T > | sp::shufflePredicated (T value, int offset, int logicalWarpSize=WARP_SIZE, int laneID=0) |
| Like shuffle() , but also yields a boolean indicating if the value that was read is valid. More...
|
|
Generic warp shuffle support supporting arbitrary input types.
- See also
- sp::shuffle()
◆ shuffle()
template<typename Shuffler = Shfl, typename T >
__device__ auto sp::shuffle |
( |
T |
value, |
|
|
int |
offset, |
|
|
int |
logicalWarpSize = WARP_SIZE |
|
) |
| |
Generic shuffle.
This is the same as the CUDA __shfl_*_sync()
intrinsics, except it works for any data type. The input is decomposed into 4-byte chunks which are sent with separate shuffle instructions and reassembled at the destination. There is no runtime overhead for this abstraction: it compiles to a sequence of shuffle instructions.
- Warning
- Input types that aren't a multiple of 4-bytes in length aren't supported (and won't compile). This could be fixed.
Example usage
struct SomeLargeStruct {
int32_t stuff[8];
};
__device__ void example() {
SomeLargeStruct foo;
foo = shuffle<ShflDown>(foo, 1);
}
- Template Parameters
-
Shuffler | A functor type specifying the shuffle operation to use (see below). By default, uses the Shfl shuffler, which is like shfl() and for which the offset argument is like shfl()'s srcLane argument. |
T | The type of the value to shuffle. |
- Parameters
-
value | The value to shuffle. |
offset | The offset argument to pass to the shuffler. |
logicalWarpSize | The logical warp size to use. |
◆ shufflePredicated()
template<typename Shuffler = Shfl, typename T >
__device__ std::pair< bool, T > sp::shufflePredicated |
( |
T |
value, |
|
|
int |
offset, |
|
|
int |
logicalWarpSize = WARP_SIZE , |
|
|
int |
laneID = 0 |
|
) |
| |
Like shuffle()
, but also yields a boolean indicating if the value that was read is valid.
The returned boolean is true iff the returned payload was read from a lane that is in bounds. This operation is cheaper than writing such a boundary check in your code, unless we've implemented the compiler optimisation that does this and nobody remembered to fix this documentation yet.
This effectively gives you access to the second (predicate) output of the underlying shuffle instructions on NVIDIA® hardware without the downsides of using inline PTX (out compiler has some ability to instruction schedule these shuffles). On AMD, the predicate result is separately calculated, meaning this function lets you write optimal code in a portable way.
- See also
sp::shuffle()
.
- Template Parameters
-
Shuffler | A functor type specifying the shuffle operation to use (see below). By default, uses the Shfl shuffler, which is like shfl() and for which the offset argument is like shfl()'s srcLane argument. |
T | The type of the value to shuffle. |
- Parameters
-
value | The value to shuffle. |
offset | The offset argument to pass to the shuffler. |
logicalWarpSize | The logical warp size to use. If omitted, the entire warp is used. |
laneID | The lane ID of this warp. Needed only if you're using a non-default logicalWarpSize. |