Things related to sorting. More...
Classes | |
struct | sp::DefaultKeyFunctor |
The default KeyFunctor used for radix sorts. More... | |
Functions | |
template<typename T , int N> | |
__device__ sp::Vec< T, N > | sp::warpBitonicSort (sp::Vec< T, N > x, int laneID) |
Perform a bitonic sort across the warp. More... | |
template<typename T , int ThreadsPerBlock, int N, int EffectiveBlockSize = ThreadsPerBlock> | |
__device__ sp::Vec< T, N > | sp::blockBitonicSort (sp::Vec< T, N > x, int threadId, sp::StaticTensor< T, ThreadsPerBlock *N *2 > &cache) |
Perform a bitonic sort of ThreadsPerBlock * N elements across an entire block. More... | |
template<typename KeyFunctor = DefaultKeyFunctor, typename SrcTensor , typename TmpTensor > | |
void | sp::radixSort (sp::Stream &s, SrcTensor subject, TmpTensor tmp) |
Radix sort. More... | |
Things related to sorting.
__device__ sp::Vec< T, N > sp::blockBitonicSort | ( | sp::Vec< T, N > | x, |
int | threadId, | ||
sp::StaticTensor< T, ThreadsPerBlock *N *2 > & | cache | ||
) |
Perform a bitonic sort of ThreadsPerBlock * N elements across an entire block.
The input and output per thread is a vector containing N contiguous elements of the array to be sorted.
This may be used with a proper initial segment of the full thread block, by setting ThreadsPerBlock accordingly. Thread indices are assumed to range over {0, 1, ..., ThreadsPerBlock - 1}, and ThreadsPerBlock must be divisible by WARP_SIZE.
The parameters ThreadsPerBlock and N must both be powers of two.
void sp::radixSort | ( | sp::Stream & | s, |
SrcTensor | subject, | ||
TmpTensor | tmp | ||
) |
Radix sort.
Performs a radix sort on the elements of subject
, using tmp
as a temporary workspace. The result is written into subject
.
Since radix sort is not a comparison sort, it cannot use operator<
to determine the ordering for user defined types. Instead, you must provide a KeyFunctor
that describes how to map the input elements to unsigned integers (of any length). You might find __uint<...>
useful for representing "strange" integer lengths for this purpose. Shorter keys are faster to sort.
If no KeyFunctor
is specified, DefaultKeyFunctor
is used. DefaultKeyFunctor
knows how to handle all integral and floating point types, meaning if you just want to sort numbers, you can ignore this issue entirely.
An example KeyFunctor
for a custom type might look like the following:
KeyFunctor | A functor object that can be used to convert elements of the input TensorLike into integral keys for sorting. |
s | The stream on which to enqueue the work. |
subject | A 1D TensorLike to sort. |
tmp | A 1D TensorLike that must be at least as large as subject and is used for temporary storage. |
__device__ sp::Vec< T, N > sp::warpBitonicSort | ( | sp::Vec< T, N > | x, |
int | laneID | ||
) |
Perform a bitonic sort across the warp.
Each thread contributes a Vec
of N
elements to the sort. Each thread is considered to be holding N
adjacent elements of the input list (formed by concatenating the contributions from each thread in order of lane ID). After the sort, each thread gets back the corresponding elements of the sorted list, such that concatenating all of the returned Vecs in laneID order would produce the sorted list.
x | A vector of elements to contribute to the sort. |
laneID | The lane ID of the calling thread. |