Speclib  0.1.2
The library for writing better CUDA libraries
Sorting

Things related to sorting. More...

Classes

struct  sp::DefaultKeyFunctor
 The default KeyFunctor used for radix sorts. More...
 

Functions

template<typename T , int N>
__device__ sp::Vec< T, N > sp::warpBitonicSort (sp::Vec< T, N > x, int laneID)
 Perform a bitonic sort across the warp. More...
 
template<typename T , int ThreadsPerBlock, int N, int EffectiveBlockSize = ThreadsPerBlock>
__device__ sp::Vec< T, N > sp::blockBitonicSort (sp::Vec< T, N > x, int threadId, sp::StaticTensor< T, ThreadsPerBlock *N *2 > &cache)
 Perform a bitonic sort of ThreadsPerBlock * N elements across an entire block. More...
 
template<typename KeyFunctor = DefaultKeyFunctor, typename SrcTensor , typename TmpTensor >
void sp::radixSort (sp::Stream &s, SrcTensor subject, TmpTensor tmp)
 Radix sort. More...
 

Detailed Description

Things related to sorting.

Function Documentation

◆ blockBitonicSort()

template<typename T , int ThreadsPerBlock, int N, int EffectiveBlockSize = ThreadsPerBlock>
__device__ sp::Vec< T, N > sp::blockBitonicSort ( sp::Vec< T, N >  x,
int  threadId,
sp::StaticTensor< T, ThreadsPerBlock *N *2 > &  cache 
)

Perform a bitonic sort of ThreadsPerBlock * N elements across an entire block.

The input and output per thread is a vector containing N contiguous elements of the array to be sorted.

This may be used with a proper initial segment of the full thread block, by setting ThreadsPerBlock accordingly. Thread indices are assumed to range over {0, 1, ..., ThreadsPerBlock - 1}, and ThreadsPerBlock must be divisible by WARP_SIZE.

The parameters ThreadsPerBlock and N must both be powers of two.

◆ radixSort()

template<typename KeyFunctor = DefaultKeyFunctor, typename SrcTensor , typename TmpTensor >
void sp::radixSort ( sp::Stream s,
SrcTensor  subject,
TmpTensor  tmp 
)

Radix sort.

Performs a radix sort on the elements of subject, using tmp as a temporary workspace. The result is written into subject.

Since radix sort is not a comparison sort, it cannot use operator< to determine the ordering for user defined types. Instead, you must provide a KeyFunctor that describes how to map the input elements to unsigned integers (of any length). You might find __uint<...> useful for representing "strange" integer lengths for this purpose. Shorter keys are faster to sort.

If no KeyFunctor is specified, DefaultKeyFunctor is used. DefaultKeyFunctor knows how to handle all integral and floating point types, meaning if you just want to sort numbers, you can ignore this issue entirely.

An example KeyFunctor for a custom type might look like the following:

struct ThingToSort {
char c;
uint32_t i;
}
struct ExampleKeyFunctor {
_hd_ auto operator()(const ThingToSort& value) {
// Let's sort first by the char, and then by the integer.
}
}
unsigned _ExtInt(Bits) __uint
Arbitrary length fixed-width unsigned integer.
Definition: Int.hpp:58
constexpr ValueT bitfieldInsert(ValueT x, PayloadT value, int start, int len=bits< PayloadT >)
Copy the least significant len bits of value into x at position start.
Definition: Twiddles.hpp:128
Template Parameters
KeyFunctorA functor object that can be used to convert elements of the input TensorLike into integral keys for sorting.
Parameters
sThe stream on which to enqueue the work.
subjectA 1D TensorLike to sort.
tmpA 1D TensorLike that must be at least as large as subject and is used for temporary storage.

◆ warpBitonicSort()

template<typename T , int N>
__device__ sp::Vec< T, N > sp::warpBitonicSort ( sp::Vec< T, N >  x,
int  laneID 
)

Perform a bitonic sort across the warp.

Each thread contributes a Vec of N elements to the sort. Each thread is considered to be holding N adjacent elements of the input list (formed by concatenating the contributions from each thread in order of lane ID). After the sort, each thread gets back the corresponding elements of the sorted list, such that concatenating all of the returned Vecs in laneID order would produce the sorted list.

Parameters
xA vector of elements to contribute to the sort.
laneIDThe lane ID of the calling thread.