Speclib  0.1.2
The library for writing better CUDA libraries
Automatic Specialisation

The specialisation system. More...

Modules

 Specialisers
 Specialisation operators.
 

Classes

struct  sp::RunSpecialisedWrapper< Function >
 A functor to wrap the call to Function::run(). More...
 
class  sp::Specialiser
 The interface for a Specialiser. More...
 

Typedefs

using sp::DefaultSpecialisers = sp::TypeList< VariableBindingSpecialiser, ScalarSpecialiser, TensorSpecialiser, BroadcastingTensorSpecialiser, OutputExprSpecialiser, TensorExprSpecialiser, VariantScalarSpecialiser, VariantOutputPtrSpecialiser, PtrScalarSpecialiser<>, TensorDescriptorSpecialiser >
 The default list of Specialiser types to use. AKA: all the ones that are valid in all cases. More...
 

Functions

template<typename... OutArgs, typename Functor , typename... Specialisers, typename InArg , typename... InArgs>
auto sp::specialiseNextArgument (Functor &fn, OutArgs &&... outArgs, sp::TypeList< Specialisers... > specialisers, InArg &&inArg, InArgs &&... args)
 
template<typename OutArgs , typename NewArg , typename Functor , typename... Specialisers, typename... Args>
auto sp::specialiseMore (NewArg &&newArg, Functor &fn, OutArgs &&outArgs, sp::TypeList< Specialisers... > specialisers, Args &&... args)
 Called when the next argument has been finished with by the Specialisers. More...
 
template<typename InArg , typename... Specialisers>
constexpr int sp::findAcceptingSpecialiser ()
 
template<typename OutArgs , typename Functor , typename... Specialisers>
auto sp::specialiseNextArgument (Functor &fn, OutArgs outArgs, sp::TypeList< Specialisers... >)
 
template<typename OutArgs , typename Functor , typename... Specialisers, typename InArg , typename... InArgs>
auto sp::specialiseNextArgument (Functor &fn, OutArgs outArgs, sp::TypeList< Specialisers... > specialisers, InArg &&inArg, InArgs &&... args)
 
template<typename Functor , typename... Specialisers, typename... Args>
auto sp::startSpecialisation (Functor &fn, sp::TypeList< Specialisers... > specialisers, Args &&... args)
 
template<typename Functor , typename... Args>
auto sp::runSpecialised (Functor &fn, Args &&... args)
 
template<typename Functor , typename... Args>
auto sp::runSpecialisedLambda (Functor fn, Args &&... args)
 Do a specialised call to a lambda. More...
 
template<typename Function , typename... Specs, typename... Args>
auto sp::runWithSpecialisers (sp::TypeList< Specs... > specialisers, Args &&... args)
 Perform a specialising call to static function Function::run(), with the given specialisers. More...
 
template<typename Function , typename... Args>
auto sp::runSpecialised (Args &&... args)
 Perform a specialising call to static function Function::run(), with the default specialsiers. More...
 

Detailed Description

The specialisation system.

CUDA kernels frequently come in multiple versions, specialised for different special cases (a constant is zero? The input is aligned? The input is small? etc.). It's common for the CPU to check these optimisable conditions itself and pick an appropriate CUDA kernel for the situation: a small amount of CPU-side branching can save a lot of GPU time.

C++ templates allow us perform that kind of specialisation, but the resulting code gets ugly fast:

// If a is 1, we don't want to bother doing the multiply, so we use a template
// parameter to cause two versions of our kernel to be generated and pick the right one
// on the host.
template<bool AIsOne>
__global__ axpy(sp::Vector<float> Y,
float a) {
// Process one element per thread.
int offset = blockIdx.x * blockDim.x + threadIdx.x;
if (offset >= length) {
return;
}
// Since AIsOne is a compile-time constant, the conditional will always optimise away and - in
// the true case - multiplication by one will be constant-propagated below, removing the
// multiply entirely.
float multiplier;
if (AIsOne) {
multiplier = 1;
} else {
multiplier = a;
}
Y[offset] += multiplier * X[offset];
}
void runAxpy(sp::Vector<float> Y,
float a) {
// Spawn enough threads...
constexpr int BLOCK_SIZE = 256;
int neededBlocks = sp::divRoundUp(N, BLOCK_SIZE);
if (a == 1) {
// This kernel will be optimised to not have the multiply in it.
axpy<true><<<neededBlocks, BLOCK_SIZE>>>(Y, X, a, N);
} else {
// This one is the general case.
axpy<false><<<neededBlocks, BLOCK_SIZE>>>(Y, X, a, N);
}
}
Represents a Tensor- a multidimensional array that can represent a multilinear map.
Definition: Tensor.hpp:32
constexpr auto divRoundUp(T x, Q y)
Divide x by y, rounding up.
Definition: IntMath.hpp:21

The above example - although a simple case - shows how quickly this gets annoying:

Speclib provides a mechanism to take care of the details, allowing the above to be rephrased as:

struct AxpyKernel : sp::SpecialisableKernel {
// Indicate which special scalar values are of interest to this specialisation target.
using ScalarsToSpecialise = sp::int_sequence<1, 0>;
template<typename AKind>
__global__ kernel(sp::Vector<float> Y,
AKind a) {
// Process one element per thread.
int offset = blockIdx.x * blockDim.x + threadIdx.x;
if (offset >= length) {
return;
}
// Scalar converts implicitly to `float` here, allowing it to be used in arithmetic.
// The opaque `AKind` template parameter - inferred from the scalar type provided by
// the generated specialisation branch tree - encodes what is known at compile-time
// about the Scalar.
// If the Kind indicates the value is constexpr-known, the type conversion operator
// to float is defined to be a compile-time constant, allowing the multiplication to
// constant-propagate.
//
// This kernel will optimise in the same way as the earlier example, but the nastiness
// is hidden, and you don't have to have exponentially-growing source code to cope with
// exponentially-growing specialisation endpoints.
Y[offset] += a * X[offset];
}
}
void runAxpy(sp::Vector<float> Y,
float a) {
// Spawn enough threads...
constexpr int BLOCK_SIZE = 256;
int neededBlocks = sp::divRoundUp(N, BLOCK_SIZE);
sp::LaunchConfiguration lc{neededBlocks, BLOCK_SIZE}
sp::runSpecialisedKernel<AxpyKernel>(lc, Y, X, a, N);
}
Encodes a sequence of integral type T.
Definition: integer_sequence.hpp:113

The above example will optimise for the case where a == 0, too, and we can add arbitary "values of interest" by adding them to the ScalarsToSpecialise integer sequence.

sp::runSpecialisedKernel generates a compile-time branch tree to select the optimal combination of arguments to pass to the target kernel function template.

This generates exponentially-many specializations in the number of arguments, as it has to enumerate all combinations, so specialisation targets can provide compile-time functions that tell the specialiser ones to consider, avoiding expensive generation of uninteresting or unused cases.

You can also define new transformations for the specialiser to perform, run any subset of the existing specialisers on a per-call basis. The built-in set of specialisers mostly relate to optimising Tensor flags for packedness, alignment, __restrict__-ness, etc.

Typedef Documentation

◆ DefaultSpecialisers

Function Documentation

◆ runSpecialised()

template<typename Function , typename... Args>
auto sp::runSpecialised ( Args &&...  args)

Perform a specialising call to static function Function::run(), with the default specialsiers.

Template Parameters
FunctionThe type to call run() on.
Parameters
argsForwarding references to the arguments to pass to Function::run().
Returns
The value returned by Function::run(args...).

◆ runSpecialisedLambda()

template<typename Functor , typename... Args>
auto sp::runSpecialisedLambda ( Functor  fn,
Args &&...  args 
)

Do a specialised call to a lambda.

◆ runWithSpecialisers()

template<typename Function , typename... Specs, typename... Args>
auto sp::runWithSpecialisers ( sp::TypeList< Specs... >  specialisers,
Args &&...  args 
)

Perform a specialising call to static function Function::run(), with the given specialisers.

◆ specialiseMore()

template<typename OutArgs , typename NewArg , typename Functor , typename... Specialisers, typename... Args>
auto sp::specialiseMore ( NewArg &&  newArg,
Functor &  fn,
OutArgs &&  outArgs,
sp::TypeList< Specialisers... >  specialisers,
Args &&...  args 
)

Called when the next argument has been finished with by the Specialisers.

The continuation function to be called by a specialiser when it's finished specialising.

This makes it easier to write Specialisers, because they don't have to keep track of the two argument packs being assembled by specialiseNextArgument, but it means we get passed the new argument at the start, followed by a blind forwarding of all the other stuff we passed into the Specialiser (which is just the arguments that specialiseNextArgument had at that point).

Note that an argument was removed from args when the Specialiser was called, so we need not do it now.

Parameters
newArgThe argument, after specialisation. Append this to outArgs and call specialiseNextArgument again.
outArgsThe same outArgs that specialiseNextArgument received right before it called the Specialiser that just called this function.
fnThe target functor
specialisersThe Specialisers being applied
argsThe not-yet-specialised arguments.

Most of the arguments are just forwarded blindly by the specialiser, with its output (the newly-specialised argument) passed as the first.