The specialisation system. More...

Modules
	Specialisers
	Specialisation operators.

Classes
struct	sp::RunSpecialisedWrapper< Function >
	A functor to wrap the call to `Function::run()`. More...

class	sp::Specialiser
	The interface for a `Specialiser`. More...

Typedefs
using	sp::DefaultSpecialisers = sp::TypeList< VariableBindingSpecialiser, ScalarSpecialiser, TensorSpecialiser, BroadcastingTensorSpecialiser, OutputExprSpecialiser, TensorExprSpecialiser, VariantScalarSpecialiser, VariantOutputPtrSpecialiser, PtrScalarSpecialiser<>, TensorDescriptorSpecialiser >
	The default list of `Specialiser` types to use. AKA: all the ones that are valid in all cases. More...

Functions
template<typename... OutArgs, typename Functor , typename... Specialisers, typename InArg , typename... InArgs>
auto	sp::specialiseNextArgument (Functor &fn, OutArgs &&... outArgs, sp::TypeList< Specialisers... > specialisers, InArg &&inArg, InArgs &&... args)

template<typename OutArgs , typename NewArg , typename Functor , typename... Specialisers, typename... Args>
auto	sp::specialiseMore (NewArg &&newArg, Functor &fn, OutArgs &&outArgs, sp::TypeList< Specialisers... > specialisers, Args &&... args)
	Called when the next argument has been finished with by the `Specialiser`s. More...

template<typename InArg , typename... Specialisers>
constexpr int	sp::findAcceptingSpecialiser ()

template<typename OutArgs , typename Functor , typename... Specialisers>
auto	sp::specialiseNextArgument (Functor &fn, OutArgs outArgs, sp::TypeList< Specialisers... >)

template<typename OutArgs , typename Functor , typename... Specialisers, typename InArg , typename... InArgs>
auto	sp::specialiseNextArgument (Functor &fn, OutArgs outArgs, sp::TypeList< Specialisers... > specialisers, InArg &&inArg, InArgs &&... args)

template<typename Functor , typename... Specialisers, typename... Args>
auto	sp::startSpecialisation (Functor &fn, sp::TypeList< Specialisers... > specialisers, Args &&... args)

template<typename Functor , typename... Args>
auto	sp::runSpecialised (Functor &fn, Args &&... args)

template<typename Functor , typename... Args>
auto	sp::runSpecialisedLambda (Functor fn, Args &&... args)
	Do a specialised call to a lambda. More...

template<typename Function , typename... Specs, typename... Args>
auto	sp::runWithSpecialisers (sp::TypeList< Specs... > specialisers, Args &&... args)
	Perform a specialising call to static function `Function::run()`, with the given specialisers. More...

template<typename Function , typename... Args>
auto	sp::runSpecialised (Args &&... args)
	Perform a specialising call to static function `Function::run()`, with the default specialsiers. More...

Detailed Description

The specialisation system.

CUDA kernels frequently come in multiple versions, specialised for different special cases (a constant is zero? The input is aligned? The input is small? etc.). It's common for the CPU to check these optimisable conditions itself and pick an appropriate CUDA kernel for the situation: a small amount of CPU-side branching can save a lot of GPU time.

C++ templates allow us perform that kind of specialisation, but the resulting code gets ugly fast:

 
// If a is 1, we don't want to bother doing the multiply, so we use a template
// parameter to cause two versions of our kernel to be generated and pick the right one
// on the host.
template<bool AIsOne>
__global__ axpy(sp::Vector<float> Y,
                sp::Vector<const float> X,
                float a) {
    // Process one element per thread.
    int offset = blockIdx.x * blockDim.x + threadIdx.x;
    if (offset >= length) {
        return;
    }
 
    // Since AIsOne is a compile-time constant, the conditional will always optimise away and - in
    // the true case - multiplication by one will be constant-propagated below, removing the
    // multiply entirely.
    float multiplier;
    if (AIsOne) {
        multiplier = 1;
    } else {
        multiplier = a;
    }
 
    Y[offset] += multiplier * X[offset];
}
 
void runAxpy(sp::Vector<float> Y,
             sp::Vector<const float> X,
             float a) {
    // Spawn enough threads...
    constexpr int BLOCK_SIZE = 256;
    int neededBlocks = sp::divRoundUp(N, BLOCK_SIZE);
 
    if (a == 1) {
        // This kernel will be optimised to not have the multiply in it.
        axpy<true><<<neededBlocks, BLOCK_SIZE>>>(Y, X, a, N);
    } else {
        // This one is the general case.
        axpy<false><<<neededBlocks, BLOCK_SIZE>>>(Y, X, a, N);
    }
}

The above example - although a simple case - shows how quickly this gets annoying:

We need explicit conditional checks to detect optimisable cases.
We have to write out the host-side branch tree explicitly.
There's no code-reusability. How can we apply this optimisation concept to arbitrary functions without writing out a similar if-statement in the host code for all of them?

Speclib provides a mechanism to take care of the details, allowing the above to be rephrased as:

struct AxpyKernel : sp::SpecialisableKernel {
    // Indicate which special scalar values are of interest to this specialisation target.
    using ScalarsToSpecialise = sp::int_sequence<1, 0>;
 
    template<typename AKind>
    __global__ kernel(sp::Vector<float> Y,
                      sp::Vector<const float> X,
                      AKind a) {
        // Process one element per thread.
        int offset = blockIdx.x * blockDim.x + threadIdx.x;
        if (offset >= length) {
            return;
        }
 
        // Scalar converts implicitly to `float` here, allowing it to be used in arithmetic.
        // The opaque `AKind` template parameter - inferred from the scalar type provided by
        // the generated specialisation branch tree - encodes what is known at compile-time
        // about the Scalar.
        // If the Kind indicates the value is constexpr-known, the type conversion operator
        // to float is defined to be a compile-time constant, allowing the multiplication to
        // constant-propagate.
        //
        // This kernel will optimise in the same way as the earlier example, but the nastiness
        // is hidden, and you don't have to have exponentially-growing source code to cope with
        // exponentially-growing specialisation endpoints.
        Y[offset] += a * X[offset];
    }
}
 
void runAxpy(sp::Vector<float> Y,
             sp::Vector<const float> X,
             float a) {
    // Spawn enough threads...
    constexpr int BLOCK_SIZE = 256;
    int neededBlocks = sp::divRoundUp(N, BLOCK_SIZE);
 
    sp::LaunchConfiguration lc{neededBlocks, BLOCK_SIZE}
    sp::runSpecialisedKernel<AxpyKernel>(lc, Y, X, a, N);
}

The above example will optimise for the case where a == 0, too, and we can add arbitary "values of interest" by adding them to the ScalarsToSpecialise integer sequence.

sp::runSpecialisedKernel generates a compile-time branch tree to select the optimal combination of arguments to pass to the target kernel function template.

This generates exponentially-many specializations in the number of arguments, as it has to enumerate all combinations, so specialisation targets can provide compile-time functions that tell the specialiser ones to consider, avoiding expensive generation of uninteresting or unused cases.

You can also define new transformations for the specialiser to perform, run any subset of the existing specialisers on a per-call basis. The built-in set of specialisers mostly relate to optimising Tensor flags for packedness, alignment, __restrict__-ness, etc.

Typedef Documentation

◆ DefaultSpecialisers

using sp::DefaultSpecialisers = typedef sp::TypeList< VariableBindingSpecialiser, ScalarSpecialiser, TensorSpecialiser, BroadcastingTensorSpecialiser, OutputExprSpecialiser, TensorExprSpecialiser, VariantScalarSpecialiser, VariantOutputPtrSpecialiser, PtrScalarSpecialiser<>, TensorDescriptorSpecialiser >

The default list of Specialiser types to use. AKA: all the ones that are valid in all cases.

Function Documentation

◆ runSpecialised()

template<typename Function , typename... Args>

auto sp::runSpecialised ( Args &&... args )

Perform a specialising call to static function Function::run(), with the default specialsiers.

Template Parameters

Function The type to call run() on.

Parameters

args	Forwarding references to the arguments to pass to Function::run().

Returns: The value returned by Function::run(args...).

◆ runSpecialisedLambda()

template<typename Functor , typename... Args>

auto sp::runSpecialisedLambda	(	Functor	fn,
		Args &&...	args
	)

Do a specialised call to a lambda.

◆ runWithSpecialisers()

template<typename Function , typename... Specs, typename... Args>

auto sp::runWithSpecialisers	(	sp::TypeList< Specs... >	specialisers,
		Args &&...	args
	)

Perform a specialising call to static function Function::run(), with the given specialisers.

◆ specialiseMore()

template<typename OutArgs , typename NewArg , typename Functor , typename... Specialisers, typename... Args>

auto sp::specialiseMore	(	NewArg &&	newArg,
		Functor &	fn,
		OutArgs &&	outArgs,
		sp::TypeList< Specialisers... >	specialisers,
		Args &&...	args
	)

Called when the next argument has been finished with by the Specialisers.

The continuation function to be called by a specialiser when it's finished specialising.

This makes it easier to write Specialisers, because they don't have to keep track of the two argument packs being assembled by specialiseNextArgument, but it means we get passed the new argument at the start, followed by a blind forwarding of all the other stuff we passed into the Specialiser (which is just the arguments that specialiseNextArgument had at that point).

Note that an argument was removed from args when the Specialiser was called, so we need not do it now.

Parameters

newArg	The argument, after specialisation. Append this to `outArgs` and call `specialiseNextArgument` again.
outArgs	The same `outArgs` that `specialiseNextArgument` received right before it called the `Specialiser` that just called this function.
fn	The target functor
specialisers	The `Specialiser`s being applied
args	The not-yet-specialised arguments.

Most of the arguments are just forwarded blindly by the specialiser, with its output (the newly-specialised argument) passed as the first.

Modules

Classes

Typedefs

Functions