Speclib

Introduction

A library for writing high performance, cross-platform GPU programs. It works particularly well with libSCALE.

Speclib provides a wide range of features, ranging from low-level abstractions of mathematical objects to complete CUDA kernels.

The highest level APIs allow you to describe the desired calculation using a DSL based on C++ templates, causing the generation of a specialized kernel via metaprogramming.

This document provides a very brief introduction to some of speclib’s headline features, but readers are encouraged to peruse the reference manual to get a more complete picture of what is available.

Many features - including all the mathematical primitives - also work on the CPU. Many features also work at compile-time, including quite high-level things like Vec or StaticTensor, allowing you to use the full expressive power of the library even in metaprograms (and to reuse the same code for both runtime and compile-time execution).

Mathematical primitives

`Vec`

An efficient, fluent representation for vectors of elements of the same type. Similar goals to CUDA’s float4 and friends, but with a far more usable API, operators, better optimisation, and constexpr-compatibility.

`Complex`

A complex number representation allowing the use of half-float (or any other numeric type) complex numbers on the GPU. API almost identical to std::complex, so far more usable than CUDA’s cuComplex/cuDoubleComplex. Moreover, the class is implemented in a composable way, supporting quaternions and octonions automatically.

Arbitrary-precision integers

Integers of any size, written as __uint<N> or __int<N>. Since N can be a template parameter, this allows you to use metaprogramming to select an integer size.

`TensorLike` Primitives

An abstraction for any object that can be accessed as if it were a multi-dimensional array of some type and dimensionality. Effectively a C++ Concept.

This allows the rest of speclib to be written to accept any TensorLike, such as:

A straightforward memory-backed array.
A composition of elementwise operations on arbitrarily many other TensorLikes (a TensorExpr)
A virutal tensor that generates values using some function.
A texture-backed object.
Something else!

Thanks to template metaprogramming, when another speclib API is called with a new kind of TensorLike, a special version of the target function is generated and optimised for that type of input.

For example: a TensorLike representing “the contents of that region of memory, multiplied by six” can be constructed and passed to Speclib’s prefix-sum function. The multiplication-by-six will be fused onto the generated prefix sum kernel, eliminating the need to use a separate launch and memory scan to perform that step.

Scalar expression templates

Conceptually similar to TensorExpr, but for scalars (in fact, TensorExpr is built atop scalar expression templates).

These are mostly useful for when you want to pass a scalar function around as a (readable) type.

These may be autodifferentiated at compile-time.

Math Functions

Faster integer division This makes it possible to cheaply “reflow” threads over the input space during a kernel, maintaining high parallelism as the input starts to run dry (a major problem for low-latency applications).
Access to non-default floating point rounding modes without resorting to inline asm.
(coming soon) Faster transendental functions than those provided by libcuda.

Data structures

Equivalents of std::tuple and std::variant that are GPU-compatible.
A system for efficiently representing lookup tables on the GPU as a branchless sequence of bitwise operations (often faster than conventional memory-based approaches).

Debugging tools

An assert feature that works on GPU and CPU, and (unlike the one built into CUDA) provides descriptive, formatted messages using a printf-like API.
Reachability assertions
Almost every other speclib API performs extensive debug assertions when compiled in debug mode. That, coupled with libSCALE’s exception specialisation mechanism means it is far more rarely necessary to use the debugger or cuda-memcheck, saving considerable developer time.

Metaprogramming

Groundbreaking C++ metaprogramming tools:

Dynamic memory allocation in constexpr functions.
High-level, dynamically-sized containers like maps, sets, and lists, all usable in constexpr functions.
Print statements in constexpr functions.
static_assert with backtraces.
A system for converting type names to strings.
Type-encoded datastructures for working with lists or sets of types or integers.

These facilities are sufficiently powerful that we used them to implement the regex compiler in SpecRegex. These features open the door for exciting future work, such as a constexpr compute-graph optimiser for kernels written entirely in C++.

Object-oriented Kernels

Speclib provides a system for defining CUDA kernels as objects. This grants some advantages:

Shared (and constant) memory objects can be declared as static memebrs of the class, giving them a lexical scope that matches their actual lifetime.
Kernel-global values (such as kernel parameters) can be passed as fields, making them accessible everywhere and eliminating a common problems where kernel parameters have to be tediously passed between device functions.
Slightly faster launches, since only one host-side copy occurs to populate the parameter buffer.

This style also allows users to write kernels that inherit from others, making it even easier to compose kernels without doing extra launches.

Some higher-level base classes are also provided for common iterator patterns, eliminating the need to write some boilerplate CUDA.

Explicit block/grid assumptions

A mechanism to inform the compiler about unused (or constant) elements of blockDim, gridDim, etc., so your code can be written in a generic fashion without worrying about redundant index calculations.

Automatic specialisation

Frequently, it is useful to have specialised versions of functions (or CUDA kernels) for degenerate cases. When a constant is zero, or an input size is a multiple of 4, or some other situation that allows a more efficient code path to be used.

Speclib provides a modular, reusable, perfectly general mechanism for defining “specialisable” functions and rules to select the right code path. This allows you to generate the necessary branch tree for any specialisable kernel without writing it out explicitly.

GPU functions

The following are provided both as standalone (auto-specialising) template kernels, and as functions that may be used to carry out the operation within a block or warp.

These are function templates that consume TensorLike objects, and (where applicable) consume operator types as inputs. This allows you to fuse other operations onto these routines simply by changing what kind of object you call them with.

For instance, this makes it much easier to implement “sort and multiply by 5”. The generated kernel would not incur an additional scan through the array (as would be the case if you simply called an element-wise kernel to follow the sort).

Sorting

Speclib provides sorting routines that significantly outperform those of Thrust, especially when the input data is non-uniform (approximately a factor of 2 for 64-bit sort keys).

Prefix Scan

Speclib’s prefix scan functions are faster than Thrust’s up to input sizes of approximately 1.8 million:

Reduction

Both value (e.g. max) and index (e.g. argmax) reduction operations are supported. Perform operations such as “find the index of the largest element” or “add all the elements together”.

TODO: Hook up to reference testing harness: it’s currently only tested via the BLAS referencer, preventing direct Thrust comparison (but we will likely win)!

IDE Support

Speclib adds additional IDE indexing support on top of CLion’s built-in by pretending to be the Clang CUDA preprocessor and including several clang headers.