Speclib

Introduction

A library for writing high performance, cross-platform GPU programs. It works particularly well with libSCALE.

Speclib provides a wide range of features, ranging from low-level abstractions of mathematical objects to complete CUDA kernels.

The highest level APIs allow you to describe the desired calculation using a DSL based on C++ templates, causing the generation of a specialized kernel via metaprogramming.

This document provides a very brief introduction to some of speclib’s headline features, but readers are encouraged to peruse the reference manual to get a more complete picture of what is available.

Many features - including all the mathematical primitives - also work on the CPU. Many features also work at compile-time, including quite high-level things like Vec or StaticTensor, allowing you to use the full expressive power of the library even in metaprograms (and to reuse the same code for both runtime and compile-time execution).

See also:

Mathematical primitives

Vec

An efficient, fluent representation for vectors of elements of the same type. Similar goals to CUDA’s float4 and friends, but with a far more usable API, operators, better optimisation, and constexpr-compatibility.

Complex

A complex number representation allowing the use of half-float (or any other numeric type) complex numbers on the GPU. API almost identical to std::complex, so far more usable than CUDA’s cuComplex/cuDoubleComplex. Moreover, the class is implemented in a composable way, supporting quaternions and octonions automatically.

Arbitrary-precision integers

Integers of any size, written as __uint<N> or __int<N>. Since N can be a template parameter, this allows you to use metaprogramming to select an integer size.

TensorLike Primitives

An abstraction for any object that can be accessed as if it were a multi-dimensional array of some type and dimensionality. Effectively a C++ Concept.

This allows the rest of speclib to be written to accept any TensorLike, such as:

Thanks to template metaprogramming, when another speclib API is called with a new kind of TensorLike, a special version of the target function is generated and optimised for that type of input.

For example: a TensorLike representing “the contents of that region of memory, multiplied by six” can be constructed and passed to Speclib’s prefix-sum function. The multiplication-by-six will be fused onto the generated prefix sum kernel, eliminating the need to use a separate launch and memory scan to perform that step.

Scalar expression templates

Conceptually similar to TensorExpr, but for scalars (in fact, TensorExpr is built atop scalar expression templates).

These are mostly useful for when you want to pass a scalar function around as a (readable) type.

These may be autodifferentiated at compile-time.

Math Functions

Data structures

Debugging tools

Metaprogramming

Groundbreaking C++ metaprogramming tools:

These facilities are sufficiently powerful that we used them to implement the regex compiler in SpecRegex. These features open the door for exciting future work, such as a constexpr compute-graph optimiser for kernels written entirely in C++.

Object-oriented Kernels

Speclib provides a system for defining CUDA kernels as objects. This grants some advantages:

This style also allows users to write kernels that inherit from others, making it even easier to compose kernels without doing extra launches.

Some higher-level base classes are also provided for common iterator patterns, eliminating the need to write some boilerplate CUDA.

Explicit block/grid assumptions

A mechanism to inform the compiler about unused (or constant) elements of blockDim, gridDim, etc., so your code can be written in a generic fashion without worrying about redundant index calculations.

Automatic specialisation

Frequently, it is useful to have specialised versions of functions (or CUDA kernels) for degenerate cases. When a constant is zero, or an input size is a multiple of 4, or some other situation that allows a more efficient code path to be used.

Speclib provides a modular, reusable, perfectly general mechanism for defining “specialisable” functions and rules to select the right code path. This allows you to generate the necessary branch tree for any specialisable kernel without writing it out explicitly.

GPU functions

The following are provided both as standalone (auto-specialising) template kernels, and as functions that may be used to carry out the operation within a block or warp.

These are function templates that consume TensorLike objects, and (where applicable) consume operator types as inputs. This allows you to fuse other operations onto these routines simply by changing what kind of object you call them with.

For instance, this makes it much easier to implement “sort and multiply by 5”. The generated kernel would not incur an additional scan through the array (as would be the case if you simply called an element-wise kernel to follow the sort).

Sorting

Speclib provides sorting routines that significantly outperform those of Thrust, especially when the input data is non-uniform (approximately a factor of 2 for 64-bit sort keys).

sort64

Prefix Scan

Speclib’s prefix scan functions are faster than Thrust’s up to input sizes of approximately 1.8 million:

prefix32

Reduction

Both value (e.g. max) and index (e.g. argmax) reduction operations are supported. Perform operations such as “find the index of the largest element” or “add all the elements together”.

TODO: Hook up to reference testing harness: it’s currently only tested via the BLAS referencer, preventing direct Thrust comparison (but we will likely win)!

IDE Support

Speclib adds additional IDE indexing support on top of CLion’s built-in by pretending to be the Clang CUDA preprocessor and including several clang headers.