A modern C++ API for CUDA.

Key features:

If speclib is in use then some additional features are enabled, such as the ability to directly allocate TensorLike or Nomadic objects on devices.

Motivational Example

In this example program we’re going to:

Everything will be done in the most asynchronous way possible.

Vanilla CUDA

#define CUDA_CHECK(C) { \
        cudaStatus_t status = C; \
        if (status != cudaSuccess) { \
            std::cerr << cudaGetErrorString(status) << std::endl; \
            assert(status == cudaSuccess); \
            abort(); \
        } \

void foo() {
    cudaEvent_t inCopyEvent;

    // Event to detect B's compute finishing
    cudaEvent_t bDoneEvent;

    // Event for the host to await everything being finished.
    cudaEvent_t allDoneEvent;

    CUDA_CHECK(cudaEventCreateWithFlags(&inCopyEvent, cudaEventDisableTiming));
    CUDA_CHECK(cudaEventCreateWithFlags(&bDoneEvent, cudaEventDisableTiming));
    CUDA_CHECK(cudaEventCreateWithFlags(&allDoneEvent, cudaEventBlockingSync | cudaEventDisableTiming));

    cudaStream_t A;
    cudaStream_t B;

    // Copy the input to the host.
    CUDA_CHECK(cudaMemcpyAsync(/** stuff **/, cudaMemcpyHostToDevice, A));

    // Make inCopyEvent represent "The copy being done".
    CUDA_CHECK(cudaEventRecord(inCopyEvent, A));

    // Make B wait for the copy.
    // That last argument is documented as "must be zero". (really).
    CUDA_CHECK(cudaStreamWaitEvent(B, inCopyEvent, 0));

    // Compute for A
    netKernel_1<<<1, 1, 0, A>>>(...args...);
    netKernel_2<<<1, 1, 0, A>>>(...args...);

    // Compute for B
    treeKernel<<<1, 1, 0, B>>>(...args...);

    // Make A wait for B
    CUDA_CHECK(cudaEventRecord(bDoneEvent, B));
    CUDA_CHECK(cudaStreamWaitEvent(A, bDoneEvent, 0));

    // Make A copy the output to the host
    CUDA_CHECK(cudaMemcpyAsync(/** stuff **/, cudaMemcpyDeviceToHost, A));
    CUDA_CHECK(cudaEventRecord(allDoneEvent, A));

    // Make the host wait for all to be done.

    // Clean up

With libSCALE

void foo() {
    // Object representing GPU 0. Realistic code might pick the value a different way, or rely
    // on `getActive()` to use the default if you don't care.
    sp::Device gpu = sp::Device::get(0);

    // Create two streams on the GPU.
    sp::Stream A = gpu.createStream();
    sp::Stream B = gpu.createStream();
    B.await(A.recordEvent()); // <- Wait for the copy

    // Compute for A
    netKernel_1<<<1, 1, 0, A>>>(...args...);
    netKernel_2<<<1, 1, 0, A>>>(...args...);

    // Compute for B
    treeKernel<<<1, 1, 0, B>>>(...args...);
    A.await(B.recordEvent()); // <- Make A wait for B.
    A.memcpy(...stuff...); // <- Copy output to host

    // Make host wait.

    // Cleanup is automatic due to RAII.
    // Errors are thrown as C++ exceptions, providing equivalent error handling to the other example without us
    // writing any code. Performance will actually be _better_, because (on most platforms) C++ exceptions have zero
    // runtime cost unless they are thrown, but C-style return codes always have nonzero cost because they must be
    // checked.

Error handling

All CUDA error codes have a corresponding CudaException type, allowing you to use try..catch to perform error handling.

Since try-catch blocks have zero cost until an exception is actually thrown (on most 64-bit platforms), this alone leads to a small performance win compared to C-style error handling. Because of this, libSCALE is able to do quite expensive and elaborate post-mortem analysis of an exception to figure out the root cause. In many cases, this means libSCALE can provide a very clear error.

For example, if you make the simple mistake of trying to do a device-to-host copy with a source pointer that points to host memory, and you have no exception handlers, your program will crash immediately and print the following:

terminate called after throwing an instance of 'sp::CudaInvalidValueException'
  what():  Source for device->host memcpy was not a valid device pointer!

 - Was the buffer already freed?
 - Did you use a pointer cast to add `__device` qualification, meaning this is actually a host pointer?

This greatly simplifies debugging of common CUDA programming mistakes.

Metaprogram using GPU hardware features

Most of the values returned by cudaGetDeviceProperties are constant properties of the hardware, and many of them influence how your program should be compiled for optimal performance.

LibSCALE provides a mechanism by which these values (through sp::Device::getAttribute<...>()) can be accessed in a constexpr fashion, allowing you to use them for metaprogramming. If you also use XCMake to build your project, extra features are available, such as:

Unsupported APIs

Some APIs aren’t supported yet but could sensibly be added:

The graph API

This would benefit significantly from an object-oriented approach. The NVIDIA® graph API is a C-style API which is much trickier to use fluently.

The graphics interoperability APIs

These would benefit from the cleanup, but they’re long and not really relevant to any of our work.

Rejected APIs

Some APIs are deliberately unsupported:

Texture/Surface references

Use texture/surface objects instead, as encouraged by the NVIDIA documentation.


Use Device objects instead. This is less mistake-prone than relying on implicit thread-local state when using multiple GPUs.


The answer would always be cudaStreamNonBlocking, since that’s the only flag that exists, and omitting it is akin to asking to have your program slowed down. Unless you’re relying on the deprecated-for-years-now implicit default stream synchronisation behaviour, this won’t break your program.


Cooperative kernels are problematic for several reasons:

All of the “C++ API Routines” from NVIDIA libcuda

NVIDIA’s libcuda provides what it optimistically calls “C++ API Routines”.

These are all template functions that cast their input back to void* and call the underlying C function. This provides no additional type-safety, so is somewhat pointless. Use the corresponding libSCALE API instead.

Synchronous memory APIs like cudaMemcpy

Doing anything synchronously in CUDA is a bad idea for performance. LibSCALE makes asynchronous programming no more unpleasant than synchronous programming, so synchronous APIs are unnecessary.

Should you want to do something “synchronously”, use the “chainable” stream manipulation APIs, like this:

auto myGPU = sp::Device::get(0);
sp::Stream myStream = myGPU::createStream();

// Queue up a memcpy, then a memset, and then wait for it.
// Basically a synchronous memcpy-then-memset.

Note: Some of NVIDIA’s ostensibly-synchronous CUDA APIs actually behave asynchronously in cases where the input is small. LibSCALE never exhibits this insanity because it can be a source of subtle bugs.