A modern C++ API for CUDA.
Key features:
If speclib
is in use then some additional features are enabled, such as the ability to directly allocate TensorLike
or Nomadic
objects on devices.
In this example program we’re going to:
Everything will be done in the most asynchronous way possible.
#define CUDA_CHECK(C) { \
cudaStatus_t status = C; \
if (status != cudaSuccess) { \
std::cerr << cudaGetErrorString(status) << std::endl; \
assert(status == cudaSuccess); \
abort(); \
} \
}
void foo() {
cudaEvent_t inCopyEvent;
// Event to detect B's compute finishing
cudaEvent_t bDoneEvent;
// Event for the host to await everything being finished.
cudaEvent_t allDoneEvent;
(cudaEventCreateWithFlags(&inCopyEvent, cudaEventDisableTiming));
CUDA_CHECK(cudaEventCreateWithFlags(&bDoneEvent, cudaEventDisableTiming));
CUDA_CHECK(cudaEventCreateWithFlags(&allDoneEvent, cudaEventBlockingSync | cudaEventDisableTiming));
CUDA_CHECK
cudaStream_t A;
cudaStream_t B;
(cudaStreamCreate(&A));
CUDA_CHECK(cudaStreamCreate(&B));
CUDA_CHECK
// Copy the input to the host.
(cudaMemcpyAsync(/** stuff **/, cudaMemcpyHostToDevice, A));
CUDA_CHECK
// Make inCopyEvent represent "The copy being done".
(cudaEventRecord(inCopyEvent, A));
CUDA_CHECK
// Make B wait for the copy.
// That last argument is documented as "must be zero". (really).
(cudaStreamWaitEvent(B, inCopyEvent, 0));
CUDA_CHECK
// Compute for A
<<<1, 1, 0, A>>>(...args...);
netKernel_1<<<1, 1, 0, A>>>(...args...);
netKernel_2
// Compute for B
<<<1, 1, 0, B>>>(...args...);
treeKernel
// Make A wait for B
(cudaEventRecord(bDoneEvent, B));
CUDA_CHECK(cudaStreamWaitEvent(A, bDoneEvent, 0));
CUDA_CHECK
// Make A copy the output to the host
(cudaMemcpyAsync(/** stuff **/, cudaMemcpyDeviceToHost, A));
CUDA_CHECK(cudaEventRecord(allDoneEvent, A));
CUDA_CHECK
// Make the host wait for all to be done.
(cudaEventSynchronize(allDoneEvent));
CUDA_CHECK
// Clean up
(cudaEventDestroy(inCopyEvent));
CUDA_CHECK(cudaEventDestroy(bDoneEvent));
CUDA_CHECK(cudaEventDestroy(allDoneEvent));
CUDA_CHECK(cudaStreamDestroy(A));
CUDA_CHECK(cudaStreamDestroy(B));
CUDA_CHECK}
void foo() {
// Object representing GPU 0. Realistic code might pick the value a different way, or rely
// on `getActive()` to use the default if you don't care.
::Device gpu = sp::Device::get(0);
sp
// Create two streams on the GPU.
::Stream A = gpu.createStream();
sp::Stream B = gpu.createStream();
sp.memcpy(...stuff...);
A.await(A.recordEvent()); // <- Wait for the copy
B
// Compute for A
<<<1, 1, 0, A>>>(...args...);
netKernel_1<<<1, 1, 0, A>>>(...args...);
netKernel_2
// Compute for B
<<<1, 1, 0, B>>>(...args...);
treeKernel.await(B.recordEvent()); // <- Make A wait for B.
A.memcpy(...stuff...); // <- Copy output to host
A
// Make host wait.
.recordBlockingEvent().sync();
A
// Cleanup is automatic due to RAII.
// Errors are thrown as C++ exceptions, providing equivalent error handling to the other example without us
// writing any code. Performance will actually be _better_, because (on most platforms) C++ exceptions have zero
// runtime cost unless they are thrown, but C-style return codes always have nonzero cost because they must be
// checked.
}
All CUDA error codes have a corresponding CudaException
type, allowing you to use try..catch
to perform error handling.
Since try-catch blocks have zero cost until an exception is actually thrown (on most 64-bit platforms), this alone leads to a small performance win compared to C-style error handling. Because of this, libSCALE is able to do quite expensive and elaborate post-mortem analysis of an exception to figure out the root cause. In many cases, this means libSCALE can provide a very clear error.
For example, if you make the simple mistake of trying to do a device-to-host copy with a source pointer that points to host memory, and you have no exception handlers, your program will crash immediately and print the following:
terminate called after throwing an instance of 'sp::CudaInvalidValueException'
what(): Source for device->host memcpy was not a valid device pointer!
- Was the buffer already freed?
- Did you use a pointer cast to add `__device` qualification, meaning this is actually a host pointer?
This greatly simplifies debugging of common CUDA programming mistakes.
Most of the values returned by cudaGetDeviceProperties
are constant properties of the hardware, and many of them influence how your program should be compiled for optimal performance.
LibSCALE provides a mechanism by which these values (through sp::Device::getAttribute<...>()
) can be accessed in a constexpr
fashion, allowing you to use them for metaprogramming. If you also use XCMake to build your project, extra features are available, such as:
Some APIs aren’t supported yet but could sensibly be added:
This would benefit significantly from an object-oriented approach. The NVIDIA® graph API is a C-style API which is much trickier to use fluently.
These would benefit from the cleanup, but they’re long and not really relevant to any of our work.
Some APIs are deliberately unsupported:
Use texture/surface objects instead, as encouraged by the NVIDIA documentation.
cudaGetDevice
/cudaSetDevice
Use Device
objects instead. This is less mistake-prone than relying on implicit thread-local state when using multiple GPUs.
cudaStreamGetFlags
The answer would always be cudaStreamNonBlocking
, since that’s the only flag that exists, and omitting it is akin to asking to have your program slowed down. Unless you’re relying on the deprecated-for-years-now implicit default stream synchronisation behaviour, this won’t break your program.
cudaLaunchCooperativeKernel
/cudaLaunchCooperativeKernelMultiDevice
Cooperative kernels are problematic for several reasons:
libcuda
NVIDIA’s libcuda
provides what it optimistically calls “C++ API Routines”.
These are all template functions that cast their input back to void*
and call the underlying C function. This provides no additional type-safety, so is somewhat pointless. Use the corresponding libSCALE API instead.
cudaMemcpy
Doing anything synchronously in CUDA is a bad idea for performance. LibSCALE makes asynchronous programming no more unpleasant than synchronous programming, so synchronous APIs are unnecessary.
Should you want to do something “synchronously”, use the “chainable” stream manipulation APIs, like this:
auto myGPU = sp::Device::get(0);
::Stream myStream = myGPU::createStream();
sp
// Queue up a memcpy, then a memset, and then wait for it.
// Basically a synchronous memcpy-then-memset.
.memcpy(...)
myStream.memset(...)
.synchronise();
Note: Some of NVIDIA’s ostensibly-synchronous CUDA APIs actually behave asynchronously in cases where the input is small. LibSCALE never exhibits this insanity because it can be a source of subtle bugs.