libSCALE  0.2.0
A modern C++ CUDA API
Host API

Host-side CUDA API (use instead of libcuda). More...

Modules

 Exceptions
 Exception objects for errors from the GPU runtime API.
 

Classes

class  sp::CudaKernel
 Object representing a kernel. More...
 
class  sp::Device
 Represents a GPU. More...
 
class  sp::Event
 Represents an event in a compute stream. More...
 
class  sp::BlockingEvent
 An event that the host can synchronise with. More...
 
class  sp::Host
 Represents the host. More...
 
class  sp::Stream
 Represents a CUDA stream. More...
 

Enumerations

enum class  sp::DeviceMemoryType { sp::DeviceMemoryType::NORMAL , sp::DeviceMemoryType::MANAGED }
 Type of GPU memory allocation. More...
 
enum class  sp::HostMemoryType { sp::HostMemoryType::NORMAL , sp::HostMemoryType::STAGING , sp::HostMemoryType::PINNED , sp::HostMemoryType::MAPPED }
 Type of host memory allocation. More...
 

Detailed Description

Host-side CUDA API (use instead of libcuda).

Queue a kernel launch on this Stream

Effectively wraps cudaLaunchKernel, providing both a more convenient API and full exception handling. No more "Unspecified Launch Failure". This results in a number of ways to run a kernel:

Legacy

Spectral

There are two ways to pass the arguments for the kernel itself along to this wrapper. We recommend the first option, which takes a simple parameter pack, as seen here:

Example

sp::Stream stream = gpu.createStream();
// Use a string as a host-side buffer/destination
std::string buffer = "aaaaaaaaaa";
// Get the things to be passed to our kernel and put pointers to them into an array
auto devicePointer = gpu.allocateMemory<char>(buffer.size());
// Queue our stream operations
stream.launchKernel(fill, (dim3)1, (dim3)256, 0, devicePointer.get(), 'b', buffer.size()); // "bbbbbbbbbb"
// Queue a copy from the GPU to our host buffer
stream.copyMemory(buffer.data(), devicePointer.get(), buffer.size());
stream.synchronize();
verify(buffer, "bbbbbbbbbb", 10);
Represents a GPU.
Definition: Device.hpp:60
Stream createStream(const std::string &name) const
Make a stream on this device.
sp::UniquePtr< __device T > allocateMemory(size_t n, DeviceMemoryType memType=DeviceMemoryType::NORMAL)
Allocate device memory.
Definition: Device.hpp:292
static Device & getActive()
Get the "active" device according to libcuda's global state.
Represents a CUDA stream.
Definition: Stream.hpp:42
const Stream & copyMemory(T *dst, const T *src, size_t count) const
Enqueue a host-to-host copy.
Definition: Stream.hpp:592
const Stream & synchronize() const
Wait for all work on this stream to finish.
T data(T... args)
T size(T... args)

However, you may also pass a (void**)arg. Usually you will construct this similarly to seen here:

Example

// Create a pack to use the (void**) overload
auto ptr = devicePointer.get();
char toWrite = 'c';
char toExpect = 'b';
uint64_t count = buffer.size();
void* args[4] = {&ptr, &toWrite, &toExpect, &count};
stream.launchKernel(conditionalFill, (dim3)1, (dim3)256, 0, (void**)args); // "cccccccccc"
stream.copyMemory(buffer.data(), devicePointer.get(), buffer.size());
stream.synchronize();
verify(buffer, "cccccccccc", 10);
T count(T... args)
Note
A void* pointer to the kernel function will implicitly convert to an sp::CudaKernel and be accepted here, for compatibility with NVIDIA® APIs.
sp::Vec<int, X> for X in 1-3 will implicitly convert to dim3 and be accepted by all methods. This is handy when you're using sp::Vec to compute sizes.
Parameters
kernelFunctionPointer to any function which returns void. This should cover all kernel functions.
gridDimNumber of blocks
blockDimNumber of threads in each block
argsPointer to a std::array<void*> containing references to the arguments needed by the kernel
dynamicSMemRequested amount of dynamic shared memory per block in bytes

TODO: Use C++ non-type template params to shift the block/thread/smem into the template, mirroring the <<<>>> syntax in a way

Enumeration Type Documentation

◆ DeviceMemoryType

enum class sp::DeviceMemoryType
strong

Type of GPU memory allocation.

Enumerator
NORMAL 

Standard allocation.

Only accessible on the device, and via memory copy operations.

MANAGED 

Managed memory.

Accessible on both the host and device.

This may seem convenient, but there are serious performance implications to consider because memory accesses can require PCIe transactions - potentially many of them.

◆ HostMemoryType

enum class sp::HostMemoryType
strong

Type of host memory allocation.

Enumerator
NORMAL 

Ordinary host memory.

You could just use new instead, but if you're metaprogramming and want to select a memory type based on some constexpr function then this can be useful.

STAGING 

Write-combining page-locked host memory.

Such memory is optimised for use as a staging area for copies to GPU.

This memory can be copied to the GPU more quickly than any other type of memory, but it should be considered write-only from the host. Host reads of this memory will be extremely slow.

This is a good choice if you want a buffer that is only written to by the host and then sent to the GPU. If you want memory that is optimised for copies to GPU and may also be read by the host, use PINNED instead.

A common configuration is to use STAGING memory for input to the GPU and PINNED memory for receiving output.

PINNED 

Page-locked host memory.

This memory can be copied to/from the GPU more efficiently than memory allocated with the usual system allocation functions.

Allocating a very large amount of page locked memory can cause OS performance issues.

MAPPED 

Page-locked and GPU-mapped host memory.

This sort of memory can be accessed from the GPU without ever copying it there. Each access will generate its own PCIe transaction to do that. This is obviously very slow, but occasionally if you have a huge and rarely accessed buffer this is useful.

Note that there is an overhead associated with using this sort of memory. If you aren't using the mappedness, use PINNED instead.

Using this kind of allocation changes the behaviour of most APIs that implicitly copy buffers to merely do an address transformation to produce the device-side pointer instead.