Host-side CUDA API (use instead of libcuda). More...
Modules | |
Exceptions | |
Exception objects for errors from the GPU runtime API. | |
Classes | |
class | sp::CudaKernel |
Object representing a kernel. More... | |
class | sp::Device |
Represents a GPU. More... | |
class | sp::Event |
Represents an event in a compute stream. More... | |
class | sp::BlockingEvent |
An event that the host can synchronise with. More... | |
class | sp::Host |
Represents the host. More... | |
class | sp::Stream |
Represents a CUDA stream. More... | |
Enumerations | |
enum class | sp::DeviceMemoryType { sp::DeviceMemoryType::NORMAL , sp::DeviceMemoryType::MANAGED } |
Type of GPU memory allocation. More... | |
enum class | sp::HostMemoryType { sp::HostMemoryType::NORMAL , sp::HostMemoryType::STAGING , sp::HostMemoryType::PINNED , sp::HostMemoryType::MAPPED } |
Type of host memory allocation. More... | |
Host-side CUDA API (use instead of libcuda).
Queue a kernel launch on this Stream
Effectively wraps cudaLaunchKernel, providing both a more convenient API and full exception handling. No more "Unspecified Launch Failure". This results in a number of ways to run a kernel:
sp::Kernel::launch()
launches that sp::Kernel
object on the given stream. If passed an sp::Stream
, it has full exception handling. If passed a cudaStream_t
it does not.There are two ways to pass the arguments for the kernel itself along to this wrapper. We recommend the first option, which takes a simple parameter pack, as seen here:
However, you may also pass a (void**)arg
. Usually you will construct this similarly to seen here:
void*
pointer to the kernel function will implicitly convert to an sp::CudaKernel
and be accepted here, for compatibility with NVIDIA® APIs.sp::Vec<int, X>
for X
in 1-3 will implicitly convert to dim3
and be accepted by all methods. This is handy when you're using sp::Vec
to compute sizes.kernelFunction | Pointer to any function which returns void. This should cover all kernel functions. |
gridDim | Number of blocks |
blockDim | Number of threads in each block |
args | Pointer to a std::array<void*> containing references to the arguments needed by the kernel |
dynamicSMem | Requested amount of dynamic shared memory per block in bytes |
TODO: Use C++ non-type template params to shift the block/thread/smem into the template, mirroring the <<<>>> syntax in a way
|
strong |
Type of GPU memory allocation.
|
strong |
Type of host memory allocation.
Enumerator | |
---|---|
NORMAL | Ordinary host memory. You could just use |
STAGING | Write-combining page-locked host memory. Such memory is optimised for use as a staging area for copies to GPU. This memory can be copied to the GPU more quickly than any other type of memory, but it should be considered write-only from the host. Host reads of this memory will be extremely slow. This is a good choice if you want a buffer that is only written to by the host and then sent to the GPU. If you want memory that is optimised for copies to GPU and may also be read by the host, use PINNED instead. A common configuration is to use STAGING memory for input to the GPU and PINNED memory for receiving output. |
PINNED | Page-locked host memory. This memory can be copied to/from the GPU more efficiently than memory allocated with the usual system allocation functions. Allocating a very large amount of page locked memory can cause OS performance issues. |
MAPPED | Page-locked and GPU-mapped host memory. This sort of memory can be accessed from the GPU without ever copying it there. Each access will generate its own PCIe transaction to do that. This is obviously very slow, but occasionally if you have a huge and rarely accessed buffer this is useful. Note that there is an overhead associated with using this sort of memory. If you aren't using the mappedness, use PINNED instead. Using this kind of allocation changes the behaviour of most APIs that implicitly copy buffers to merely do an address transformation to produce the device-side pointer instead. |