In conventional NVIDIA® CUDA®, the type system does not distinguish between pointers to GPU memory and pointers to host memory. This allows you to compile nonsensical code like:
void myKernel(float *someFloats) {
__global__ // ...
}
int main() {
float *someFloats = new float[1024];
// Passing host buffer to kernel, but this will only fail at runtime.
<<<1024, 1>>>(someFloats);
myKernel
delete []someFloats;
}
When using libSCALE and the Spectral LLVM Compiler, device pointers are a different type to host pointers. Device allocations yield types such as __device float *
, allowing such errors to be statically found:
void myKernel(__device float *someFloats) {
__global__ // ...
}
int main() {
float *someFloats = new float[1024];
// Compile error: `__device float *` is not compatible with `float *`.
<<<1024, 1>>>(someFloats);
myKernel
delete []someFloats;
}
A complete version of this example using SCALE would look something like this:
void myKernel(__device float *someFloats) {
__global__ // ...
}
int main() {
// Object representing GPU 0.
auto& gpu = sp::Device::get(0);
// Object representing a cudaStream.
::Stream s = gpu.createStream();
sp
// Allocate 1024 floats on GPU 0. Any CUDA error is thrown as a C++ exception.
::UniquePtr<__device float> buffer = gpu.allocateMemory<float>(1024);
sp
// Launch the kernel on the stream, and await it.
<<<1024, 1, 0, s>>>(buffer.get());
myKernel.synchronise();
s
// The float buffer and cudaStream are automatically deallocated when their objects go out of scope (RAII).
}
C++ and conventional CUDA provide only the “generic” address space, represented at T *
- for example float *
. In conventional CUDA, the generic address space represents all possible pointers.
With explicit address space qualifiers, the following address-spaces are provided:
Address space | Example | Description |
---|---|---|
__flat |
__flat int * |
GPU or CPU memory. |
Generic | int * |
Semantics differ from C++ and conventional CUDA as described in the generic address space subsection. |
__device |
__device int * |
GPU global memory. |
__shared |
__shared int * |
GPU shared memory. |
__constant |
__constant int * |
GPU constant memory. |
__local |
__local int * |
GPU stack memory (rarely useful). This uses the same physical memory as __device unless promoted to registers, but without memory coalescing. |
As an example, __shared float *
is a pointer to a float
in GPU shared memory. Likewise, __shared float &
is a reference to a float
in GPU shared memory. Address space qualifiers are treated as qualifiers (like const
and volatile
qualifiers), so it is permissible to have an address space qualified type as a template parameter, for example:
template <typename A, typename B>
void callee(A *a, B *b) {
__device__ // ...
}
void caller(__device int *a, const int *b) {
__device__ (a, b); // A is `__device int` and B is `const int`.
caller}
With explicit address space qualifiers, the generic address space refers to any memory that may be dereferenced in the context in which it is used. In a __global__
or __device__
function, this is __device
, __shared
, __constant
, or __local
: i.e: any GPU memory. In a host function (__host__
or unannotated), this is just host memory. A __host__ __device__
(or constexpr
) function is considered for this purpose to be a host function when building for the host and a device function when building for the device. It is considered for the purposes of address space rules only if it semantically instantiated, i.e: if the function is not inline
(either explicitly, or implicitly by being constexpr
, templated, or defined in a class body) or if it’s ODR-used or explicitly instantiated.
This means that the generic address space can be used in host functions to limit a type to only accepting host pointers without the need to qualify pointers and references in device functions that are intended to accept pointers and references to any GPU memory (which would become cumbersome for many OOP patterns). Suitable implicit casts to the generic address space are provided as described in the address space casts section.
The inline
/ODR-used rule for __host__ __device__
functons are intended to make it possible to write code that works on both the host and the device, even if if performs address space casts in device code that would be invalid in host code. The requirement that such a function be inline
(even if implicitly) is because otherwise the function would be code-generated, and could be used by other translation units’ host code. Template-dependence is not required because some constructs could access different address spaces, depending on whether the translation unit is being built for host or device, even if the using function is not a template (or a method of a templated class).
The address space of a variable is inferred from context and side being compiled for. The following table lists declarations and the inferred address space corresponding to those declarations.
Declaration | Context | Host address space | Device address space |
---|---|---|---|
int i; |
Global variable | Generic | |
int i; |
Local variable | Generic | Generic |
__local int i; |
Local variable | __local |
|
constexpr int i; |
Anywhere | Generic | Generic |
__device__ int i; |
Anywhere | __device |
__device |
__shared__ int i; |
Anywhere | __shared |
__shared |
__constant__ int i; |
Anywhere | __constant |
__constant |
Local variables (i.e: those which have automatic storage duration in C++) are generic address space by default, rather than __local
. If it is desired to make a local variable explicitly in the __local
address space, it can be declared as such, for example:
int value = 0; __local
Variables that are __device__
, __shared__
, and __constant__
have the obvious address spaces. Other valid variables have generic address space.
The local variable semantics are chosen because many useful constructs in C++ (for example, those including decltype
) rely on obtaining the type of local variables and applying computations, transformations, or comparisons on them. Such constructs are likely to break if local variables are unexpectedly qualified. Consider, for example:
= 0;
T value if constexpr (std::is_same_v<T, float>) {
// ...
}
else {
// ...
}
It is not normally required to provide an overload for local variables. They should be promoted to registers in efficient GPU code and no shared semantics are possible. It is also not possible for a local memory pointer to meaningfully appear in host code. The __local
address space qualifier could, however, be useful to distinguish a local variable from a __device
variable so that an implementation can be chosen that is amenable to being placed in registers (e.g: using chained ternary operators to select between multiple values rather than an array offset).
The rationale for constexpr
’s address space is much the same as for local variables: templated code depending on constexpr
values and their types should continue to work correctly.
In host contexts, no implicit address space casts are permitted except for to the __flat
address space. The following implicit address space casts are legal in device contexts:
__flat
to the generic address space.__flat
address space.Additionally, the following explicit address space casts are permitted in both host and device contexts:
__flat
address space to any address space.An explicit cast can be performed with a C-style casts or with addrspace_cast
, the latter of which operates analogously to const_cast
and static_cast
. These casts can be performed on pointers or references.
It is undefined behaviour to indirectly perform an address space cast not listed above (such as from __shared
to generic or __flat
and then from the generic of __flat
to __device
).
void deviceFn(int *genericPtr, __device int *devicePtr) {
__device__ int *a = devicePtr; // Implicit "up" cast.
int *b = (__device int *)genericPtr; // Explicit "down" cast.
__device int *c = genericPtr; // BAD! Needs an explicit cast.
__device int *d = (__shared int *)genericPtr; // BAD! Incompatible address spaces.
__shared }
void hostFn(int *genericPtr, __device int *devicePtr) {
int *a = devicePtr; // BAD! Implicit cast from __device to generic not allowed in a host function.
int *b = (int *)genericPtr; // Explicit cast from __device to generic is permitted even in a host function.
int *c = (__device int *)genericPtr; // Explicit "down" cast.
__device }
Explicit casts to the generic address space in host contexts permit interoperability with other CUDA libraries by casting away address-space qualifiers at the API boundary, and casting them back on again for pointers returned by the other library.
The implicit casts to the generic address space mean that functions do not have to be templated to accept pointers from anywhere on the same device as their context while also not requiring address space casts. Provided the methods and functions are inlined, this should not result in a loss of the ability to use address space specific instructions, as the optimizer can infer their address space based on the address space of the input to the address space cast.
Values may be accessed (read from and written to) only when their address space can be implicitly converted to the generic address space. Thus, for example, a __device int &
may be accessed from a device function, but not a host function.
It is undefined behaviour to perform an access via a generic address space pointer or reference where an explicit cast was required to obtain it in the generic address space, except if:
__flat
and the runtime location is accessible from the accessor’s side.__device
and the runtime location is managed memory.void deviceFn(const int &a, const __device int &b, const __shared int &c, const __flat int &d) {
__device__ int OK = a + b + c; // All these are accessible on device.
int BAD = d; // Flat can't be accessed because it could be a host pointer.
}
void hostFn(const int &a, const __device int &b, const __shared int &c, const __flat int &d) {
int OK = a; // Only generic is accessible on the host.
int BAD = b + c + d; // None of these can be accessed because they are or could be on the device.
}
void hostFn(const __flat int *ptr) {
int MAYBE = *(int *)ptr; // This is OK if and only if the thing pointed to by ptr is actually on the host.
}
These rules forbid accesses that would (or in the case of directly accessing __flat
, could) involve attempting to access memory that is not accessible from the side of the accessing function, for example a __device int *
(even if casted to generic address space) on the host. The explicit cast from __flat
to generic (or other address space), followed by an access of that generic is permitted because this can be valid. It is up to the programmer to ensure that it actually is valid at runtime and so requires an explicit cast to prevent the accidental assumption of this responsibility.
This language extension also allows you to use overload resolution to change the behaviour of functions based on the address-space of the arguments.
A function may be overloaded based on the the address space of its arguments. If an overload is not provided for the given address space of an argument, but there is an overload for the generic address space, the generic address space overload is used.
int fn(int &i) {
__device__ // This function is called if the address space is not __device or __shared.
return i + 1;
}
int fn(__device int &i) {
__device__ return i + 2;
}
int fn(__shared int &i) {
__device__ return i + 3;
}
int c = 42;
__constant__
void kernel(__device int *dst1, __device int *dst2, __device int *dst3) {
__global__ int sharedInt;
__shared__ if (threadIdx.x == 0) {
= *dst3;
sharedInt }
*dst1 = fn(c); // *dst1 = 42 + 1;
*dst2 = fn(*dst2); // *dst2 += 2;
*dst3 = fn(sharedInt); // *dst3 += 3;
}
Methods can be address space qualified, just as they can be const
qualified. The address space qualification of a method applies to the address space of the object it is called on (i.e: the address space of the thing pointed to by this
).
Address space qualification may only be applied to __device__
methods, and __host__ __device__
and constexpr
methods compiled for the device. The qualifier may not be __flat
. Address space qualification may not be applied to static methods or non-member functions. It is permitted to qualify (for example, by a template) a __host__ __device__
or constexpr
method that is built for the host with the generic address space.
struct S {
int fn() const {
__device__ return 1;
}
int fn() const __shared {
__device__ return 2;
}
};
;
__device__ S deviceObj;
__shared__ S sharedObj
void kernel(__device int *dst1, __device int *dst2) {
__global__ *dst1 = deviceObj.fn(); // *dst1 = 1;
*dst2 = sharedObj.fn(); // *dst2 = 2;
}
The rules for what address space qualifications are allowed on methods are governed by what would enable this
to be dereferenced. Thus address space qualification is allowed only on non-static methods built for the device and __flat
is not permitted. Qualification of non-static __host__ __device__
and constexpr
methods built for the host with generic address space is permitted so that the same definition can be template address space qualified for the device and used for the host.
Address spaces may be template parameters, and may participate in template argument deduction.
template <typename T, sp::AddressSpace AS>
int fn(__AS(AS) T) {
// T will be the generic address space object. We can select directly on AS. AS should only be compared for equality
// with the members of sp::AddressSpace though, since its numerical values and ordering are not defined.
return AS == sp::AddressSpace::SHARED ? 1 : 0;
}
int deviceInt;
__device__ int sharedInt;
__shared__
void kernel(__device int *dst1, __device int *dst2) {
__global__ *dst1 = fn(deviceInt); // *dst1 = 0;
*dst2 = fn(sharedInt); // *dst2 = 1;
}
The address space of a method can be a template parameter. This can be done using the __AS
macro taking an sp::AddressSpace
template argument in place of a specific address space qualifier. The address space This is useful for methods that rely on knowing what address space their object is in with generality, such as if they call further methods that are overloaded by object address space.
struct S {
template <sp::AddressSpace AS>
int fn() const __AS(AS) {
return AS == sp::AddressSpace::SHARED ? 1 : 0;
}
};
;
__device__ S deviceObj;
__shared__ S sharedObj
void kernel(__device int *dst1, __device int *dst2) {
__global__ *dst1 = deviceObj.fn(); // *dst1 = 0;
*dst2 = sharedObj.fn(); // *dst2 = 1;
}
The address space utilities are in the <spec/cuda/Addrspace.hpp>
header. The presence of address space qualifiers being enabled during compilation can be detected with the compiler pre-defined __SCALE_ADDRSPACES__
macro.
sp::AddressSpace
enumeratorThe sp::AddressSpace
enumerator provides a way to refer to different address spaces. It is used by many of the other address space utilities. It contains the following entries:
enum entry |
Address space |
---|---|
FLAT |
__flat |
GENERIC |
Generic |
DEVICE |
__device |
SHARED |
__shared |
CONSTANT |
__constant |
LOCAL |
__local |
The specific values of these entries are implementation defined, and subject to change. Users should therefore not rely on specific values or their ordering.
__AS
macroThe __AS
macro can be used to as an address space qualifier. It is a function macro that takes an entry from sp::AddressSpace
as an argument. For example: __AS(sp::AddressSpace::DEVICE) int *
is equivalent to __device int *
.
Type traits are provided that are analogous to the standard C++ type traits for CV qualification. In the following table, T
is a type template parameter and AS
is a non-type template parameter of type sp::AddressSpace
.
C++17 Style | C++11 style | Description |
---|---|---|
sp::get_addrspace_v<T> |
sp::get_addrspace<T>::value |
Get the address space of T . Returns an sp::AddressSpace . This is analogous to std::is_const_v . |
sp::add_addrspace_t<T, AS> |
sp::add_addrspace<T, AS>::type |
Add an address space to T (which must be in the generic address sapce). This is analogous to std::add_const_t . |
sp::remove_addrspace_t<T> |
sp::remove_addrspace<T>::type |
Remove the address space from T , leaving a generic address space type. This is analogous to std::remove_const_t . |
sp::remove_cva_t |
sp::remove_cva |
Remove const , volatile and address space qualifiers. This is analogous to std::remove_cv_t . |
sp::remove_cvaref_t |
sp::remove_cvaref |
Remove reference and const , volatile and address space qualifiers. This is analogous to std::remove_cvref_t . |
#include <spec/cuda/Addrspace.hpp>
/* sp::get_addrspace_v */
// Gets the address space of a type.
static_assert(sp::get_addrspace_v<__device float> == sp::AddressSpace::DEVICE);
// Like std::is_const_v, sp::get_addrspace_v does not propagate through references.
static_assert(sp::get_addrspace_v<__shared float &> == sp::AddressSpace::GENERIC);
// This has no address space, so it's in the generic address space.
static_assert(sp::get_addrspace_v<float> == sp::AddressSpace::GENERIC);
// The pointer is in the generic address space even though the pointed to type is not.
static_assert(sp::get_addrspace_v<__shared const float *> == sp::AddressSpace::GENERIC);
#include <spec/cuda/Addrspace.hpp>
/* sp::add_addrspace_t */
// Add the __device address space to a type.
static_assert(std::is_same_v<__device float, sp::add_addrspace_t<float, sp::AddressSpace::DEVICE>>);
// As with std::add_const_t, the action does not apply to the referenced type.
static_assert(std::is_same_v<float &, sp::add_addrspace_t<float &, sp::AddressSpace::DEVICE>>);
// As with std::add_const_t, the action applies to the pointer, not the pointed to type.
static_assert(std::is_same_v<float * __device, sp::add_addrspace_t<float *, sp::AddressSpace::DEVICE>>);
// These are bad because the type already has an address space.
static_assert(std::is_same_v<..., sp::add_addrspace_t<__shared float, sp::AddressSpace::DEVICE>>); // BAD!
static_assert(std::is_same_v<..., sp::add_addrspace_t<__device float, sp::AddressSpace::DEVICE>>); // BAD!
#include <spec/cuda/Addrspace.hpp>
/* sp::remove_addrspace_t */
// Remove the __shared address space from a float.
static_assert(std::is_same_v<float, sp::remove_addrspace_t<__shared float>>);
// Already has no address space, so this is a no-op.
static_assert(std::is_same_v<float, sp::remove_addrspace_t<float>>);
// Like std::remove_const, this does not apply to the referenced type.
static_assert(std::is_same_v<__shared const float &, sp::remove_addrspace_t<__shared const float &>>);
// The pointer itself is in the generic address space.
static_assert(std::is_same_v<__shared float *, sp::remove_addrspace_t<__shared float *>>);
#include <spec/cuda/Addrspace.hpp>
/* sp::remove_cva_t */
// Strips __shared.
static_assert(std::is_same_v<float, sp::remove_cva_t<__shared float>>);
// Strips __shared const.
static_assert(std::is_same_v<float, sp::remove_cva_t<__shared const float>>);
// As with std::remove_cv_t, this does not affect the referenced type.
static_assert(std::is_same_v<__shared float &, sp::remove_cva_t<__shared float &>>);
#include <spec/cuda/Addrspace.hpp>
/* sp::remove_cvaref_t */
// Unlike the others, this removes the reference, and so the address space of the referenced type as well.
static_assert(std::is_same_v<float, sp::remove_cvaref_t<__shared float &>>);
// Likewise.
static_assert(std::is_same_v<float, sp::remove_cvaref_t<__shared const float &>>);
// Like std::remove_cvaref_t, this acts on the pointer, not the pointed to type.
static_assert(std::is_same_v<__shared const float *, sp::remove_cvaref_t<__shared const float *>>);
Address space qualifiers are enabled by default with the Spectral compiler when building CUDA unless it is in nvcc
mode. It can be explicitly enabled (e.g: for plain C++) with -fscale-addrspaces
and disabled with -fno-scale-addrspaces
.
Using libSCALE with upstream Clang (or C++ without enabling address space qualifiers) is permitted. In these cases, only the generic address space is available. This means that code can be written to be address space safe, but also work where address space qualifications are unavailable.
When address space qualifications are unavailable, stripped down implementations of the following are provided:
Feature | C++11 style | Implementation |
---|---|---|
Address space qualifiers | Empty macros. | |
__AS |
An empty macro. | |
sp::AddressSpace |
Each entry in the enum has the same value. |
|
sp::get_addrspace_v |
sp::get_addrspace |
Always returns sp::AddressSpace::GENERIC . |
sp::add_addrspace_t |
sp::add_addrspace |
A no-op. |
sp::remove_addrspace_t |
sp::remove_addrspace |
A no-op. |
sp::remove_cva_t |
sp::remove_cva |
Acts like std::remove_cv_t . |
sp::remove_cvaref_t |
sp::remove_cvaref |
Acts like std::remove_cvref_t . |
addrspace_cast |
A no-op that enforces input and output types to be the same. |
Without address space qualifications, any attempt to overload based on address spaces will fail because there is only the generic address space and so the different signatures would in fact be the same. Also, the compiler will not warn or error about inconsistent address spaces without address space qualifications.
The following address spaces are defined to have pointers of the same size:
Pointers of these types are considered equivalent for type-based alias analysis provided the pointees are equivalent for type-based alias analysis.
Note that functions that take address space qualified arguments have different mangled names from those with generic address space arguments. Thus care must be taken to ensure externally defined functions for which ABI interoperability is required have no arguments with address space qualification.
The -fscale-addrspaces
flag exists so that C++ code can interoperate with address space qualifications without having to build empty device code objects or selectively be marked with -fcuda-host-only
. The -fno-scale-addrspaces
flag exists so that our compiler can build conventional CUDA code in the (hopefully rare) case where it is not compatible with the enhanced address space qualifications.
The stripped-down (essentially no-op) address space features are implemented for when enhanced address space qualifiers are not available so that the same code can be written for both the case where it is available, and where it is not.
Consider the following that might appear in a header file:
#include <spec/cuda/Addrspace.hpp>
struct KernelArguments {
float *dst;
__device const float *lhs;
__device const float *rhs;
__device int size;
};
(sp::Device &device, int size);
KernelArguments allocateKernelArgumentsvoid launchSomeKernel(KernelArguments args);
This code might be expected to be included from code built with the Spectral compiler and code built with another compiler. In this case, address space safety is provided for the code built with the Spectral compiler (via the use of the __device
address space qualifications), but the header and its functions are not prohibited from use by code built with other compilers (or otherwise without address space qualifiers).
The type-based alias analysis rules defined above are chosen so that treating a type like KernelArguments
can be interchanged with an equivalent type without the address space qualifiers in a well defined way. The requirement that __device
pointers are the same size as generic pointers exists to facilitate this. Likewise for __flat
pointers, since a generic pointer is required (because of the semantics of conventional CUDA) to be the same size for host and device and a __flat
pointer is intended to be able to represent either.
Other address spaces, such as __shared
, are not guaranteed to be bitwise compatible because it is useful to allow such pointers to be shorter than generic pointers.