Address-space qualifiers

Introduction

In conventional NVIDIA® CUDA®, the type system does not distinguish between pointers to GPU memory and pointers to host memory. This allows you to compile nonsensical code like:

__global__ void myKernel(float *someFloats) {
    // ...
}

int main() {
    float *someFloats = new float[1024];

    // Passing host buffer to kernel, but this will only fail at runtime.
    myKernel<<<1024, 1>>>(someFloats);

    delete []someFloats;
}

When using libSCALE and the Spectral LLVM Compiler, device pointers are a different type to host pointers. Device allocations yield types such as __device float *, allowing such errors to be statically found:

__global__ void myKernel(__device float *someFloats) {
    // ...
}

int main() {
    float *someFloats = new float[1024];

    // Compile error: `__device float *` is not compatible with `float *`.
    myKernel<<<1024, 1>>>(someFloats);

    delete []someFloats;
}

Example

A complete version of this example using SCALE would look something like this:

__global__ void myKernel(__device float *someFloats) {
    // ...
}

int main() {
    // Object representing GPU 0.
    auto& gpu = sp::Device::get(0);

    // Object representing a cudaStream.
    sp::Stream s = gpu.createStream();

    // Allocate 1024 floats on GPU 0. Any CUDA error is thrown as a C++ exception.
    sp::UniquePtr<__device float> buffer = gpu.allocateMemory<float>(1024);

    // Launch the kernel on the stream, and await it.
    myKernel<<<1024, 1, 0, s>>>(buffer.get());
    s.synchronise();

    // The float buffer and cudaStream are automatically deallocated when their objects go out of scope (RAII).
}

Address spaces

C++ and conventional CUDA provide only the “generic” address space, represented at T * - for example float *. In conventional CUDA, the generic address space represents all possible pointers.

With explicit address space qualifiers, the following address-spaces are provided:

Address space Example Description
__flat __flat int * GPU or CPU memory.
Generic int * Semantics differ from C++ and conventional CUDA as described in the generic address space subsection.
__device __device int * GPU global memory.
__shared __shared int * GPU shared memory.
__constant __constant int * GPU constant memory.
__local __local int * GPU stack memory (rarely useful). This uses the same physical memory as __device unless promoted to registers, but without memory coalescing.

As an example, __shared float * is a pointer to a float in GPU shared memory. Likewise, __shared float & is a reference to a float in GPU shared memory. Address space qualifiers are treated as qualifiers (like const and volatile qualifiers), so it is permissible to have an address space qualified type as a template parameter, for example:

template <typename A, typename B>
__device__ void callee(A *a, B *b) {
    // ...
}

__device__ void caller(__device int *a, const int *b) {
    caller(a, b); // A is `__device int` and B is `const int`.
}

The generic address space

With explicit address space qualifiers, the generic address space refers to any memory that may be dereferenced in the context in which it is used. In a __global__ or __device__ function, this is __device, __shared, __constant, or __local: i.e: any GPU memory. In a host function (__host__ or unannotated), this is just host memory. A __host__ __device__ (or constexpr) function is considered for this purpose to be a host function when building for the host and a device function when building for the device. It is considered for the purposes of address space rules only if it semantically instantiated, i.e: if the function is not inline (either explicitly, or implicitly by being constexpr, templated, or defined in a class body) or if it’s ODR-used or explicitly instantiated.

Rationale

This means that the generic address space can be used in host functions to limit a type to only accepting host pointers without the need to qualify pointers and references in device functions that are intended to accept pointers and references to any GPU memory (which would become cumbersome for many OOP patterns). Suitable implicit casts to the generic address space are provided as described in the address space casts section.

The inline/ODR-used rule for __host__ __device__ functons are intended to make it possible to write code that works on both the host and the device, even if if performs address space casts in device code that would be invalid in host code. The requirement that such a function be inline (even if implicitly) is because otherwise the function would be code-generated, and could be used by other translation units’ host code. Template-dependence is not required because some constructs could access different address spaces, depending on whether the translation unit is being built for host or device, even if the using function is not a template (or a method of a templated class).

Inferred address spaces

The address space of a variable is inferred from context and side being compiled for. The following table lists declarations and the inferred address space corresponding to those declarations.

Declaration Context Host address space Device address space
int i; Global variable Generic
int i; Local variable Generic Generic
__local int i; Local variable __local
constexpr int i; Anywhere Generic Generic
__device__ int i; Anywhere __device __device
__shared__ int i; Anywhere __shared __shared
__constant__ int i; Anywhere __constant __constant

Local variables (i.e: those which have automatic storage duration in C++) are generic address space by default, rather than __local. If it is desired to make a local variable explicitly in the __local address space, it can be declared as such, for example:

__local int value = 0;

Variables that are __device__, __shared__, and __constant__ have the obvious address spaces. Other valid variables have generic address space.

Rationale

The local variable semantics are chosen because many useful constructs in C++ (for example, those including decltype) rely on obtaining the type of local variables and applying computations, transformations, or comparisons on them. Such constructs are likely to break if local variables are unexpectedly qualified. Consider, for example:

T value = 0;
if constexpr (std::is_same_v<T, float>) {
    // ...
}
else {
    // ...
}

It is not normally required to provide an overload for local variables. They should be promoted to registers in efficient GPU code and no shared semantics are possible. It is also not possible for a local memory pointer to meaningfully appear in host code. The __local address space qualifier could, however, be useful to distinguish a local variable from a __device variable so that an implementation can be chosen that is amenable to being placed in registers (e.g: using chained ternary operators to select between multiple values rather than an array offset).

The rationale for constexpr’s address space is much the same as for local variables: templated code depending on constexpr values and their types should continue to work correctly.

Address space casts

In host contexts, no implicit address space casts are permitted except for to the __flat address space. The following implicit address space casts are legal in device contexts:

Additionally, the following explicit address space casts are permitted in both host and device contexts:

An explicit cast can be performed with a C-style casts or with addrspace_cast, the latter of which operates analogously to const_cast and static_cast. These casts can be performed on pointers or references.

It is undefined behaviour to indirectly perform an address space cast not listed above (such as from __shared to generic or __flat and then from the generic of __flat to __device).

Example

__device__ void deviceFn(int *genericPtr, __device int *devicePtr) {
    int *a = devicePtr; // Implicit "up" cast.
    __device int *b = (__device int *)genericPtr; // Explicit "down" cast.
    __device int *c = genericPtr; // BAD! Needs an explicit cast.
    __shared int *d = (__shared int *)genericPtr; // BAD! Incompatible address spaces.
}

void hostFn(int *genericPtr, __device int *devicePtr) {
    int *a = devicePtr; // BAD! Implicit cast from __device to generic not allowed in a host function.
    int *b = (int *)genericPtr; // Explicit cast from __device to generic is permitted even in a host function.
    __device int *c = (__device int *)genericPtr; // Explicit "down" cast.
}

Rationale

Explicit casts to the generic address space in host contexts permit interoperability with other CUDA libraries by casting away address-space qualifiers at the API boundary, and casting them back on again for pointers returned by the other library.

The implicit casts to the generic address space mean that functions do not have to be templated to accept pointers from anywhere on the same device as their context while also not requiring address space casts. Provided the methods and functions are inlined, this should not result in a loss of the ability to use address space specific instructions, as the optimizer can infer their address space based on the address space of the input to the address space cast.

Address space accessibility

Values may be accessed (read from and written to) only when their address space can be implicitly converted to the generic address space. Thus, for example, a __device int & may be accessed from a device function, but not a host function.

It is undefined behaviour to perform an access via a generic address space pointer or reference where an explicit cast was required to obtain it in the generic address space, except if:

Example

__device__ void deviceFn(const int &a, const __device int &b, const __shared int &c, const __flat int &d) {
    int OK = a + b + c; // All these are accessible on device.
    int BAD = d; // Flat can't be accessed because it could be a host pointer.
}
void hostFn(const int &a, const __device int &b, const __shared int &c, const __flat int &d) {
    int OK = a; // Only generic is accessible on the host.
    int BAD = b + c + d; // None of these can be accessed because they are or could be on the device.
}
void hostFn(const __flat int *ptr) {
    int MAYBE = *(int *)ptr; // This is OK if and only if the thing pointed to by ptr is actually on the host.
}

Rationale

These rules forbid accesses that would (or in the case of directly accessing __flat, could) involve attempting to access memory that is not accessible from the side of the accessing function, for example a __device int * (even if casted to generic address space) on the host. The explicit cast from __flat to generic (or other address space), followed by an access of that generic is permitted because this can be valid. It is up to the programmer to ensure that it actually is valid at runtime and so requires an explicit cast to prevent the accidental assumption of this responsibility.

Function overloading

This language extension also allows you to use overload resolution to change the behaviour of functions based on the address-space of the arguments.

Function arguments

A function may be overloaded based on the the address space of its arguments. If an overload is not provided for the given address space of an argument, but there is an overload for the generic address space, the generic address space overload is used.

Example

__device__ int fn(int &i) {
    // This function is called if the address space is not __device or __shared.
    return i + 1;
}
__device__ int fn(__device int &i) {
    return i + 2;
}
__device__ int fn(__shared int &i) {
    return i + 3;
}

__constant__ int c = 42;

__global__ void kernel(__device int *dst1, __device int *dst2, __device int *dst3) {
    __shared__ int sharedInt;
    if (threadIdx.x == 0) {
        sharedInt = *dst3;
    }
    *dst1 = fn(c); // *dst1 = 42 + 1;
    *dst2 = fn(*dst2); // *dst2 += 2;
    *dst3 = fn(sharedInt); // *dst3 += 3;
}

Methods

Methods can be address space qualified, just as they can be const qualified. The address space qualification of a method applies to the address space of the object it is called on (i.e: the address space of the thing pointed to by this).

Address space qualification may only be applied to __device__ methods, and __host__ __device__ and constexpr methods compiled for the device. The qualifier may not be __flat. Address space qualification may not be applied to static methods or non-member functions. It is permitted to qualify (for example, by a template) a __host__ __device__ or constexpr method that is built for the host with the generic address space.

Example

struct S {
    __device__ int fn() const {
        return 1;
    }
    __device__ int fn() const __shared {
        return 2;
    }
};

__device__ S deviceObj;
__shared__ S sharedObj;

__global__ void kernel(__device int *dst1, __device int *dst2) {
    *dst1 = deviceObj.fn(); // *dst1 = 1;
    *dst2 = sharedObj.fn(); // *dst2 = 2;
}

Rationale

The rules for what address space qualifications are allowed on methods are governed by what would enable this to be dereferenced. Thus address space qualification is allowed only on non-static methods built for the device and __flat is not permitted. Qualification of non-static __host__ __device__ and constexpr methods built for the host with generic address space is permitted so that the same definition can be template address space qualified for the device and used for the host.

Templated address spaces

Address spaces may be template parameters, and may participate in template argument deduction.

Address space template arguments

Example

template <typename T, sp::AddressSpace AS>
int fn(__AS(AS) T) {
    // T will be the generic address space object. We can select directly on AS. AS should only be compared for equality
    // with the members of sp::AddressSpace though, since its numerical values and ordering are not defined.
    return AS == sp::AddressSpace::SHARED ? 1 : 0;
}

__device__ int deviceInt;
__shared__ int sharedInt;

__global__ void kernel(__device int *dst1, __device int *dst2) {
    *dst1 = fn(deviceInt); // *dst1 = 0;
    *dst2 = fn(sharedInt); // *dst2 = 1;
}

Method templated address spaces

The address space of a method can be a template parameter. This can be done using the __AS macro taking an sp::AddressSpace template argument in place of a specific address space qualifier. The address space This is useful for methods that rely on knowing what address space their object is in with generality, such as if they call further methods that are overloaded by object address space.

Example

struct S {
    template <sp::AddressSpace AS>
    int fn() const __AS(AS) {
        return AS == sp::AddressSpace::SHARED ? 1 : 0;
    }
};

__device__ S deviceObj;
__shared__ S sharedObj;

__global__ void kernel(__device int *dst1, __device int *dst2) {
    *dst1 = deviceObj.fn(); // *dst1 = 0;
    *dst2 = sharedObj.fn(); // *dst2 = 1;
}

Address space utilities

The address space utilities are in the <spec/cuda/Addrspace.hpp> header. The presence of address space qualifiers being enabled during compilation can be detected with the compiler pre-defined __SCALE_ADDRSPACES__ macro.

The sp::AddressSpace enumerator

The sp::AddressSpace enumerator provides a way to refer to different address spaces. It is used by many of the other address space utilities. It contains the following entries:

enum entry Address space
FLAT __flat
GENERIC Generic
DEVICE __device
SHARED __shared
CONSTANT __constant
LOCAL __local

The specific values of these entries are implementation defined, and subject to change. Users should therefore not rely on specific values or their ordering.

The __AS macro

The __AS macro can be used to as an address space qualifier. It is a function macro that takes an entry from sp::AddressSpace as an argument. For example: __AS(sp::AddressSpace::DEVICE) int * is equivalent to __device int *.

Address space type traits

Type traits are provided that are analogous to the standard C++ type traits for CV qualification. In the following table, T is a type template parameter and AS is a non-type template parameter of type sp::AddressSpace.

C++17 Style C++11 style Description
sp::get_addrspace_v<T> sp::get_addrspace<T>::value Get the address space of T. Returns an sp::AddressSpace. This is analogous to std::is_const_v.
sp::add_addrspace_t<T, AS> sp::add_addrspace<T, AS>::type Add an address space to T (which must be in the generic address sapce). This is analogous to std::add_const_t.
sp::remove_addrspace_t<T> sp::remove_addrspace<T>::type Remove the address space from T, leaving a generic address space type. This is analogous to std::remove_const_t.
sp::remove_cva_t sp::remove_cva Remove const, volatile and address space qualifiers. This is analogous to std::remove_cv_t.
sp::remove_cvaref_t sp::remove_cvaref Remove reference and const, volatile and address space qualifiers. This is analogous to std::remove_cvref_t.

Examples

#include <spec/cuda/Addrspace.hpp>

/* sp::get_addrspace_v */
// Gets the address space of a type.
static_assert(sp::get_addrspace_v<__device float> == sp::AddressSpace::DEVICE);

// Like std::is_const_v, sp::get_addrspace_v does not propagate through references.
static_assert(sp::get_addrspace_v<__shared float &> == sp::AddressSpace::GENERIC);

// This has no address space, so it's in the generic address space.
static_assert(sp::get_addrspace_v<float> == sp::AddressSpace::GENERIC);

// The pointer is in the generic address space even though the pointed to type is not.
static_assert(sp::get_addrspace_v<__shared const float *> == sp::AddressSpace::GENERIC);
#include <spec/cuda/Addrspace.hpp>

/* sp::add_addrspace_t */
// Add the __device address space to a type.
static_assert(std::is_same_v<__device float, sp::add_addrspace_t<float, sp::AddressSpace::DEVICE>>);

// As with std::add_const_t, the action does not apply to the referenced type.
static_assert(std::is_same_v<float &, sp::add_addrspace_t<float &, sp::AddressSpace::DEVICE>>);

// As with std::add_const_t, the action applies to the pointer, not the pointed to type.
static_assert(std::is_same_v<float * __device, sp::add_addrspace_t<float *, sp::AddressSpace::DEVICE>>);

// These are bad because the type already has an address space.
static_assert(std::is_same_v<..., sp::add_addrspace_t<__shared float, sp::AddressSpace::DEVICE>>); // BAD!
static_assert(std::is_same_v<..., sp::add_addrspace_t<__device float, sp::AddressSpace::DEVICE>>); // BAD!
#include <spec/cuda/Addrspace.hpp>

/* sp::remove_addrspace_t */
// Remove the __shared address space from a float.
static_assert(std::is_same_v<float, sp::remove_addrspace_t<__shared float>>);

// Already has no address space, so this is a no-op.
static_assert(std::is_same_v<float, sp::remove_addrspace_t<float>>);

// Like std::remove_const, this does not apply to the referenced type.
static_assert(std::is_same_v<__shared const float &, sp::remove_addrspace_t<__shared const float &>>);

// The pointer itself is in the generic address space.
static_assert(std::is_same_v<__shared float *, sp::remove_addrspace_t<__shared float *>>);
#include <spec/cuda/Addrspace.hpp>

/* sp::remove_cva_t */
// Strips __shared.
static_assert(std::is_same_v<float, sp::remove_cva_t<__shared float>>);

// Strips __shared const.
static_assert(std::is_same_v<float, sp::remove_cva_t<__shared const float>>);

// As with std::remove_cv_t, this does not affect the referenced type.
static_assert(std::is_same_v<__shared float &, sp::remove_cva_t<__shared float &>>);
#include <spec/cuda/Addrspace.hpp>

/* sp::remove_cvaref_t */
// Unlike the others, this removes the reference, and so the address space of the referenced type as well.
static_assert(std::is_same_v<float, sp::remove_cvaref_t<__shared float &>>);

// Likewise.
static_assert(std::is_same_v<float, sp::remove_cvaref_t<__shared const float &>>);

// Like std::remove_cvaref_t, this acts on the pointer, not the pointed to type.
static_assert(std::is_same_v<__shared const float *, sp::remove_cvaref_t<__shared const float *>>);

Compatibility

Address space qualifiers are enabled by default with the Spectral compiler when building CUDA unless it is in nvcc mode. It can be explicitly enabled (e.g: for plain C++) with -fscale-addrspaces and disabled with -fno-scale-addrspaces.

Using libSCALE with upstream Clang (or C++ without enabling address space qualifiers) is permitted. In these cases, only the generic address space is available. This means that code can be written to be address space safe, but also work where address space qualifications are unavailable.

When address space qualifications are unavailable, stripped down implementations of the following are provided:

Feature C++11 style Implementation
Address space qualifiers Empty macros.
__AS An empty macro.
sp::AddressSpace Each entry in the enum has the same value.
sp::get_addrspace_v sp::get_addrspace Always returns sp::AddressSpace::GENERIC.
sp::add_addrspace_t sp::add_addrspace A no-op.
sp::remove_addrspace_t sp::remove_addrspace A no-op.
sp::remove_cva_t sp::remove_cva Acts like std::remove_cv_t.
sp::remove_cvaref_t sp::remove_cvaref Acts like std::remove_cvref_t.
addrspace_cast A no-op that enforces input and output types to be the same.

Without address space qualifications, any attempt to overload based on address spaces will fail because there is only the generic address space and so the different signatures would in fact be the same. Also, the compiler will not warn or error about inconsistent address spaces without address space qualifications.

The following address spaces are defined to have pointers of the same size:

Pointers of these types are considered equivalent for type-based alias analysis provided the pointees are equivalent for type-based alias analysis.

Note that functions that take address space qualified arguments have different mangled names from those with generic address space arguments. Thus care must be taken to ensure externally defined functions for which ABI interoperability is required have no arguments with address space qualification.

Rationale

The -fscale-addrspaces flag exists so that C++ code can interoperate with address space qualifications without having to build empty device code objects or selectively be marked with -fcuda-host-only. The -fno-scale-addrspaces flag exists so that our compiler can build conventional CUDA code in the (hopefully rare) case where it is not compatible with the enhanced address space qualifications.

The stripped-down (essentially no-op) address space features are implemented for when enhanced address space qualifiers are not available so that the same code can be written for both the case where it is available, and where it is not.

Consider the following that might appear in a header file:

#include <spec/cuda/Addrspace.hpp>

struct KernelArguments {
    __device float *dst;
    __device const float *lhs;
    __device const float *rhs;
    int size;
};

KernelArguments allocateKernelArguments(sp::Device &device, int size);
void launchSomeKernel(KernelArguments args);

This code might be expected to be included from code built with the Spectral compiler and code built with another compiler. In this case, address space safety is provided for the code built with the Spectral compiler (via the use of the __device address space qualifications), but the header and its functions are not prohibited from use by code built with other compilers (or otherwise without address space qualifiers).

The type-based alias analysis rules defined above are chosen so that treating a type like KernelArguments can be interchanged with an equivalent type without the address space qualifiers in a well defined way. The requirement that __device pointers are the same size as generic pointers exists to facilitate this. Likewise for __flat pointers, since a generic pointer is required (because of the semantics of conventional CUDA) to be the same size for host and device and a __flat pointer is intended to be able to represent either.

Other address spaces, such as __shared, are not guaranteed to be bitwise compatible because it is useful to allow such pointers to be shorter than generic pointers.