This page describes some of the and optimizations added to LLVM and the improvements made to its code generation.
This page makes use of the following terms:
Term | Description |
---|---|
Backend | The code generation mechanism for target-specific code. “A backend refers” to a specific target backend. “The backend” can also refer to the target-independent part of this mechanism. |
Basic block | Part of a function with only one entry point and (at most) one exit point, aside from function calls. |
Branch | An instruction that can jump to one or more basic blocks. When talking about low level code, such as assembly, branches can sometimes to other jumps, such as function calls, too. |
Control flow (graph) | The structure of a function with respect to the paths through it that can be taken. This can be represented as a graph called a “control flow graph” (CFG). |
LLVM IR | LLVM’s intermediate representation. |
ptxas |
The NVIDIA-supplied tool to convert PTX to GPU machine code. |
PTX | The lowest-level documented programming language for NVIDIA GPUs. It is as to assembly as NVIDIA document. |
SDNode |
Represents an instruction in the backend. |
Stack | This is where a funciton’s local variables are stored. |
Uniform | The same for each thread in a warp. Uniform values and control flow can sometimes be more efficient. |
__builtin_assume
Improve LLVM’s ability to simplify control flow using
__builtin_assume
by eliminating branches that would violate
the assumption (even if there exist branches that would not violate the
assumption).
Use an existing analysis in LLVM called global value numbering to identify basic blocks that are equivalent so they can be deduplicated.
__shared__
memory and
__device__
memory. This allows better optimization of
memory operations.memcpy
and memset
on targets where the library
call is expensive.This pass tries to convert local variables to LLVM IR values (that can be represented using the processor’s registers) even if they are small arrays that are accessed using computed offsets. This is especially useful on GPUs where local memory accesses are particularly high latency.
switch
statements using
__builtin_assume
Simplify switch
statements (at LLVM IR level) whose
switch value is subject to a __builtin_assume
such as:
switch (i) {
case 0: return a;
case 1: return b;
case 2: return c;
case 3: return d;
}
__builtin_assume(i < 2);
This can be useful, for example, where the
__builtin_assume
is generated from template arguments or
enabled by an if constexpr
.
Jump tables (such as might be generated from a switch
statement) track whether the value they switch on is uniform.
Try to convert wide integer (e.g: 256 bit) bitwise operations to SIMD instructions. This works well in conjunction with extended integer types.
This section describes the improvements made to the NVPTX backend that’s used to generate CUDA code for NVIDIA GPUs.
Alter the optimization pipeline for NVPTX to include passes, such as address space inference, earlier or at places where they’re likely to create further optimization opportunities.
In many cases, the LLVM emits a better sequence of instructions than the library function (which also has call overhead).
Such as:
BFE
.Such as:
set
or
slct
, but don’t use set
or slct
when the predicates are needed anyway.virtual
__device__
functions for a class.set
instead of setp
or selp
.mov
instructions to unpack vectors of values
rather than bitwise arithmetic.We added instruction throughput information to improve instruction selection.
isspacep
in IR where the
address space is known at compile time.__shfl_sync(m, x, offset % size, size)
and
friends to get rid of the %
.We added patterns to match GPU-specific instructions. Such as:
bfe
, ctlz
and bfind
patterns.prmt
instruction.__fns
.ptxas
’s patternsSome GPU instructions are able perform secondary operations, such as
negating their arguments. We added optimizeations to allow
ptxas
to better exploit this. Such as:
ptxas
to generate poor code.setcc
using set
then
add
. Sometimes ptxas
can fuse the add into
other GPU instructions.We added patterns to use instructions that, on the GPU, are faster. Such as:
n + n
to represent n * 2
and n
<< 1
.n + n + n
to represent n * 3
.We refactored the NVPTX instruction description to make LLVM’s existing backend optimization infrastructure apply more widely and to make NVPTX’s backend code apply more widely too.
The backend can emit the bra.uni
and brx.idx.uni
instructions. This gives ptxas
more opportunity to generate
the more efficient non-divergent branch instructions. To support this,
the SDNode
divergence analysis is enabled for the NVPTX
backend.