Spectral LLVM optimizations

Introduction

This page describes some of the and optimizations added to LLVM and the improvements made to its code generation.

Glossary

This page makes use of the following terms:

Term	Description
Backend	The code generation mechanism for target-specific code. “A backend refers” to a specific target backend. “The backend” can also refer to the target-independent part of this mechanism.
Basic block	Part of a function with only one entry point and (at most) one exit point, aside from function calls.
Branch	An instruction that can jump to one or more basic blocks. When talking about low level code, such as assembly, branches can sometimes to other jumps, such as function calls, too.
Control flow (graph)	The structure of a function with respect to the paths through it that can be taken. This can be represented as a graph called a “control flow graph” (CFG).
LLVM IR	LLVM’s intermediate representation.
`ptxas`	The NVIDIA-supplied tool to convert PTX to GPU machine code.
PTX	The lowest-level documented programming language for NVIDIA GPUs. It is as to assembly as NVIDIA document.
`SDNode`	Represents an instruction in the backend.
Stack	This is where a funciton’s local variables are stored.
Uniform	The same for each thread in a warp. Uniform values and control flow can sometimes be more efficient.

Target Independent

Control flow simplification using `__builtin_assume`

Improve LLVM’s ability to simplify control flow using __builtin_assume by eliminating branches that would violate the assumption (even if there exist branches that would not violate the assumption).

Global-Value-Numbering based basic block deduplication

Use an existing analysis in LLVM called global value numbering to identify basic blocks that are equivalent so they can be deduplicated.

Memory analysis improvements

Address space alias analysis. On some targets, some address spaces can never alias. For example __shared__ memory and __device__ memory. This allows better optimization of memory operations.
Earlier generation of loops to implement memory functions like memcpy and memset on targets where the library call is expensive.
Increase limits for memory dependence analysis. This is useful for heavily inlined code on the GPU, where inlining is more beneficial and local memory usage more harmful.
Use memory address space to determine whether memory is written to. On some targets, such as GPU targets, some address spaces are constant by definition. Thus the compiler can always know that loads from the same addresses give the same values.

Monolithic Load-Store of Aggregates

This pass tries to convert local variables to LLVM IR values (that can be represented using the processor’s registers) even if they are small arrays that are accessed using computed offsets. This is especially useful on GPUs where local memory accesses are particularly high latency.

Peephole optimizations

Optimize bitcasts across control flow. This can eliminate redundant bitcasts, or unhide opportunities for further optimizations.

Simplification of `switch` statements using `__builtin_assume`

Simplify switch statements (at LLVM IR level) whose switch value is subject to a __builtin_assume such as:

switch (i) {
  case 0: return a;
  case 1: return b;
  case 2: return c;
  case 3: return d;
}
__builtin_assume(i < 2);

This can be useful, for example, where the __builtin_assume is generated from template arguments or enabled by an if constexpr.

Speculative execution

Make speculative execution (i.e: calculating a result and then conditionally select it without branching) more aggressive on targets with branch divergence (i.e: where the cost of the branch instructions will exceed that of the extra speculative execution).

Uniformity of jump table indices

Jump tables (such as might be generated from a switch statement) track whether the value they switch on is uniform.

Wide integer vectorization

Try to convert wide integer (e.g: 256 bit) bitwise operations to SIMD instructions. This works well in conjunction with extended integer types.

NVPTX

This section describes the improvements made to the NVPTX backend that’s used to generate CUDA code for NVIDIA GPUs.

Addition of helpful optimization passes

Alter the optimization pipeline for NVPTX to include passes, such as address space inference, earlier or at places where they’re likely to create further optimization opportunities.

Avoid GPU library functions

In many cases, the LLVM emits a better sequence of instructions than the library function (which also has call overhead).

Such as:

64-bit BFE.
Conversion functions.
Various maths functions.

Avoid reduncancies

Such as:

Convert most i32 compare/selects into set or slct, but don’t use set or slct when the predicates are needed anyway.
Don’t generate a vtable if there are no virtual __device__ functions for a class.
Improved matching of set instead of setp or selp.
Mark “special register” (such as where thread index or dynamic shared memory size) reads as being able to be moved (e.g: to rare branches where their values are used).
Use of mov instructions to unpack vectors of values rather than bitwise arithmetic.

Improved divergence analysis

The thread ID is divergent only for the least significant 5 bits.

Instruction throughput

We added instruction throughput information to improve instruction selection.

Memory optimization

Align kernel arguments (which live in memory) so that they can be loaded more efficiently (e.g: with vector instructions).
Use a different address space for local variables, so they can be proved not to alias other memories.

Optimization of functions and intrinsics

Elimination of address space isspacep in IR where the address space is known at compile time.
Optimize __shfl_sync(m, x, offset % size, size) and friends to get rid of the %.

Patterns to match GPU-specific instructions

We added patterns to match GPU-specific instructions. Such as:

bfe, ctlz and bfind patterns.
Extra patterns for the prmt instruction.
Extra patterns for vote instructions.
Improved implementation of __fns.
Saturating addition patterns.
Use multiply-add-with-carry instructions.

Patterns to match underlying GPU instructions’ capabilities and `ptxas`’s patterns

Some GPU instructions are able perform secondary operations, such as negating their arguments. We added optimizeations to allow ptxas to better exploit this. Such as:

Don’t fold multiply-add instructions of types the hardware doesn’t support. This causes ptxas to generate poor code.
Hoisting bitwise negation before bitwise instructions.
Zero-extend setcc using set then add. Sometimes ptxas can fuse the add into other GPU instructions.

Patterns to use faster GPU instructions

We added patterns to use instructions that, on the GPU, are faster. Such as:

Population count of shifted input into population count of AND’d input.
Use n + n to represent n * 2 and n << 1.
Use n + n + n to represent n * 3.
Use multiply-add instead of shift then add.

Refactor

We refactored the NVPTX instruction description to make LLVM’s existing backend optimization infrastructure apply more widely and to make NVPTX’s backend code apply more widely too.

Uniform branches

The backend can emit the bra.uni and brx.idx.uni instructions. This gives ptxas more opportunity to generate the more efficient non-divergent branch instructions. To support this, the SDNode divergence analysis is enabled for the NVPTX backend.