Spectral LLVM optimizations

Introduction

This page describes some of the and optimizations added to LLVM and the improvements made to its code generation.

Glossary

This page makes use of the following terms:

Term Description
Backend The code generation mechanism for target-specific code. “A backend refers” to a specific target backend. “The backend” can also refer to the target-independent part of this mechanism.
Basic block Part of a function with only one entry point and (at most) one exit point, aside from function calls.
Branch An instruction that can jump to one or more basic blocks. When talking about low level code, such as assembly, branches can sometimes to other jumps, such as function calls, too.
Control flow (graph) The structure of a function with respect to the paths through it that can be taken. This can be represented as a graph called a “control flow graph” (CFG).
LLVM IR LLVM’s intermediate representation.
ptxas The NVIDIA-supplied tool to convert PTX to GPU machine code.
PTX The lowest-level documented programming language for NVIDIA GPUs. It is as to assembly as NVIDIA document.
SDNode Represents an instruction in the backend.
Stack This is where a funciton’s local variables are stored.
Uniform The same for each thread in a warp. Uniform values and control flow can sometimes be more efficient.

Target Independent

Control flow simplification using __builtin_assume

Improve LLVM’s ability to simplify control flow using __builtin_assume by eliminating branches that would violate the assumption (even if there exist branches that would not violate the assumption).

Global-Value-Numbering based basic block deduplication

Use an existing analysis in LLVM called global value numbering to identify basic blocks that are equivalent so they can be deduplicated.

Memory analysis improvements

Monolithic Load-Store of Aggregates

This pass tries to convert local variables to LLVM IR values (that can be represented using the processor’s registers) even if they are small arrays that are accessed using computed offsets. This is especially useful on GPUs where local memory accesses are particularly high latency.

Peephole optimizations

Simplification of switch statements using __builtin_assume

Simplify switch statements (at LLVM IR level) whose switch value is subject to a __builtin_assume such as:

switch (i) {
  case 0: return a;
  case 1: return b;
  case 2: return c;
  case 3: return d;
}
__builtin_assume(i < 2);

This can be useful, for example, where the __builtin_assume is generated from template arguments or enabled by an if constexpr.

Speculative execution

Uniformity of jump table indices

Jump tables (such as might be generated from a switch statement) track whether the value they switch on is uniform.

Wide integer vectorization

Try to convert wide integer (e.g: 256 bit) bitwise operations to SIMD instructions. This works well in conjunction with extended integer types.

NVPTX

This section describes the improvements made to the NVPTX backend that’s used to generate CUDA code for NVIDIA GPUs.

Addition of helpful optimization passes

Alter the optimization pipeline for NVPTX to include passes, such as address space inference, earlier or at places where they’re likely to create further optimization opportunities.

Avoid GPU library functions

In many cases, the LLVM emits a better sequence of instructions than the library function (which also has call overhead).

Such as:

Avoid reduncancies

Such as:

Improved divergence analysis

Instruction throughput

We added instruction throughput information to improve instruction selection.

Memory optimization

Optimization of functions and intrinsics

Patterns to match GPU-specific instructions

We added patterns to match GPU-specific instructions. Such as:

Patterns to match underlying GPU instructions’ capabilities and ptxas’s patterns

Some GPU instructions are able perform secondary operations, such as negating their arguments. We added optimizeations to allow ptxas to better exploit this. Such as:

Patterns to use faster GPU instructions

We added patterns to use instructions that, on the GPU, are faster. Such as:

Refactor

We refactored the NVPTX instruction description to make LLVM’s existing backend optimization infrastructure apply more widely and to make NVPTX’s backend code apply more widely too.

Uniform branches

The backend can emit the bra.uni and brx.idx.uni instructions. This gives ptxas more opportunity to generate the more efficient non-divergent branch instructions. To support this, the SDNode divergence analysis is enabled for the NVPTX backend.