Speclib  0.1.2
The library for writing better CUDA libraries
Integer Division

Tools for doing integer division quickly on GPUS. More...

Functions

__device__ int sp::uniformSmallDiv (int x, int divisor)
 Integer division x/y, subject to the following limitations: More...
 
__device__ int sp::uniformSmallMod (int x, int divisor)
 Integer mod, based on uniformSmallDiv(). More...
 
__device__ int sp::divPowTwo (int x, int divisor)
 Divide by a power of two in cases where the compiler doesn't have enough information to prove that strength-reduction is safe. More...
 
__device__ int sp::modPowTwo (int x, int divisor)
 Modulus based on divPowTwo(). More...
 
__device__ int sp::iFloatDiv (int x, int divisor)
 Use floating point operations to perform integer division. More...
 
__device__ int sp::iFloatMod (int x, int divisor)
 Modulus based on iFloatDiv(). More...
 

Detailed Description

Tools for doing integer division quickly on GPUS.

The naive expansion of an integer division on the GPU is very expensive (about 40 instructions). If the RHS is a constant than the compiler will do some magic to improve things a lot, but what if you only know the range of the divisor?

When the divisor is small enough, we can use precomputed fixed-point reciprocals to do an integer division in just 3 instructions.

If all threads in a warp have the same divisor, we can then use constant memory to access the lookup table of reciprocals very cheaply. A common application for this is to dynamically adjust how threads are tiled by "reflowing" threads across the available work.

When the divisor is a power of 2, but we can't prove that at compile-time, we can also do the strength reduction explicitly.

Function Documentation

◆ divPowTwo()

__device__ int sp::divPowTwo ( int  x,
int  divisor 
)

Divide by a power of two in cases where the compiler doesn't have enough information to prove that strength-reduction is safe.

This is quite rarely useful because the compiler is extremely good at deducing when things are guaranteed to be powers of two.

Note, also, that this rounds toward negative infinity, not toward zero. This technically makes it a stronger from of strength-reduction than classical optimisation of signed integer division by a power of two, but is also fine for address calculations. Usually. Not breaking it is left as an exercise for the reader.

◆ iFloatDiv()

__device__ int sp::iFloatDiv ( int  x,
int  divisor 
)

Use floating point operations to perform integer division.

This uses three int<->float conversions, one floating point multiply, one integer add, and one floating point reciprocal.

Parameters
xThe number to be divided. This is tested between (-(1 << 21), 1 << 21).
divisorThe number to divide by. This is tested between (-(1 << 21), 1 << 21).
Returns
x / divisor.

◆ iFloatMod()

__device__ int sp::iFloatMod ( int  x,
int  divisor 
)

Modulus based on iFloatDiv().

◆ modPowTwo()

__device__ int sp::modPowTwo ( int  x,
int  divisor 
)

Modulus based on divPowTwo().

◆ uniformSmallDiv()

__device__ int sp::uniformSmallDiv ( int  x,
int  divisor 
)

Integer division x/y, subject to the following limitations:

  • 1 < y <= MaxFastDivisor
  • y is the same for all threads in the warp that are concurrently executing this function (or it's very slow).

When the constraints are met, this operation costs a constant memory read and a call to __umulhi().

◆ uniformSmallMod()

__device__ int sp::uniformSmallMod ( int  x,
int  divisor 
)

Integer mod, based on uniformSmallDiv().