Tools for doing integer division quickly on GPUS. More...
Functions | |
__device__ int | sp::uniformSmallDiv (int x, int divisor) |
Integer division x/y , subject to the following limitations: More... | |
__device__ int | sp::uniformSmallMod (int x, int divisor) |
Integer mod, based on uniformSmallDiv() . More... | |
__device__ int | sp::divPowTwo (int x, int divisor) |
Divide by a power of two in cases where the compiler doesn't have enough information to prove that strength-reduction is safe. More... | |
__device__ int | sp::modPowTwo (int x, int divisor) |
Modulus based on divPowTwo() . More... | |
__device__ int | sp::iFloatDiv (int x, int divisor) |
Use floating point operations to perform integer division. More... | |
__device__ int | sp::iFloatMod (int x, int divisor) |
Modulus based on iFloatDiv() . More... | |
Tools for doing integer division quickly on GPUS.
The naive expansion of an integer division on the GPU is very expensive (about 40 instructions). If the RHS is a constant than the compiler will do some magic to improve things a lot, but what if you only know the range of the divisor?
When the divisor is small enough, we can use precomputed fixed-point reciprocals to do an integer division in just 3 instructions.
If all threads in a warp have the same divisor, we can then use constant memory to access the lookup table of reciprocals very cheaply. A common application for this is to dynamically adjust how threads are tiled by "reflowing" threads across the available work.
When the divisor is a power of 2, but we can't prove that at compile-time, we can also do the strength reduction explicitly.
__device__ int sp::divPowTwo | ( | int | x, |
int | divisor | ||
) |
Divide by a power of two in cases where the compiler doesn't have enough information to prove that strength-reduction is safe.
This is quite rarely useful because the compiler is extremely good at deducing when things are guaranteed to be powers of two.
Note, also, that this rounds toward negative infinity, not toward zero. This technically makes it a stronger from of strength-reduction than classical optimisation of signed integer division by a power of two, but is also fine for address calculations. Usually. Not breaking it is left as an exercise for the reader.
__device__ int sp::iFloatDiv | ( | int | x, |
int | divisor | ||
) |
Use floating point operations to perform integer division.
This uses three int<->float conversions, one floating point multiply, one integer add, and one floating point reciprocal.
x | The number to be divided. This is tested between (-(1 << 21), 1 << 21). |
divisor | The number to divide by. This is tested between (-(1 << 21), 1 << 21). |
__device__ int sp::iFloatMod | ( | int | x, |
int | divisor | ||
) |
Modulus based on iFloatDiv()
.
__device__ int sp::modPowTwo | ( | int | x, |
int | divisor | ||
) |
Modulus based on divPowTwo()
.
__device__ int sp::uniformSmallDiv | ( | int | x, |
int | divisor | ||
) |
Integer division x/y
, subject to the following limitations:
1 < y <= MaxFastDivisor
y
is the same for all threads in the warp that are concurrently executing this function (or it's very slow).When the constraints are met, this operation costs a constant memory read and a call to __umulhi()
.
__device__ int sp::uniformSmallMod | ( | int | x, |
int | divisor | ||
) |
Integer mod, based on uniformSmallDiv()
.