CRTP-base kernel providing software scheduling of virtual blocks, reducing loading overhead. More...

#include <BlockChain1DKernel.hpp>

Inheritance diagram for sp::BlockChain1DKernel< Subclass, KGrid, blockStride >:

Public Member Functions
	BlockChain1DKernel (int stepsPerBlock, int length, uint16_t remainderBlocks)

__device__ int	getStepsThisBlock ()
	Query the number of times this block will run the main loop. More...

__device__ void	run ()

__device__ void	init ()
	Called before any calls to `processBlock()`. Overload if you want to do something then. More...

__device__ void	finish ()
	Called after all calls to `processBlock()`. Overload if you want to do something then. More...

Public Member Functions inherited from sp::Kernel< Subclass, KGrid >
__device__ void	operator() ()

__device__ bool	isLastBlockHere (__device int *atomicCounter)
	A handy mechanism for determining if a block is the last one to reach a certain point. More...

__host__ void	launch (sp::Vec< int, Grid::GridRank > numBlocks, sp::Vec< int, Grid::BlockRank > numThreads, sp::Stream &stream, int dynamicSMem=0)
	Do the kernel launch. More...

__host__ void	launch (sp::Vec< int, Grid::GridRank > numBlocks, sp::Vec< int, Grid::BlockRank > numThreads, cudaStream_t stream, int dynamicSMem=0)
	Evil legacy wrapper that allows using bare streams. More...

Static Public Member Functions
constexpr static bool	needsBoundaryLogic ()
	Overload to turn off the bounds-checked version and reduce overheads, if appropriate. More...

Protected Attributes
const int	stepsPerBlock

const int	length

const uint16_t	remainderBlocks

Additional Inherited Members
Public Types inherited from sp::Kernel< Subclass, KGrid >
using	Grid = KGrid

Detailed Description

template<typename Subclass, typename KGrid, int blockStride>
class sp::BlockChain1DKernel< Subclass, KGrid, blockStride >

CRTP-base kernel providing software scheduling of virtual blocks, reducing loading overhead.

Typically used to apply an operation to a number of chunks of data- instead of launching a number of blocks linear in the input size, this kernel can launch the best number for the hardware, while providing a very similar interface to the user code:

Subclass should contain a member function template with this signature:

template<bool BoundsCheck = false>

__device__ void processBlock(int blockOffset);

Where blockOffset is an abstract 1D offset to the start of the "region" to be processed by the block, replacing the typical blockIdx, and BoundsCheck is true iff the region of interest does not contain the end of the array. Passed as a template parameter to allow efficient code generation for non-bounds-checked blocks-typically the majority.