PSCF v1.2
|
Functions | |
template<int D> | |
void | Pscf::ThreadMesh::setConfig (IntVec< D > const &meshDims, bool invert, int blockSize=-1) |
Given a multidimensional grid of threads, set execution configuration. | |
void | Pscf::ThreadMesh::checkConfig () |
Check the execution configuration (grid and block dimensions). | |
void | Pscf::ThreadMesh::setThreadsPerBlock (int blockSize) |
Manually set the block size that should be used by default. | |
dim3 | Pscf::ThreadMesh::gridDims () |
Get the multidimensional grid of blocks determined by setConfig. | |
dim3 | Pscf::ThreadMesh::blockDims () |
Get the dimensions of a single block determined by setConfig. | |
dim3 | Pscf::ThreadMesh::meshDims () |
Return last requested multidimensional grid of threads. | |
int | Pscf::ThreadMesh::warpSize () |
Get the warp size. | |
bool | Pscf::ThreadMesh::hasUnusedThreads () |
Will there be unused threads? | |
Management of multidimensional GPU thread execution configurations.
void Pscf::ThreadMesh::setConfig | ( | IntVec< D > const & | meshDims, |
bool | invert, | ||
int | blockSize = -1 ) |
Given a multidimensional grid of threads, set execution configuration.
This function accepts an IntVec<D> array containing D mesh dimensions, and constructs an execution configuration that is a D-dimensional grid of threads matching this mesh.
The function first determines an optimal number of threads per block, based on maximizing thread occupancy. It then constructs a D-dimensional thread block of that size (or less, if the requested grid of threads is smaller than the optimal block size). Finally, it determines the dimensions of the D-dimensional grid of blocks that is needed to span the entire input mesh.
The resulting grid and block dimensions are stored internally to the ThreadMesh namespace, and can be accessed via the accessors gridDims() and blockDims, respectively. The dimensions are stored in objects of type dim3 because that is the data type that is needed to launch a CUDA kernel with a multidimensional mesh of threads. If D = 2, the z member of these dim3 objects is set to 1, and if D = 1, the y member is also set to 1.
IMPORTANT: A required input for this function is the boolean "invert". If invert == true, the order of the dimensions stored in the dim3 objects will be inverted from the dimensions that were provided in meshDims. This means that
If each thread in a warp (32 threads) must access a unique element of an array, that corresponds to 32 independent memory accesses. However, if the elements being accessed by the warp are stored in consecutive locations in memory, the compiler can "coalesce" this operation into one memory access, which can greatly speed up a CUDA kernel.
A multidimensional thread block must be linearized into a 1D array before warps can be assigned. This linearization is performed in column-major order, in which x is the most rapidly changing index, then y, then z. If, say, a thread block has 32 threads along it's x dimension, then each warp will be at a single value of y and z, but will have x values that span from 0 to 31.
Multidimensional arrays of data stored on the device must also be linearized, but may be linearized in row-major order, in which z is the most rapidly changing index, then y, then x. If the thread grid for the CUDA kernel has the same dimensions as an array that was linearized in row-major order, the memory accesses within a single warp cannot be coalesced.
Instead, one can simply invert the dimensions within the CUDA kernel. If the x dimension within the kernel corresponds to the z dimension within the array, then an array that was linearized in row-major order can be accessed with coalescence. The CUDA kernel must be written with the knowledge that the dimensions are inverted in such a way, but otherwise no changes are necessary.
Regardless of the value of "invert", the program will prioritize having a block size of at least 32 in the x dimension if possible, allowing coalescence to be maximized within a kernel.
meshDims | dimensions of the desired grid of threads (input) |
invert | should the coordinate order be inverted, xyz -> zyx? |
blockSize | desired block size (optional, must be power of 2) |
Definition at line 88 of file ThreadMesh.cu.
References Pscf::ThreadMesh::checkConfig(), Util::Log::file(), Pscf::ThreadMesh::meshDims(), UTIL_CHECK, and UTIL_THROW.
Referenced by Pscf::Rpg::ExtGenFilm< D >::generate(), Pscf::Rpg::MaskGenFilm< D >::generate(), Pscf::Rpg::ExtGenFilm< D >::stressTerm(), and Pscf::Rpg::MaskGenFilm< D >::stressTerm().
void Pscf::ThreadMesh::checkConfig | ( | ) |
Check the execution configuration (grid and block dimensions).
Check for validity and optimality, based on hardware warp size and streaming multiprocessor constraints.
Definition at line 273 of file ThreadMesh.cu.
References Util::Log::file(), and UTIL_THROW.
Referenced by Pscf::ThreadMesh::setConfig().
void Pscf::ThreadMesh::setThreadsPerBlock | ( | int | blockSize | ) |
Manually set the block size that should be used by default.
blockSize | the block size to be used |
Definition at line 342 of file ThreadMesh.cu.
References Util::Log::file(), and UTIL_THROW.
Referenced by Pscf::Rpg::System< D >::setOptions().
dim3 Pscf::ThreadMesh::gridDims | ( | ) |
Get the multidimensional grid of blocks determined by setConfig.
Definition at line 364 of file ThreadMesh.cu.
Referenced by Pscf::Rpg::ExtGenFilm< D >::generate(), Pscf::Rpg::MaskGenFilm< D >::generate(), Pscf::Rpg::ExtGenFilm< D >::stressTerm(), and Pscf::Rpg::MaskGenFilm< D >::stressTerm().
dim3 Pscf::ThreadMesh::blockDims | ( | ) |
Get the dimensions of a single block determined by setConfig.
Definition at line 367 of file ThreadMesh.cu.
Referenced by Pscf::Rpg::ExtGenFilm< D >::generate(), Pscf::Rpg::MaskGenFilm< D >::generate(), Pscf::Rpg::ExtGenFilm< D >::stressTerm(), and Pscf::Rpg::MaskGenFilm< D >::stressTerm().
dim3 Pscf::ThreadMesh::meshDims | ( | ) |
Return last requested multidimensional grid of threads.
Definition at line 370 of file ThreadMesh.cu.
Referenced by Pscf::Rpg::ExtGenFilm< D >::generate(), Pscf::Rpg::MaskGenFilm< D >::generate(), Pscf::ThreadMesh::setConfig(), Pscf::Rpg::ExtGenFilm< D >::stressTerm(), and Pscf::Rpg::MaskGenFilm< D >::stressTerm().
int Pscf::ThreadMesh::warpSize | ( | ) |
Get the warp size.
Definition at line 373 of file ThreadMesh.cu.
bool Pscf::ThreadMesh::hasUnusedThreads | ( | ) |
Will there be unused threads?
Definition at line 376 of file ThreadMesh.cu.