Organizing Threads in CUDA C
Blocks and Threads
When a kernel function is moved to device and is executed a collection of threads are spawned which will perform the task provided by the kernel function. CUDA has a thread hierarchy which enables to organizing the threads. Its consist of two subtypes of hierarchy; block of threads and grid of blocks. All the threads are spawned simultaneously by the kernel launch are collectively called a grid. A grid is made of many thread blocks. A thread of block is a group that cooperate with each other using,
- Block Local Synchronization.
- Block Local Shared Memory.
Note
Threads from different blocks can not cooperate.
CUDA pre-initialize blockIdx
and threadIdx
to uniquely identify threads inside the kernel function. When a kernel is launched values are assigned to blockIdx
and threadIdx
by the CUDA runtime. These variables are of type uint3
which is a CUDA built-in type for vector of 3 integers. Its component are accessible through blockIdx.x
, blockIdx.y
and blockIdx.z
and similarly threadIdx.x
, threadIdx.y
and threadIdx.z
.
CUDA organizes grids and blocks in 3 dimensions. Their dimensions are defined by dim3
type. The dim3
type is a CUDA built-in type for 3D vector of unsigned integers. The dim3
type can be initialized with 1, 2 or 3 dimensions. For example, dim3 block(16)
initializes a block of 16 threads in x dimension and 1 in y and z dimensions. Similarly, dim3 grid(4, 4)
initializes a grid of 4 blocks in x and y dimensions and 1 in z dimension.
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
__global__ void checkIndex(void) {
printf(
"threadIdx: (%d, %d, %d), blockIdx: (%d, %d, %d), blockDim: (%d, %d, %d), "
"gridDim: (%d, %d, %d)\n",
threadIdx.x, threadIdx.y, threadIdx.z, blockIdx.x, blockIdx.y, blockIdx.z,
blockDim.x, blockDim.y, blockDim.z, gridDim.x, gridDim.y, gridDim.z);
}
int main() {
const int nElem = 6;
const dim3 block(3);
const dim3 grid((nElem + block.x + 1) / block.x);
printf("grid.x %d grid.y %d grid.z %d\n", grid.x, grid.y, grid.z);
printf("block.x %d block.y %d block.z %d\n", block.x, block.y, block.z);
checkIndex<<<grid, block>>>();
cudaDeviceSynchronize();
cudaDeviceReset();
return EXIT_SUCCESS;
}
>>> nvcc CheckDimension.cu -o CheckDimension
>>> ./CheckDimension
grid.x 3 grid.y 1 grid.z 1
block.x 3 block.y 1 block.z 1
threadIdx: (0, 0, 0), blockIdx: (0, 0, 0), blockDim: (3, 1, 1), gridDim: (3, 1, 1)
threadIdx: (1, 0, 0), blockIdx: (0, 0, 0), blockDim: (3, 1, 1), gridDim: (3, 1, 1)
threadIdx: (2, 0, 0), blockIdx: (0, 0, 0), blockDim: (3, 1, 1), gridDim: (3, 1, 1)
threadIdx: (0, 0, 0), blockIdx: (2, 0, 0), blockDim: (3, 1, 1), gridDim: (3, 1, 1)
threadIdx: (1, 0, 0), blockIdx: (2, 0, 0), blockDim: (3, 1, 1), gridDim: (3, 1, 1)
threadIdx: (2, 0, 0), blockIdx: (2, 0, 0), blockDim: (3, 1, 1), gridDim: (3, 1, 1)
threadIdx: (0, 0, 0), blockIdx: (1, 0, 0), blockDim: (3, 1, 1), gridDim: (3, 1, 1)
threadIdx: (1, 0, 0), blockIdx: (1, 0, 0), blockDim: (3, 1, 1), gridDim: (3, 1, 1)
threadIdx: (2, 0, 0), blockIdx: (1, 0, 0), blockDim: (3, 1, 1), gridDim: (3, 1, 1)
For a given data size, the number of blocks and threads can be calculated using the following formula,
Where, \(n_g\) is the number of grids, \(n\) is the number of elements and \(n_b\) is the number of blocks. This formula ensures that the number of blocks is rounded up to the nearest integer. The -1 is used to ensure that if there are any remaining elements, an additional block is created to handle them.
See more examples at Qazalbash/CUDAForge/code/2_GridBlock
Checkout Qazalbash/CUDAForge for more CUDA examples.