Summation of Two Vectors in CUDA C

Summation of two vectors is an embarrassingly parallel problem, which means solution to the problem can easily be parallelized. If we have two vectors $A,B\in\mathbb{R}^{n}$ , where $n\in\mathbb{N}$ and their summation is $C$ . An arbitrary component $C_i$ for $1\leq i\leq n$ can be shown as $C_i=A_i+B_i$ , which means it only depends on the corresponding component in vector $A$ and $B$ . The parallel algorithm would be to add every component of the vector by a unique GPU thread. A GPU kernel for this problem would be,

__global__ void ArrSumOnDevice(const float* A, const float* B, float* C, const int N) {
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

here, idx is the index mapped to the global memory and if they index is within the range to access a single component of a vector then it will take respective component from each vector, add them, and save their result to the respective final component. An equivalent CPU function would look like,

void ArrSumOnHost(const float* A, const float* B, float* C, const int N) {
    for (int idx = 0; idx < N; ++idx) {
        C[idx] = A[idx] + B[idx];
    }
}

Complete code is available Qazalbash/CUDAForge/code/1_ArrSum/ArrSum.cu.

Checkout Qazalbash/CUDAForge for more CUDA examples.