Memory Management in CUDA C

The CUDA programming model assumes system is composed of host (CPU) and device (GPU), with each of them having their own memory. In order to get maximum performance and control CUDA allow users to manipulate memory by allocation, deallocation, and copying etc. Following are the four most important memory operations provided by CUDA.

`cudaMalloc`

cudaMalloc allocates linear memory on the device. It is analogous to malloc in Standard C and similarly allocates memory in bytes. Its function signature is cudaError_t cudaMalloc(void**, size_t). First argument is the pointer to the device memory and it should be casted to an appropriate type and the second argument is the number of bytes. The function returns cudaSuccess on success, or an error code on failure.

#include <cuda_runtime.h>
#include <stdio.h>

int main() {
    int   *d_a;
    const size_t size = 10 * sizeof(int);

    // Allocate memory on the device
    cudaError_t err = cudaMalloc((int **)&d_a, size);
    if (err != cudaSuccess) {
        fprintf(stderr, "Error allocating memory on device: %s\n",
                cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Use the allocated memory...

    // Free the allocated memory
    cudaFree(d_a);

    return EXIT_SUCCESS;
}

The above code allocates memory for an array of 10 integers on the device. The cudaMalloc function is used to allocate the memory, and the pointer to the allocated memory is stored in d_a. The size of the memory to be allocated is specified in bytes. After using the allocated memory, it is important to free it using cudaFree.

`cudaMemcpy`

cudaMemcpy copies memory between the host and device. The function takes four arguments: the destination pointer, the source pointer, the size of the memory to be copied, and the direction of the copy (host to device or device to host). The function returns cudaSuccess on success, or an error code on failure.

#include <cuda_runtime.h>
#include <stdio.h>

int main() {
    int         *h_a, *d_a;
    const size_t size = 10 * sizeof(int);

    // Allocate memory on the host
    h_a = (int *)malloc(size);
    if (h_a == NULL) {
        fprintf(stderr, "Error allocating memory on host\n");
        exit(EXIT_FAILURE);
    }

    // Initialize the host array
    for (int i = 0; i < 10; i++) {
        h_a[i] = i;
    }

    // Allocate memory on the device
    cudaError_t err = cudaMalloc((int **)&d_a, size);
    if (err != cudaSuccess) {
        fprintf(stderr, "Error allocating memory on device: %s\n",
                cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy data from host to device
    err = cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    if (err != cudaSuccess) {
        fprintf(stderr, "Error copying data from host to device: %s\n",
                cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy data from device to host
    err = cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);
    if (err != cudaSuccess) {
        fprintf(stderr, "Error copying data from device to host: %s\n",
                cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Free the allocated memory
    free(h_a);
    cudaFree(d_a);

    return EXIT_SUCCESS;
}

The above code allocates memory for an array of 10 integers on the host and initializes it. The cudaMemcpy function is used to copy the data from the host to the device and then back from the device to the host. The direction of the copy is specified using the cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost flags. After using the allocated memory, it is important to free it using free for host memory and cudaFree for device memory.

`cudaMemset`

cudaMemset sets a specified number of bytes in the device memory to a specified value. The function takes three arguments: the pointer to the memory to be set, the value to set, and the size of the memory to be set. The function returns cudaSuccess on success, or an error code on failure.

#include <cuda_runtime.h>
#include <stdio.h>

int main() {
    int         *d_a;
    const size_t size = 10 * sizeof(int);

    // Allocate memory on the device
    cudaError_t err = cudaMalloc((int **)&d_a, size);
    if (err != cudaSuccess) {
        fprintf(stderr, "Error allocating memory on device: %s\n",
                cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Set the device memory to zero
    err = cudaMemset(d_a, 0, size);
    if (err != cudaSuccess) {
        fprintf(stderr, "Error setting device memory: %s\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Free the allocated memory
    cudaFree(d_a);

    return EXIT_SUCCESS;
}

The above code allocates memory for an array of 10 integers on the device. The cudaMemset function is used to set the memory to zero. After using the allocated memory, it is important to free it using cudaFree.

`cudaFree`

cudaFree frees the memory allocated on the device. The function takes a single argument: the pointer to the memory to be freed. The function returns cudaSuccess on success, or an error code on failure. It is important to free the memory after use to avoid memory leaks. You can see the use of cudaFree in the previous examples.

For more details refer to NVIDIA CUDA Library Documentation.

Checkout Qazalbash/CUDAForge for more CUDA examples.