Technical Articles

C/C++ CUDA Learning Notes - Kernels

Rain

September 6, 2025

Introduction

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows developers to use NVIDIA GPUs for general purpose processing (an approach known as GPGPU, General-Purpose computing on Graphics Processing Units).

This article records some key learning points about designing a simple CUDA kernel, along with the basic workflow and core concepts.

Here’s a quick overview with a flowchart:

CUDA Flowchart

Github Repository:

Day 1: Simple Kernel

Kernels

When it comes to kernels, there are a few important points to note:

A kernel function is declared with the __global__ specifier. Aside from this, its definition is similar to a regular function.
A kernel represents the work executed in parallel by each GPU thread. Inside a kernel, you can access built-in variables such as threadIdx to identify the current thread.

Thread Hierarchy

Threads can be organized in one, two, or three dimensions. These groups of threads are called thread blocks, and can be declared using the dim3 or int keyword.
A thread block is expected to reside on the same streaming multiprocessor (SM), where its threads share limited memory resources. On modern GPUs, a single block can contain up to 1024 threads. The total number of threads is given by:
```
Total threads = Threads per block × Number of blocks
                    
```
To cover the entire dataset, the total number of threads must be computed as block × grid, ensuring that all elements are processed.
Multiple thread blocks form a grid. The number of blocks is usually determined by the dataset size and is often larger than the number of available multiprocessors.
The thread ID within a grid can be calculated using the formula:
```
threadID = threadIdx + blockIdx * blockDim
                    
```
Here, blockDim refers to the length of a block in a given dimension. For example, blockDim.x represents the block length along the X-axis.

Last updated: September 6, 2025

Python - Accelerating Brain Mask Generation with CUDA

C/C++ - Retrieving GPU Device Information with CUDA Driver API

C/C++ CUDA Learning Notes - Kernels

Introduction

Kernels

Thread Hierarchy

Related Posts