Hardware role for ML

Continuation from here https://joyantablog.wordpress.com/2025/04/08/neural-network-dive-deeper/

Let’s continue exploring the hardware magic behind long machine learning tasks like training and inference. In my previous post, I discussed how tools like TensorFlow and PyTorch offload resource management and expose optimal approaches through libraries and APIs. This enables you to focus on understanding the methods and functionalities, applying them based on your specific workload.

However, the real performance acceleration comes from hardware solutions. GPUs, for example, are designed for parallel processing, and they play a key role in speeding up matrix multiplication tasks—something CPUs, which handle tasks one at a time, struggle with. In this article, I’ll dive deeper into how hardware vendors like NVIDIA have developed groundbreaking GPU architectures that frameworks like TensorFlow and PyTorch leverage to unlock their full potential.

Let’s brainstorm how we can efficiently perform a chain of matrix multiplications from a hardware perspective.

With a GPU, you have thousands of smaller cores working simultaneously. This massively parallel architecture allows the GPU to break down the matrix multiplication task into smaller sub-tasks, which can then be computed in parallel.

On the other hand, with a CPU, you’re limited to fewer cores (usually between 4 and 64), but these cores are more powerful individually. While they can’t perform as many tasks in parallel as a GPU, they can handle complex operations with high precision, executing each task more quickly.

But when working on a huge neural network with millions of floating-point matrix multiplications between high-dimensional vectors, precision can sometimes be sacrificed for faster operations.

Here’s a question for you:

If you apply the same algorithm to train a machine learning model in different environments (e.g., CPU vs. GPU), do you think the model’s accuracy or precision would be the same?

People with an ML background know exactly what I’m talking about. I’ll stop here, though, discussing the trade-off between precision and execution speed (I’ll touch on this in the next post).

CUDA (Compute Unified Device Architecture)

As mentioned earlier, GPUs contain thousands of smaller cores working simultaneously. This parallel processing capability is what makes GPUs so powerful, especially for tasks like graphics rendering. But the true potential of GPUs goes far beyond just rendering images or videos. Let’s explore how GPUs, with the help of CUDA, are used to accelerate general-purpose computation.

CUDA was developed to harness the massive parallel processing power of GPUs and make it available for tasks beyond graphics rendering. Before CUDA, GPUs were used almost exclusively for graphics (think gaming). CUDA turned NVIDIA’s GPUs into powerful, general-purpose processors that can accelerate a wide variety of computational workloads, including scientific simulations, machine learning, data analysis, and more.

Optimizing GPU Cores for Matrix Multiplication

To optimize GPU cores, we can break down matrix multiplication into smaller, more manageable tasks. For example, consider multiplying smaller matrices like [2 x 2] or [4 x 4]. This is a common technique used in CUDA to maximize the parallel processing power of GPUs.

CUDA enables massive parallelism, allowing tasks to be split across thousands of GPU cores. By utilizing this parallelism, along with efficient memory management (such as registers, cache, shared memory, etc.), matrix multiplication can be executed much faster compared to CPU-based execution. While breaking down larger matrices into smaller sub-matrices (like [2 x 2]) is one way to optimize GPU performance, CUDA’s true strength lies in its ability to manage parallel threads and optimize memory access for high-performance computing tasks.

The task of CUDA is to expose the GPU’s optimization capabilities to the outside world, including deep learning frameworks like TensorFlow and PyTorch. This exposure happens via APIs or library methods. TensorFlow or PyTorch submits the matrix multiplication (or other computational tasks) to an appropriate API endpoint or method provided by CUDA. CUDA then determines the optimal strategy for executing the task on the GPU, leveraging specialized hardware like Tensor Cores (which I’ll explain in the next section) for tasks such as matrix multiplication.

In short: CUDA acts as an intermediary, parallel computing platform and programming model to leverage the GPU’s power without having to manually handle low-level optimizations, such as choosing when to use Tensor Cores for matrix operations. It abstracts all the complexity and provides an efficient, optimized way to interact with the GPU.

Some may argue that CUDA is just software that exposes the real power of GPUs with certain APIs and libraries. That’s a valid point, but Tensor Cores from NVIDIA are definitely part of the hardware context. Let’s talk about that.

Tensor Cores: Optimizing Matrix Multiplications

Tensor Cores are specialized hardware units within NVIDIA GPUs designed specifically to accelerate the types of operations most commonly used in deep learning, like matrix multiplications, convolutions, and other linear algebra operations. The primary goal of Tensor Cores is to provide high throughput for deep learning tasks, improving performance by orders of magnitude compared to traditional GPU cores. Tensor Cores are optimized for matrix operations, particularly those involved in neural network training and inference.

Let’s explore how Tensor Cores optimize matrix multiplication—crucial for neural network operations. I recommend checking out my previous article, where I demonstrate that performing matrix multiplication on two [1×1] matrices requires 23 CPU cycles. Now, let’s see how Tensor Cores perform this in just a few GPU cycles.

Precision Compromise:

Tensor Cores use mixed-precision arithmetic, specifically FP16 (half-precision, 16-bit floating point) and FP32 (single-precision, 32-bit floating point). FP16 holds the temporary results of matrix multiplication, while FP32 stores the accumulated sum. Unlike CPUs, which typically use 32-bit precision, Tensor Cores leverage FP16 for faster computations, though this does result in some loss of precision. You can read more about this trade-off in next article. For a deep dive, check out this Volta Architecture whitepaper, which shows how a [1×1] matrix multiplication on a CPU takes 23 cycles, while on a Volta-based GPU, it’s done in just 4 clock cycles on a CUDA core.

Warp Concept:

The Warp mechanism breaks a task into smaller sub-tasks and solves them in parallel threads. Each parallel thread unit, known as a CUDA core or streaming processor, communicates with others through the GPU’s internal shared memory. The GPU consists of Streaming Multiprocessors (SM), each containing multiple CUDA cores. Inside an CUDA core, each threads runs on separate physical GPU core for maximum parallelization, and share ultra fast memory called shared memory for communication.

Unlike CUDA cores, which handle general-purpose computations, Tensor cores are optimized for performing high-speed, mixed-precision matrix multiplications, which are common in AI and machine learning workflows (read the next article on precision). Tensor cores don’t rely on NUMA (Non-Uniform Memory Access) like multi-processor systems. Instead, they use a memory model optimized for parallel processing, with shared memory within each SM and global memory for inter-SM communication.

For large matrix multiplications, Warp splits the task into smaller matrices (e.g., [4 * 4]), assigns them to different SMs, and leverages the ultra-fast shared memory. Here, Tensor cores take over to accelerate the matrix multiplication process, significantly boosting performance.

The Big Picture:

NVIDIA’s Tensor Cores are designed to accelerate matrix operations, especially those central to machine learning and deep learning workloads. One of the ways they achieve remarkable performance is by optimizing smaller matrix calculations, specifically 4×4 matrix multiplications. This makes them distinct from general CUDA cores, which are not optimized for such specialized operations.

So here’s the stack:

TensorFlow/PyTorch (Python) → CUDA (parallel computing and programming model) → Tensor Cores (Hardware)

In short, TensorFlow invokes specific CUDA libraries that manage Tensor Cores to perform tasks like matrix multiplication. One such library, CUDA Basic Linear Algebra Subprograms (cuBLAS), specializes in matrix multiplication operations.

Next Topic:

You might get a sense of how the Warp Concept works if you think of it in terms of microservices. You can treat your available cores in a microservice orchestration manner to maximize the outcome from those resources. But what about precision? Why does it need to be compromised for better performance in terms of ML accuracy and execution speedup? I’ll discuss this in the next article. Find it here.

One thought on “Hardware role for ML

Leave a comment