Why precision matters, use cases

Let’s dive quickly into the main topic. I hope you read my previous article here.

In my last post, I talked about the trade-offs between CPU and GPU when it comes to matrix multiplication. A single GPU core (which often refers to the tensor core) performs matrix multiplication much faster than a CPU core (4 cycles vs. 23 cycles). However, this comes with a slight precision loss. The reason for this is that GPUs typically use 16-bit (FP16) floating-point numbers, while CPUs generally work with 32-bit (FP32) floating-point numbers.

In this post, I’ll discuss why using 16-bit floating-point numbers can cause more precision loss compared to 32-bit and, more importantly, when this really matters and when it doesn’t.

Precision in Floating Point Numbers

Let’s start with the basics. The bit size of a number plays a major role in its precision, especially when dealing with floating-point operations. For integer arithmetic, precision is pretty straightforward: if the result of a multiplication doesn’t fit in the variable, it overflows, resulting in incorrect data. This is easy to understand.

But floatingpoint arithmetic is more complex. It’s full of terms like mantissa and exponent, which can be tricky to grasp at first. However, today, let’s focus on how bit size affects precision in a more digestible way.

Bit Size and Precision in Floating-Point Multiplication

When working with floating-point numbers, their precision behaves differently than with integers. To make things easier, let’s focus on positive numbers between 0.0 and 1.0. As you multiply small numbers within this range, the results get smaller and eventually approach zero. For example, a number like 2−14 becomes so tiny that it’s nearly zero. In such cases, for many calculations, we can safely ignore these very small values.

Think of this in terms of machine learning, like when training a model such as a neural network. During training, weights (values assigned to model features) are multiplied across multiple layers. Now, imagine a weight becomes so small, like 2−14, that it barely influences the model’s predictions. Do we really need to keep that value? The answer is no. This is where a technique called model pruning comes in: it helps remove these tiny weights that have no significant impact on the model.

Now, how do we decide which values are “close enough to zero” to be ignored? This depends on the precision of the system you’re working with. For example:

  • If you’re using 32-bit floating-point numbers (FP32),, you might consider values smaller than 2−126 as too small to matter.
  • For 16-bit floating-point numbers (FP16), the smallest value that can be represented is 2−14, so any value below this is practically zero in this system.

The key takeaway is that FP16 can’t accurately represent very small values between 2−14 and 2−126. If you try to calculate these, the result will be garbage or incorrect values. When people refer to precision loss, they’re often talking about the inability of certain systems (like FP16) to represent values in this small range. The same reasoning applies to integers and other data types.

Addition and Precision Loss: Understanding the Trade-offs

In matrix multiplication, both multiplication and addition are essential for processing large data sets, especially in machine learning models. However, addition presents challenges when dealing with very small values.

When two small numbers, such as those smaller than 2−14, are added, their sum can still fall within the same small range (e.g., [0, 2−14]). This creates a problem where small values below the threshold can’t be discarded, but when this happens repeatedly, it leads to significant precision loss as the results are rounded off. This becomes particularly problematic in sensitive computations, like those in neural networks or machine learning tasks. The phenomenon is called Fused Multiply-Add (FMA), and NVIDIA’s Volta architecture optimizes FMA for small, fixed matrix sizes (e.g., [4 * 4]).

To minimize precision loss, a larger temporary storage range is necessary—one that can hold these sums without losing accuracy. This is where Tensor Cores in NVIDIA’s Volta architecture come in. These cores are designed to accelerate both matrix multiplication and addition at the bit level, using specialized hardware to maintain high throughput while minimizing precision loss.

Tensor Cores support mixed-precision arithmetic, using half-precision (FP16) and single-precision (FP32) floating-point formats. FP16 reduces computation time and memory usage, while FP32 stores the results of multiplying FP16 numbers, acting as a temporary holder for addition values with minimal precision loss. The Volta architecture combines these formats to efficiently handle both multiplication and addition, optimizing performance without sacrificing accuracy.

This specialized hardware is essential for balancing performance and precision. By offloading matrix multiplication and addition to Tensor Cores, Volta ensures fast and nearly accurate computations. This enhances the performance of AI, deep learning, and high-performance computing applications.

Why Choose GPU Over CPU?

When it comes to training machine learning models, there’s always a trade-off. If your model is small and requires high precision, like for linear regression or tasks where feature weights play a significant role in the predictions, you might prefer a CPU. A CPU works with 32-bit precision and provides high accuracy, but it’s slower for large-scale computations.

On the other hand, GPUs are designed for parallel processing and can train models much faster. However, they often use FP16 precision for floating-point operations, which can lead to some loss in precision. Fortunately, modern CUDA libraries, like cuBLAS, allow you to switch between FP32 and FP16 depending on the workload. This means you can get the benefits of FP16 speed without sacrificing too much accuracy when needed.

But here’s the catch: GPUs require more power to run at that ultra-fast speed, which means higher energy consumption. If you’re using a cloud provider, this can add to the cost, as GPU-based services tend to be more expensive than CPU-based ones.

Wrapping Up:

So, which one should you choose? Well, if you’re dealing with smaller models that need high precision and can’t afford much loss in accuracy, CPU-based training is your friend. But if you’re working on large-scale models, like those in neural networks, GPUs can speed up the training process considerably—just keep in mind that there will be a slight loss in precision, especially when working with FP16.

Remember, it’s all about the right tool for the job. Every approach has its strengths, but as with most things in tech, there’s no one-size-fits-all solution. Choose wisely based on your needs, resources, and priorities.

Thanks for reading!

One thought on “Why precision matters, use cases

Leave a comment