Neural network, Dive Deeper

Continuation from here .

What discussed so far how micro services helps implementing server-less architecture for Neural network based architecture. In this article I will talk few interesting topics like parallel computing, how ML library tools helps implementing smart resource management, and finish by giving a touch hardware solutions from NVIDIA helps accelerating ML progress. In the next post, I will dive into deeper about the internal hardware based accelerator, and finally would cover how micro service based architecture solution offered by TensorFlow and others, make best use of the hardware accelerator feature.

We’ll start by exploring how matrix multiplication sits at the heart of both forward and backward propagation in NNs. From there, we’ll dive into how parallelization of large matrix operations can drastically reduce both training and inference time.

I’ll also walk through how synchronizing computations across different hidden layers — especially in a serverless setup — can improve resource utilization.

We’ll look at how microservice-based architectures can scale massive training workloads efficiently, and finally, how hardware and software advancements (like CUDA and Tensor Cores) work together to accelerate the evolution of deep learning.

Matrix Multiply-Accumulate (MMA):

Think about a linear regression model. The equation to predict the output of a sample is:

Y_output = w₁*x₁ + w₂*x₂ + w₃*x₃+....+w_nx_n +bias

Now instead of weighted average w₁*x₁ + w₂*x₂ + w₃*x₃+....+w_nx_n , you can rewrite the equation as Matrix multiplication of W and X vector. For example,

Y_output = W * X + bias

where, W=[w₁ w₂ w₃ ... w_n] and X=[x₁ x₂ x₃ ... x_n] are [1 * N] and [N * 1] Matrix respectively. The result should be a [1 * 1] Matrix.

So, now you see Why matrix multiplication is important in prediction task. Now think about the Neural network. Each neuron acts like a linear regression model if you think RELU acts as linear. And what passes through layer to layer, this becomes a chain of matrix multiplication. This mechanism is referred to as Matrix Multiply-Accumulate (MMA). This is the core operation in both forward pass and backward pass (Back propagation).

Parallelization of MMA operations and boundary

Imagine you have a giant neural network (NN), millions of neurons, hundred of hidden layers. NN has to solve a chain of matrix multiplication originated from different neurons (in each single pass for each inputs). Lets formulate the problem. Since now this is clear NN can be thought a oracle machine that solves matrix multiplication problem, lets focus on that multiplication part. In a over simplified way, the below Matrix Multiplication Out (MMO) is what is expect for a single forward pass.

MMO_Output= M₁ * M₂ * M₃ * .....* M_n

Based on association rule, we can break the big chain into two small part.

MMO_Output= (M₁ * M₂ * M₃ * .....* M_k ) * (M_k+1 * M_k+2 *....M_n),

where k < n

The multiplication tasks (M₁ * M₂ * M₃ * .....* M_k ) and (M_k+1 * M_k+2 *.....M_n) can be executed in parallel on two dedicated CPU cores to accelerate the matrix multiplication process.By further breaking these tasks into smaller pieces and distributing them across idle CPU cores, you can achieve even faster results. However, it’s important to assign matrix multiplication tasks intelligently to fully harness the potential of parallel computing. For instance, if you assign one CPU core to handle millions of matrix multiplications (for higher-dimensional matrices) while another CPU core only handles a few hundred, this assignment would be suboptimal. One CPU core will finish much faster than the other, increasing the total execution time.

There’s also a trade-off to consider. While there are many factors influencing parallelization, let’s focus on the task of breaking down large matrix multiplication tasks. One approach is to set up a large server or cluster with many CPU and GPU cores, enabling rapid execution of extensive training tasks.

However, there’s a catch: there is a limit to how much you can parallelize a job. I won’t dive into the details, but the maximum level of parallelization is primarily determined by the portion of the task that can be parallelized. For example, imagine you have a storage server that extracts data from different sources. Once the data is accessed, it undergoes a security assessment to check for any sensitive information leaks. The data is then sent to a powerful MapReduce server, equipped to process petabytes of data and complete a word-count task in minutes through parallelization.

But in reality, this doesn’t make much sense if the data extraction and processing take hours. The overall operation time is dominated by the time it takes to process the data, which could be several hours. This portion is often referred to as the “sequential part” in Amdahl’s Law. Amdahl’s Law perfectly illustrates the limitations of parallel processing. In short, it states that the execution time is determined by both the sequential and parallel parts of a task. As seen in the previous example, the sequential portion (data processing) limits the full potential of parallel processing. You could argue that parallelizing the data processing, as Spark does, could help overcome this limitation, and indeed, that’s one potential solution.

Why balancing of task is the key

As mentioned earlier, without balancing the task appropriately, you could ended by under utilized and over utilized resources. For example, the CPU cores that are managing millions of calculations, and the core that is managing few thousands of Matrix multiplication calculation is an example of over utilized and under utilized resources. One way to proper balancing could be assigning more CPU cores to the part which deals millions of calculations. Then the total training time and inference time can be reduced a lot. Otherwise the execution time will be dictated by the part, that does lot of processing (i.e. part of matrix multiplication chain and overly utilized). You can guess most optimal execution time would be, if all the sub-tasks can be scheduled in a way so that all sub task can finish the assigned task around the same time. So you see, balanced workload distribution plays a major role for optimizing the execution time, and without that you might ended up creating a lot of bottleneck points regardless of the high performance resources you have installed.

Talking about the concept sequential part in Amdahl’s Law from matrix multiplication perspective. Lets see how a “typical” neural net works. In both forward and backward pass, the operation is being done in layer by layer manner. That is, hidden layer L_n relies on the neurons output from the neurons on layer L_n-1. Another way to pitch this, L_n and L_n-1 acts like a barrier-synchronization concept, where LN has a barrier, and for synchronization sake, It can not start working until L_n-1 finished the assigned task.

Now I think you started to see how Amdahl’s Law applies in this particular case. The term Sequential in Amdahl’s Law not only refer sequential code in the source code that can not be parallelized, it also refers the task in parallel processing that is blocked by some other task.

Also optical placement of task on available resources, is not a straight forward task. You should have a monitoring system, that monitors the resources, updates with the current utilization and other factors while assigning any resource to any tasks. You would find lot of published work based on various mechanism to address this optimal balancing. They are really interesting to see how they put intelligence to schedule tasks on available computer resources like Storage, GPU and CPU core etc. Besides that, many open source solutions and tools like TensorFlow, py-torch or JAX provides methods and API to apply those intelligence on your neural network. If you find any solution that works for your use case then open source that also. And that is the beauty of open source. Knowledge is open, grab, enhance and share with others.

Okay What is TensorFlow, Py-torch or JAX

As discussed before, optimally distributing your workload is not something to take lightly. Even with high-end computer resources installed on your data server or local machine, you could easily mess up your system. Tools like TensorFlow, PyTorch, and JAX help you manage this complexity. They offer libraries, methods, and APIs that allow you to focus on your neural network tasks without worrying about how to properly utilize your infrastructure. These tools follow industry standards that are both research and industry proven, so we usually don’t need to reinvent the wheel.

I won’t dive deep into the functionality of these tools, as they are well-documented and easy to explore on your own. However, when you start exploring neural networks, you’ll notice that the naming convention of tensors follows a pattern that aligns with how operations occur in the network. For example, in TensorFlow, tensors is a concept that represents the data that moves between neurons. This data can take many forms: a simple number (scalar), a list of numbers (vector), a table of numbers (matrix), or even more complex structures. From a matrix multiplication (MMA) perspective, tensors are the results of these operations passed from one layer to another, or from one neuron to another. Each tensor represents the transformed data after an operation, helping carry information through the network.

Let me share an analogy to help clarify how these tools work. Think of these tools as the drivers in an operating system. Just like how a driver hides the complexity of interacting with a device—whether it’s an audio device (like a Bluetooth speaker) or a storage drive (like an NVMe drive)—TensorFlow or PyTorch acts as the driver for managing your computational resources. These resources might include GPUs (like NVIDIA) or high-end servers, and even virtualized environments offered by tools like libvirt or VMware.

TensorFlow or similar tools take your neural network architecture as input. You can then define the neural network with different hyperparameters, such as the number of neurons, the activation function, and the layers. The tool provides APIs to upload your training data, train the neural network, and fine-tune your model—all without you needing to worry about the underlying infrastructure. You only need to configure your TensorFlow setup to point to your endpoints, whether it’s a cluster, an in-memory database like Redis, or the GPU you’re using for training. TensorFlow handles the complexity of infrastructure management, so you can focus on the training data and the machine learning pipeline.

Here’s an example of how tools like TensorFlow make your life easier when working with neural networks:

You might want to leverage the power of GPUs distributed across your data center. TensorFlow allows you to monitor key metrics such as GPU utilization and task queue lengths, enabling you to allocate resources more efficiently to speed up the training process. By utilizing TensorFlow Distributed, you can perform training across multiple servers, accelerating the overall training time
If you have a Kubernetes cluster specialized for training tasks, TensorFlow can ensure your workload is handled efficiently, even when the GPUs are busy. Check TensorFlow on Kubernetes.
After your training job is complete, you’ll need to store the trained model, possibly with proper versioning. TensorFlow helps you manage this by offering tools for version control and managing model repositories (registries).
You’ll also need to access clean and structured data sources when training your model. TensorFlow helps streamline this process, saving you time that would otherwise be spent cleaning and processing raw, unstructured data.

Without proper resource allocation and management policies, having a powerful server or high-performance cluster doesn’t matter much. This is where tools like TensorFlow come in—they offload much of the heavy lifting for you. You don’t need to worry about managing the infrastructure; you just need to understand the APIs, methods, and their functionalities. TensorFlow provides ways to handle all of the points mentioned above.

If I stopped here, you might misunderstand the full value of TensorFlow, so I encourage you to dive deeper into exploring it (and similar tools). Consider contributing by sharing your own resource management strategies on open source.

What about Hardware Solutions

Imagine you have a machine with a single-core CPU and Ubuntu as your operating system. Now, you want to use TensorFlow to train a Convolutional Neural Network (CNN) model, which involves processing millions of images and feeding them through the network for training.

First, TensorFlow might not even allow you to run the model—it might simply exit. Why? Because running the training operation could take days or even weeks, and nobody wants that for this type of use case. So, TensorFlow typically works only on the hardware resources that can optimize its performance. This leads to the natural question: Why is there a hardware limitation?

Let’s walk through a calculation to show that the idea of days or weeks isn’t an exaggeration.

Understanding CPU Speed

Assume you’re working with a CPU core that runs at 2.5 GHz. This means the core can complete 2.5 billion CPU cycles per second. In a single day (which has 86,400 seconds), the CPU core can perform: 86,400 * 2.5 billion cycles= 216 billion cycles in a day.

Now, let’s break down some operations:

Register Access: Takes 1 CPU cycle.
CPU Cache Access: Could take anywhere from 1 to 50 CPU cycles depending on which cache level is being accessed.
RAM Access: Takes between 60 to 100 CPU cycles or more.

Let’s consider a simple operation like matrix multiplication for a single weight and activation matrix of dimension [1 * 1]. Assuming the weight value and activation value are stored in the CPU register (which is ideal and rarely happens in practice), the operations would look like this:

Accessing register: Takes 1 cycle each.
Matrix multiplication: Takes 20 CPU cycles to multiply two 64-bit numbers.
Addition: Takes 1 cycle.

So, the total number of cycles for a single [1 * 1] matrix multiplication is: 20+2+1=23 CPU cycles.

The Forward Pass for sample Image

Next, assume your image is 256x256 pixels, which means there are 65,536 input pixels to be processed in the input layer. For each pixel, you’ll perform 23 CPU cycles, so the total number of operations for a single forward pass for one image will be: 65,536 × 23=1,507,328 multiplications. which will take ~ 603 microseconds.

Now, if you have 2 million images to train your model, the total time for just the forward pass would be: 2,000,000×603 microseconds=1206 seconds (or about 20 minutes). Which is 1.39% of a single day of 24 hours.

This calculation assumes the best-case scenario, where all matrix variables are stored in registers—something that’s highly unlikely in real-world applications. Plus, we’re only considering the forward pass of the input layer for all images, which is just the “tip of the iceberg” in terms of computations.

In reality, you’ll also have to account for backward passes, weight updates, and other operations that make the calculation even more complex. So when we say that training a model could take days or weeks on a single-core CPU, this isn’t an exaggeration. It’s a simple way to understand the immense computational demand of training deep learning models.

GPUs play a key role in speeding up matrix multiplication tasks because they are purpose built for parallel processing, unlike CPUs that handle tasks one at a time. A GPU has thousands of smaller cores that can work on many calculations at once, which is perfect for the large matrix operations needed in deep learning.

When training models, like Convolutional Neural Networks (CNNs), GPUs can perform thousands of calculations simultaneously, making the process much faster. This allows neural networks to be trained in hours or even minutes, compared to days or weeks on a CPU. In short, GPUs make AI and machine learning tasks much quicker by handling large amounts of data at once.

Okay, the post is getting a bit long. In the next article, I’ll cover hardware-based ML accelerators that are designed to speed up machine learning tasks. These hardware solutions make it easier for tools like TensorFlow to unlock their full potential.

	Docker and Podman or… on Docker or Podman, whichever fi…
	Why precision matter… on Hardware role for ML
	Hardware role for ML… on Why precision matters, use…
	Hardware role for ML… on Neural network, Dive Deep…
	Neural network, dive… on MicroService concept, philosop…

joyantablog

Neural network, Dive Deeper

Matrix Multiply-Accumulate (MMA):

Parallelization of MMA operations and boundary

Why balancing of task is the key

Okay What is TensorFlow, Py-torch or JAX

What about Hardware Solutions

Understanding CPU Speed

The Forward Pass for sample Image

One thought on “Neural network, Dive Deeper”

Leave a comment Cancel reply

Matrix Multiply-Accumulate (MMA):

Parallelization of MMA operations and boundary

Why balancing of task is the key

Okay What is TensorFlow, Py-torch or JAX

What about Hardware Solutions

Understanding CPU Speed

The Forward Pass for sample Image

Share this:

Related

Share this:

One thought on “Neural network, Dive Deeper”

Leave a comment Cancel reply