Energy, Cost, and Performance: Rethinking Data Center Design

Data centers consume a massive amount of energy, and this demand is growing rapidly as facilities are increasingly designed to support AI and GPU-heavy workloads. Decisions made early—during design and planning—have long-term consequences. Good choices can save energy, money, and operational headaches, while poor ones often lock in inefficiencies that are difficult and expensive to fix later.

This article series explores how to design energy-efficient and cost-effective data centers and co-location facilities, with a focus on performance from day one. In high-density environments, even small inefficiencies can scale into significant energy and cost penalties, making early, performance-aware planning essential.

A well-known example comes from Google DeepMind, where smarter control techniques reduced cooling energy use by nearly 40%. This shows what’s possible—but it also highlights a key reality. Software alone isn’t enough. Data centers are purpose-built systems, and real efficiency comes only when hardware behavior, facility design, cooling strategy, and operations work together.

In the rest of this series, I’ll break down how CPUs and GPUs consume power, why compute dominates energy costs, how power is managed in practice, and why cooling and HVAC systems are critical—especially in the age of AI. Before diving into the hardware details, it’s important to first clarify what we really mean by data center performance.

What Does Data Center Performance Really Mean?

Data center performance refers to how well a facility operates over time. This includes measures like energy use, how resources are utilized, trends in costs, and the ability to support reliable and growing computing needs. These factors are crucial for making investment decisions and managing risks.

Some performance measures are unique to data centers. A key measure is Power Usage Effectiveness (PUE), which reflects energy efficiency by comparing the total energy the facility uses to the energy used by IT equipment. For example, if IT equipment uses 1 unit of energy, the entire facility might use 1.4 units, meaning about ~71% of the energy is consumed by IT equipment, while the rest goes to cooling, power losses, and other parts of the building. While PUE provides useful information about energy use, it only highlights one aspect of overall data center performance.

To truly understand how well a data center performs, we need to consider more than just one number. We should assess its operation over time, in real-life situations, and as demand changes. Factors like energy efficiency, costs, flexibility, and ease of maintenance all influence how effectively a facility runs.

Using renewable energy can improve metrics like PUE, but the added infrastructure and complexity may create challenges with budget or compliance. The situation can vary if a data center is located in an area with low or abundant renewable energy costs. This demonstrates that data center performance should be evaluated as a whole, rather than based on a single metric.

In summary, these points illustrate that data center performance relies on overall metrics and how energy is consumed and managed. To identify areas for improvement and limitations, it’s essential to examine how power is supplied to and used by computing hardware. This leads to a discussion about the importance of energy and power consumption in computing devices.

Energy and Power Analysis of Computational Devices

Power (watt) is how much energy (joule) is used over time. In a data center, parts like CPUs, GPUs, memory, and storage in servers need electricity to work. This power comes from tools like power distribution units (PDUs) and uninterruptible power supplies (UPS), which ensure a steady and constant electrical supply.

Power problems can show up in different ways, like losing all utility power, changes in voltage, frequency differences, or short outages that can interrupt equipment work. UPS systems help by providing backup power and improving the electrical input, letting servers keep running or shut down safely without losing data or damaging hardware.

Reliable power is very important in data centers, where even minor disruptions can affect system availability, cooling, and overall building stability.

Peak load is key in designing and running data centers. The highest expected power need, instead of the average use, usually decides how to size the electrical and cooling systems. This means that peak load affects costs, energy efficiency, and long-term performance.

Computational Devices Internal

We all know a bit about CPUs and GPUs. What’s really interesting is what happens inside them. Imagine that a CPU or GPU has billions of tiny parts called transistors. These transistors aren’t smart by themselves, but they work together to handle complicated tasks, like running machine learning models.

Thanks to Moore’s Law, we can make these transistors smaller and smaller, allowing billions of them to fit into a single chip. This miniaturization helps modern CPUs and GPUs meet our computing needs today.

This article mainly looks at how CPUs work, but the same ideas relate to GPUs because the same physical rules apply to all computer chips. In my previous articles, I’ve talked about how a CPU uses power, which we can divide into two types: static power and dynamic power. Here, I will focus on dynamic power, which is the most important in intense computing situations.

Who needs the power

In short, the transistors in the CPU are what use power. Each transistor works as a switch for binary states—0 or 1—by being OFF or ON. When a transistor is ON, it represents a “1”; when it is OFF, it represents a “0”.

During each CPU cycle, billions of transistors turn on and off as they work in logic units, registers, caches, buffers, and control circuits. This constant turning on and off uses power to keep everything running.

For a more in-depth discussion, you can check out this paper I wrote in 2018-2019. It mainly talks about how to manage energy in network and transportation, but it also covers energy management for computing resources.

In a data center, all parts like computing, storage, and networking use power, but the computing resources—like CPUs and GPUs—make up about 65-70% of the total energy used. That’s why managing computing power well is crucial for reducing the overall energy use of a data center.

The power (watts) needed by a CPU is expressed by

P_{dynamic}= C *V^2*f

Where:

( C ) is the capacitance, which is a measure of the ability of a integrated circuit to store an electrical charge, influenced by the physical characteristics of the capacitor and the materials used in its construction. Design Constant.
( V ) is the applied voltage for transistor operations,
( f ) is the CPU’s running frequency, usually specified in its specification. This dominates how fast the CPU can perform any given task. Usually as user we care about this value.

This equation highlights an important point: CPU power consumption is influenced more by how the processor is operated than by how much work it performs. Power scales linearly with frequency but quadratically with voltage. Because of this, even small reductions in voltage—along with a corresponding reduction in frequency—can significantly reduce instantaneous power draw.

This principle is applied in a method known as dynamic voltage and frequency scaling (DVFS), which changes CPU voltage and frequency when full performance isn’t needed to reduce peak power usage.

Big picture

At first glance, DVFS might look like a small and simple hack. However, in large data centers, it has a much bigger effect. By managing short bursts of power use in CPUs and GPUs, DVFS helps keep facilities running smoothly and lowers costs.

In co-location data centers, customers are billed based on the number of RACKs they use and the amount of power each RACK can use. Each RACK has a set power limit and holds servers and networking gear. Customers pay for this reserved power even if they don’t use it all.

The power limit of a RACK is based on the highest power needs (Peak Load) of the equipment inside, like CPUs, GPUs, and network cards. This limit is like a cap on how much power these resources can use. It’s important to set this limit to prevent systems from becoming unstable or damaging the hardware.

A RACK sets a strict power limit, which is the most power that can be safely used. However, the RACK itself cannot communicate with the CPUs or GPUs inside; these components are made up of tiny transistors that turn ON and OFF.

DVFS is the software mechanism that translates that RACK-level limit into behavior at the chip level. By adjusting voltage and frequency, DVFS effectively tells the CPU or GPU, This is how fast you can run without crossing the power boundary. Instead of allowing transistors to switch at full speed and cause power spikes, DVFS keeps their activity within the allowed range.

In this way, a facility-level power limit becomes enforceable at the silicon level, helping protect hardware, maintain stability, and keep power usage under control.

This mechanism is not new or experimental—it is well established and widely used in practice. DVFS and power limits are commonly configured and enforced through BIOS and firmware settings, where operators define voltage, frequency, and power boundaries that the CPU or GPU must obey. These controls allow hardware to operate safely within rack and facility limits while still delivering the required performance.

But the story doesn’t end with power usage. The electricity consumed by CPUs and GPUs for tasks, whether a simple addition or a complex AI job, doesn’t just vanish. Most of it turns into HEAT, which needs to be removed to keep systems running safely.

Why Computational Systems Generate Heat — Why It Matters

Computational resources create heat mainly because of a process called Joule heating. Modern GPU servers use a lot of electricity to switch billions of transistors on and off during calculations. This power comes from the server’s power network and turns into heat.

Joule heating happens when electric current flows through materials that resist (R) electricity, like transistors and wiring. As the current (I) passes through these parts, some electrical energy changes into heat, following the rule.

P_{HEAT} = I^2 * R

Almost all electrical energy used during calculations becomes heat inside the server. This principle applies to all electrical devices, not just GPUs or CPUs. However, if a system can recover and reuse this heat for other purposes, it can improve the Power Usage Effectiveness (PUE) of a data center. This isn’t a new idea; many businesses and homes already recycle waste heat to lower their energy needs.

In data centers, the situation is different. They host critical IT infrastructure where reliability and uptime are essential. Because of this, heat reuse cannot be applied directly inside the IT area without introducing risks such as overheating or disrupting normal cooling airflow, which can affect equipment performance and reliability.

Instead, data center heat recovery is implemented in a controlled way using technologies such as liquid cooling loops, rear-door heat exchangers, or warm-water (a.k.a. chilled water) cooling systems. These approaches capture and transfer heat from IT equipment without interfering with standard cooling operation.

By decoupling heat recovery from the primary cooling path, data centers can reuse waste heat while maintaining thermal stability, improving energy efficiency, and reducing operational costs. These strategies also support sustainability by lowering overall energy demand and improving resource utilization. I will discuss these approaches in more detail in upcoming articles.

Cooling: A Core Requirement for Data Center Operation

So, managing heat from computers is crucial to keep them cool. If not, some areas can become too hot, which can stress and damage the hardware.

One way to cool down is by using air cooling. This method removes heat and prevents overheating in certain spots. Another option is liquid cooling, which uses a special liquid that flows around the GPU or CPU parts. Liquids can absorb and transfer heat better than air, making cooling quicker and more effective.

Now, think about a RACK with several GPU servers running at once. If we only focus on cooling each individual GPU core, it can create hot spots in different areas of the rack. That’s why it’s often better to have cooling systems at the RACK level, even though individual GPU servers are designed to remove heat safely on their own. Rack-level cooling helps manage heat better and keeps all servers working smoothly.

But let’s keep in mind that while cooling is necessary, it comes at a cost—managing cooling systems in a sensitive place like a data center is not free. Cooling infrastructure needs extra equipment, energy, space, and can complicate operations, which all add to higher costs.

I’ll wrap this up here. In the next article, I’ll continue discussing cooling strategies in data centers and co-location facilities. A large part of the total energy and utility costs—often up to ~70%—is related to cooling sensitive IT equipment and maintaining a safe and comfortable environment for the people working there. That’s why achieving even a 40% reduction in cooling energy inside Google DeepMind feels like a major win to me.

Stay tuned.

	Docker and Podman or… on Docker or Podman, whichever fi…
	Why precision matter… on Hardware role for ML
	Hardware role for ML… on Why precision matters, use…
	Hardware role for ML… on Neural network, Dive Deep…
	Neural network, dive… on MicroService concept, philosop…