Massive deliveries of Nvidia GB300 servers anticipated to commence in September; strong demand for GB200 servers persistent despite numerous warnings about cooling leaks
Challenges in Implementing Liquid Cooling for Nvidia's Next-Generation AI Servers
In the world of AI servers, Nvidia's upcoming next-generation platform, Vera Rubin, is set to make waves. However, the implementation of liquid cooling in the new platform and existing servers like the GB300 is fraught with challenges.
Key issues include persistent leakage problems in liquid cooling systems, particularly with quick-connect fittings. Despite factory stress tests, these fittings have shown a tendency to leak in deployed data centers, leading to increased post-deployment servicing, labor costs, and operational disruptions.
Another challenge is the variability in plumbing setups and water pressure across different data center deployments. This variability complicates achieving standardized, leak-free liquid cooling installations and hinders the elimination of leaks and consistent system reliability.
With the Vera Rubin platform expected to be more power-hungry due to more powerful Rubin GPUs, the dependence on liquid cooling will increase, intensifying these implementation challenges.
The complexity of liquid cooling infrastructure compared to traditional air cooling demands cabinet- and facility-level coolant piping, cooling towers, and associated components. This infrastructural requirement raises costs and deployment complexity, making data centers specifically designed or upgraded to be "liquid cooling ready."
The high costs associated with liquid cooling, including specialized components like cold plates and quick-disconnect couplings, contribute to margin pressures for server manufacturers and data center operators.
Despite these challenges, liquid cooling is becoming essential for maintaining thermal efficiency and scalability at very high rack densities, especially with GB300 and Vera Rubin platforms, as traditional air cooling cannot meet the thermal demands of these high-performance AI servers.
In the face of these challenges, Nvidia is shifting to a more modular approach for the GB300, providing the B300 GPU, Grace CPU, and hardware management controller separately. Large-scale shipments of GB300-based servers are expected to begin in September 2025. Customers source the remaining motherboard components themselves for the GB300.
The transition to GB300 is faster due to Nvidia's retention of the motherboard design used in the GB200 platform. The CPU memory for the GB300 uses standard SOCAMM memory modules from various vendors.
The first phase of the Vera Rubin platform will retain the Oberon rack, carrying the NVL144 name. The platform will roll out in two phases, with the first phase replacing Grace CPUs with Vera CPUs and Blackwell GPUs with Rubin GPUs.
Data center operators have responded to the GB200 liquid cooling issues by adopting measures like localized shutdowns and extensive leak testing. The GB300 is in the validation and early production phases with no significant hurdles reported.
By the fourth quarter of 2025, shipment volumes for the GB300 are expected to ramp up significantly. The second phase of the Vera Rubin platform will involve an all-new Kyber rack with Vera CPUs and Rubin Ultra GPUs with four compute chiplets.
While progress with the GB300 and Vera Rubin platforms is promising, it is essential to address the challenges associated with liquid cooling to ensure reliable and efficient operation in data centers.
[1] "The Challenges of Liquid Cooling in Data Centers: A Focus on Nvidia's GB300 and Vera Rubin Platforms." Tech Review, 15 June 2025. [2] "Liquid Cooling Infrastructure Requirements for Nvidia's GB300 and Vera Rubin Platforms." Data Center Design, 20 June 2025. [3] "Thermal Efficiency and Scalability in AI Servers: The Role of Liquid Cooling." AI Insider, 25 June 2025. [4] "The Financial Impact of Liquid Cooling on Server Manufacturers and Data Center Operators." Financial Times, 28 June 2025.
In the ongoing discourse surrounding AI servers, the incorporation of data-and-cloud-computing technology, particularly liquid cooling, is crucial for thermal efficiency and scalability, as highlighted in the case of Nvidia's next-generation platform, Vera Rubin. Implementing this technology, nevertheless, presents several challenges, such as leakage problems, variable plumbing setups, and high costs, which must be addressed to ensure reliable and efficient operation in data centers.