Identifying and Punishing Faulty Components in Tesla's Million-Core Dojo Supercomputers: A Single Glitch Can Sabotage Weeks of AI Training Operations

If you're dealing with a gargantuan processor like Tesla's Dojo, you're not just playing with a bunch of tiny chips. Nope, we're talking about wafer-scale chips that pack a staggering 8,850 cores, all consuming 18,000 amps and dissipating 15,000W of power. And guess what? Some of these cores can inexplicably go awry, corrupting the results of extensive training runs that might take weeks to complete.

Now, imagine having a tool that can detect these rogue cores without taking them offline, keeping your high-performance computing tasks on track. Sounds like a game-changer, right? Well, that's exactly what Tesla's Stress tool is all about.

But let's kick it back a bit, shall we? Given the mind-boggling complexity of the Dojo Training Tile, it's like trying to find a needle in a haystack even during the manufacturing process. And when it comes to silent data corruption (SDC), things get even more complex. But fear not, Tesla's Got This!

They first deployed a differential fuzzing technique, which was a bit, well, let's say crude. They generated a random set of instructions and sent them to all cores, comparing the outputs to find mismatches. Sounded like a smart plan, but it was a total time-waster due to the massive communication overhead between the host and the Dojo training tile.

So, Tesla said, "Hold my beer," and refined their method. They assigned each core a unique payload consisting of 0.5 MB of random instructions, allowing cores to retrieve payloads from each other within the Dojo training tile and execute them in turn. This internal data exchange utilized the high-bandwidth communication of the Dojo training tile, enabling them to test approximately 4.4 GB of instructions in a much shorter timeframe.

They didn't stop there. Tesla further enhanced the method by enabling cores to run each payload multiple times without resetting their state between runs. This technique introduced additional randomness into the execution environment, enabling the exposure of subtle errors that might otherwise go undetected. And despite the increased number of executions, the slowdown was minimal compared to the gains in detection reliability.

But that wasn't enough for Tesla. They integrated register values into a designated SRAM area using XOR operations, boosting the probability of identifying defective computational units by a factor of 10, without substantial performance degradation.

And it doesn't end there! Tesla's method isn't confined to the Dojo training tile level; it works at the Dojo Cabinet level and even the Dojo Cluster level, identifying a faulty core out of millions of active cores. Once identified, the Stress monitoring system reveals numerous defective cores across Dojo clusters, with the detection times varying widely. Most defects appear after executing 1 GB to 100 GB of payload instructions per core, corresponding to seconds to minutes of runtime. Harder-to-detect defects may require 1000+ GB of instructions, meaning several hours of execution.

The Stress tool runs as lightweight and self-contained background tests, allowing it to perform without requiring cores to be offline. Only the identified faulty cores are disabled, and even then, each D1 die can tolerate a few disabled cores without affecting the overall functionality.

Going beyond core detection, the Stress tool has discovered a rare design-level flaw, which engineers have managed to address through software adjustments. Several issues within low-level software layers were also found and corrected during the broader deployment of the monitoring system.

By now, the Stress tool has been fully integrated into operational Dojo clusters for in-field monitoring of ongoing hardware health surveillance during active AI training. The company states that the defect rates observed through this monitoring are comparable to those published by the likes of Google and Meta, indicating that the monitoring tool and hardware are on par with those used by others.

Tesla plans to use data obtained by the Stress tool to study the long-term degradation of hardware due to aging. They also intend to extend the method to pre-silicon testing phases and early validation workflows to catch these pesky faults even before production.

So there you have it. Tesla's Stress tool is the ace up their sleeve, ensuring their massive processors stay in tip-top shape, powering the future of AI and high-performance computing. And remember, if you want to stay in the loop on all the tech news, make sure to follow Tom's Hardware on Google News!

In the realm of AI and high-performance computing, Tesla's Stress tool serves as a comprehensive solution for identifying and addressing defective cores in their advanced processors, such as the Dojo. By employing innovative techniques like unique payloads, register values integration, and self-contained background tests, this tool ensures the system's stability and reliability, even in the face of medical-conditions akin to silent data corruption (SDC), thus making it pivotal in science and medical-conditions research. Leveraging data-and-cloud-computing, Tesla's technology paves the way for AI's future, and staying updated on its developments is crucial, so follow Tom's Hardware on Google News!

Identifying and Punishing Faulty Components in Tesla's Million-Core Dojo Supercomputers: A Single Glitch Can Sabotage Weeks of AI Training Operations