Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates

Training frontier AI models is, at its core, a coordination problem. Thousands of chips must communicate with each other continuously, synchronizing every gradient update across the network. When one chip fails or even slows down, the entire training run can stall.

As models scale toward hundreds of billions of parameters, that fragility becomes increasingly untenable. Google DeepMind is now proposing a different model entirely. Google DeepMind researchers introduced Decoupled DiLoCo (Distributed Low-Communication), a distributed training architecture that decouples compute into asynchronous, fault-isolated ‘islands,’ enabling large language model pre-training across geographically distant data centers without requiring the tight synchronization that makes conventional approaches brittle at scale.

The Problem with Traditional Distributed Training To understand why Decoupled DiLoCo is important, it helps to understand how distributed training typically works. Standard Data-Parallel training replicates a model across many accelerators (GPUs or TPUs), each processing a different mini-batch of data. After each forward and backward pass, gradients must be averaged across every device — a process called AllReduce — before the next training step can begin.

This blocking synchronization step means every device must wait for the slowest one. Across thousands of chips spanning multiple data centers, that bottleneck is not just inconvenient; it makes global-scale training effectively impractical. Bandwidth is another hard constraint.

Conventional Data-Parallel training requires approximately 198 Gbps of inter-datacenter bandwidth across eight data centers — far beyond what standard wide-area networking (WAN) can support between geographically distributed facilities. How Decoupled DiLoCo Works Decoupled DiLoCo builds on two prior systems from Google. The first is Pathways, which introduced a distributed AI system based on asynchronous data flow, allowing different compute resources to work at their own pace without blocking on one another.

The second is DiLoCo, which dramatically reduced the inter-datacenter bandwidth required for distributed training by having each worker perform many local gradient steps before communicating with peers — dramatically reducing how much data needs to flow between data centers. Decoupled DiLoCo brings both ideas together. Built on top of Pathways, training is divided across separate clusters of accelerators called learner units — the ‘islands’ of compute.

Each learner unit trains semi-independently, performing many local steps, before sharing a compressed gradient signal with an outer optimizer that aggregates updates across all learner units. Because this outer synchronization step is asynchronous, a chip failure or slow learner unit in one island does not block the others from continuing to train. The bandwidth savings are dramatic.

Decoupled DiLoCo reduces required inter-datacenter bandwidth from 198 Gbps to just 0.84 Gbps across eight data centers — multiple orders of magnitude lower — making it compatible with standard internet-scale connectivity between datacenter facilities rather than requiring custom high-speed network infrastructure. Self-Healing Through Chaos Engineering One of the most technically significant properties of Decoupled DiLoCo is its fault tolerance. The research team used chaos engineering, a method that deliberately introduces artificial hardware failures into a running system to test its robustness during training runs.

The system continued training after the loss of entire learner units, and then seamlessly reintegrated those units when they came back online. This behavior is what the research team describes as ‘self-healing’. In simulations involving 1.2 million chips under high failure rates, Decoupled DiLoCo maintained a goodput (the fraction of time the system is performing useful training) of 88%, compared to just 27% for standard Data-Parallel methods.

Goodput is the practical metric that matters here: a training run with high nominal compute but low goodput wastes significant resources. https://deepmind.google/blog/decoupled-diloco/? Critically, these resilience gains come with minimal degradation in model quality. In real-world experiments using Gemma 4 models, Decoupled DiLoCo achieved an average ML benchmark accuracy of 64.1%, compared to 64.4% for the conventional baseline — a difference well within the noise of typical evaluation variance.

Training a 12B Model Across Four U.S. Regions The research team validated Decoupled DiLoCo at production scale by successfully training a 12 billion parameter model across four separate U.S. regions using just 2–5 Gbps of wide-area networking, a bandwidth level achievable with existing commercial internet infrastructure between data center facilities. The system accomplished this more than 20 times faster than conventional synchronization methods. The key reason: rather than forcing compute to pause and wait fo