Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs
arXiv:2512.17131v3 Announce Type: replace-cross Abstract: We propose Generalized Primal Averaging (GPA), an extension of Nesterov’s method that unifies and generalizes recent averaging-based optimizers like single-worker DiLoCo and Schedule-Free, within a non-distributed setting. While DiLoCo relies on a memory-intensive two-loop structure to periodically aggregate pseudo-gradients using Nesterov momentum, GPA eliminates this complexity by decoupling Nesterov’s interpolation constants to enable smooth iterate averaging at every step. Structurally, GPA resembles Schedule-Free but replaces uniform averaging with exponential moving averaging. Empirically, GPA […]