A Proof of Learning Rate Transfer under $mu$P
arXiv:2511.01734v3 Announce Type: replace Abstract: We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $mu$P, a neural network parameterization designed to “maximize” feature learning in the infinite-width limit. We show that under $mu P$, the optimal learning rate converges to a emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative […]