Gated Attention & DeltaNets: The Missing Link for Long-Context AI
Paper-explained Series 6 Transformers didn’t take over AI because they were perfect — they took over because they were parallelizable. Attention let models look everywhere at once, unlocking massive scale. But as models grew deeper and wider, a strange pattern emerged: attention wasn’t failing to look, it was failing to decide when to change. Loss Spikes: Sudden, destabilizing jumps in training error often caused by massive activations in the residual stream, which limit how fast a model can learn. Attention Sinks: A phenomenon […]