Hybrid by Design: Inside the Mamba-MoE Engine of Nemotron 3
Inside the Mamba-MoE Engine of Nemotron 3 TL;DR The Models: The family includes Nano, Super, and Ultra.The Architecture: A Hybrid Mamba-Transformer Mixture-of-Experts (MoE) design that replaces most attention layers with Mamba-2 layers for high throughput. Key Innovations: LatentMoE: A new expert routing mechanism in Super/Ultra that projects tokens into a smaller latent space to improve accuracy-per-byte. MTP (Multi-Token Prediction): Enables faster generation via native speculative decoding. NVFP4: Native 4-bit floating-point training for the larger models. Capabilities: Supports 1M token context […]