Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching
Diffusion policies are expressive yet incur high inference latency. Flow Matching (FM) enables one-step generation, but integrating it into Maximum Entropy Reinforcement Learning (MaxEnt RL) is challenging: the optimal policy is an intractable energy-based distribution, and the efficient log-likelihood estimation required to balance exploration and exploitation suffers from severe discretization bias. We propose textbf{F}low-based textbf{L}og-likelihood-textbf{A}ware textbf{M}aximum textbf{E}ntropy RL (textbf{FLAME}), a principled framework that addresses these challenges. First, we derive a Q-Reweighted FM objective that bypasses partition function estimation […]