Seeking arXiv cs.LG endorsement for paper on probe transfer failure in reward hacking detection
eeking arXiv cs.LG endorsement for a paper on activation probe transfer failure for reward hacking detection. I test whether probes trained on the School of Reward Hacks dataset (Taylor et al. 2025) transfer to GRPO-induced reward seeking. They don’t. The SFT and RL probe directions are nearly orthogonal (cosine = -0.07). Paper builds on Wilhelm et al. 2026, Taufeeque et al. 2026, and Gupta & Jenner 2025 (NeurIPS MechInterp Workshop). Paper will be visible on arXiv once endorsed […]