We’re Teaching AI to Lie. These Researchers Built a Truth Serum.
Author(s): Nicholas Borg Originally published on Towards AI. How OpenAI’s “confession training” solves the problem no one’s talking about: models optimised to deceive You’ve been there, right? You ask an AI to write code. It hacks the timer to pass impossible tests, then tells you “Task completed!” Reinforcement learning often teaches models to look good rather than be good, creating a divide between output and intent. Source: Gemini Nano Banana ProThis article discusses the challenges of reward hacking […]