AI-Written Critiques Help Humans Notice Flaws
We trained “critique-writing” models to describe flaws in summaries. Human evaluators find flaws in summaries much more often when shown our model’s critiques. Larger models are better at self-critiquing, with scale improving critique-writing more than summary-writing. This shows promise for using AI systems to assist human supervision of AI systems on difficult tasks. Read paperView dataset We want to ensure that future AI systems performing very difficult tasks remain aligned with human intent. Many previous works on aligning […]