This week introduces scalable oversight as an approach to preventing reward misspecification, and discusses one scalable oversight proposal: iterated amplification.
Scalable oversight refers to methods that enable humans to oversee AI systems that are solving tasks too complicated for a single human to evaluate. This week begins by justifying the problem of scalable oversight; we then examine iterated amplification as a potential solution to the problem. Iterated amplification is built around task decomposition: the strategy of training agents to perform well on complex tasks by decomposing them into smaller tasks which can be more easily evaluated and then combining those solutions to produce answers to the full task. Iterated amplification involves repeatedly using task decomposition to train increasingly powerful agents.
We'll examine two other alignment techniques that could work at scale next week.
Core readings:
Optional readings:
The landscape of alignment research:
Iterated amplification: