Task decomposition for scalable oversight

Task decomposition for scalable oversight


This week introduces scalable oversight as an approach to preventing reward misspecification, and discusses one scalable oversight proposal: iterated amplification.

Scalable oversight refers to methods that enable humans to oversee AI systems that are solving tasks too complicated for a single human to evaluate. This week begins by justifying the problem of scalable oversight; we then examine iterated amplification as a potential solution to the problem. Iterated amplification is built around task decomposition: the strategy of training agents to perform well on complex tasks by decomposing them into smaller tasks which can be more easily evaluated and then combining those solutions to produce answers to the full task. Iterated amplification involves repeatedly using task decomposition to train increasingly powerful agents.

We'll examine two other alignment techniques that could work at scale next week.

Core readings:

AI alignment landscape (Christiano, 2020) (only main talk, not Q&A) (30 mins)
Supervising strong learners by amplifying weak experts (Christiano et al., 2018): blog post and full paper (only up to the end of section 3.1) (35 mins)

Optional readings:

The landscape of alignment research:

Iterated amplification:

Factored cognition (Ought, 2019) (introduction and scalability section) (20 mins)

Next in the AGI Safety Fundamentals curriculum

Topics (1)