Task decomposition for scalable oversight

ℹ️

This is based on the 2023 AGI Safety Fundamentals curriculum

This week introduces scalable oversight as an approach to preventing reward misspecification, and discusses one scalable oversight proposal: iterated amplification.

Scalable oversight refers to methods that enable humans to oversee AI systems that are solving tasks too complicated for a single human to evaluate. This week begins by justifying the problem of scalable oversight; we then examine iterated amplification as a potential solution to the problem. Iterated amplification is built around task decomposition: the strategy of training agents to perform well on complex tasks by decomposing them into smaller tasks which can be more easily evaluated and then combining those solutions to produce answers to the full task. Iterated amplification involves repeatedly using task decomposition to train increasingly powerful agents.

We'll examine two other alignment techniques that could work at scale next week.

Core readings:

🔗

AI alignment landscape (Christiano, 2020) (only main talk, not Q&A) (30 mins)

🔗

Measuring Progress on Scalable Oversight for Large Language Models (Bowman, 2022) (introduction only) (5 mins)

🔗

Summarizing books with human feedback: blog post (Wu et al., 2021) (5 mins)

🔗

Supervising strong learners by amplifying weak experts (Christiano et al., 2018): blog post and full paper (only up to the end of section 3.1) (35 mins)

🔗

Language Models Perform Reasoning via Chain of Thought (Wei et al., 2022) (10 mins)

🔗

Least-to-most prompting enables complex reasoning in large language models (Zhou et al., 2022) (only up to end of section 3.1) (15 mins)