Adversarial techniques for scalable oversight

ℹ️

This is based on the 2023 AGI Safety Fundamentals curriculum

This week focuses on two more potential alignment techniques proposed to work at scale: debate and training using unrestricted adversarial examples.

The initial readings focus on practical and theoretical aspects of debate. The next two readings explore how to generate inputs on which AIs misbehave. Although there is a large literature on adversarial examples (inputs which cause misbehaviour despite being very similar to training examples), we focus on the general case of inputs which cause misbehaviour without necessarily being close to training inputs (known as unrestricted adversarial examples).

Note that although these techniques don’t rely on the task decomposability assumption required for iterated amplificiation, they rely on different strong assumptions. For debate, the assumption is that truthful arguments are more persuasive. For unrestricted adversarial training, the assumption is that adversaries can generate realistic inputs even on complex real-world tasks. The first further reading on each technique explores some problems with these assumptions and potential solutions. The first assumption can be operationalized in terms of a discriminator-critique gap and the second in terms of a generator-discriminator gap (both of which are discussed in the full version of Saunders et al.’s (2022) critiques paper).

Core readings:

🔗

AI-written critiques help humans notice flaws: blog post (Saunders et al., 2022) (10 mins)

🔗

AI safety via debate (Irving et al., 2018) (ending after section 3) (35 mins)

🔗

Red-teaming language models with language models (Perez et al., 2022) (10 mins)

🔗

For those with significant ML background: 1. Robust Feature-Level Adversaries are Interpretability Tools (Casper et al., 2021) (30 mins)

🔗

For those with less ML background: 1. High-stakes alignment via adversarial training blog posts (part one, part two) (Ziegler et al., 2022) (25 mins)

Optional readings:

Debate:

🔗

Debate update: obfuscated arguments problem (Barnes and Christiano, 2020) (excluding appendix) (15 mins)

🔗

Two-turn debate doesn’t help humans answer hard reasoning comprehension questions (Parrish et al., 2022)

🔗

WebGPT (Nakano et al., 2022) and GopherCite (Menick et al., 2022)

Unrestricted adversarial examples:

🔗

Training robust corrigibility (Christiano, 2019) (20 mins)

🔗

Discovering language model behaviors with model-written evaluations: blog post (Perez et al., 2022) (10 mins)

🔗

Adversarial Robustness as a Prior for Learned Representations (Engstrom et al., 2019) (30 mins)

🔗

Constructing unrestricted adversarial examples with generative models (Song et al., 2018) (45 mins)

🔗

Unrestricted adversarial examples via semantic manipulation (Bhattad et al., 2020) (45 mins)

🔗

Adversarial training for high-stakes reliability (Ziegler et al., 2022)

🔗

ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation (Liu et al., 2019)

Notes:

During this week’s discussion session, consider playing OpenAI’s implementation of the Debate game. The instructions on the linked page are fairly straightforward, and each game should be fairly quick. Note in particular the example GIF on the webpage, and the instructions that “the debaters should take turns, restrict themselves to short statements, and not talk too fast (otherwise, the honest player wins too easily).”
What makes AI Debate different from debates between humans? One crucial point is that in debates between humans, we prioritize the most important or impactful claims made - whereas any incorrect statement from an AI debater loses them the debate. This is a demanding standard (aimed at making debates between superhuman debaters easier to judge).

Adversarial techniques for scalable oversight

Core readings:

Optional readings:

Notes:

Next in the AGI Safety Fundamentals curriculum

Topics