This week focuses on two more potential alignment techniques proposed to work at scale: debate and training using unrestricted adversarial examples.
The initial readings focus on practical and theoretical aspects of debate. The next two readings explore how to generate inputs on which AIs misbehave. Although there is a large literature on adversarial examples (inputs which cause misbehaviour despite being very similar to training examples), we focus on the general case of inputs which cause misbehaviour without necessarily being close to training inputs (known as unrestricted adversarial examples).
Note that although these techniques don’t rely on the task decomposability assumption required for iterated amplificiation, they rely on different strong assumptions. For debate, the assumption is that truthful arguments are more persuasive. For unrestricted adversarial training, the assumption is that adversaries can generate realistic inputs even on complex real-world tasks. The first further reading on each technique explores some problems with these assumptions and potential solutions. The first assumption can be operationalized in terms of a discriminator-critique gap and the second in terms of a generator-discriminator gap (both of which are discussed in the full version of Saunders et al.’s (2022) critiques paper).
Unrestricted adversarial examples:
- During this week’s discussion session, consider playing OpenAI’s implementation of the Debate game. The instructions on the linked page are fairly straightforward, and each game should be fairly quick. Note in particular the example GIF on the webpage, and the instructions that “the debaters should take turns, restrict themselves to short statements, and not talk too fast (otherwise, the honest player wins too easily).”
- What makes AI Debate different from debates between humans? One crucial point is that in debates between humans, we prioritize the most important or impactful claims made - whereas any incorrect statement from an AI debater loses them the debate. This is a demanding standard (aimed at making debates between superhuman debaters easier to judge).