This week starts off by focusing on reward misspecification: the phenomenon where our default techniques for training ML models often unintentionally assign high rewards to undesirable behaviour. Behaviour which exploits reward misspecification to get a high reward is known as reward hacking.
This type of alignment failure happens due to our failure to capture our exact desires for the resulting systems' behaviour in the reward function or loss function that we use to train machine learning systems. Failure to solve this problem is an important source of danger from advanced systems if they're built using current-day techniques, which is why we're spending a week on it.
We start by looking at some toy examples when rewards are hard-coded or based on human feedback. We'll then look at two techniques engineers working with foundation models use to overcome reward misspecification in advanced systems: Reinforcement Learning from Human Feedback (RLHF) and Inverse Reinforcement Learning (IRL). We’ll also examine the limitations of these techniques. Note that RLHF is much more of a focus, and the two readings on IRL should only be done after you feel like you understand the rest of the material in this week.
The second key topic this week is instrumental convergence: the idea that AIs pursing a range of different rewards or goals will tend to converge to a set of similar instrumental strategies. Broadly speaking, we can summarize these strategies as aiming to gain power over the world. Instrumental convergence therefore provides a bridge between the narrow examples of reward misspecification we see today, and the possibility of large-scale disempowerment from AI; a reading by Christiano provides an illustration of how this might occur.
Understand what is meant by 'instrumental convergence' and the five goals that AI systems might internalise, which are 'instrumental' to achieving its overall goal (rather than being goals we intended them to pursue).
For those with extensive ML background:
Turner et al. flesh out the arguments from Bostrom (2014) by formalizing the notion of power-seeking in the reinforcement learning context, and proving that many agents converge to power-seeking. (See also the corresponding blog post and paper.)
On threat models: