Reward misspecification and instrumental convergence

ℹ️

This is based on the 2023 AGI Safety Fundamentals curriculum

This week starts off by focusing on reward misspecification: the phenomenon where our default techniques for training ML models often unintentionally assign high rewards to undesirable behaviour. Behaviour which exploits reward misspecification to get a high reward is known as reward hacking.

This type of alignment failure happens due to our failure to capture our exact desires for the resulting systems' behaviour in the reward function or loss function that we use to train machine learning systems. Failure to solve this problem is an important source of danger from advanced systems if they're built using current-day techniques, which is why we're spending a week on it.

We start by looking at some toy examples when rewards are hard-coded or based on human feedback. We'll then look at two techniques engineers working with foundation models use to overcome reward misspecification in advanced systems: Reinforcement Learning from Human Feedback (RLHF) and Inverse Reinforcement Learning (IRL). We’ll also examine the limitations of these techniques. Note that RLHF is much more of a focus, and the two readings on IRL should only be done after you feel like you understand the rest of the material in this week.

The second key topic this week is instrumental convergence: the idea that AIs pursing a range of different rewards or goals will tend to converge to a set of similar instrumental strategies. Broadly speaking, we can summarize these strategies as aiming to gain power over the world. Instrumental convergence therefore provides a bridge between the narrow examples of reward misspecification we see today, and the possibility of large-scale disempowerment from AI; a reading by Christiano provides an illustration of how this might occur.

Core readings:

🔗

Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020) (10 mins)

🔗

Learning from Human Preferences: blog post (Christiano et al., 2017) (5 mins)

🔗

Learning to summarize with human feedback: blog post (Stiennon et al., 2020) (20 mins)

🔗

The alignment problem from a deep learning perspective (Ngo, Chan and Mindermann, 2022) (only section 2, on reward hacking) (10 mins)

🔗

Instrumental convergence reading: These readings change track a little from the previous ones from this week. We are examining a hypothesis for another way that ML systems could maximise reward in unexpected ways: to, broadly, seek power. In other words, systems could develop 'instrumental' goals like resources, self-preservation and self-enhancement For those with little background in ML: 1. Superintelligence: Instrumental convergence (Bostrom, 2014)

Understand what is meant by 'instrumental convergence' and the five goals that AI systems might internalise, which are 'instrumental' to achieving its overall goal (rather than being goals we intended them to pursue).

For those with extensive ML background:

Optimal Policies Tend To Seek Power (Turner et al., 2021)

Turner et al. flesh out the arguments from Bostrom (2014) by formalizing the notion of power-seeking in the reinforcement learning context, and proving that many agents converge to power-seeking. (See also the corresponding blog post and paper.)

🔗

What failure looks like (Christiano, 2019) (only part 1) (5 mins)

🔗

Inverse reinforcement learning example (Udacity, 2016) (5 mins)

🔗

The easy goal inference problem is still hard (Christiano, 2018) (5 mins)