Goal misgeneralization

ℹ️

This is based on the 2023 AGI Safety Fundamentals curriculum

Even without reward misspecification, the rewards used during training don’t allow us to control how agents generalize to new situations. This week we cover the scenarios in which agents in new situations generalize to behaving in competent yet undesirable ways because of learning the wrong goals from previous training: the problem of goal misgeneralisation.

The first two readings define and characterize goal misgeneralisation (also known as inner misalignment in the alignment field). Note that goal misgeneralisation and inner misalignment are roughly equivalent concepts, even though they are defined in slightly different ways. Goal misgeneralisation is defined in terms of behaviour in novel situations, while inner misalignment is defined in terms of the representations learned during training.

The following two readings explore specific hypotheses about how agents which have learned the wrong goals will behave: deceptively gaining high reward, and seeking power on a larger scale. We then finish with two readings on how these behaviours could lead to catastrophe.

Core readings:

🔗

Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct Goals (Shah, 2022): both the blog post (10 mins) and sections 1-4 of the full paper (30 mins) 1. Shah et al. argue that even an agent trained on the 'right' reward function might learn goals which generalize in undesirable ways and provide both concrete and hypothetical illustrations of the phenomenon. Read the paper to get a deeper sense of Shah et al.'s experiments and why they did them. In particular, this paper provides an empirical demonstration that an agent's learned goals can differ from those we expect them to have from our training setup and initially observed behaviour in test environments. This is distinct from standard “capabilities misgeneralization” because the agent can still competently pursue its objective in the test environment. The agent nonetheless scores low reward in the test environment because its learned objective diverges from the one we trained it on.

🔗

For those with little ML experience:

The other alignment problem: mesa-optimisers and inner alignment (Miles, 2021) (starting from 2:27) (25 mins)

This video introduces the problem of 'inner misalignment', which refers to the problem of agents learning internal representations of the wrong goals. ‘Inner misalignment’ and ‘goal misgeneralization’ can be seen as roughly equivalent concepts, except that the former is typically defined in terms of internal learned representations, while the latter is typically defined in terms of misbehavior in new environments.

Why alignment could be hard with modern deep learning (Cotra, 2021) (20 mins)

Cotra presents one broad framing for why achieving alignment might be hard, tying together the ideas from the core readings in a more accessible way.

In particular, it does a good job of describing an agent's internal representations of objectives and how they could become misaligned from the objectives we intended to communicate under default training regimes.

🔗

For those with significant ML background:

Thought experiments provide a third anchor and ML systems will have weird failure modes (Steinhardt, 2022) (20 mins)

In the first post, Steinhardt gives some reasons for expecting thought experiments to be useful for thinking about how future machine learning systems will behave.

In the second post, he gives a specific example of a thought experiment on a hypothesized phenomenon called 'deceptive alignment'. He assumes that, during the training process, a neural network develops an internally-represented objective which diverges from the training objective. He then argues that the network will be incentivized to perform in a way that leads to that misaligned internally-represented objective being preserved during training, and that this could lead to sudden misbehavior during deployment.

The alignment problem from a deep learning perspective (Ngo, Chan and Mindermann, 2022) (only sections 3 and 4) (20 mins)

Although Shah et al. define goal misgeneralisation in terms of undesirable behaviour, this reading approaches the same idea differently — by reasoning about goals in terms of agents' internal representations.

This paper emphasises the difference between capability and goal misgeneralisation. It also extends the argument to theorise how goal misgeneralisation could be a mechanism through which 'power-seeking behaviour' could manifest and persist throughout training.

🔗

What failure looks like (Christiano, 2019) (only part 2) (10 mins)

Christiano describes a scenario in which goal misgeneralization could lead to catastrophe. It describes a situation where the AI systems' goals are misaligned from what we intended, which after a change in environment, could see those systems rapidly start to behave in highly undesirable ways.

Optional readings:

Goal misgeneralization/inner alignment:

🔗

An investigation of model-free planning (Guez et al., 2019) (only abstract and introduction) (5 mins)

🔗

Goal misgeneralization in deep reinforcement learning (Langosco et al., 2022) (ending after section 3.3) (25 mins)

🔗

What is inner misalignment? (Leike, 2022) (10 mins)

🔗

Risks from Learned Optimization (Hubinger et al., 2019) (only sections 1, 3 and 4) (80 mins)

🔗

Advanced artificial agents intervene in the provision of reward (Cohen et al., 2022) (45 mins)

Threat models:

🔗

Yudkowsky Contra Christiano On AI Takeoff Speeds (2022) (40 mins)

🔗

Is power-seeking AI an existential risk? (Carlsmith, 2021) (only sections 2: Timelines and 3: Incentives) (25 mins)

🔗

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover (Cotra, 2022)

🔗

AGI ruin: a list of lethalities (Yudkowsky, 2022) (50 mins)

🔗

Where I agree and disagree with Eliezer (Christiano, 2022) (30 mins)

🔗

Modeling the human trajectory (Roodman, 2020) (35 mins)

Notes:

The core idea behind the goal misgeneralization problem is that in RL training, although the reward function is used to update a policy’s behavior based on how well it performs tasks during training, the policy doesn’t refer to the reward function while carrying out any given task (e.g. playing an individual game of Starcraft). So the motivations which drive a policy’s behavior when performing tasks need not precisely match the reward function it was trained on. The best thought experiments to help understand this are cases where the reward function is strongly correlated with some proxy objective (like resource acquisition or survival) during training, but then diverges from it in novel situations; this is analogous to how humans evolved to care directly about some former proxies for genetic fitness which are no longer robustly correlated with it (like eating sugar or having sex).
Another way of seeing the distinction between the outer and inner alignment problems is provided by Ortega et al. (2018): the outer alignment problem involves producing a design specification which corresponds to the ideal specification; and the inner alignment problem involves producing a revealed specification which corresponds to the design specification. The distinction is summarized in the diagram below.

Next in the AGI Safety Fundamentals curriculum

Task decomposition for scalable oversight

Goal misgeneralization

Core readings:

Optional readings:

Notes:

Next in the AGI Safety Fundamentals curriculum

Topics (1)