Even without reward misspecification, the rewards used during training don’t allow us to control how agents generalize to new situations. This week we cover the scenarios in which agents in new situations generalize to behaving in competent yet undesirable ways because of learning the wrong goals from previous training: the problem of goal misgeneralisation.
The first two readings define and characterize goal misgeneralisation (also known as inner misalignment in the alignment field). Note that goal misgeneralisation and inner misalignment are roughly equivalent concepts, even though they are defined in slightly different ways. Goal misgeneralisation is defined in terms of behaviour in novel situations, while inner misalignment is defined in terms of the representations learned during training.
The following two readings explore specific hypotheses about how agents which have learned the wrong goals will behave: deceptively gaining high reward, and seeking power on a larger scale. We then finish with two readings on how these behaviours could lead to catastrophe.
- The other alignment problem: mesa-optimisers and inner alignment (Miles, 2021) (starting from 2:27) (25 mins)
- This video introduces the problem of 'inner misalignment', which refers to the problem of agents learning internal representations of the wrong goals. ‘Inner misalignment’ and ‘goal misgeneralization’ can be seen as roughly equivalent concepts, except that the former is typically defined in terms of internal learned representations, while the latter is typically defined in terms of misbehavior in new environments.
- Why alignment could be hard with modern deep learning (Cotra, 2021) (20 mins)
- Cotra presents one broad framing for why achieving alignment might be hard, tying together the ideas from the core readings in a more accessible way.
In particular, it does a good job of describing an agent's internal representations of objectives and how they could become misaligned from the objectives we intended to communicate under default training regimes.
- Thought experiments provide a third anchor and ML systems will have weird failure modes (Steinhardt, 2022) (20 mins)
- In the first post, Steinhardt gives some reasons for expecting thought experiments to be useful for thinking about how future machine learning systems will behave.
- The alignment problem from a deep learning perspective (Ngo, Chan and Mindermann, 2022) (only sections 3 and 4) (20 mins)
- Although Shah et al. define goal misgeneralisation in terms of undesirable behaviour, this reading approaches the same idea differently — by reasoning about goals in terms of agents' internal representations.
In the second post, he gives a specific example of a thought experiment on a hypothesized phenomenon called 'deceptive alignment'. He assumes that, during the training process, a neural network develops an internally-represented objective which diverges from the training objective. He then argues that the network will be incentivized to perform in a way that leads to that misaligned internally-represented objective being preserved during training, and that this could lead to sudden misbehavior during deployment.
This paper emphasises the difference between capability and goal misgeneralisation. It also extends the argument to theorise how goal misgeneralisation could be a mechanism through which 'power-seeking behaviour' could manifest and persist throughout training.
- Christiano describes a scenario in which goal misgeneralization could lead to catastrophe. It describes a situation where the AI systems' goals are misaligned from what we intended, which after a change in environment, could see those systems rapidly start to behave in highly undesirable ways.
Goal misgeneralization/inner alignment:
- The core idea behind the goal misgeneralization problem is that in RL training, although the reward function is used to update a policy’s behavior based on how well it performs tasks during training, the policy doesn’t refer to the reward function while carrying out any given task (e.g. playing an individual game of Starcraft). So the motivations which drive a policy’s behavior when performing tasks need not precisely match the reward function it was trained on. The best thought experiments to help understand this are cases where the reward function is strongly correlated with some proxy objective (like resource acquisition or survival) during training, but then diverges from it in novel situations; this is analogous to how humans evolved to care directly about some former proxies for genetic fitness which are no longer robustly correlated with it (like eating sugar or having sex).
- Another way of seeing the distinction between the outer and inner alignment problems is provided by Ortega et al. (2018): the outer alignment problem involves producing a design specification which corresponds to the ideal specification; and the inner alignment problem involves producing a revealed specification which corresponds to the design specification. The distinction is summarized in the diagram below.