Agent foundations or AI governance, and careers in alignment

Agent foundations or AI governance, and careers in alignment


This week, participants can choose between two topics: agent foundations research, and AI governance research. Then both streams will spend the second half of the session covering careers in alignment.

The first two readings cover the agent foundations research agenda (pursued primarily by the Machine Intelligence Research Institute (MIRI)), which aims to develop better theoretical frameworks for describing AIs embedded in real-world environments.

The next two readings cover AI governance. In Clarke’s (2022) taxonomy (pictured below), they focus on strategy research, tactics research and field-building, not on developing, advocating or implementing specific policies. Those interested in exploring AI governance in more detail, including evaluating individual policies, should look at the curriculum for the parallel AI governance track of this course.


We finish with a compilation of resources related to careers in alignment.

Core readings:

Read two of the following three blog posts, which give brief descriptions of work on agent foundations.
  1. Logical induction: blog post (Garrabrant et al., 2016) (10 mins)
    1. Garrabrant et al. (2016) provide an idealized algorithm for induction under logical uncertainty (e.g. uncertainty about mathematical statements).
  2. Logical decision theory (Yudkowsky, 2017) (only up to the beginning of the “Evidential versus counterfactual conditioning” section) (10 mins)
    1. Yudkowsky (2017) outlines a novel decision theory which accounts for correlations between the decisions of different agents.
  3. Progress on causal influence diagrams: blog post (Everitt et al., 2021) (15 mins)
    1. Everitt et al. formalize the concept of an RL agent having an incentive to influence different aspects of its training setup.

Optional readings:

Agent foundations:

AI governance:



  1. “Accident” risks, as discussed in Dafoe (2020), include the standard risks due to misalignment which we’ve been discussing for most of the course. I don’t usually use the term, because “accident” has connotations of being non-deliberate, whereas the other risks would be driven by “deliberate” misbehavior from AIs.
  2. Compared with the approaches discussed over the last few weeks, agent foundations research is less closely-connected to existing systems, and more focused on developing new theoretical foundations for alignment. Given this, there are many disagreements about how relevant it is for deep-learning-based systems.

Other topics to check out