Potential Extreme Risks from AI

ℹ️

This is a link post for the 2023 AI Governance Curriculum

Researchers have proposed several paths by which advances in AI may cause especially long-lasting harms—harms that are especially worrying from the perspective that the long-term impacts of our actions are very important. This week, we’ll consider: What are some of the best arguments for and against various claims about how AI poses existential risk?

For context on the field’s current perspectives on these questions, a 2020 survey of AI safety and governance researchers (Clarke et al., 2021) found that, on average, researchers currently guess there is:

A 10% chance of existential catastrophe from misaligned, influence-seeking AI
A 6% chance of existential catastrophe from AI-exacerbated war or AI misuse
A 7% chance of existential catastrophe from “other scenarios”

Note that there were high levels of uncertainty and disagreement in the above survey’s results. These imply that many researchers must be wrong about important questions, which arguably makes skeptical and questioning mindsets (with skepticism toward both your own view and others’) especially valuable.

Core Readings

🔗

AI Governance: Opportunity and Theory of Impact (Dafoe, 2020) (25 mins). Descriptions of potential long-term AI risks from sources other than misaligned, influenced-seeking AI, in the context of an introduction to the AI governance field and how it can have impact.Descriptions of potential long-term AI risks from sources other than misaligned, influenced-seeking AI, in the context of an introduction to the AI governance field and how it can have impact.

🔗

What Failure Looks Like (Christiano, 2019) (15 mins). Christiano narratively describes how misaligned, influence-seeking AI may cause humanity’s future potential to be wasted. (People often find aspects of this piece confusing; see the “Additional readings” section for a list of works that clarify and expand on this picture of AI risk.)

🔗

Is power-seeking AI an existential risk? (Carlsmith, 2021) (just read from the section “Playing with Fire” to the end of the section “Incentives”) (28 mins). Carlsmith examines dynamics of power, agentic planning, and incentives—key pieces of arguments that influence-seeking misaligned AI poses existential risk.

🔗

Avoiding Extreme Global Vulnerability as a Core AI Governance Problem (2022) (10 minutes) This brief document suggests a synthesized framing of some large-scale AI governance problems, with the aim of suggesting some questions and applications to keep an eye on over the rest of the course. It also briefly discusses some recent trends in AI development.

🔗

Coordination challenges for preventing AI conflict (Torges, 2021) (28 minutes) This piece argues that AI systems which engage in bargaining may have costly coordination failures, even if they are skilled negotiators, and it suggests approaches for mitigating this.

Additional Recommendations

🔗

Sharing the World with Digital Minds (Shulman and Bostrom, 2020) (especially pages 1-5 and 10-16)

Analyses of arguments that misaligned, influence-seeking AI poses existential risk:

🔗

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover (Cotra, 2022)

🔗

The full report Is power-seeking AI an existential risk? (Carlsmith, 2021) (some sections of this were a Core Reading for this week)

🔗

Reviews of “Is power-seeking AI an existential risk?”

🔗

AGI Safety From First Principles - Ngo (2020) - see especially the section “Goals and Agency” for an analysis of AI agentic planning, and “Alignment” for an explanation (and criticism) of one way of breaking down the alignment problem.

Clarifications of / elaborations on Christiano’s “What failure looks like,” and subsequent work:

🔗

Clarifying “What failure looks like” (Clarke, 2020) – suggests several potential lock-in mechanisms involved in Part 1 of “What Failure Looks Like” (Christiano has more recently written, in the context of this risk scenario: “I think misaligned power-seeking is the main way the problem is locked in.”)

🔗

An Unaligned Benchmark (Christiano, 2018) - Christiano elaborates on some background frameworks—see footnote for elaboration.

🔗

Eliciting Latent Knowledge (Christiano et al., 2021) – see the linked doc for a 105-page elaboration on closely related issues.

🔗

Another (Outer) Alignment Failure Story (Christiano, 2021)

🔗

A criticism (Hanson, 2019) of “What Failure Looks Like” and a response (Carlier, 2020) to this criticism.

🔗

Arguments for AI Risk and Arguments against AI risk sections of "AI Alignment 2018-19 Review." (Shah, 2019)

🔗

Unpacking Classic Arguments for AI Risk (Garfinkel, c. 2019) (or see the related 80,000 Hours Podcast episode)

🔗

A shift in arguments for AI risk (Adamczewski, 2019) - Discusses several arguments for AI existential risk and some of their history

🔗

Distinguishing AI takeover scenarios (Clarke and Martin, 2021)

🔗

Database of existential risk estimates (or similar) (Aird, 2020)—While not an analysis, this database presents a wide range of views.

Analyses of sources of risk not focused on misaligned, influence-seeking AI:

🔗

Potential Existential Risks from Artificial Intelligence (Garfinkel, c. 2019) (broad overview of potential existential risks from AI)

🔗

Is Democracy a Fad? (Garfinkel, 2021)

🔗

A Survey of the Potential Long-term Impacts of AI (Clarke and Whittlestone, 2022)

On other risks from harmful competitive situations:

🔗

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) (Critch, 2021) (Whether this is focused on misaligned, influence-seeking AI is contested.)

🔗

Meditations on Moloch (2014) - informal discussion assuming little background context

🔗

The Future of Human Evolution (Bostrom, 2009)

🔗

Information security considerations for AI and the long term future (Ladish and Heim, 2022)

[19] I emphasize mean (not median) estimates because mean estimates equally weigh surveyed opinions, arguably making for more useful overall risk assessments (if two doctors told you some cold medicine was very safe, and one doctor told you it would likely kill you, should you pay more attention to the average or the median of their views?). (Some work suggests the geometric mean would be better–thanks to Michael Aird for flagging this to me. I don’t emphasize this value because I don’t know it or have raw survey data.)

I also emphasize unconditional risk estimates (rather than estimates which condition on some AI existential catastrophe having occurred and just ask what form that catastrophe took); this is because the unconditional estimates seem closer to what we care about (assuming investment into mitigating various risks should depend on the absolute level of risk they pose).

Still, for transparency, here are the other statistics from the survey that I was shown (in addition to the ones in the published overview), shared here with permission (see the published overview for definitions of scenarios 1--5):

Conditional probability for scenarios 1-3 together: Mean 49%; Median 50%.
Unconditional probability for scenarios 1-3 together: Mean 10%; Median 6%.
Unconditional probability for scenarios 4-5 together: Mean 7%; Median 2%.
Unconditional probability for "Other scenarios": Mean 6%; Median 3%.

[20] Some of these summary statistics (shared with permission) are from the full results rather than the public overview.

[21] This is the average of total risk levels assigned to scenarios 1, 2, and 3 from the survey—scenarios that all (arguably) center on misaligned, influence-seeking AI. Note that, as discussed in the linked survey overview, there is also wide disagreement over which of scenarios 1, 2, and 3 is most likely.

Potential Extreme Risks from AI

Core Readings

Additional Recommendations

Next in the AI Governance curriculum

Topics