Technical Challenges of AI Alignment

ℹ️

This is a link post for the 2023 AI Governance Curriculum

With many powerful technologies, misuse and indirect effects pose large risks. AI carries additional potential problems: partly due to the opaque nature of modern AI systems and their reliance on automated learning, AI systems often behave in unintended ways. This is already causing problems, and avoiding it is an unsolved technical challenge. One type of unintended behavior is brittleness: AI systems can fail at acting competently when placed in new environments. Another type of unintended behavior is misalignment: AI systems can pursue unintended objectives, such as when they are trained to meet an objective that is inadequately specified. As AI systems become more capable, the stakes of avoiding unintended, harmful behavior grow.

You may be wondering: Why may we expect AI systems to have their own goals or objectives in the first place? [11] To whose values are we aligning AI? [12] What do credible authorities think about the AI alignment problem? [13] Why might we want to focus on aligning powerful, goal-directed AI, instead of just not building it? [14] Why might engineers be foolish enough to build powerful, misaligned AI? [15] And if smart AIs will be smart enough to learn what humans value, why might that not solve the AI alignment problem? [16] This week will focus on other, more technical questions (while still aiming to be accessible to non-technical readers), so if you are wondering about any of those questions, please see the corresponding footnotes.

This week, we will examine introductions to various aspects of the AI alignment problem. We take this focus because this problem, while already appearing in existing AI systems, is also at the center of many (though far from all) worries about how AI may cause large-scale, long-lasting harms, and existing publications on these risks are relatively detailed. (Next week, we will examine arguments about how much risk this problem—and others—pose.)

Core Readings

🔗

Why AI alignment could be hard with modern deep learning - Cotra (2021) (20 mins)

This article introduces two ways in which modern AI techniques may create misaligned AI systems.

🔗

Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020) (15 mins)

DeepMind researchers elaborate on the difficulty of adequately specifying objectives for AI systems. Failures to solve this may lead to what the first reading refers to as “sycophants.”

🔗

The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment [video] (Miles, 2021) (23 minutes) (People often initially find this confusing; these other posts offer other introductions, and there is also a peer-reviewed technical introduction.)

🔗

A brief introduction to some approaches to AI alignment (2022) (13 minutes)

🔗

Preparing for AI: risks and opportunities [video] (Dafoe, 2017) (just watch from 11:33 to the slide about “The AI Race,” which is at 21:30) (10 minutes)

Additional Recommendations

Some technical introductions to foundational issues in AI alignment, published at top ML journals:

🔗

Defining and Characterizing Reward Hacking (Skalse, et al., 2022)

🔗

Goal Misgeneralization in Deep Reinforcement Learning (Langosco et al., 2022)

🔗

Optimal policies tend to seek power (Turner et al., 2021)

On current approaches to making progress on the alignment problem:

🔗

2022 AGI Safety Fundamentals curriculum (Ngo, 2021), weeks 4-6

🔗

AI alignment landscape (Christiano, 2019) (30 minutes)

🔗

• How might we align transformative AI if it’s developed very soon? (Karnofsky, 2022)

🔗

(My understanding of) What Everyone in Technical Alignment is Doing and Why (Larsen and Lifland, 2022) (46 minutes)

On what problem(s) alignment is dealing with:

🔗

Alignment (Ngo, 2021)

🔗

Clarifying “AI alignment” (Christiano, 2018)

🔗

Superintelligence, Chapter 7: The superintelligent will (Bostrom, 2014)

🔗

Risks from Learned Optimization in Advanced Machine Learning Systems (Hubinger et al., 2019)

🔗

Concrete Problems in AI Safety (Amodei, et al., 2016)

On “how hard is alignment?”

🔗

AGI Ruin: A List of Lethalities (Yudkowsky, 2022)

🔗

Where I agree and disagree with Eliezer (Christiano, 2022) (partly a response to the above)

🔗

Ben Garfinkel on scrutinising classic AI risk arguments (Lempel et al., 2020) (see also the linked readings)

🔗

Takeaways from safety by default interviews (Bergal, 2020)

Book-length introductions:

🔗

The Alignment Problem: Machine Learning and Human Values (Christian, 2020)

🔗

Human Compatible: Artificial Intelligence and the Problem of Control (Russell, 2019)

More in-depth materials:

🔗

Paul Christiano’s AI alignment blog

🔗

Superintelligence (Bostrom, 2014)

🔗

Alignment Forum curated sequences

🔗

Annotated Bibliography of Recommended Materials, from UC Berkeley’s Center for Human-Compatible Artificial Intelligence

🔗

An overview of 11 proposals for building safe advanced AI (Hubinger, 2020)

[11] Current AI systems are not very “agentic”; they cannot make plans to achieve goals across a wide range of complex situations. So why might future AI systems be more like that? Here are a few arguments for thinking they will be:

Some may claim computers cannot be agentic in principle, but they are mistaken. The properties that make some human thinking agentic—planning, optimizing, strategically selecting actions—are cognitive processes; they are ways of turning inputs (information about our environment) into outputs (actions). Additionally, all (finite) cognitive processes can be implemented on computers, so it is in principle possible for computer programs to be agentic.
Many tasks that it would be profitable to eventually automate—such as those of company CEOs—involve planning skills and the strategic pursuit of goals. This means there are strong economic incentives to eventually create AI systems with these agentic types of thinking.

For more in-depth discussion, see Ngo (2020) and Carlsmith (2021) (especially Sections 1-3).

[12] Steering AI is of course not just a technical challenge—there are important, hard questions to answer about what values AI should be steered toward. At the same time, according to researchers focused on this, steering AI is not just a political or ethical challenge either; engineers simply don’t have technical methods that can reliably steer increasingly advanced AI systems. In other words, they worry that there is no good steering wheel for AI, and that any decisions or compromises about where to steer AI will be impossible to implement until AI is steerable.

[13] It is not necessarily clear who the credible authorities here are; “AI researchers” might be obvious candidates, but their speciality is AI research, not AI forecasting, policy, or ethics. Still, a 2016 survey of top AI researchers (p. 13) suggests that over half of these researchers, after being shown an argument for the importance of AI alignment, agree that it is at least a moderately important problem, although over half did not see it as especially urgent yet. More broadly, the AI alignment problem has been receiving increasing interest and recognition, including from some leading computer science academics, cutting-edge AI research laboratories, and major governments (representing both left and right-wing political parties). As a research field, however, AI alignment remains moderately small, with about 100-200 researchers currently focused on it.

[14] Some argue that, instead of trying to align powerful, goal-pursuing AI, we should never build it. Skeptical researchers often respond that it may not be up to us; there is a good chance that someone will eventually follow their economic incentives to build such AI, so we should be prepared.

[15] One common worry is that AI development and deployment decisions may have to be made under major technical uncertainty (over whether an AI system is robustly aligned) and intense competitive pressure—an environment that might not bode well for careful and safe decision-making.

[16] Sufficiently capable, general AI seems likely to learn what humans value, but there is a difference between an AI system knowing what humans value and an AI system having those values. Even if AI systems could learn what is morally right (an assumption that is philosophically controversial, since it assumes objective morality), it is unclear why they would have motives to act accordingly (unless they were programmed or trained to have those motives, which no one currently knows how to do).

[17] In fact, researchers have not yet found any way to reliably get AI systems to pursue any particular complex, real-world objective. (After all, we cannot just slot goals into machine learning models—that isn’t how machine learning works.)

[18] Some more recent work (e.g. this) argues for a more nuanced understanding of outer alignment and inner alignment problems, but the distinction discussed above seems like a good initial understanding.

Next in the AI Governance curriculum

Potential Extreme Risks from AI

Technical Challenges of AI Alignment

Core Readings

Additional Recommendations

Next in the AI Governance curriculum

Topics (1)