With many powerful technologies, misuse and indirect effects pose large risks. AI carries additional potential problems: partly due to the opaque nature of modern AI systems and their reliance on automated learning, AI systems often behave in unintended ways. This is already causing problems, and avoiding it is an unsolved technical challenge. One type of unintended behavior is brittleness: AI systems can fail at acting competently when placed in new environments. Another type of unintended behavior is misalignment: AI systems can pursue unintended objectives, such as when they are trained to meet an objective that is inadequately specified. As AI systems become more capable, the stakes of avoiding unintended, harmful behavior grow.
You may be wondering: Why may we expect AI systems to have their own goals or objectives in the first place?  To whose values are we aligning AI?  What do credible authorities think about the AI alignment problem?  Why might we want to focus on aligning powerful, goal-directed AI, instead of just not building it?  Why might engineers be foolish enough to build powerful, misaligned AI?  And if smart AIs will be smart enough to learn what humans value, why might that not solve the AI alignment problem?  This week will focus on other, more technical questions (while still aiming to be accessible to non-technical readers), so if you are wondering about any of those questions, please see the corresponding footnotes.
This week, we will examine introductions to various aspects of the AI alignment problem. We take this focus because this problem, while already appearing in existing AI systems, is also at the center of many (though far from all) worries about how AI may cause large-scale, long-lasting harms, and existing publications on these risks are relatively detailed. (Next week, we will examine arguments about how much risk this problem—and others—pose.)
This article introduces two ways in which modern AI techniques may create misaligned AI systems.
DeepMind researchers elaborate on the difficulty of adequately specifying objectives for AI systems. Failures to solve this may lead to what the first reading refers to as “sycophants.”
Some technical introductions to foundational issues in AI alignment, published at top ML journals:
On current approaches to making progress on the alignment problem:
On what problem(s) alignment is dealing with:
On “how hard is alignment?”
More in-depth materials:
 Current AI systems are not very “agentic”; they cannot make plans to achieve goals across a wide range of complex situations. So why might future AI systems be more like that? Here are a few arguments for thinking they will be:
- Some may claim computers cannot be agentic in principle, but they are mistaken. The properties that make some human thinking agentic—planning, optimizing, strategically selecting actions—are cognitive processes; they are ways of turning inputs (information about our environment) into outputs (actions). Additionally, all (finite) cognitive processes can be implemented on computers, so it is in principle possible for computer programs to be agentic.
- Many tasks that it would be profitable to eventually automate—such as those of company CEOs—involve planning skills and the strategic pursuit of goals. This means there are strong economic incentives to eventually create AI systems with these agentic types of thinking.
 Steering AI is of course not just a technical challenge—there are important, hard questions to answer about what values AI should be steered toward. At the same time, according to researchers focused on this, steering AI is not just a political or ethical challenge either; engineers simply don’t have technical methods that can reliably steer increasingly advanced AI systems. In other words, they worry that there is no good steering wheel for AI, and that any decisions or compromises about where to steer AI will be impossible to implement until AI is steerable.
 It is not necessarily clear who the credible authorities here are; “AI researchers” might be obvious candidates, but their speciality is AI research, not AI forecasting, policy, or ethics. Still, a 2016 survey of top AI researchers (p. 13) suggests that over half of these researchers, after being shown an argument for the importance of AI alignment, agree that it is at least a moderately important problem, although over half did not see it as especially urgent yet. More broadly, the AI alignment problem has been receiving increasing interest and recognition, including from some leading computer science academics, cutting-edge AI research laboratories, and major governments (representing both left and right-wing political parties). As a research field, however, AI alignment remains moderately small, with about 100-200 researchers currently focused on it.
 Some argue that, instead of trying to align powerful, goal-pursuing AI, we should never build it. Skeptical researchers often respond that it may not be up to us; there is a good chance that someone will eventually follow their economic incentives to build such AI, so we should be prepared.
 One common worry is that AI development and deployment decisions may have to be made under major technical uncertainty (over whether an AI system is robustly aligned) and intense competitive pressure—an environment that might not bode well for careful and safe decision-making.
 Sufficiently capable, general AI seems likely to learn what humans value, but there is a difference between an AI system knowing what humans value and an AI system having those values. Even if AI systems could learn what is morally right (an assumption that is philosophically controversial, since it assumes objective morality), it is unclear why they would have motives to act accordingly (unless they were programmed or trained to have those motives, which no one currently knows how to do).
 In fact, researchers have not yet found any way to reliably get AI systems to pursue any particular complex, real-world objective. (After all, we cannot just slot goals into machine learning models—that isn’t how machine learning works.)
 Some more recent work (e.g. this) argues for a more nuanced understanding of outer alignment and inner alignment problems, but the distinction discussed above seems like a good initial understanding.