Interpretability

ℹ️

This is based on the 2023 AGI Safety Fundamentals curriculum

Our current methods of training capable neural networks give us little insight into how or why they function. This week we cover the field of interpretability, which aims to change this by developing methods for understanding how neural networks think.

This week’s curriculum starts with readings related to mechanistic interpretability. Mechanistic interpretability is a subfield of interpretability which aims to understand networks on the level of individual neurons. It then moves onto concept-based interpretability, which focuses on techniques for automatically probing (and potentially modifying) human-interpretable concepts stored within neural networks. Note that this week’s readings differ significantly depending on whether readers want to cover the topics in more or less technical detail.

Core readings:

🔗

Mechanistic interpretability readings:

For those with significant ML background:

Zoom In: an introduction to circuits (Olah et al., 2020) (35 mins)
Toy models of superposition (Elhage et al., 2022) (only sections 1 and 2) (30 mins)

For those with less ML background:

Feature visualization (Olah et al., 2017) (20 mins)

Feature visualization is a set of techniques for developing a qualitative understanding of what different neurons within a network are doing.

Zoom In: an introduction to circuits (Olah et al., 2020) (35 mins)

See above.

🔗

Concept-based interpretability readings:

For those with significant ML background:

Discovering latent knowledge in language models without supervision (Burns et al., 2022) (only sections 1-3) (30 mins)

This paper explores a technique for automatically identifying whether a model believes that statements are true or false, without requiring any ground-truth data.

For those with less ML background:

Probing a deep neural network (Alain and Bengio, 2018) (only sections 1 and 3) (15 mins)

This paper introduces the technique of linear probing, a crucial tool in concept-based interpretability.

Acquisition of chess knowledge in AlphaZero (McGrath et al., 2021) (only up to the end of section 2.1) (20 mins)

This paper provides a case study of using concept-based interpretability techniques to understand AlphaZero’s development of human chess concepts. The first two sections are a useful review of the field of interpretability.

🔗

For everyone: