MIST: Scalable Mechanistic Interpretability for Safe and Trustworthy LLM Agents

Large language models are increasingly used as agents that can plan, reason, and use tools. Yet we still lack a clear understanding of how they produce their outputs — a mist that must be cleared before these systems can be safely deployed in high-stakes applications.

MIST develops scalable methods for mechanistic interpretability to understand the inner workings of LLM agents, and translates these insights into practical safety measures.

Aims

Aim 1: Fundamentals of Mechanistic Interpretability. We develop theory-driven, scalable methods for mechanistic interpretability. Our core hypothesis is that functional substructures in neural networks exhibit compositional structure — and that exploiting this latent compositionality is key to overcoming the current scalability bottleneck in mechanistic interpretability.

Aim 2: Mechanisms of LLM Agents. We identify the causal mechanisms underlying key agent capabilities: instruction following, tool use, planning, reasoning, and multi-agent communication. We test whether these mechanisms are universal across models, languages, and modalities.

Aim 3: Steering and Safety in Multi Agent Systems. We build controlled multi-agent environments to study how interventions on individual agents propagate through the system. Through red teaming, including the deliberate introduction of misaligned or malfunctioning agents, we stress test system robustness. Our goal is to develop procedures for issuing safety certificates to LLM agents that demonstrate robust behavior under these controlled perturbations.

Funding

MIST is funded by the Novo Nordisk Foundation through a RECRUIT grant for international recruitment (2026–2030).