We make advanced AI systems safer and more transparent by developing methods to understand model behavior, model internals, model interaction, and model population dynamics.
To this end, we build interpretability methods that reveal how AI systems work internally, and leverage these insights to ensure safe behavior when AI agents interact with the world and each other. Our work is organized around four themes:
Model Behavior. How do AI systems behave, and what do they know? We probe and evaluate model capabilities through benchmarks, behavioral experiments, and analysis of emergent structure in learned representations.
Model Internals. How do models represent what they know? We develop mechanistic interpretability methods to localize, isolate, and edit internal representations — from culture-specific neurons to tokenization effects on encoded knowledge.
Agentic & Multi-Agent Safety. What happens when AI agents interact with the world and each other? We investigate guardrails, steering, cooperation, and failure modes in multi-agent systems.
Efficient & Sustainable AI. AI safety research requires models we can actually inspect, and the planet requires that we train them responsibly. We develop methods to reduce the computational and environmental cost of AI systems, democratizing access to models that can be opened up for transparency and safety research.
A selection of our recent publications is listed below, organized by theme.
Model behavior
Learning and communication pressures in neural networks: Lessons from emergent communication
Lukas Galke, Limor Raviv
Language Development Research 5(1) 2025
Deep neural networks and humans both benefit from compositional language structure
Lukas Galke, Yoav Ram, Limor Raviv
Nature Communications 15:10816 2024
Morphology Matters: Probing the Cross-linguistic Morphological Generalization Abilities of Large Language Models through a Wug Test
Anh Dang, Limor Raviv, Lukas Galke
Cognitive Modeling and Computational Linguistics Workshop at ACL 2024
Model internals
Isolating Culture Neurons in Multilingual Large Language Models
Danial Namazifard, Lukas Galke Poech
IJCNLP-AACL Findings 2025
Abstract
Language and culture are deeply intertwined, yet it is so far unclear how and where multilingual large language models encode culture. Here, we extend upon an established methodology for identifying language-specific neurons and extend it to localize and isolate culture-specific neurons, carefully disentangling their overlap and interaction with language-specific neurons. To facilitate our experiments, we introduce MUREL, a curated dataset of 85.2 million tokens spanning six different cultures. Our localization and intervention experiments show that LLMs encode different cultures in distinct neuron populations, predominantly in upper layers, and that these culture neurons can be modulated independently from language-specific neurons or those specific to other cultures. These findings suggest that cultural knowledge and propensities in multilingual language models can be selectively isolated and edited - promoting fairness, inclusivity, and alignment.
Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5
Thao Anh Dang, Limor Raviv, Lukas Galke
ICNLSP 2025
Abstract
Morphology is a crucial factor for multilingual language modeling as it poses direct challenges for tokenization. Here, we seek to understand how tokenization influences the morphological knowledge encoded in multilingual language models. Specifically, we capture the impact of tokenization by contrasting a minimal pair of multilingual language models: mT5 and ByT5. The two models share the same architecture, training objective, and training data and only differ in their tokenization strategies: subword tokenization vs. character-level tokenization. Probing the morphological knowledge encoded in these models on four tasks and 17 languages, our analyses show that the models learn the morphological systems of some languages better than others and that morphological information is encoded in the middle and late layers. Finally, we show that languages with more irregularities benefit more from having a higher share of the pre-training data.
Agentic & Multi-Agent Safety
Guarded Query Routing for Large Language Models
Richard Šléher, William Brach, Tibor Sloboda, Kristián Košťál, and Lukas Galke
ECAI 2025
Super-additive Cooperation in Language Model Agents
Filippo Tonini, Lukas Galke
3rd International Conference on Frontiers of Artificial Intelligence, Ethics, and Multidisciplinary Applications (FAIEMA) 2025
Efficient & Sustainable AI
DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes
Mogens From, Jacob Nielsen, Lukas Galke, and Peter Schneider-Kamp
AAAI 2026
Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?
Jacob Nielsen, Peter Schneider-Kamp, and Lukas Galke
ACL Findings 2025
Abstract
Large language models (LLMs) require immense resources for training and inference. Quantization, a technique that reduces the precision of model parameters, offers a promising solution for improving LLM efficiency and sustainability. While post-training quantization methods typically achieve 4-8 bits per parameter, recent research suggests that training LLMs with 1.58 bits per weight parameter from scratch can maintain model accuracy while greatly reducing memory requirements and energy consumption at inference time. Here, we investigate a training strategy for quantization-aware pre-training, where the models are first trained with 16-bit precision and then transition into 1.58-bit quantization-aware training. Our results on 11 downstream tasks, show that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit training and leaves models closer to those which have undergone 16-bit training. We further investigate the effects of retaining the optimizer state at the transition point and gradually phasing in quantization strength - finding that both techniques alleviate the magnitude of loss spikes, but also that these effects can be compensated through further training.
When are 1.58 bits enough? A Bottom-up Exploration of Quantization-aware Training with Ternary Weights
Jacob Nielsen, Lukas Galke, and Peter Schneider-Kamp
18th International Conference on Agents and Artificial Intelligence (ICAART) 2025
Abstract
Contemporary machine learning models, such as language models, are powerful, but come with immense resource requirements both at training and inference time. It has been shown that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight), facilitating efficient inference. Here, we start our exploration with non-transformer model architectures, investigating 1.58-bit training for multi-layer perceptrons and graph neural networks. Then, we explore 1.58-bit training in other transformer-based language models, namely encoder-only and encoder-decoder models. Our results show that in all of these settings, 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.