publications
Publications by Ilija Lichkovski — research papers on AI safety, mechanistic interpretability, reinforcement learning, and LLM agents.
2025
- RegMLEU-Agent-Bench: Measuring Illegal Behavior of LLM Agents Under EU LawIn NeurIPS 2025 Workshop on Regulatable ML, 2025
Large language models (LLMs) are increasingly deployed as agents in various contexts by providing tools at their disposal. However, LLM agents can exhibit unpredictable behaviors, including taking undesirable and/or unsafe actions. In order to measure the latent propensity of LLM agents for taking illegal actions under an EU legislative context, we introduce EU-Agent-Bench, a verifiable human-curated benchmark that evaluates an agent’s alignment with EU legal norms in situations where benign user inputs could lead to unlawful actions. Our benchmark spans scenarios across several categories, including data protection, bias/discrimination, and scientific integrity, with each user request allowing for both compliant and non-compliant execution of the requested actions. Comparing the model’s function calls against a rubric exhaustively supported by citations of the relevant legislature, we evaluate the legal compliance of frontier LLMs, and furthermore investigate the compliance effect of providing the relevant legislative excerpts in the agent’s system prompt along with explicit instructions to comply. We release a public preview set for the research community, while holding out a private test set to prevent data contamination in evaluating upcoming models. We encourage future work extending agentic safety benchmarks to different legal jurisdictions and to multi-turn and multilingual interactions.
@inproceedings{lichkovski2025euagentbench, title = {EU-Agent-Bench: Measuring Illegal Behavior of LLM Agents Under EU Law}, author = {Lichkovski, Ilija and M{\"u}ller, Alexander and Ibrahim, Mariam and Mhundwa, Tiwai}, booktitle = {NeurIPS 2025 Workshop on Regulatable ML}, year = {2025}, } - MechInterpThe Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse FeaturesJeremias Ferrao, Matthijs Lende, Ilija Lichkovski, and 1 more authorIn NeurIPS 2025 Mechanistic Interpretability Workshop, 2025
Spotlight
Prevailing alignment methods induce opaque parameter changes, obscuring what models truly learn. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically demonstrate that this mechanism is expressive enough to approximate the behavioral shifts of post-training processes. We then apply FSRL to preference optimization and perform a causal analysis of the learned policy. Our analysis reveals a crucial insight: the model learns to reward stylistic presentation as a proxy for quality, disproportionately relying on features related to style and formatting over those tied to alignment concepts like honesty. By effectively optimizing the preference objective, FSRL serves as a transparent proxy for observing the alignment process. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.
@inproceedings{ferrao2025fsrl, title = {The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features}, author = {Ferrao, Jeremias and van der Lende, Matthijs and Lichkovski, Ilija and Neo, Clement}, booktitle = {NeurIPS 2025 Mechanistic Interpretability Workshop}, year = {2025}, }