EU-Agent-Bench

Expert-curated, verifiable benchmark for LLMs in an EU context

Actions speak louder than words, so our expert-curated, verifiable benchmark helps compare LLMs in an EU context, in an agential setting (where they have to use tools and take actions), and in benign scenarios.

Status: Accepted at the Regulatable ML workshop at NeurIPS 2025.

Currently, early work on EU-Agent-Bench 2 is underway.

Skills: AI Agents, Data engineering

Time period: September 23, 2025