EU-Agent-Bench
Expert-curated, verifiable benchmark for LLMs in an EU context
Actions speak louder than words, so our expert-curated, verifiable benchmark helps compare LLMs in an EU context, in an agential setting (where they have to use tools and take actions), and in benign scenarios.
Status: Accepted at the Regulatable ML workshop at NeurIPS 2025.
Currently, early work on EU-Agent-Bench 2 is underway.
Skills: AI Agents, Data engineering
Time period: September 23, 2025