Compositional reasoning

Benchmarking LLM generalization with tunable difficulty axes

Can LLMs generalize out of distribution? In this work, I define a formalism and create a framework around it for generating text-based compositionality benchmarks that feature tunable difficulty axes in order to assess an LLM’s generalization capacity.

The hope is that this benchmark achieves better alignment between the inductive priors of humans and LLMs, allowing for a more faithful comparison.

Status: In Progress

Skills: Mathematics, Python