Policy optimization over SAE features

Using RL to steer sparse autoencoder features as an alternative to RLHF

Using reinforcement learning (RL) to steer SAE (sparse autoencoder) features, my friends at AISIG and I are researching an alternative to RLHF. Thus far, we’ve found a causal mechanistic interpretation behind RLHF’s influence on style over alignment.

Status: Paper accepted as a spotlight at the MechInterp Workshop @ NeurIPS 2025.

Skills: Bash, Evaluations, Experimental, Python

Time period: September 23, 2025