Policy optimization over SAE features

Using reinforcement learning (RL) to steer SAE (sparse autoencoder) features, my friends at AISIG and I are researching an alternative to RLHF. Thus far, we’ve found a causal mechanistic interpretation behind RLHF’s influence on style over alignment.

Status: Paper accepted as a spotlight at the MechInterp Workshop @ NeurIPS 2025.

Skills: Bash, Evaluations, Experimental, Python

Time period: September 23, 2025