Policy optimization over SAE features
Using RL to steer sparse autoencoder features as an alternative to RLHF
Using reinforcement learning (RL) to steer SAE (sparse autoencoder) features, my friends at AISIG and I are researching an alternative to RLHF. Thus far, we’ve found a causal mechanistic interpretation behind RLHF’s influence on style over alignment.
Status: Paper accepted as a spotlight at the MechInterp Workshop @ NeurIPS 2025.
Skills: Bash, Evaluations, Experimental, Python
Time period: September 23, 2025