8.
Improving Activation Oracles for interpretability
The work targets vagueness and hallucination in natural-language queries over model activations rather than only optimizing benchmark scores
2 appearances on the backlist front page in the last 30 days.
The work targets vagueness and hallucination in natural-language queries over model activations rather than only optimizing benchmark scores
I had a lot of fun working on this paper - we found an elegant story for why subliminal learning happens! A key intuition in interpretability is that basically every interesting phenomena in LLMs boils down to adding a steering vector. Sub