@NeelNanda5 on Backlist

65.

This was a fascinating project - turns out that LLMs inherit a lot of traits from LLMs they're distilled from, in…

This was a fascinating project - turns out that LLMs inherit a lot of traits from LLMs they're distilled from, including in subtle ways without clear semantic meaning. This has pretty interesting implications - safety problems in a model in

by @NeelNanda5 (Neel Nanda) · backlist 2026-06-15 · rubric 73.0

33.

At the start of this project I assumed that to fix misalignment we mainly needed to intervene on the RL stage of …

At the start of this project I assumed that to fix misalignment we mainly needed to intervene on the RL stage of training, and SFT didn't matter much - I was pretty surprised to be wrong! I think these results will plausibly change over t

by @NeelNanda5 (Neel Nanda) · backlist 2026-06-13 · rubric 86.0

8.

Improving Activation Oracles for interpretability

The work targets vagueness and hallucination in natural-language queries over model activations rather than only optimizing benchmark scores

by @NeelNanda5 (Neel Nanda) · backlist 2026-06-05 · rubric 0.0

44.

I had a lot of fun working on this paper - we found an elegant story for why subliminal learning happens!

I had a lot of fun working on this paper - we found an elegant story for why subliminal learning happens! A key intuition in interpretability is that basically every interesting phenomena in LLMs boils down to adding a steering vector. Sub

by @NeelNanda5 (Neel Nanda) · backlist 2026-06-03 · rubric 78.0