87.
probably the best blog i have read for some time
probably the best blog i have read for some time viewing SFT, RL, and OPD as different ways of reshaping a model's distribution makes their tradeoffs super intuitive. - SFT pulls toward a fixed external target - RL moves along the reward