59.
Targeted RL with textual feedback sounds interesting, basically self-distill from a model with hint to one withou…
Targeted RL with textual feedback sounds interesting, basically self-distill from a model with hint to one without hint, creating dense reward signal alongside the super long rollout.