43.
Using off-policy (rollouts of another model) prefixes gives the game away - the model would learn to classify off…
Using off-policy (rollouts of another model) prefixes gives the game away - the model would learn to classify off- vs on- policy even better than they do already. You would get higher eval awareness, not lower, even though it would be bette