49.
More than a year ago (x.com)
More than a year ago @TacoCohen in SPO https:// arxiv.org/abs/2503.05453 already derives a critic-free value/Q parameterization from the policy-reference log-ratio under KL-regularized RL by exactly “deriving value loss through policy ra