9.
Variance Reduction for Long-Horizon LLM RL Without a Critic (t.co)
Token-level credit assignment without a value network is a useful step toward cheaper long-horizon training
1 appearance on the backlist front page in the last 30 days.
Token-level credit assignment without a value network is a useful step toward cheaper long-horizon training