SDPO++ for Continual Learning
SDPO++ for Continual Learning Day 5 of Trajectory, we modify Self Distillation Policy Optimization for long horizon agentic tasks. SDPO is a promising route. It learns from a single trajectory, with no group required and failures still p