41.
On-policy Distillation (OPD) can suffer from mode-seeking behavior due to the reverse KL objective. In our recent… (x.com)
On-policy Distillation (OPD) can suffer from mode-seeking behavior due to the reverse KL objective. In our recent work, we address this by augmenting OPD with a forward KL term. Please check out @wg_jin02 's post for more details!