45.
A quick repro on this: (t.co)
A quick repro on this: https:// github.com/shuminghu/next lat … 2-layer transformer trained at seq_len 12 or 36 fail at seq_len 36 at test 1-layer dynamics model (RNN) co-trained with transformer (1-step next hidden prediction) at seq_l