https:// (t.co)
https:// arxiv.org/abs/2605.28079 Long context benchmark suite. It aggregates previous benchmarks.
9 appearances on the backlist front page in the last 30 days.
https:// arxiv.org/abs/2605.28079 Long context benchmark suite. It aggregates previous benchmarks.
Meta’s report describes multi-datacenter training techniques including a pipeline-parallel schedule designed to work with ZeRO-2/3-style optimization
https:// arxiv.org/abs/2605.22769 Could it be better to pretrain on temporally ordered data? It could bias the model towards recent information. I have wondered when information is updated or changed over time whether the model is able to
https:// arxiv.org/abs/2605.21486 Importance of embedding learning rate for hyperparameter transfer and training stability. It aligns with previous work ( https:// arxiv.org/abs/2407.05872), and maybe with older work ( https:// arxiv.org/a
https:// arxiv.org/abs/2605.20798 Validation of transformer modifications similar to https:// arxiv.org/abs/2102.11972 with modern modifications. It is nice to find out bonferroni correction here. Maybe the setup that has been used here c
https:// arxiv.org/abs/2605.15220 Using LoRAs for determining dataset mixture. For a continual training setup, when new datasets are introduced, it is possible to train LoRAs for them and combine them with a LoRA on previous datasets.
https:// arxiv.org/abs/2605.15422 Kernel-level implementation of prefix grouping for group-based RL.
https:// arxiv.org/abs/2605.16147 It's interesting that DiT does not have outlier tokens (maybe because of noise it would be hard to anchor on specific tokens?) but still register tokens are beneficial, especially for pixel-level models.
Pretraining evaluation for predicting posttraining performance. It is rubric-based. Evaluates whether the model could discriminate the response which follows the rubrics or not.