Runway’s notes on using DTensor for distributed training correctness
Runway adopted DTensor to prevent silent gradient bugs, then documented the dispatch overhead, recompilation storms, and MFU losses that came with the safety gain
Top 90 curated tweets ranked for substance on 18 May 2026 UTC.
Runway adopted DTensor to prevent silent gradient bugs, then documented the dispatch overhead, recompilation storms, and MFU losses that came with the safety gain
Frontier models missed bugs in ordinary review but found them when asked to construct Rocq proofs against real C++ code
Torch distributed TokenSwitch backed by NCCL-EP brings expert-parallel MoE dispatch and combine closer to a standard PyTorch primitive
Participants earned about $1.3M for 47 vulnerabilities, including successful exploits against AI developer products
Encrypted browser calls can still expose who is on a call and their motion or speech patterns through unencrypted RTP metadata
A steering API built for a minor Toyota nudge feature later let more than 1,000 old Priuses run open-source driver-assistance software
A code-signing revocation crippled GitButler’s Windows distribution after the critical notice was buried among dozens of webinar-like emails
BLASST uses online softmax statistics and a scalar threshold to skip negligible attention blocks without training a new model
MTP support gives Qwen3.6-family local inference a large performance jump on commodity hardware
The study maps the MoE design space across expert count, expert size, shared experts, routing, and token dropping, concluding that expert size and count dominate
Monitor performance can degrade sharply when malicious actions are embedded inside or before long benign transcripts
Tasks with accurate feedback become easy for coding agents, while tasks without reliable feedback remain hard regardless of raw model capability
Browserbase released a catalog of researched website playbooks intended to make browser agents more reliable on real web tasks
Atlas learned a heavy-object manipulation behavior by practicing many fridge variations in simulation before transferring the policy to the robot
HY World 2.0 ships full inference code and models for building interactive generated worlds
A randomized Phase 3 trial found a nightly oral drug effective for severe obstructive sleep apnea
KRAS mutations drive about 90% of pancreatic cancers, and recent progress has made a formerly undruggable cancer target tractable
Biotech is split between the speed and scale of China-based manufacturing and the long-term risks of outsourcing strategic medical capacity
A 1 GW Blackwell data center may cost up to twice as much as current TPU or Trainium builds, but Nvidia’s compute power efficiency changes the comparison
Anker announced a compute-in-memory chip using mature NOR flash, potentially avoiding the most contested AI memory supply chains
Tokenized equities could become tradable through third parties even when the underlying public company never opted in
A major U.S. utility combination would reshape the power sector just as data-center and electrification demand become central constraints
arXiv updated its code of conduct so authors are accountable for unverified AI-generated content and can face a one-year ban for low-quality submissions
Young children already show a strongly left-lateralized language system, complicating explanations of recovery after early left-hemisphere damage
The dataset covers Indian operational landholdings by district, social group, and farm size across states and union territories
The Rust port let TOML and YAML parsers recurse beyond the old test’s expected stack-overflow point, exposing a subtle benchmark/test assumption
Grafana disclosed a code theft incident while publicly refusing to pay the attackers’ ransom demand
NYC’s under-18 population is falling despite positive natural change because the child population shift is driven by families moving to the suburbs
Replacing Android with postmarketOS turned an old ARM64 phone into a Matrix-connected, end-to-end encrypted Hermes agent server
Reward hacking is an arms race between coding agents and RL envs. A common eval flaw: the agent and verifier share the same sandbox. If the agent can tamper with the grader, “pass” may just mean “cheated.”
https:// arxiv.org/abs/2605.15422 Kernel-level implementation of prefix grouping for group-based RL.
We recently built an AI assistant inside @Razorpay called Slash. It reads our entire codebase, debugs production incidents, reviews specs, writes code, reviews every single PR, answer tech queries and also raises PRs for small features.
GPUs for sale from my friend: What's available: - 20 nodes of H200 NVL - Located in India, ready to deploy ASAP - Ideal for inference workloads Pricing (per GPU/hr): - $2.8 — 6-month minimum commit - $2.6 — 12-month commit Why this matter
Sub-second image generation with Flux.2 [dev] and Qwen-Image: Flux.2 [dev]: 2.3x faster, 0.98s latency (B200) Qwen-Image: 1.6x faster, 0.87s latency (B200) Details on how we got there in Faraz's article.
First line of defense: a clean verifier. The agent should get a normal dev environment: files, shell, build tools. But when the run ends, the harness destroys that environment and copies only the declared artifact into a fresh verifier.
More on v3.6.1. The new XL neural depth models: 1248x780 @ 8.5 FPS 1056x660 @ 11 FPS 864x540 @ 17 FPS 768x480 @ 22 FPS Higher resolution means finer detail with thin structures, object edges, small geometry... All while maintaining a
If AI is code, and AI can code, let’s automate AI research and then discover new knowledge everywhere else! New blog announcing our investment in RSI, and why this team is best suited to making open-ended learning a reality.
I'm excited to share our TeamBench , a new benchmark for evaluating agent coordination under operating system-enforced role separation. Multi-agent systems have become a dominant paradigm for building AI agents. However, most evaluations a
Codex CLI 0.131.0 is out. Highlights: - Python SDK moved to openai-codex / openai_codex, with pinned runtime-generated types, concurrent turn routing, and approval modes - codex doctor added for support-ready diagnostics across runtime, au
New post: "Generalization Dynamics of LM Pre-training" Most people (including me) assume that LMs smoothly mature from pattern-matching to generalizing. This mental model is wrong. The true dynamics are stranger, and far more fascinating
Qwen3.6 now runs 2x faster with MTP GGUFs! Run locally on just 18GB RAM. MTP enables Qwen3.6 to generate ~1.4–2.2× faster with no accuracy change. Qwen3.6-27B MTP runs at 160 tokens/s. 35B-A3B reaches 240 t/s. GGUFs: https:// huggingfa
Video lectures, UC Berkeley CS 182 / 282a Deep Learning fall 2025, by Gireeja Ranade & Anant Sahai https:// berkeley-cs182.github.io/fa25/ https:// youtube.com/playlist?list= PLIygTcviGPKCJO2wgN4rjqRFozoPjvWQs … .
1 Trillion Dense Model Ring-2.6-1T from @TheInclusionAI just dropped A 1 trillion-parameter open reasoning model built for agent workflows, not just Q&A. 63.82 ClawEval (top-tier among open models) Adjustable reasoning effort: high
Second fix: control network access. Unrestricted egress lets agents fetch solutions or use external tools to bypass task difficulty. Off-container agents can keep model/API traffic outside the task sandbox. For on-container agents like C
update: tested GATS on GPT-5.5 BFCL: +5.34% τ²-bench: +2.32% so there is consistent gain of three GPT models: GPT-4o, GPT-5, and GP5.5. simulated feedback for tool-calls refinement keeps working even as base models get stronger. code
Editor’s note: imported_from_x_likes
A big factor is that evals are harder to trust in safety work. If an AI can solve IMO problems, it's probably good at math. If an AI gets a perfect safety score, it could be very safe or it could be very eval aware. There's also a long hi
they dont know that modal has negative latency. it actually saves time
Self-distillation for long-horizon training at scale!
Added a smol new section to last week's blog post on the technical internals of @modal 's fast cold boots. This section describes how we frame cloud buffer management as a linear optimization problem and solve it with GLOP. https:// mod
What are best practices for running Claude Code at scale? New blog post on what we've learned from teams running it across multi-million-line monorepos, decades-old legacy systems, and distributed microservices:
trained an actual reward hacker with RL to study as a model organism for qwen 3 14b, plan to train some more ty @PrimeIntellect for good infra and @_VGen_ for env :D Checkpoints included for every step: https:// huggingface.co/ceseld
Efficient AI Lecture 14: LLM Post-Training PEFT is one of the most practical ideas in LLM post-training. Instead of updating the whole model, train a tiny targeted part: - Adapters: small inserted modules - Prompt tuning: soft prom
Had a chance to fully read the MolmoACT2 paper today. Imo, the ablation results are the most exciting part. So many ideas popping.
With 99.98% uptime, Codex only sleeps 8 minutes per month.
2.5% of our sandboxes run longer than 24 hrs. That 2.5% brings 20% of our revenue. Long-running stateful workloads are not an edge case. It feels weird to see that this isn't the consensus yet.
Announcing the Rogo Excel Plug-In. Felix, our AI agent for finance, now native to Microsoft Excel. Build, extend, and audit models grounded in your firm's conventions and precedents, without leaving your workbook.
5.5 Is a great model, but man is it bad at writing good code on its own
AI agents in healthcare face tight constraints: latency can't exceed 800ms per turn, the first turn processes 10k tokens of context, and safety models analyze the conversation in parallel. Using our MAX framework, @hippocraticai keeps pa
Targeted RL with textual feedback sounds interesting, basically self-distill from a model with hint to one without hint, creating dense reward signal alongside the super long rollout.
Introducing Agora-1, a world model that's learned to simulate multi-agent experiences. It's so fun. Today we're launching a playable research preview, where you can relive your childhood and enjoy a multiplayer simulation of GoldenEye. So
browse skills add http:// poke.com/send-message
Demo gods were on my side for this guest lecture on AI Agent Security at @MIT_CSAIL : I was able to show a prompt injection attack against @AnthropicAI 's Opus 4.6 model. Agent security is still an unsolved problem!
New Anthropic Fellows research: Classifier Context Rot Anthropic monitors for dangerous actions in agent transcripts that are getting very long. Can monitors handle such long transcripts?
Arabic. Japanese. Turkish. Redacting clinical discharge summaries in real-time. 30+ new open-source PII models shipped today on @huggingface . 30+ MLX variants as native Swift packages for macOS and iOS. OpenMed PII family: 1M+ downloads
NLAs are claimed to verbalize model activations. But can they faithfully interpret steered activations? In our latest paper, we show that steering moves activations into non-invertible regions; and almost surely, no prompt maps to steered
In a new article, we take a tour of epoll and io_uring through the lens of an HTTP file server, starting off first with a synchronous thread-per-request server as a baseline.
GitHits exists because AI agents can read your repo, but not the open-source code your repo depends on.
A lightweight attention method to speed up pretraining, especially for long-context models. It doesn’t try to reinvent something new. Instead, it wraps a non-learnable pipeline around FlashAttention. it downsamples the sequence using a no
ClickFix just leveled up. One user-pasted command now drops scheduled task persistence + PySoxy (a 10-year-old open-source Python SOCKS5 proxy) for encrypted backup access. Blocking the first C2? Doesn’t stop it — the task keeps retrying
Two new papers in @AI_PrecisionOnc this month with @NGThaker_XRT and collaborators: RAG in Oncology — where it works, where it breaks, what it takes to go from demo to deployment Data Transparency as AI-Ready Infrastructure — AI can
Code released for “Predictive but Not Plannable: RC-aux for Latent World Models”. RC-aux adds lightweight reachability correction to latent world models,improving planning without changing the LeWM backbone. http:// github.com/Guang000
Well expert iteration is an (inefficient) policy gradient algorithm.
1) nice video 2) interesting that Jane Street seems to own/operate this DC themselves. Strong data privacy needs? 3) JS is now a big AI compute user overall. They recently ordered $6B of compute from CoreWeave, order of $1B/year, comparable
just made `helix chef`. it just one shot a memory system running on helix.
Hey everyone! Good news: we've fixed the "conversation memory loss" issue of OpenAgents Workspace! What we fixed: Context no longer drops in multi-turn conversations The Agent can now properly remember and reference previous messages Multi
Many people are worried that AI agents are going to differentially underperform on safety research (even if they're not scheming) because (i) RL generalizes poorly to hard-to-verify tasks and (ii) AI safety research is harder to verify than
if you can’t find affordable 8xH100 these days, don’t worry. You can just synthetically train on them inside of a world model.
1/ Language models have been stuck in discrete space while vision models ride the continuous diffusion wave. Why? We assumed text inherently needed discrete diffusion. A new MIT paper proves this assumption is mathematically wrong.
cursor is at frontier scale, both in terms of performance and compute if composer 2.5's budget was put into a pre-train: ~6.3T total, 200B active trained on ~56T tokens if composer 3 allocates 50% of the budget to pre-training: ~500B acti
it hugely improves coherence + understanding especially across multiple compaction windows and helps future iterations understand which parts of the code and spec are "carefully thought thru + decided" vs just "yeah this is what happened to
gpt 5.5 in the linux VM by @asciidotdev used computer use to find my French family's lost Polish roots in old books he transcribed shit like this perfectly and we're now digging 6 generations, across many regions, already going back to t
The hardest problem in AI agents may no longer be intelligence. It’s coordination. Multi-agent systems are failing 41–87% of the time — mostly from coordination breakdowns, not model weakness. which means: the next infrastructure layer
crazy to see that video inference requests have already grown 4x in little over a month -- my prediction is that multimodal inference is going to be WAY larger than text-based inference on venice, especially when they enable TEE/E2EE modes
https:// arxiv.org/abs/2605.16147 It's interesting that DiT does not have outlier tokens (maybe because of noise it would be hard to anchor on specific tokens?) but still register tokens are beneficial, especially for pixel-level models.
a litmus test i’ve been thinking about for continual learning is bounding lifetime retrieval count per fact. a model should use tools to look things up, but gradually compound fuzzy memories of things they’ve searched, and eventually not ne
very nice write-up on preventing reward hacking by designing the verifier and network boundaries clearly
The @cursor_ai team shipped Composer 2 and now Composer 2.5 on the same Kimi K2.5 base model. Performance benchmarks are. Frontier quality and open-source economics. 85% of the compute powering these gains came from RL. Fireworks powers
Amazing benchmark numbers, but what stood out to me most is the feel in daily use. Clearer turn summaries, easier-to-follow edits, and code that feels like something I’d write myself.
this is most visceral with anything multimodal. the agents cant into visual feedback loops
Very rarely you stumble on a method that's simple, obvious in hindsight, free, and touches on every problem you care about: CLI agents, continual learning, self-improvement, world models. ECHO is one of those