Backlist — 19 May 2026 UTC

1.

Google and Blackstone plan a 500 MW TPU cloud

Google is turning TPUs into an external cloud business with Blackstone capital, 500 MW planned for 2027, and a new operator outside GCP

by @anissagardizy8 (Anissa Gardizy) · backlist 2026-05-19 · rubric 84.0

2.

METR’s first Frontier Risk Report

Anthropic, Google, Meta, and OpenAI let METR test internal models with chain-of-thought access and review non-public evidence about agent control risks

by @METR_Evals (METR) · backlist 2026-05-19 · rubric 94.0

3.

Carbon: open generative DNA models 275× faster at inference

Carbon-3B was trained on 1T DNA tokens and claims leading DNA-model performance with inference fast enough to generate a whole human genome on a laptop

by @_lewtun (Lewis Tunstall) · backlist 2026-05-19 · rubric 96.0

4.

MIT built an OS for reverse engineering CPUs (t.co)

A purpose-built operating system let researchers probe Apple Silicon branch predictors and observe effects like phantom fetches that ordinary software cannot expose

by @0xjprx (Joseph Ravichandran) · backlist 2026-05-19 · rubric 97.0

5.

Vitest fixes arbitrary file exposure, code execution, and XSS issues

Vitest’s API and UI modes exposed arbitrary files, allowed arbitrary execution, and had an otelCarrier XSS bug, making the update immediately relevant to many JS projects

by @vitest_dev (Vitest) · backlist 2026-05-19 · rubric 94.0

6.

General-Agent: self-evolving synthetic RL environments

Prime Intellect released a fully synthetic agent task corpus that grows harder over time, spanning 4,504 tool-use tasks, 1,040 domains, and 8,159 tools

by @PrimeIntellect (Prime Intellect) · backlist 2026-05-19 · rubric 97.0

7.

Microsoft: Mini Shai-Hulud npm supply-chain attack targeting antv packages

Attackers compromised an antv maintainer account and published malicious versions of widely used npm packages, extending the Shai-Hulud-style supply-chain pattern

by @MsftSecIntel (Microsoft Threat Intelligence) · backlist 2026-05-19 · rubric 88.0

8.

NanoGPT-Bench: coding agents recover 9.3% of human AI R&D progress

On a months-long AI R&D task, Codex, Claude Code, and Autoresearch mostly tuned hyperparameters and recovered only 9.3% of human progress

by @IntologyAI (Intology) · backlist 2026-05-19 · rubric 96.0

9.

CXMT S-1: $30B revenue run rate and 60%+ net margins

CXMT’s filing implies a massive Chinese memory business with Hynix-like margins, high utilization, and LPDDR-heavy revenue despite no disclosed HBM

by @FredaDuan (Freda Duan) · backlist 2026-05-19 · rubric 79.0

10.

BEA paper argues U.S. healthcare productivity is massively understated (t.co)

The BEA argues standard statistics overstate healthcare inflation and miss productivity gains from treatments that extend healthy life

by @M_C_Klein (Matthew C. Klein) · backlist 2026-05-19 · rubric 67.0

11.

Turso contributor finds 10+ SQLite bugs using Quint validation (x.com)

Validation tooling built for Turso uncovered more than ten bugs in SQLite, showing how formal models can improve even mature database systems

by @glcst (Glauber Costa) · backlist 2026-05-19 · rubric 84.0

12.

Deep-learning gene-perturbation predictors still trail linear baselines

A Nature Methods paper finds deep-learning approaches to gene perturbation effect prediction do not yet beat simple linear baselines

by @mikejg84 (Michael Gandal) · backlist 2026-05-19 · rubric 78.0

13.

Unitree G1 generates arbitrary actions from voice commands in real time

A humanoid robot is shown translating external voice commands into varied real-time actions without pre-scripted motion playback

by @UnitreeRobotics (Unitree) · backlist 2026-05-19 · rubric 92.0

14.

Cloudflare Radar launches a browser-based MRT Explorer

MRT Explorer lets operators inspect BGP routing update files directly in the browser for outage and route-leak investigations

by @CloudflareRadar (Cloudflare Radar) · backlist 2026-05-19 · rubric 92.0

15.

Tonic joins the gRPC project (t.co)

The main Rust gRPC implementation is moving into the gRPC project, reducing ecosystem fragmentation for production Rust services

by @lucio_d_franco (Lucio Franco) · backlist 2026-05-19 · rubric 78.0

16.

tinygrad merges its first assembly backend for x86 CPUs

tinygrad now has instruction selection and register allocation for an x86 assembly backend, making generated kernels visible and optimizable below LLVM/PTX

by @__tinygrad__ (the tiny corp) · backlist 2026-05-19 · rubric 92.0

17.

An incremental approach to compiler construction

Ghuloum’s 2006 paper teaches compiler building by starting with a tiny working compiler and extending it step by step instead of front-loading hundreds of lines of machinery

by @paoloanzn (4nzn) · backlist 2026-05-19 · rubric 88.0

18.

Simulation Distillation for long-horizon real-world robotics

SimDist turns large-scale simulated experience into reusable world-model priors so robots can adapt faster on contact-rich real-world tasks

by @ty_westenbroek (Tyler Westenbroek) · backlist 2026-05-19 · rubric 91.0

19.

Samsung to show 16-layer 3D DRAM at VLSI 2026 (t.co)

A 16-layer 3D DRAM paper points toward new memory-density approaches as AI demand strains conventional DRAM capacity

by @DrFrederickChen (Fred Chen) · backlist 2026-05-19 · rubric 82.0

20.

Vision Pro eye tracking will control power wheelchairs in visionOS 27

Apple is using Vision Pro’s precision eye tracking as an input method for compatible power wheelchair drive systems

by @aaronp613 (Aaron) · backlist 2026-05-19 · rubric 67.0

21.

Google’s AI Co-Scientist is published in Nature and opens broadly (x.com)

Google’s Co-Scientist work has moved into a peer-reviewed Nature publication and is being made available through Gemini for Science

by @vivnat (Vivek Natarajan) · backlist 2026-05-19 · rubric 46.0

22.

Parallel Index: revenue for content used by AI agents (x.com)

Parallel is building a platform where content owners can see how agents use their work and earn revenue from that usage, with partners including The Atlantic, Fortune, PitchBook, and ZoomInfo

by @p0 (Parallel Web Systems) · backlist 2026-05-19 · rubric 84.0

23.

1,300 public-domain 19th-century landscapes, free to download (t.co)

A free archive bundles more than 1,300 public-domain landscape images as individual downloads or a full zip, pushing back on paid repackaging of open material

by @driceroland (Drice) · backlist 2026-05-19 · rubric 62.0

24.

An 80s business-tech satire site built with scroll-driven WebGPU (x.com)

The Shader Sweden site commits to a full retro-computing concept while using modern WebGPU scene transitions and scroll-driven rendering

by @crnacura (Manoela Ilic) · backlist 2026-05-19 · rubric 22.0

25.

What happens when you execute the simplest C++ program? (t.co)

A minimal C++ program opens a path into loaders, runtimes, linking, startup code, and the hidden machinery behind “hello world”

by @vivekgalatage (Vivek Galatage) · backlist 2026-05-19 · rubric 86.0

26.

Boston Logan will let travelers clear TSA in Framingham

Starting June 1, travelers can clear airport security off-site in Framingham and be dropped at Logan already beyond TSA for $9 each way

by @OnlyInBOS (Only In Boston) · backlist 2026-05-19 · rubric 26.0

27.

Mobile broadband linked to less teen socializing, lower fertility, and higher suicide (t.co)

A new paper reports that access to broadband mobile phone networks reduced in-person teen socializing, decreased teen fertility, and increased teen suicide

by @florianederer (Florian Ederer) · backlist 2026-05-19 · rubric 41.0

28.

Vercel adds CDN pricing that smooths traffic spikes

Vercel is changing CDN pricing to reduce surprise bills from viral traffic without routing users onto slower paths or lower-priority network tiers

by @rauchg (Guillermo Rauch) · backlist 2026-05-19 · rubric 84.0

29.

Is DRAM fab capacity the nearer-term AI scaling wall?

If AI already consumes 52% of DRAM wafer capacity this year and 69% next year, memory fabrication may constrain scaling before logic fabs do

by @snewmanpv (Steve Newman) · backlist 2026-05-19 · rubric 73.0

30.

Amazing to see what the @Hippocratic AI team is achieving with MAX. Their Polaris agent runs patient care convers…

Amazing to see what the @Hippocratic AI team is achieving with MAX. Their Polaris agent runs patient care conversations and needs to complete every turn in under 800ms, with safety models analyzing in parallel.

by @clattner_llvm (Chris Lattner) · backlist 2026-05-19 · rubric 96.0

31.

very good read on making models learn terminal/env dynamics!!

very good read on making models learn terminal/env dynamics!! 1. the authors add a CE loss on env output tokens alongside the GRPO loss on actions. 2. the model is trained to predict what the terminal will return, which forces the weights

by @vivek_2332 (Vivek) · backlist 2026-05-19 · rubric 96.0

32.

#CVPR2026 Can frontier LLMs write PhD-level 3D vision code?

#CVPR2026 Can frontier LLMs write PhD-level 3D vision code? We introduce GeoCodeBench, a benchmark that asks models to read real 3D geometric vision papers and implement core functions. Best result so far: GPT-5 reaches only 36.6%. This

by @HaoZhao_AIRSUN (Hao Zhao) · backlist 2026-05-19 · rubric 96.0

33.

https:// (t.co)

https:// arxiv.org/abs/2605.15220 Using LoRAs for determining dataset mixture. For a continual training setup, when new datasets are introduced, it is possible to train LoRAs for them and combine them with a LoRA on previous datasets.

by @rosinality (Rosinality) · backlist 2026-05-19 · rubric 96.0

34.

Unsloth Studio now has auto speculative decoding & MTP support for GGUFs! Get up to 2x faster inference with no a…

Unsloth Studio now has auto speculative decoding & MTP support for GGUFs! Get up to 2x faster inference with no accuracy loss! We ran many experiments from small models to MoEs, and optimized the params for Mac, GPUs & CPUs. There's also

by @danielhanchen (Daniel Han) · backlist 2026-05-19 · rubric 95.0

35.

By far the most impactful low hanging fruit for auto research type setups would be to find a setup that makes PPO…

By far the most impactful low hanging fruit for auto research type setups would be to find a setup that makes PPO or OPSD broadly work / stable Whether or not models are ready to make eureka level breakthroughs, this should be in reach. Th

by @JoshPurtell (Josh) · backlist 2026-05-19 · rubric 94.0

36.

Blackstone announced a joint venture with (x.com)

Blackstone announced a joint venture with @Google to create a new TPU cloud. We see a generational opportunity to invest at scale in AI infrastructure and help meet the unprecedented demand for compute. More: https:// bit.ly/4uY936w

by @blackstone (Blackstone) · backlist 2026-05-19 · rubric 94.0

37.

We added RTX PRO 6000 Blackwell to Jarvislabs this week.

We added RTX PRO 6000 Blackwell to Jarvislabs this week. I was curious about one thing: can this make 30B-class inference simpler? So our team benchmarked Qwen3-32B on vLLM across BF16, FP8, and NVFP4. NVFP4 is NVIDIA’s new 4-bit floatin

by @vishnuvig (Vishnu - Jarvislabs.ai) · backlist 2026-05-19 · rubric 94.0

38.

1/4 (x.com)

1/4 New paper with @weijie444 ! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU

by @timlautk (Tim Lau) · backlist 2026-05-19 · rubric 93.0

39.

We are top growth dog!

We are top growth dog! And we are hiring distributed compute, SRE, and infrastructure engineers to work on the coolest and most challenging inference problems

by @chris_mladenov (Christian Mladenov) · backlist 2026-05-19 · rubric 93.0

40.

Gemini 3.5 flash + Gemini managed agents api just audited a real megatron-lm ci failure inside Eigent. root cause…

Gemini 3.5 flash + Gemini managed agents api just audited a real megatron-lm ci failure inside Eigent. root cause in minutes! watch the handoff: coordinator agent plans the audit, developer agent loads the ml-failure-audit skill and gather

by @Eigent_AI (Eigent AI) · backlist 2026-05-19 · rubric 92.0

41.

Salesbench is a very nice multi-agent negotiation environment, and the blog post contains detailed experiments, w…

Salesbench is a very nice multi-agent negotiation environment, and the blog post contains detailed experiments, written down concisely and easy to understand. You should read it!

by @omouamoua (snimu) · backlist 2026-05-19 · rubric 92.0

42.

Cerebras sets a new record: a one trillion parameter model @ 1,000 tokens/s

by @draecomino (James Wang) · backlist 2026-05-19 · rubric 92.0

43.

Our report focuses on risks from AI agents intentionally causing harm within an AI company. We highlight 6 key fi…

Our report focuses on risks from AI agents intentionally causing harm within an AI company. We highlight 6 key findings that span “means” (what harmful actions agents could take), “motive” (why they might try), and “opportunity” (whether at

by @METR_Evals (METR) · backlist 2026-05-19 · rubric 92.0

44.

We’re releasing Nemotron-Labs-Diffusion - the first Tri-mode LM family (3B/8B/14B) that switches between Autoregr…

We’re releasing Nemotron-Labs-Diffusion - the first Tri-mode LM family (3B/8B/14B) that switches between Autoregressive, Diffusion, and Self-Speculation decoding by simply changing the attention pattern/mask. One model Three decoding modes

by @PavloMolchanov (Pavlo Molchanov) · backlist 2026-05-19 · rubric 92.0

45.

Scaling evaluations—not just compute—is critical for AI-driven science.

Scaling evaluations—not just compute—is critical for AI-driven science. SimpleTES introduces a new framework to scale discovery loops, finding new SOTA solutions across 21 open science problems. Including: • >2× faster LASSO algorithm •

by @james_y_zou (James Zou) · backlist 2026-05-19 · rubric 92.0

46.

bullish on LangChain Labs. imo initiatives like this are important because continual learning for agents is funda…

bullish on LangChain Labs. imo initiatives like this are important because continual learning for agents is fundamentally an infrastructure problem...agents need systems that can collect trajectories, extract learning signal from behavior,

by @novasarc01 (λux) · backlist 2026-05-19 · rubric 92.0

47.

FutureSim Update

FutureSim Update We evaluated Opus 4.7 at max reasoning in Claude Code. Despite potential test-set contamination with knowledge cutoff of Jan '26, it scored just 21%, barely edging past Opus 4.6 and still behind GPT 5.5! Will Mythos

by @nikhilchandak29 (Nikhil Chandak) · backlist 2026-05-19 · rubric 92.0

48.

All Firewall mitigations are now fully free on (x.com)

All Firewall mitigations are now fully free on @vercel . Not just DDoS and system-level mitigations, but also any rule you configure. Vercel now absorbs the computational and network costs of any size of attack or traffic mitigation for y

by @rauchg (Guillermo Rauch) · backlist 2026-05-19 · rubric 92.0

49.

Your agent finished the task. Did it also read files it shouldn't have, call tools outside policy, or leak data a…

Your agent finished the task. Did it also read files it shouldn't have, call tools outside policy, or leak data across components? If you only score final outputs, you can't tell. 𝐇𝐚𝐫𝐧𝐞𝐬𝐬𝐀𝐮𝐝𝐢𝐭 evaluates the three safety layers

by @xwang_lk (Xin Eric Wang (hiring postdoc)) · backlist 2026-05-19 · rubric 92.0

50.

Aware of the login and auth issues people are having with Antigravity. Facing a significant increase in traffic a…

Aware of the login and auth issues people are having with Antigravity. Facing a significant increase in traffic and thundering herd issues. Fixing ASAP!

by @_mohansolo (Varun Mohan) · backlist 2026-05-19 · rubric 91.0

51.

GDM finally manage to run OSWorld!

by @TianbaoX (Tianbao Xie) · backlist 2026-05-19 · rubric 91.0

52.

Google’s new Gemini 3.5 Flash is the clear leader on the Intelligence vs Speed Pareto frontier and makes large ga… (x.com)

Google’s new Gemini 3.5 Flash is the clear leader on the Intelligence vs Speed Pareto frontier and makes large gains on GDPval-AA (real-world agentic tasks), but is 5x the cost of Gemini 3 Flash @GoogleDeepMind gave us pre-release access

by @ArtificialAnlys (Artificial Analysis) · backlist 2026-05-19 · rubric 91.0

53.

Meet Gemini 3.5 Flash — our strongest agentic and coding model yet.

Meet Gemini 3.5 Flash — our strongest agentic and coding model yet. It delivers frontier-level performance at 4x the speed of comparable frontier models — often at less than half the cost. Generally available, starting today. #GoogleIO

by @Google · backlist 2026-05-19 · rubric 91.0

54.

We’re opening up a new role at Abundance: Head of Data Engineering.

We’re opening up a new role at Abundance: Head of Data Engineering. The role is simple to describe and hard to do: build the data layer that lets AI agents reason across messy, high-stakes financial and domain specific data. If you’ve wo

by @apoorva_mehta (Apoorva Mehta) · backlist 2026-05-19 · rubric 91.0

55.

Excited to announce an open-sourcing webui to experiment w/ steering vectors! Works OOTB w/ Gemma 26B A4B and Gem…

Excited to announce an open-sourcing webui to experiment w/ steering vectors! Works OOTB w/ Gemma 26B A4B and Gemma 4B E4B (for smaller setups), and comes w/ 13 pre-built steering vectors, and lets you build your own - see a demo video belo

by @N8Programs (N8 Programs) · backlist 2026-05-19 · rubric 91.0

56.

DeMix targets the weak point in data mixture search: proxy fidelity on hard capabilities.

DeMix targets the weak point in data mixture search: proxy fidelity on hard capabilities. Instead of training a proxy for every sampled ratio, DeMix trains component models once, then uses weighted model merging to synthesize proxies for a

by @gm8xx8 (𝚐𝔪𝟾𝚡𝚡𝟾) · backlist 2026-05-19 · rubric 91.0

57.

Anthropic announced self-hosted sendboxes and MCP tunnels for Claude Managed Agents during its "Code with Claude"…

Anthropic announced self-hosted sendboxes and MCP tunnels for Claude Managed Agents during its "Code with Claude" event in London. > With self-hosted sandboxes, you keep sensitive files, packages, and services in your own infrastructure or

by @testingcatalog ( AI News | TestingCatalog) · backlist 2026-05-19 · rubric 91.0

58.

"We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — es… (t.co)

"We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving

by @S_OhEigeartaigh (Seán Ó hÉigeartaigh) · backlist 2026-05-19 · rubric 91.0

59.

For years, AI safety has been about the model: alignment, refusal training, jailbreak resistance. When you deploy…

For years, AI safety has been about the model: alignment, refusal training, jailbreak resistance. When you deploy an agent in 2026, the model is not making most of the consequential decisions. The harness is. It chooses which tools the mode

by @xwang_lk (Xin Eric Wang (hiring postdoc)) · backlist 2026-05-19 · rubric 91.0

60.

Excited to launch Claude Managed Agents on Cloudflare today!

Excited to launch Claude Managed Agents on Cloudflare today! - Run sandboxes as microVMs or even lighter-weight isolates on CF - Zero-trust creds injection, custom egress proxies, better observability, private services via VPC - Agent Emai

by @MikeNomitch_CF (Mike Nomitch) · backlist 2026-05-19 · rubric 91.0

61.

Excited to share our new paper:

Excited to share our new paper: Continuous Diffusion Scales Competitively with Discrete Diffusion for Language We introduce RePlaid , a continuous diffusion language model (DLM) with Discrete likelihood bound Scaling laws competitive with

by @zhihanyang_ (Zhihan Yang) · backlist 2026-05-19 · rubric 91.0

62.

Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster

by @chaumian (alon turing (30+ education advocate)) · backlist 2026-05-19 · rubric 91.0

63.

i did an experiment a while back, codex was able to build a zero dependency reverse proxy with http 1/2/3 support…

i did an experiment a while back, codex was able to build a zero dependency reverse proxy with http 1/2/3 support that's faster that the cloudflare rust one and nginx in golang over a weekend and improve it in a auto research style loop.

by @dosco (spacy) · backlist 2026-05-19 · rubric 91.0

64.

Excited to share our new paper MIXSD: Mixed Contextual Self-Distillation for Knowledge Injection

Excited to share our new paper MIXSD: Mixed Contextual Self-Distillation for Knowledge Injection Supervised fine-tuning is the common way to teach LLMs new knowledge, but it often catastrophically forgets existing capabilities. We introduc

by @Jiarui_Liu_ (Jiarui Liu) · backlist 2026-05-19 · rubric 90.0

65.

we document the internal-external capabilities gap, demonstrate AI systems' spike on “hill-climbable” tasks, inve…

we document the internal-external capabilities gap, demonstrate AI systems' spike on “hill-climbable” tasks, investigate performance on somewhat more open-ended tasks, and much more besides.

by @joel_bkr (Joel Becker) · backlist 2026-05-19 · rubric 90.0

66.

okay this is basically it. does this work? if it does, this is the general consumer OpenClaw moment. personal clo…

okay this is basically it. does this work? if it does, this is the general consumer OpenClaw moment. personal cloud agent with persistent context and access. this is absolutely the kind of thing google *could* potentially pull off. but is

by @tenobrus (Tenobrus) · backlist 2026-05-19 · rubric 90.0

67.

It goes without saying but although OPSD is great, I think making PPO or some minor/reasonable variant work is by…

It goes without saying but although OPSD is great, I think making PPO or some minor/reasonable variant work is by far the best bet

by @JoshPurtell (Josh) · backlist 2026-05-19 · rubric 90.0

68.

Introducing Carbon a family of open generative DNA foundation models. Carbon-3B matches Evo2-7B while running 25…

Introducing Carbon a family of open generative DNA foundation models. Carbon-3B matches Evo2-7B while running 250x faster at inference. It can generate new DNA sequences and score the functional impact of mutations, zero-shot. We borrowed

by @LoubnaBenAllal1 (Loubna Ben Allal) · backlist 2026-05-19 · rubric 90.0

69.

so good to see more local model builders getting their hands on NVIDIA DGX Spark.

so good to see more local model builders getting their hands on NVIDIA DGX Spark. Laguna XS.2 already runs on DGX Spark. you can run XS.2 through vLLM, SGLang, and Ollama today, with TRT-LLM support coming soon. If you have one, go try i

by @poolsideai (poolside) · backlist 2026-05-19 · rubric 90.0

70.

We’ve added two security improvements to Claude Managed Agents.

We’ve added two security improvements to Claude Managed Agents. Self-hosted sandboxes keep the agent’s execution environment in your infrastructure or with a managed sandbox provider. MCP tunnels let the agent connect to services inside

by @ClaudeDevs · backlist 2026-05-19 · rubric 90.0

71.

oMLX 0.3.9rc1 released. (x.com)

oMLX 0.3.9rc1 released. Highlights: - Low-memory Macs stay stable instead of getting killed by the OS - DFlash bumped to v0.1.7 (thanks to @bstnxbt 's dflash-mlx). Qwen thinking/GDN fix, Etc. - Chunked prefill. A long prompt no longer bloc

by @jundotkim (Jun Kim) · backlist 2026-05-19 · rubric 90.0

72.

Frontier VLMs can be jailbroken by making them recover unsafe intent from visual context! (x.com)

Frontier VLMs can be jailbroken by making them recover unsafe intent from visual context! Example: we replace a harmful object (bomb) in an image with a banana, then ask how to make “the object that the banana replaced.” @GeminiApp compl

by @jan_dubinski_ (Jan Dubiński) · backlist 2026-05-19 · rubric 89.0

73.

> Zero-trust creds injection, custom egress proxies, better observability, private services via VPC

> Zero-trust creds injection, custom egress proxies, better observability, private services via VPC speaking my language!

by @elithrar (Matt Silverlock ) · backlist 2026-05-19 · rubric 89.0

74.

no, it's still very important. just tell the agent to specify in detail HOW and WHY the bug happened. this has 2 …

no, it's still very important. just tell the agent to specify in detail HOW and WHY the bug happened. this has 2 benefits: 1. the agent has more context if it needs to fix the bug 2. the agent is allowed to ignore the instruction if conditi

by @menhguin (Minh Nhat Nguyen) · backlist 2026-05-19 · rubric 89.0

75.

across about 100 open PRs, robobun/claude is gradually realizing we ported bun to rust and rewriting them

across about 100 open PRs, robobun/claude is gradually realizing we ported bun to rust and rewriting them XML parser is one of those PRs

by @jarredsumner (Jarred Sumner) · backlist 2026-05-19 · rubric 89.0

76.

big day of building today

big day of building today We’re now doing RL training on the runtime of our new agent framework The implementation is a loop: run the native agent runtime through real and ambitious economic tasks, trace every step, score behavior agains

by @Inner_Axiom (Axobotl) · backlist 2026-05-19 · rubric 89.0

77.

Congratulations Cerebras on going public last week!

Congratulations Cerebras on going public last week! Artificial Analysis benchmarks were cited in Cerebras' S-1 filing regarding inference performance. We have benchmarked Cerebras’ serverless API since the day it launched in August 2024. S

by @ArtificialAnlys (Artificial Analysis) · backlist 2026-05-19 · rubric 88.0

78.

OpenAI is guaranteeing compute capacity for 1-3 years.

by @AndrewCurran_ (Andrew Curran) · backlist 2026-05-19 · rubric 88.0

79.

the report is out!!!!!

the report is out!!!!! i want to share the spookiest transcript i read while working on this where an OpenAI model, unprompted, tried to break out of METR infrastructure ;-;

by @vvvincent_c (Vincent) · backlist 2026-05-19 · rubric 88.0

80.

the evidence from somewhat more open-ended “challenge” problems is super interesting.

the evidence from somewhat more open-ended “challenge” problems is super interesting. one of the most capable models discovered a vulnerability that could have allowed the model to arbitrarily alter displayed transcripts and scores on METR

by @joel_bkr (Joel Becker) · backlist 2026-05-19 · rubric 88.0

81.

AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior.

AI labs have started developing systems to monitor internally deployed AI agents for misaligned behavior. Earlier this year, I spent a month embedded at Anthropic stress-testing these systems, to see how easily current/future AIs could “go

by @idavidrein (david rein) · backlist 2026-05-19 · rubric 88.0

82.

We created private reports for each participating company based on our model evaluations and analysis. Participan…

We created private reports for each participating company based on our model evaluations and analysis. Participants could then approve what non-public evidence we could disclose in our public report, but had no editorial control.

by @METR_Evals (METR) · backlist 2026-05-19 · rubric 88.0

83.

You have to read this one. (x.com)

You have to read this one. We just published a recap into how @wafer_ai pushed @AMD inference performance to a level that’s getting the entire ecosystem’s attention and the results are kind of wild. What makes this story interesting i

by @tensorwave (TensorWave) · backlist 2026-05-19 · rubric 88.0

84.

GRPO and its minor variants are just not viable. Useful baseline, that’s it. It is time to figure out how to make…

GRPO and its minor variants are just not viable. Useful baseline, that’s it. It is time to figure out how to make a real algorithm work

by @JoshPurtell (Josh) · backlist 2026-05-19 · rubric 88.0

85.

we just added self-hosted sandboxes to Claude Managed Agents.

we just added self-hosted sandboxes to Claude Managed Agents. i've been excited about this for a while: you can now connect many more "hands" (customizable execution environments) to the agent. here's a few interesting articles covering

by @RLanceMartin (Lance Martin) · backlist 2026-05-19 · rubric 88.0

86.

C2.5 is the same pretrain as C2, but powered by a much better and stronger midtrain (nearly an OOM more FLOPS)!

C2.5 is the same pretrain as C2, but powered by a much better and stronger midtrain (nearly an OOM more FLOPS)! The base model matters a ton for RL, so we're very excited for the power of Colossus 2 to push this way further

by @VeringJulius (Julius Vering) · backlist 2026-05-19 · rubric 88.0

87.

Google just showed a demo, Gemini Flash model running between 600-1400 tokens per second on TPU 8i

Google just showed a demo, Gemini Flash model running between 600-1400 tokens per second on TPU 8i It peaked out around 1480 tok/s, with average around 800 tok/s

by @mweinbach (Max Weinbach) · backlist 2026-05-19 · rubric 88.0

88.

absolutely no offense to stainless, but we have a generator for our sdks for each programming language we maintai…

absolutely no offense to stainless, but we have a generator for our sdks for each programming language we maintain (we did this before AI!!), you don't need a whole fucking company for this.

by @jessfraz (Jessie Frazelle) · backlist 2026-05-19 · rubric 88.0

89.

A (my) Pythia Search Engine find:

A (my) Pythia Search Engine find: https://12000. org Algebra, Mathematics, Control Systems, Signal Image Processing, Differential Equations, Simulations and more It goes deep with examples, solutions and it's very interestingly structur

by @wavefnx · backlist 2026-05-19 · rubric 88.0

90.

Excited to share our new paper using cognitive science to distinguish AI agents and humans!

Excited to share our new paper using cognitive science to distinguish AI agents and humans! We administered CogCAPTCHA30, a set of 30 cognitive tasks, to frontier VLMs (GPT-5, Sonnet 4.5, Gemini 2.5 Pro) and humans. We found that processes

by @mdahardy (Matt Hardy) · backlist 2026-05-19 · rubric 88.0