Supply-chain compromise hits 42 TanStack npm packages (x.com)
A live npm supply-chain attack pushed malicious versions of 42 TanStack packages, turning a popular frontend dependency into a credential-theft risk
Kept the AI slate to one major model release plus the most substantive infra, benchmark, biology, and labor-market items; left out several near-duplicate coding-agent and OpenAI launch posts to preserve breadth.
A live npm supply-chain attack pushed malicious versions of 42 TanStack packages, turning a popular frontend dependency into a credential-theft risk
Microsoft and OpenAI describe a fault-tolerant datacenter networking design demonstrated on a 75,000-GPU pretraining run
External staff working on the NHS flagship data platform reportedly received broad access to identifiable patient records, raising the stakes for health-data governance
A data-center project breaks AI infrastructure spending into land, power, construction, equipment, financing, and operating economics instead of treating capex as a single number
Bun’s rewritten component now passes the project’s test suite on Linux, Windows, and macOS across x64 and arm64, and may close roughly 200 GitHub issues
Polars crossed 50 million monthly downloads and is positioning itself as a faster local and distributed dataframe engine with on-prem workspaces and spill-to-disk support
A floating-point NaN bit pattern could masquerade as a tagged pointer across a Firefox sandbox boundary, making a numerical edge case into an exploit primitive
Cerebras increased both share count and price range in its IPO filing, signaling unusually strong appetite for non-Nvidia AI hardware exposure
Ginkgo’s Nebula lab now lets customers price and order experiments across a larger fleet of robots and lab instruments, pushing biological experimentation toward cloud-style procurement
Frontier LLM agents can beat a hand-engineered protein design pipeline on some tasks, but still fall short of a human expert across 76 expert-graded protein design challenges
Hydrogene reports non-viral DNA delivery into non-human-primate liver at expression levels comparable to commercial AAV products for hemophilia
Monge Inception Distance aims to replace FID with a more robust and faster generative-model metric that needs an order of magnitude fewer samples
A labor-market paper estimates which US occupations are easiest to improve through reinforcement-learning post-training, connecting AI exposure to how jobs are actually structured
River’s resumable jobs let expensive substeps such as spinning up an agent sandbox run once and survive retries, reducing wasted work in background processing
Workers, D1, KV, R2, Queues, Durable Objects, Workflows, and Browser Rendering now make a small Cloudflare plan look like a serious low-cost application backend
A detailed write-up walks through virtual memory from page tables and TLBs down to Linux internals, the kind of systems knowledge that remains useful across stacks
An LLM CLI can be used directly in a shebang line, turning English or YAML-templated prompts into executable scripts
Thinking Machines is previewing models that listen, speak, watch, interrupt, and react without relying on rigid turn boundaries
Encrypted RCS messaging between Android and iPhone users begins rolling out, closing a long-standing privacy gap in cross-platform texting
A transit visualization uses 15-minute demand data and rail timetables to show how 6.3 million weekday London trips move through the network
A citywide map identifies 2,180 abandoned San Francisco buildings, including hundreds of government-owned structures that have sat unused for decades
A true virtual cell will require combining AI’s pattern recognition with mechanistic models that can represent causal biological processes
Removing one endothelial cell from a brain capillary can trigger neighboring cells to rapidly extend and rebuild the vessel
Anthropic’s warning that unauthorized stock interests are void could invalidate layered SPVs and synthetic secondary exposure in one of the hottest private AI companies
A new sparsity approach claims to overcome GPU-unfriendly scattered memory reads, converting extreme FLOP reduction into more than 20% actual speedup
The first reported genome from the outbreak is 99% identical to a 2018 Argentina case and gives an estimated mutation rate for tracking transmission
A 14th-century private annuities market shows that sophisticated household finance and secondary trading appeared much earlier than many modern intuitions suggest
School construction in 1960s Chile raised education and earnings, narrowed gender gaps, increased second-generation schooling, and generated a high estimated public return
Zenbu.js explores software designed to be edited by end users and their coding agents after installation, making local applications more malleable
To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models with one recipe, then extrapolated 300× to predict a 25B-param / 600B-token run with just 0.2% error. Getting there to
Accepted to NeurIPS 2025, IFBench tests how well language models follow precise output constraints. It asks models to do things like answer only with “yes” or “no,” mention a specific word at least three times, or hit an exact sentence, w
What happens when you mix evolution x LLMs x RL? We evolve tuned interfaces for RL agents using LLMs and find that the transformed observation and rewards work better than native environment ones. Paper accepted at Reinforcement Learning
DeepSeek V4 Pro brings long-context reasoning and SOTA coding performance to Together AI serverless. The next layer is serving it efficiently: KV cache, prefix reuse, hybrid attention, batching, kernels, and endpoint profiles. We go deepe
The most extensive independent benchmark of LLMs for software engineering just got a big update! - How does GPT-5.5 compare to Opus 4.7? - Are open models catching up, and in what areas? - How do cost and performance stack up?
TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accu
2/5 The Technique behind the Efficiency: LLaVA-UHD v4 MiniCPM-V 4.6 debuts the new LLaVA-UHD v4 architecture, slashing vision encoding FLOPs by 55.8% without performance degradation. Intra-ViT Early Compression: Dramatically reduces visual
Together, those constraints expose a common failure: models can understand the topic and still miss parts of a request. "IFBench measures instruction following in a way that feels closer to real-world use than earlier ... evals," says @Ar
3/5 Extreme Computation Efficiency Thanks to the LLaVA-UHD v4 technique, MiniCPM-V 4.6 nearly flattens the "Resolution-Latency" curve. Even with 3136² high-res images, TTFT is just 75.7ms on a 4090 GPU — 2.2x faster than Qwen3.5-0.8B. On
You'll help build out our in-house order and execution management systems, add new order types, integrate with new protocols (perps, prediction markets), track orders, calculate PnL, and anything else to make JTX the place for those looking
Artificial Analysis relies on our IFBench eval to test how closely models follow user prompts. Most evals in their Intelligence Index saturate within months. IFBench hasn't because it measures what others miss—and what frontier models sti
My favorite “cursed” computer bug was when AMD’s Zen 2/3 would occasionally play robotic / demonic audio randomly for a few seconds while also freezing the system. The reason was pretty funny. Windows 11 upgrades were…kind of a mess. The
We surveyed 349 technical researchers, engineers, and managers (in February–April 2026) about how they use AI tools at work. On average, participants self-report that AI use made their work 1.6–2.1x more valuable, and that this multiplier
1/?) As promised to Sander Dieleman ( @sedielem ), we’re finally excited to share: Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion We show that continuous diffusion can achieve
What happens when you compare the distributions of real and simulated user behaviors? The gap is large. We introduce a method to measure this gap and evaluate 24 LLM-based user simulators across coding and writing tasks. @convai_uiuc
Introducing Benchling Biologics: an end-to-end platform for antibody R&D, built for the speed and complexity that scientists need. Antibody-aware data model No-code configuration for any format Automated registration linking proteins,
My fav way to factcheck info when I am implementing something is the Paper Breakdown CLI tool. Basically you can just type: paperbd ask --arxiv <ARXIV> --query "your question" and get paragraphs that contain your answer (we are doing agen
1/5 MiniCPM-V 4.6 (1.3B) is now live High-res visual processing, optimized for consumer-grade and mobile hardware. We’ve leveraged the latest LLaVA-UHD v4 technique to cut vision encoding costs by 55%, enabling native edge deployment with
Qwen released WebWorld an open world model series for web agents 8B/14B/32B+Dataset Apache2.0 +9.9% MiniWob++, +10.9% WebArena Matches Claude Opus 4.1 & Gemini 3 Pro on factuality,beats GPT-5 as world model Unified action space, 30+ st
In this paper, a 7B language model trained with reinforcement learning learns to orchestrate larger frontier models like GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro. It does so by writing natural-language subtasks, assigning each to one of
2/ @jess__yan + i showed a toy example of Outcomes: i had an Managed Agent make a generative UI w/ metrics (charts, graphs) rendered as svg. i used an Outcomes loop to improve the render timing - Claude figured out various tricks (prompti
New paper: FlexSQL — a Text-to-SQL agent that lets gpt-oss reach 65.4% on Spider2, outperforming agents built with large models like DeepSeek-R1 and o3. The key: just let agent explore flexibly. Don’t collapse a complex query into one sch
Cost per task (API token pricing) varies >30x, token use varies >3x and cache hit rate is fairly high across agents
Today we’re launching the OpenAI Deployment Company to help businesses build and deploy AI. It's majority-owned and controlled by OpenAI. It brings together 19 leading investment firms, consultancies, and system integrators to help organiz
We are excited to welcome @suragnair this Tuesday to present CompBioBench: A benchmark of 100 diverse tasks for evaluating agentic systems in computational biology! 2:30pm Tuesday May 12 | CoDa E160 | Stanford and Zoom
Previously when adding init embedding x0 to deeper layers' value vectors, we detached x0's gradient in this extra path. It killed the improvement so we concluded the lift is just gradient benefit to x0. However, @classiclarryd raised an
the Q1 earnings cycle just doubled datacenter capex expectations in 2026-2027 alone from $450b/year to $800b in 2026 and $1.16 T in 2027 weekly token consumption has 3.5x’d since start of year we were bullish before and we’re still gett
Zeta2.1 is out. Our edit prediction model now emits 3x fewer output tokens. Predictions are 28% faster at p50, and we're running 30% fewer servers to handle the same traffic. Learn more:
New paper from my group: Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors Gemini models are quite an outlier in terms of instrumental behaviors...
Codex /goal now has a native Kanban board. Starting a /goal run now fires up a lightweight Kanban board with clickable cards that move as Codex completes tasks. npx goalbuddy Or update with npx goalbuddy update Start the goal-prep skill
the "small" model behind this demo is a 276B total 12B active MoE (larger pretrains are cooking), sparsity ratio looks pretty standard compared to open models of the same size
BREAKING: 84 TanStack npm packages were compromised in an ongoing Mini Shai-Hulud supply chain attack, adding suspected CI credential-stealing malware. Socket flagged every malicious version within six minutes of publication. This is a
In under a week, our checkpoints have been downloaded 74.5K times on HF. To help more people try MolmoAct2: if you have a bimanual YAM, DROID Franka, or SO-100 but don't have the compute to run locally, DM me — I can host it for you to try
FYI, the other person helping me with Wan's PhD hood is Phil Lehman of the *Lehman-Yao B-Link Tree* from 1981. You can see Phil's name at the top of the README for @PostgreSQL 's B+Tree implementation: https:// github.com/postgres/postg r
We are still hiring! Looking for a new grad (Masters/PhD) with a background in combinatorial optimization to join our Constraint Satisfaction & Algorithms team. Updated JD and application here: https:// job-boards.greenhouse.io/cerebrassy
3 weeks since ml-intern launched and we just hit 1M messages exchanged. that's 3.3 agent-years of ML research in 21 days. 2 months worth of research every day. 17,383 training jobs total. talk about AI acceleration. here's some of what pe
We agree that parameterization of eval samples by difficulty/discrimination via IRT allows us to better measure model capabilities -- shown through our DatBench work. Updates soon on how better calibrated evals inform VLM data curation h
We believe this gives a comprehensive overview on where the low-hanging fruit for RL scaling still lies and can be useful for labour market impact assessments for economic modelling
New at http:// makeitanimated.dev The keyboard opens, three cards fan out, the label floats up — all on the UI thread. Slash app Login input on focus animation React Native + Expo
First time I've seen a regression test case to catch a future bug! tl;dr; @pavan4820 finds a bug in SQLite's xfer optimization, then notices Turso does not have the optimization yet, and, therefore, sends a test case to detect the bug whe
Demystifying Manifold Constraints in LLM Pre-training "The empirical success of large language model (LLM) pre-training relies heavily on heuristic stabilization techniques, such as explicit normalization layers and weight decay." "While
Compute futures have arrived: quarterly, monthly & yearly contracts on Nvidia H100/H200 prices trading on Architect. Purpose-built for companies with exposure to GPU price volatility. Intuitive trading interface, instant settlement, our tea
some motivating ideas: RL is structured around task completion, which maps directly onto how occupational classifications are built. Prior approaches were not. The gap between those two is large for specific occupation groups to be meanin
It's amazing the extent to which codex has changed how I use software. Needed some gnarly oauth/ admin stuff done yesterday, and I just told codex to go off and do it. It interrupted me once, to ask for a 2FA key to be copy pasted, but othe
Musk v. Altman: Ilya Sutskever testifies that his OpenAI stake is worth ~$7B and he had concerns about Altman for a year before Altman's brief ouster as CEO ( @rachelmetz / Bloomberg) (Visit Techmeme dot com for the link and full context!
This is a tutorial on diffusion and flow matching, based on my previous postings here. I’ve made available the PDF, a python notebook for people to play with it, and the TEX source so hopefully one of you can translate to your language. ht
1/ Outcomes in Claude Managed Agents is just a "Ralph loop" to verify output vs a user provided rubric. it uses a grader sub-agent for the verification. some interesting points on the benefits of an isolated verifier here: https:// anthrop
new research from me @METR_Evals : technical workers claim that today's AI impacts value of their work to an extraordinary degree (& growing over time). of course, self-reports plausibly overestimate. the magnitudes nonetheless strike me
750,000 cars built at Giga Berlin. Here’s what it takes.
Introducing mxbai-rerank-v3-listwise: reranking that goes beyond binary relevance. It reads the whole candidate set, resolves conflicts, and ranks by directives like recency, source priority, and multi-step rules. +11% NDCG@10 on average
Introducing the OpenAI Deployment Company, which will help businesses maximally succeed with their deployments of AI. Starting with 150 Forward Deployed Engineers and Deployment Specialists, and $4 billion of initial investment from 19 par
I think Reachy is the one who needs chess lessons… Robotics meets WebAI: Gemma 4 running fully offline on WebGPU with Transformers.js, controlling Reachy Mini over WebSerial. No internet, just a browser and a USB-C cable. What should Re
What can you or my agent do with all the 3D data collected (ideally by robots or my scanner)? Here I built on actual customer data an automatic stockpile and volume calculating AI. Beware the deltas of the stockpiles show what the company
Announcing the Artificial Analysis Coding Agent Index! Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks, token usage, cost and more When developers use AI to code they’r
Scores by benchmark: Individual benchmark results broadly follow the overall Artificial Analysis Coding Agent Index, but help identify particular strengths and weaknesses. GPT-5.5 in Codex is strongest on SWE-Atlas-QnA and Terminal-Bench v2
Exactly Romain. Though not perfect, AlphaFold is one of the cleanest illustrations of the Bitter Lesson in bio. Pre-AF methods were essentially doing clever but brittle ‘search+human priors’ on a small, expensive dataset (PDB). Switching th
TS-DFM, Trajectory-Shaped Discrete Flow Matching, targets the trajectory bottleneck in few-step DFM rather than changing the student, objective, or inference path. DFM turns noise or mask tokens into language through hundreds or thousands
Still amazed that the M4 Max is staying completely responsive under constant 100% CPU usage
ChatGPT/1.2026.125 beta (Android) adds mentions of Codex Remote (connection to Codex desktop), including Codex Remote bubbles, pets sync from desktop and Codex Voice (voice call) "Codex uses the power of your desktop computer to build soft
This episode features an interview with Yao Shunyu @ShunyuYao14 , Research Scientist at Google DeepMind. Yao has held research scientist roles at both Anthropic and Google DeepMind, contributing to the development of key models including
We just released TurboAPI v1.0.30. 2.3x faster than FastAPI on WebSockets. On par with Go gorilla. - real RFC 6455 on the Zig core - 30-75% FASTER hot-path HTTP vs v1.0.29 pip install turboapi==1.0.30