Mini Shai-Hulud supply-chain worm crosses from npm into PyPI (x.com)
A credential-stealing package worm had moved beyond npm into PyPI, compromising high-download artifacts including opensearch-project, mistralai, and guardrails-ai
Selected one main Mini Shai-Hulud incident item plus one practical mitigation rather than several near-duplicate compromise reports; AI infrastructure is present but balanced with security, biology, energy, graphics, markets, hardware, and policy.
A credential-stealing package worm had moved beyond npm into PyPI, compromising high-download artifacts including opensearch-project, mistralai, and guardrails-ai
Google Threat Intelligence said it detected a threat actor using an AI-developed zero-day exploit before a planned wider attack could land
A major publisher is planning around the end of search and social referral economics rather than treating traffic declines as a temporary cycle
Prior Labs claims no-training tabular prediction at enterprise scale with 10–1000x faster inference and support for million-row datasets
Imported solar panels totaling 51.5 GW suggest Pakistan is building a decentralized solar economy far larger than official net-metering statistics show
Modern inference now spans reasoning models, agents, KV caches, heterogeneous hardware, and routing rather than a single request running on a single machine
A new line of work extends language-model approaches from decoding 5′ UTR regulatory grammar toward designing RNA programs that control cellular behavior
A 100-year-old public travel company is being taken private with an explicit plan to transform operations through AI rather than merely add software tools
Editor’s note: imported_from_x_likes
A native JavaScript port of SQLite’s parser reportedly beats existing JS and WASM SQL parsers by 2.5x to 200x depending on the comparison
Enterprise AI spending appears mostly reallocated from software, services, headcount, BPO, and license consolidation rather than added as fresh budget
A new benchmark of novel research math problems from 64 mathematicians has frontier models scoring under 30%, beyond saturated olympiad-style tests
Apple’s application-processor-side Wi-Fi stack now combines mitigations like MIE and the XZM allocator in ways that make exploitation harder than on other platforms
The deal covers 13 early-stage oncology, hematology, and immunology programs, making it one of the largest China–global biotech alliances to date
A diffusion model produced two-layer RF designs with vias and closed-loop EM verification, with fabricated filters used to validate the approach
Activation checkpointing is central to training large models, and its PyTorch API history exposes the tradeoffs between memory, recomputation, and usability
Flux Matching learns broader vector fields with the data distribution as stationary, enabling faster mixing, interpretable dynamics, and structural priors
World models can improve by having reinforcement-learning agents explore simulators and games to discover adversarial trajectories and failure cases automatically
Modern laptop-scale geospatial tools can draw arbitrary regions, query millions of parcels, and tabulate surface ownership interactively
CODA maps suggest vascular systems across species obey the same space-filling fractal geometry with dimension three
A full rendering walkthrough connects atmospheric light scattering, sunsets, and planet-scale views into a real-time graphics implementation
A robotics foundation model was adapted to fly a drone by outputting directional velocities inside the flight control loop rather than waypoint commands
Agner Fog’s regularly updated C++ optimization guide remains a compact reference for low-level performance work across CPUs and compilers
A Boyden lab technique can fabricate nanoscale devices for manipulating visible light, potentially supporting future optical computing hardware
Orbital data centers are moving from speculative pitch to reported launch discussions between two of the companies capable of testing the idea
New revenue-sharing terms would cap OpenAI’s payments to Microsoft far below a prior path that could have reached $135B through 2030
A $12M no-bid contract led to a strange trail of corporate misrepresentation, including a fake-looking development chief still bearing a stock-photo watermark
Setting a minimum release age is not enough if packages can still pull remote GitHub references, which pnpm can block with blockExoticSubdeps
Removing tutoring-style interaction pushes students back toward answer-giving assistants, which can create the illusion of learning without retention
Modded-NanoGPT optimization result #14 (2026/05/04): @Sam_Acqua has achieved a new record of 3150 steps (-60), by adding SOAP preconditioning before Muon orthogonalization for the MLP weights (SOAP-Muon).
Launching Agentick A unified benchmark for training and evaluating general sequential decision-making agents. RL agents, LLMs, VLMs, hybrids, bots, and humans can all be evaluated on: same tasks. same seeds. same score. First result: n
Modded-NanoGPT optimization result #13: @benjamintherien has achieved a new record of 3210 steps (-15), by wrapping NorMuonH in a MuLoCo-style outer Nesterov SGD. Compared to the target loss, this result has a p-value of p=1.3e-4. Compar
You can have between 10,000 and 5 million sandboxes running concurrently on @daytonaio Try calling AWS and asking for that type of concurrency. They'll ask a ton of questions, and it's going to take a lot of time to get provisioned. For
GB 200s change how one does the prefill and decode disaggregation when serving large MoEs like Qwen. We’ve published details of our stack quantifying the throughput benefits compared to serving on Hoppers.
Microsoft is investigating mistralai PyPI package v2.4.6 compromise. Attackers injected code in mistralai/client/__init__.py that executes on import, downloads hxxps://83[.]142[.]209[.]194/transformers.pyz to /tmp/transformers.pyz, and laun
San Francisco Compute builds both the financial & technical layer. We build the order book & then we build everything else: the VM orchestration, the clusters, and the data centers. We were the first to do this, we hit scale, are growing
super cool and ambitious initiative to try to fully automate a significant chunk of optimization research! also can't wait for this low-rank hyperparameter transfer paper.
Crazy I was able to make a ~1200 page math document with Codex that should be correct after agents checking through it over and over. /goal is cool
cool work :) if you've ever tried world models, you know how easily they break (e.g. stare into grass in minecraft and you'll easily fall OOD) using RL to find adversarial trajectories and then improve the world models is great - esp. if
The elegance of Slime lies in combining the best existing components (SGLang + Megatron + Ray) in the cleanest way. Its top-level logic is simple with only dozens of lines yet each module has enough depth to handle complex engineering det
Crabbox 0.12.0 is live Azure Windows desktop + WSL2 Proxmox + Tensorlake providers preflight, failure bundles, phase timing keep failed boxes around for SSH debugging Remote test boxes got much less slippery.
The probability of this result not reaching the target loss is p=1.1e-4. The probability of this result being obtainable by shortening the previous record is p=1.4e-6. Reproducible log:
@augmentcode rebuilt their context compaction layer around Mercury 2. 82% latency cut. 90% cost cut. Comparable quality to Opus 4.7. Running in production today. "We took a counter-intuitive bet. We decoupled summarization entirely, offlo
1/ A single neuron is sufficient to bypass safety alignment in LLMs. Across 7 models, 2 families, and scales from 1.7B to 70B, suppressing one MLP neuron bypasses refusal behavior — with no fine-tuning and no prompt engineering. We call
You can extend every step of Claude Code's agentic loop. I've been thinking a lot about what that means for the last one. What are you doing to help Claude verify its own work? Genuinely want to hear what workflows people have.
We’ve built a tool called Genie that turns meetings into software. If someone on the team says “I wish we had a tool for X” during a meeting, Genie automatically builds it. How it works: • analyzes granola meeting transcripts • creates L
Also, I realized that JAX itself isn't magic per-se. E.g. training a regular GPT2 on the latest 6th gen TPU hardware is around 85 minutes, while modded GPT2 on PyTorch can do under 2 minutes
Recently we showed that the minimax optimal rate for multicalibration is T^{2/3}. But that doesn't mean you have to do that badly on all instances. We give an algorithm that can adapt to easy instances and get better rates while still being
This is a really fun and multi purpose feature! I currently use these APIs to hold and then cleanly evict kv cache from spawned subagents since StreamingSessions are not added to the RadixTree or written to lower memory tiers.
Hmm, could not handle the FOMO of @antirez DS4 so I made it work on my Strix Halo using ROCm HIPify
Verification bottlenecks progress. Bandwidth bottlenecks verification.
Real time, multimodal, full duplex. Super excited to this model. Also feel tremendous multimodal infra behind this demo.
We started out trying to benchmark the AIs... We had experts create the benchmark... we had experts validate the benchmark... ...Then AIs starting doing well on the benchmark ..Now AIs found critical errors in the benchmark itself the human
1/ Following our previous MoE paper w/ @hayou_soufiane ( https:// arxiv.org/abs/2604.09780), we confirmed that scaling the residual stream: h^{\ell+1} = h^{\ell} + alpha \Delta^\ell improves MoE load balancing at initialization by reduci
The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics.
The third semis memo is out We talk about power & analog semis, orchestration plane in the agentic era, the neoclouds trade, interconnect bottleneck (probably the biggest limiter for 2026-27), Korea Unlocked
Today: OpenMed Agent ships in preview. Built on @huggingface : → HF endpoints power clinical extraction + terminology → MCP for your own services → Every tool call, every plan, fully visible 1,000+ OpenMed medical models on HF. Preview
Cost_train = Cost_inference It's never too late, http:// arxiv.org/abs/2503.14647 Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache
2.5GB cold start in less than 2 minutes... At my previous company, the optimization I did for serverless deployment of large models was 8GB cold start in less than 20 seconds.
What if letting frontier LLMs design their own test-time scaling strategies is much easier than it sounds? Introducing AutoTTS — an environment-driven discovery framework. Humans define the right environment; frontier coding agents discove
seems like amd's ATOM is capable of providing the FASTEST open source inference. we used AMD's atom to beat all other providers on @ArtificialAnlys and provided the code below. truly excited for the new wave of heterogenous compute!!!
Read this article carefully U will be hearing a lot more about the PCB/interconnect bottleneck when mass production of TPU v8, Rubin, and Trainium3 starts in Q4 2026
7/ Regarding the frontend/backend design (what they call the "interaction" and "background" models): (i) How do you teach the frontend model when to defer to the backend? LLM's famously have problems in knowning what they don't know. For a
Congrats to @andrew_li03 and the @JudgementLabs team on their fundraise! I vividly remember back in early 2025 when Andrew explained to me why agent monitoring and evaluation would be so crucial for any enterprise. It’s awesome to see t
And, of course, they should be plotted with compute, latency, or cost on the x-axis.
Preparing an AI evaluation budget by just estimating how many human hours the task would take and discounting prevailing human wages. This is @joel_bkr thought.
xAI revamps the "Grok Computer" section that was mentioned last week. Now the setting more accurately says "Work Folder" and gives you 2 options, Default ( Groks Sandbox "computer" ), or Google Drive. This will allow Grok to work dire
it's very important our inference business has customers world wide the entire game is keeping GPUs busy hard to do that if your customers are all in one timezone
Even without releasing Mythos, cyberattack threat surfaces are way larger than you'd think. Attackers can put agents into an RL-like loop until they find vulnerabilities and lift attack success rates. Expect to see big scale-up of cyberatt
Diffusion world models can help test and improve robot policies before running them on real robots. But can the choice of latent space make the WM more faithful? We show that semantic spaces beat reconstruction spaces on task relevant met
1/ The "20 tokens per parameter" Chinchilla scaling law is flawed. It is an artifact of your tokenizer. Scaling shouldn't be measured in tokens at all. It should be measured in bytes.
one of the big challenges with byte-based (hierarchical) LLMs is the slow decoding, since that is byte by byte, even with a smaller decoder. Glad to see model architectures addressing this bottleneck.
Benchmark on AI literature review quality: depth × reliability × breadth. DeepSeek-V4-Pro (1 in 92) and Claude Opus 4.7 (0 in 104) show the lowest hallucination rates on this task. DeepSeek-V4-Pro’s writing ability feels like a real qual
On-policy distillation (OPD) is one of the most effective LLM post-training methods, but it traditionally requires a costly live teacher server throughout training. In our latest work, Lightning OPD, we show that OPD can be performed fully
thoughts after doing a bunch of synthetic data gen for eval + environment building - LLMs are incredible projections of the world bundled into a set of weights - but doing targeted extraction of certain distributions from those weight is
Not all diffusion noise is equally useful for training! We introduce NoiseRater: a meta-learned framework that scores and selects informative noise instances during diffusion training. Instead of treating Gaussian noise uniformly, we lear
Update 5:05 PT: The attack has now expanded well beyond @TanStack and @Mistral . 373 malicious package-version entries across 169 npm package names, including @uipath , @squawk , @tallyui , @beproduct , and more. The malware propa
How well do MLLMs and agentic video frameworks handle questions (e.g., tracking objects or abstracting recurring behavior patterns) over long-horizon videos, which often require memory to retrieve and aggregate information across time? To
My first Cloudflare ship is in wrangler@4.90.1 It fixes remote bindings hanging indefinitely when closing a wrangler dev session A little side ship while I was working my way through onboarding, simple fix but tricky to nail down - fun w
Update on the jax-js thing: I've gone back to TensorFlow.js It's simpler (direct WGSL instead of 2 compilation stages), and performance was easier to improve in tf.js Kernel fusion only gives you marginal benefits, and for interpretabilit
There will be many winners and losers in the next 10 years while the stack for AI compute matures. If you want to succeed, focus on timeless numbers like FLOPS/$, FLOPS/W, GB/s/$ and GB/$. Any focus on applications will have a short shelf
Compare Speech to Speech models on Tau voice: https:// artificialanalysis.ai/speech-to-spee ch … Methodology: https:// artificialanalysis.ai/speech-to-spee ch/methodology …
someone already wrote a love letter to pi, by @badlogicgames . so we wrote a love paper to pi :) with my teammates @xuzihuan4 and @lintool . a few days ago, i promised i’d share some fun plots once Pi-Serini joined the BrowseComp-Plu
These days, companies are struggling to keep their AI agents from running amuck. Judgment Labs, led by 22-year-old Alex Shan, is tackling agent monitoring and evals and raised 2 back-to-back rounds from Lightspeed, most recently at a $175m
For anyone building scientific agents on top of stochastic generative tools, another proof that the bottleneck right now is the evaluate-and-filter loop, not the model and not the tool catalog. Striking new benchmark of LLM agents for prot
if you are building knowledge worker agents that require more setup than the equivalent of "here's a laptop in the mail show up to the office Tuesday at 8:30am" you're ngmi
There is a consistent thread among frontier researchers: the best training grounds for model breakthroughs are domains with massive, discrete search spaces and easily verifiable outcomes. Think of Sudoku. In principle, you can brute-force
The tradeoff here should be trading a slower cold start speed for a faster inference speed, similar to how vLLM pre-caches a CUDA graph during each startup to reduce the overhead of continuously launching kernels during inference. Since thi
this is exactly correct. if the agent can’t use it immediately, it won’t be used. build the software for agents first, humans are stakeholders. challenge: you have to build for the dumbest agent model someone might use. “doesn’t work! Btw
the window has closed on building products that I, the customer, must integrate with. I should be able to drop agents into the workspace with zero setup and they set themselves up
One fun accidental discovery during my PhD was when I accidentally heated up my superconducting resonator by spamming the piezo motor in the dil fridge to see if it was even working This is the resonator thermal noise peak, the frequency s