Agentic workloads are shifting inference economics toward 100k-token requests
SemiAnalysis measured 432k real coding-agent requests and found a 96k-token median input, making KV cache and long-context serving the next infrastructure bottleneck
Top 90 curated tweets ranked for substance on 22 May 2026 UTC.
SemiAnalysis measured 432k real coding-agent requests and found a 96k-token median input, making KV cache and long-context serving the next infrastructure bottleneck
MTProto over unencrypted TCP can reveal auth_key_id, enabling tracking even without breaking Telegram message encryption
Brain-wide single-cell recording at this scale changes what can be measured about distributed neural activity in living animals
A small building-code change can free up to 56% more living space per floor by replacing duplicated stair cores in many apartment buildings
Supply-chain attacks are moving from isolated package incidents to coordinated compromise campaigns against the software commons
The lecture connects primitive digital circuits to modern accelerator architectures and the tradeoffs that make each compute substrate different
Decoupling GPU and HBM with optical bridges is emerging as a possible response to the memory wall in AI semiconductors
Micron says it has started producing advanced DRAM in Manassas, strengthening domestic U.S. memory manufacturing capacity
Two billion web pages are now directly accessible as training data, lowering the friction for pretraining experiments
Rewriting transformer operations as matrix multiplies plus epilogues suggests LLMs and humans can generate near-optimal kernels from a small primitive set
Editor’s note: imported_from_x_likes
Scale reports that open-ended RL can improve checklist scores while broader quality declines because models optimize the verifier setup itself
NV-Generate-CTMR produces realistic 3D medical volumes with paired segmentation masks, giving researchers more training data without using patient scans
Offline audio diffusion can be transformed into real-time interactive instruments for live performance on local hardware
Standardized hardware and scene initialization make real-world robot model evaluation easier to reproduce across labs
Agents exploring 3D worlds need persistent memory to stay curious rather than repeatedly rediscovering the same spaces
SKF uses moment representations and score matching to handle nonlinear non-Gaussian filtering while exactly recovering Kalman-filter information in the linear-Gaussian case
A live page showing 6,244 people waiting across California DMVs turns a painful public-service queue into observable infrastructure
Every street name becomes data in a visualization of how Toronto cycling infrastructure evolved over two decades
A 117-year-old seasoning company became strategically important to AI hardware through materials used in advanced chip packaging
New Census LACE data exposes tract-level patterns in where American households still lack AC as heat risk rises
Researchers earned a UniFi bounty for a path traversal flaw affecting UniFi OS devices, adding another case study in appliance security
The 2026-07-28 MCP candidate is stateless, dropping handshakes and session IDs so any request can hit any server instance
A standardized in-memory columnar layout lets Python, Rust, JavaScript, and other systems share data without repeated conversion between CSV, Parquet, and JSON
Keeping data layout opaque to the API preserves direct CPU-to-GPU writes and lets shaders interpret memory without needless framework constraints
A Python-based x86-to-LLVM lifter makes binary analysis and transformation more accessible to researchers and tool builders
A small shell-alias layer can prevent accidental installation of known malicious npm packages locally and in CI
Synchronized jobs that run every five minutes can spike databases and APIs, while randomized start times make autoscaling and backends behave better
Lilly’s highest retatrutide dose reportedly produced 28.3% average body-weight loss over 80 weeks, with 45% of patients losing at least 30%
Americans for Fair Markets will lobby federal policymakers against sportsbook and casino incumbents as prediction markets fight for legitimacy
Legacy anchor tenants often held veto power over mall management, creating governance gridlock after Sears and JCPenney collapsed
Agents deserve Servers, not Computers Every agent is getting a computer (whether it's localhost or a sandbox) as part of their harness. Evals like TerminalBench even assume the presence of a computer! It's been a long time coming: a comput
PP 1841.7 tk/s | TG 101.3 tk/s | Context 735K 2 x #2080Ti 22GB NVlinked run Qwen3.6-27B-AWQ through vLLM TP=2 MTP K=3 KV=tq4nc single request at extraordinary performance! Maximized AI value of the $500 legacy setup. https:// github.com/w
Excited to share that @modal is supporting Stanford CS321M: AI Measurement Science with compute for class assignments, student projects, and GPU scoring infrastructure for the Predictive AI Evaluation Challenge.
Even more striking: ~50% of requests already exceed 128k tokens. The driver isn't user prompts getting longer. It's everything the agent stuffs in before you even type: system prompts, tool definitions, skills, MCP schemas, prior turn conte
Inference economics are shifting. Expect more "fast tier" pricing (Opus Fast, Gemini Flash), more specialized inference hardware (Cerebras, Groq), and more pressure on KV cache management. The next bottleneck isn't model intelligence. It's
we setup gangprompt.opencode.ai (threw it behind cloudflare access for SSO) which is running an opencode server on a fast machine with all our repos cloned now our whole team can pop in there and prompt some stuff and see everything ever
Claude Sonnet 4.8 Leaks - Anthropic accidentally shipped a massive 512,000-line internal debugging source map through a Claude Code npm update on March 31, 2026 - The leaked source code references Sonnet 4.8 inside unreleased keyword filte
Two new tokenisation methods hitting arXiv today, both based on linear programming, and both with SOTA results! Check them out! Tokenisation via Convex Relaxations: https:// arxiv.org/abs/2605.22821 Tokenization with Split Trees: https:/
Extremely proud of the team @cartesia for launching Sonic 3.5, which sets a new state of the art for TTS I personally led the technical direction of this model; we built it ground up from first principles, and it contains multiple non-tr
Wow! 4 MacBooks serving 40 tok/s+ on a 230B param model hmmmm I thought the gatekeepers said this isnt possible
Today, Zyphra Research is sharing fundamental work extending Equilibrium Propagation beyond Energy-Based Models to biologically realistic neuron models. A step toward more efficient AI, local learning, and future hardware beyond backprop.
INPUT (CACHE HIT) $0.003625 FOREVER
I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (11 am–1 pm PDT)! FlashInfer ran a kernel competition for B200 GPUs. Our team ( @thepushkarp + me) won 1st place on the Gated Delta Ne
MoE (8): Enforcing Sequence-Level Balance https:// kexue.fm/archives/11760 This article explores how to achieve sequence-level load balancing without incurring any loss penalty. Starting from the original Quantile Balancing (QB), we gradu
Meta has proposed a method to patch audio waveforms and generate them directly with DiT. Paper: https:// arxiv.org/abs/2605.18749 Demo: https:// facebookresearch.github.io/WavFlow/ I thought it would take a bit longer for direct DiT gen
New from NVIDIA! You can edit a model’s compressed memory without scrambling what it already knows! Enter Gated DeltaNet-2. It separates the erase and write operations in linear attention using two independent gates – one for forgetting
subtle agent orchestration change that anecdotally works better in pretty much every case I’ve seen (need to run some evals) bossman supervisor >> external judge >>> self reflection - when verifying agent outputs using a fresh judge (wit
Pro-tip: using CUDA graphs and annoyed that all the kernels have no labels in your profiles? Get a nightly that has mark_kernels context manager: https:// github.com/pytorch/pytorc h/pull/179768 … (thanks Natalia and Shangdi for implementi
New paper! Post-training doesn't build the Assistant, it just turns up the volume on personas that pretraining already laid down, at 0.22% of total tokens! We traced them across OLMo-3 and Apertus here's what we found
AI is now a major part of scientific research . But can it actually forecast scientific progress? We tested 6 frontier models on 4,760 real breakthroughs under strict knowledge cutoffs. They recognize science. They can't forecast it.
Got a thermal camera to check my server GPUs are hot because they're thinking hard the dog is... not
on the day of modal's series c announcement i am ... getting microsoft word to run in a modal sandbox
We discover the 𝐀𝐬𝐲𝐦𝐦𝐞𝐭𝐫𝐢𝐜 𝐑𝐨𝐥𝐞𝐬 𝐨𝐟 𝐃𝐚𝐭𝐚 𝐆𝐚𝐭𝐢𝐧𝐠 𝐚𝐧𝐝 𝐑𝐞𝐰𝐚𝐫𝐝 𝐆𝐫𝐨𝐮𝐧𝐝𝐢𝐧𝐠 𝐢𝐧 𝐒𝐞𝐥𝐟-𝐏𝐥𝐚𝐲 𝐑𝐋: data gating, not reward grounding, is the binding constraint on stability. A strict gate stabiliz
Outcome rewards in LLM RL are sparse --> AVSD (Adaptive-View Self-Distillation) turns privileged info into dense token-level supervision, and instead of relying on only one privileged view, it combines multiple views and balances stable cr
This week, I solved a problem in RL involving ludicrous sparsity that I have been thinking about since 2018. Initial sweeps are showing SOTA on one of our most consistently informative test envs. Blog post soon. For now, you can follow the
My favorite detail in the CODA paper is delaying the RMSNorm scale to the next GEMM just to dodge a VRAM roundtrip. We waste so much bandwidth writing to memory just to run activations. Hide your math in the epilogue while the tile is still
What surprised me most about building a computer use model was the data. When the field is early, there's no established corpus to draw from. No prior dataset that captures how humans actually navigate computers… the full range of applic
More of a codex fan? Codex app server emulated inside a Durable Object: codex --remote wss://codex-do.southpolesteve.workers.dev:443
we studied some suspected effects of subword tokenization on language model training, and found which of them actually mattered. this also led us to try to amplify them, resulting in the Token Superposition work we previously shared
Feel the power of KV Cache
Our AI audio pipeline just ranked #1 in a blind benchmark against NotebookLM and Spotify’s Save-to-Spotify workflow. Across 100 matched topics and 600 blind evaluations, SUN came out on top.
Big docs update for @Cloudflare MCP Server Portals head into this Memorial Day Weekend -- troubleshooting, service token auth, tool policies, DLP, Terraform, API reference, architecture docs, and more. Nearly all of it came from user fe
this AI UGC video was made for under $1... my V3 system has officially killed the UGC industry, and i mean it. this video genuinely cost barely a dollar to make and no, it's not Seedance 2.0 or any model you've seen before for the longe
By end of year I think 95%+ agent sessions will come from automations and events. We already see this happening @cognition where more than 50% of Devin customer sessions are triggered by non-humans. Learning how to build these types of s
Perplexity open‑sourcing Bumblebee, a read-only endpoint scanner for risky dev and AI tooling configs. This definitely wasn’t on my bingo card.
Excited to share that I will be joining @amazon this summer as an Applied Science Intern! I will be working with the @amazonquick team on improved reliability in multi-agent systems. If you are in Seattle this summer, I would love t
The wait is over! Today at #MLSys, we'll give a talk to reveal the final results and present the awards for the FlashInfer AI GPU Competition! I'll also introduce FlashInfer-Bench: an agent-oriented Benchmark Engine designed for producti
#MLSys2026 Event Tensor is our new take on how to bring in first class shape (for dynamic batching size) and data dependent dynamism(and moe) into megakernels, while minimizing runtime part through compilation, check it out
> born too late to build file systems > builds them anyway
Thanks for sharing this. The Huggingface Hub, instead of being considered as a huge dataset, should be considered as a dynamic discovery engine that can be directly executed and verified.
Coding agents are putting selective pressure on CLIs. If Claude thinks `pulumi org member list` is the way to see org members and its not , then it's worth considering if it should be. Pulumi CLI has now been revamped around this idea.
we need a new benchmark for this type of stuff. @TaviNeverSleeps performs SOTA on people search score 97.03 on the PSB https:// arxiv.org/abs/2603.27476
Earlier this week we confirmed @Lighter_xyz 's desert verifier reproduces byte-exact from public source. Today, we explain what that actually means for exchange risk. Every centralized venue runs on one trust assumption: you trust the ve
The deeper I studied Postgres extensions, especially TimescaleDB, pgvector, PostGIS, and PL/pgSQL, the more I realized how amazing Postgres’s design is. It took me another month to really understand how they work inside.
codex (5.5) is running for 4h30m for absolutely nothing the task probably takes 10 min max and it hasn't done anything so, that's what special model meant how do we tell it the truth
Happy Friday — one more thing: We’ve open-sourced OpenBridge, a local-first / BYOK version of @bridge_surf and our Computer Use stack. You can now run the full computer use system locally with your own models and API keys — with complet
https:// arxiv.org/abs/2605.22769 Could it be better to pretrain on temporally ordered data? It could bias the model towards recent information. I have wondered when information is updated or changed over time whether the model is able to
After recent upgrade @manaflowai (cmux) is completely unuseable, I have about 12 tabs with 3 actively running ~2 claude sessions and it stutters to a complete stop after 10~15 minutes. If I ignore it it freezes my maxxed out M3. cc @aus
If you want to age your sys admins 30 years overnight, remember that Active Directory is fully unicode compatible, so you can rename your laptop with emojis it its hostname, and it will reflect like that in AD ping desktop-.mycompany.local
Very cool work building off of sleep-time compute that shows how to *learn* how to consolidate memory for end-task performance Learned methods for memory is a new and exciting space that we'll be sharing more on soon :)
uhh so... i realised that the same tooling that makes universal diffs & patches fast would also make a REALLY fast state lib. turns out it benchmarks CONSIDERABLY faster than all of the existing state libs in js.
Hurricane process. Zero shader code written, all base layers in Unicorn. Lot's of craft went into this one. Live demo: https:// unicorn.studio/embed/bG5xs8kL K1bwLKAiEWY9?controls=1 …
What a day, but on a more positive note, we did a release this morning, and the latency on the CLOB is now 500% better. You should see a huge difference. Thanks for waiting for the fix. It was a hard one because it was at the lowest level o
[[ M 1/4 ]] things achieved > basic setup and docker compose starts all 8 services > transfer endpoint commits cross shard transactions > holds around 1000 + transactions > benchmarked our latency numbers which further we will use in ou
FYI people, whilst this picture is pretty to look at, it is *not* the construction GPT obtained (which has far too many points to draw), this is just an example of a number field lattice that breaks for specific n.
We've ran into a new set of tasks that are challenging to train on because they involve internal tools and processes or customer preference data that can't be found on the internet. RL with low success rate doesn't get you very far (after a
AI is embarrassing a lot of senior engineers. A junior who touched the frontier yesterday often has better instincts for what’s possible than someone experienced who last touched it six months ago.
Every memory system for LLM agents evolves what it stores. None evolves how it retrieves. EvolveMem is out, now shipping inside the SimpleMem v0.3.0 update. Powered by AutoResearch: the system researches its own retrieval, treating the fu
A tool that traces & visualizes all memory accesses. So much nicer than staring at the code.
I realized that what I cannot profile, I cannot optimize. This is why I embarked on a little project in Diffusers, to try to profile important pipelines, identify bottlenecks for torch.compile, and fix them. Got decent results. I documen