Evals Are Strategic IP (x.com)
Measurement only becomes a moat when it captures outcomes nobody else can easily see
Trimmed most of the AI launch noise and kept the strongest infrastructure, research, policy, market, and a few standout tools
Measurement only becomes a moat when it captures outcomes nobody else can easily see
A company that can verify work instead of just generating it compounds trust and quality
Benchmarks are useful only when someone actively tries to break them
Testing offensive cyber on real systems moves the debate from toy evals to deployable capability
The cost model of software changes when subscriptions behave like base load and APIs like peakers
Compute futures turn AI infrastructure into a tradable asset class before the underlying market even matures
Leveraged loans are now financing the physical buildout behind the AI boom
China just crossed from research to treatment with the first CAR-T approved for a solid tumor
A major political resignation changes the British government’s immediate trajectory
Age-verification rules and online safety bills are starting to collide with basic internet access
A single deal makes frontier compute capacity look more like industrial supply than software spend
Memory and storage have become strategic inputs in frontier-model partnerships
Groq’s raise shows how much capital still chases low-latency inference infrastructure
A $90M restatement is the kind of footnote that can reset trust in a defense contractor
The pager and radio blasts remain a live case study in how supply-chain compromise can kill
Quantum research is now being treated as a foreign-intelligence target
A giant knockout database gives genetics a rare scale advantage over anecdote
A handheld tremor-correcting pen is the kind of low-tech assistive device that can matter immediately
Model serving is becoming a systems problem of graphs, placement, and batching rather than just APIs
Robots can now be trained from ordinary human video instead of expensive bespoke demos
Companies with tight feedback loops spend less time on theater and more time shipping
Good design is mostly about covering the ugly states nobody wants to draw
Performance tools that are one command away get used instead of admired
A kerning tool that explains every adjustment turns typography into something measurable
Screenshots carry enough context to make prompting feel like working with a designer, not a search box
A simple physical demo makes simulation-heavy tooling legible in a way slideware never does
A local flight controller built on SHAKTI points to a real domestic drone stack
Defense products are shaped as much by adversaries as by government buyers
Constraints, not convenience, are what make ordinary activities feel meaningful
we've added unique user rankings some models are token heavy so they skew upwards in rankings - unique people using the model is a more accurate ranking we'll orient more of our data around this metric
I noticed that Apple Notes has a similar UI as the AI chat apps, so I turned it into a Claude/ChatGPT frontend. Use any LLM API to interact or chat with in Apple Notes
Production autoresearch is usually killed by reward hacking or side effects. But we still see a pattern that survives: the unit been evaled is functional or near-functional code. Some examples: (1/5)
Anime.js 4.5 is out and it's a fun one: Introducing the @threejs adapter - Up to 50% less code for 3D animations - CSS transform-like API for 3D objects (rotate, skew…) - Simpler material color animations - Easy instanced mesh animation
Animated, hierarchical origin-destination commute flows in Dallas-Fort Worth, Texas. Downtowns, medical districts, airports, and suburban job hubs light up from 2023 LODES data. This technique uses http:// Flowmap.gl, now available in th
introducing, the @stripe directory: from the cli, people and agents can search for, and pay, businesses on stripe. > stripe search "serverless postgres database" - payments between two stripe users are free - includes mpp and projects
got my kernel in qr_v2 leaderboard @GPU_MODE , one thing I can say is you need to engineer every bench shape to squeeze most from it, I found cluster reduce helps in some while they could be worse in others and yeah like it was a nice try.
Think VLM-based OCR might finally be close to working on historic newspapers! Many models I've tried before failed (hallucinations, repetition, context overflow...) Surya OCR 2 (a 650M model!) on first inspection seems to do a very good j
1/4: A couple notes on the implementation. The async RL training itself is powered by SkyRL, with the research agent’s goal being resolving setup issues (in this case a libnuma dependency) and analyzing runs autonomously.
If access to a stronger generator model is given, the much simpler solution to improving capabilities is just to use the generator model as a teacher model for OPD. Building tasks with the stronger generator moreso serves the purpose of pr
https:// tol.is/paperplane I made a paper plane I didn't want it to fly like an object but like paper. There's no rigid body. The wings bow, the body flexes under load, and it bobs as it goes, the way a real glider does. You're not flyin
"You may be using planning as a safe simulation of competence. Execution threatens that simulation because it produces evidence: errors, slowness, missing knowledge. That feels bad, so the brain protects you by escaping."
1/ Codex is quietly killing your SSD. It writes diagnostic logs to disk non-stop, even when you're not doing anything. Your SSD has a write limit. Codex is burning through it in the background. One command fixes it
noticed my codex has an exponentially growing 11.36gb rollout where the replacement history grows over the number of compactions and the larger blob gets appended to the file. diagnosed it to each compaction record embedding all attached i
Surya 2, which has 650M params and scores 83.3% on olmocr, is the most accurate small OCR model. One reason why is character tokenization. Constant compute over chars improves accuracy and model size.
Our main goal was to shrink model size while improving multilingual accuracy (matching Surya 1). We got vocab size down to ~65k, and even managed to make certain languages/document types more efficient. Constant compute means fewer rare c
The web is full of egocentric human videos. But robots can’t directly use them as demonstrations yet. Meet EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations, for zero-shot visuomotor learning without re
first 30k problems on hugging face for anyone who wants to run experiments, its nemotron math v2 AoPS split with solve rates for gpt oos120B at 2-6/8. hints generated by gpt 5.5 https:// huggingface.co/datasets/ar0ck et1/hintedselfteach
Introducing Sakana Fugu: A full multi-agent orchestration system accessible via a single model API. Our ‘Fugu Ultra’ model matches the performance of Fable and Mythos, delivering frontier capability without the risk of export controls. Tr
People keep saying that software will improve for the GPU rigs over time. But what improves more, the client or the server? GB300 NVL72 is here. We have the cost per Mtok at 8:1 input:output ratio at $0.21 for an iso-interactivity of 35 t
i just tested to put GLM-5.2 on my rig. 753B parameter MoE. 2x RTX PRO 6000 Blackwells, Threadripper PRO 9995WX with 1TB DDR5. prefills at 64 tok/s. decode holds at 13-15. system RAM bandwidth is the bottleneck. running UD-Q4_K_XL 4-bit.
I don't see how model routers / orchestrating multiple models is a viable business model. If most inference is multi-turn agents, routing to a new model in the middle, and paying for the full context that would otherwise be KV-cached, seem
Web crawlers are dead. This PixelRAG in the video completely skips HTML parsing. It takes screenshots of web pages directly, then lets the visual model read the answers from the pixels. In the past, AI reading web pages meant first break
IK alone can take you pretty far. I put together a really cute demo to demonstrate this. Watch this Panda robot CNC engrave the MuJoCo logo into a curved dome using mink.
a minimal version of "Fugu" using ax that recreates the visible runtime pattern: conductor, specialist workers, explicit context passing, verification and synthesis. real "Fugu" learns the conductor/router this version prompts it. the i
Most investors buying hyperscaler data center bonds think they're underwriting the tenant. Often times, they're actually underwriting a purpose-built facility they'll have to re-lease a decade from now.
(1/6) New μP paper: Under μP, GQA passes coordinate checks but fails to transfer learning rate! That contradiction isn't just a quirk of GQA. It exposes a silent failure mode in the TPV framework that the original theory simply can't dete
On Friday, we released six new state-of-the-art drafters for accelerated inference. We also put out a blog post on why spec dec is so great. Supporting that was a roofline model of speedup from speculation. Play with it in our LLM Enginee
This is Sourcelike, a remake of Source Engine in Godot 4 complete with an asset/map importer and multiplayer.
Many didn't understand this, so here's a direct comparison. Border + shadow = static color, no contrast Hairline over the shadow = crisp, blends in
the metric I really care about when it comes to the economic impacts of AI is how much revenue do the labs make from things apart from selling model access
AI demand has surged so high that import prices for computers and semiconductors rose 3.6% in May, now up 14.4% year-to-year. This is so far from anything in the historical record that 'fastest ever' doesn't do justice to it. (2/4)
Built a design system with @paper + @cursor_ai . No Figma involved. I wrote everything in a design.md file - colors, spacing scale, type system, component rules. Cursor read it and generated components in Paper that actually matched. T
The opencode discord bot in 2 steps: 1. @iamdavidhill uploaded a Figma screenshot of the new opentui logo. 2. The @opencode bot read the relevant Discord attachment, implemented it in opentui, and finally uploaded a screenshot.
to be clear, this is a closed source orchestrator on top of closed source models. if before you didn't control the models, now you don't even control which ones are used or how much. this is not "AI sovereignty" i've also read the tech rep
okay i'm officially loop pilled. my current experiment + eval stack is 13 prs deep and i need to just get it merged i told codex to create 1 thread for each pr to fix comments/lint/conflicts + another to review the changes he spun up 26 t
I was thinking about Vector Sets and the Redis approach to this stuff in general. Now that the hype with RAG is gone, I'm 100% sure I made the right call there, saying: RAG will mostly go away, but raw vector search is a useful, fundamental
"Rename this internal environmental variable" - "I changed the name but kept the old one as a fallback" - "but the original one was never committed" - "you're absolutely right".
GLM-5.2 by @Zai_org is 2nd on Game Dev Arena on Design Arena with an Elo of 1368. This is a 6 position and 29 Elo jump from GLM-5.1, putting GLM-5.2 in the same performance band as Claude Fable 5 by @Anthropic . GLM-5.2 is the top open
Trace AI SDK calls without a custom integration. AI SDK 7 adds 𝚊𝚒:𝚝𝚎𝚕𝚎𝚖𝚎𝚝𝚛𝚢, a Node.js tracing channel for observability providers to follow model calls and tool executions.
create animated text shimmers w/ background-clip .txt { background-clip: text; color: #0000; background: var(--gradient) 0 0 / 300% 100%; animation: shim 2s infinite; } @keyframes shim { 0% { background-position: 100% 0; } }
built something that saves me a stupid amount of time now Shelf is a tiny browser extension for Safari and Chrome that saves links to Telegram in one click for those who keep everything in Telegram chats go try it http:// useshelf.dev
Was chatting with friends across different AI companies about the classic algorithm vs infra debate. One thing people often miss is survivorship bias. The algorithm researchers you see are usually the ones with strong papers, strong projec
loop engineering playing into my name really well lol. eat your lopo loops fam. seriously though loop engineering is finding ways to invoke your agents with cron, event-based triggers, and tail calls. glhf
>if the authors wanted to just form a “RL environments startup” they could probably sell it for millions of dollars wrong; the recipe, like most synthetic RL env papers, relies on a strong generator and is hence not useful for frontier wor
Our work on multi-agent teamwork is accepted to #ICML2026! When agents operate beyond fixed workflows, teamwork is critical. We show that frontier LLMs often make poor teammates—not because they're unhelpful, but because they're too accomm
What absolute nonsense. The paper doesn't claim that gravity must be quantised. It discusses a theory based on expectation values (mean field gravity, sometimes called "semi-classical gravity"). 1/2
no one asked, but one of the more damaging self-owns openai did to internal culture and safety in particular was framing a bunch of things that were obviously handed to the press by execs or comms as "leaks"
Last April I wrote about the idea of an "adaptation buffer"—the window between when cutting-edge AI can do something dangerous and when that capability is widely available. You use the buffer to prepare & defend. In unrelated news, apparen
we made an interactive movie in a day - powered by a world model - running in real time - you can explore and make your own choices this is Operation Pandora. play now
I definitely think papers like Tmax are interesting! I just want to clarify that one cannot derive that it is easy to create synthetic tasks for frontier models. Most pipelines that labs and vendors are building are much more complex than w
My family is donating another $400,000 to the Zig Software Foundation. Zig is exceptional software. I use AI every day. Zig has one of the strongest anti-AI policies in open source. We disagree on some things, but respect doesn’t require ag
wrote about it briefly. sleep apnea (cpap) test can change your life to the point where you end up waking up with 30-50% more energy... If you snore (or someone you care about) sending them for a test is a must
> someone deployed a contract on blockchain > didn't advertise it to any users > a random mev bot starts interacting with it > gets drained > looking for legal action whose fault is it though?
i haven’t written code by hand for 6 months
Deep Agents v0.6 feature spotlight: a code interpreter. Agents can now call tools from inside a runtime, keep intermediate results out of model context, and only pass the relevant output back. Fewer round trips. Less token waste. https:
The HYVIV Index, the first $HYPE (Hyperliquid) volatility index, is now live. The 14-day tenor implied volatility index launched today, with more indices coming soon. The ticker is $HYVIV. Hyperliquid. Volmex charts: https:// charts.
Pretty interesting that so many different companies have basically converged to similar problems ~ the same time: 1) Frontier models are expensive. What parts of our stack can be offloaded to a dumber/faster/cheaper model (canonical exampl
Router models are great, but one thing that no one realizes is that its insanely hard to maintain frontier performance with them. All labs are releasing new models at breakneck pace, this means that you have to train router weights, collec
LLMs have a way of doing technical writing where I can read a full report, nod along, and then walk away without retaining any new information