Agentic benchmarks are riddled with defects (x.com)
Fixing 31% of Terminal-Bench tasks moved every model’s score by 6–12 points, showing that benchmark maintenance can look like model progress
Kept the Anthropic/Fable launch mostly out of the slate in favor of benchmark reliability, security, hardware, biology, robotics, policy, and market-structure items; several strong AI-infra candidates were left out to avoid an agent roundup.
Fixing 31% of Terminal-Bench tasks moved every model’s score by 6–12 points, showing that benchmark maintenance can look like model progress
Metal tensors, TensorOps, and Core AI custom ops expose lower-level primitives for running modern AI workloads efficiently on Apple hardware
Letting autonomous agents access Deno Deploy required a custom Go firewall because existing security tools could not handle tunneling, Postgres, and Kubernetes workflows
Battery recycling has become large enough that Redwood Materials claims to be the largest cobalt producer in the United States
The new distributed key generation protocol improves previous async DKG tradeoffs without requiring a silent setup
Rhaister predicts drug phenotypes in new contexts while Emerald Bay supplies long time-course measurements across thousands of drug–cell-line interactions
A US Navy surface drone located and rescued two helicopter crew members at sea, an apparent first for the military
A software company running more than ten major products with about 700 people is reporting revenue per employee above Google and Meta
Standardized chiplet packaging needs routing and interconnect tools before customers can mix and match off-the-shelf chiplets
Robot learning cannot scrape the web for motor torques, tactile forces, and physical trajectories, making data collection the limiting system problem
Merging Q, K, and V projections challenges a core transformer assumption and could reduce memory pressure in long-context models
LCLMs compress token context into latent vectors and claim a better latency–accuracy frontier for long-context inference
Apple is adding lightweight persistent Linux environments with home directories and repositories automatically mounted on the Mac
Linking US federal drinking water investments to Medicare records finds lower mortality among older Americans and a benefit-cost ratio above 20
Rent-adjusted wages now punish many non-college workers for living in high-density places while college-educated workers still receive an urban premium
Client-side scanning would require software on personal devices to inspect screens and content before encryption protects them
Costco copies best-in-class retail norms but pushes pricing, product selection, and employee retention far enough to create unusual outcomes
Lovable reports 50M projects built, 720M monthly visits to generated apps, and 80% of builders coming from non-technical backgrounds
Morpho’s token round, co-led by Paradigm, a16z crypto, and Ribbit, frames onchain credit as infrastructure rather than another lending wrapper
Editor’s note: no crypto
ChangXin Memory Technologies has been filing on 3D DRAM since at least 2022, underscoring China’s push into next-generation memory
Existing power futures do not hedge the electricity risk profile created by AI datacenters, which changes both consumption scale and location needs
The paper argues that learned PDE solvers become more cost-effective as the underlying PDE task gets harder
North Mini Code is a small agentic coding model with 30B total parameters, 3B active parameters, 256K context, OpenCode compatibility, and Apache 2.0 weights
The package campaign is now tracked across hundreds of npm and PyPI artifacts, including newer PyPI samples aimed at bioinformatics and MCP developers
The list surfaces small, playful web artifacts including digital flowers, tweet destruction, Kanye album guessing, friendship tracking, design books, and NYC stoops
PenPen merges drawing, animation, and code into a freely tryable tool from Etter Studio and Studio Feixen
The suggested split uses XGBoost for five-minute-and-below forecasts and ridge regression for 15-minute-and-above horizons, with nonparametric feature transforms in between
As dozens of companies rebrand as sandbox providers, the underlying distinction between real isolated execution environments and adjacent tooling matters for buyers
A proposed curriculum for the singularity age pairs heavy AI use with Lindy skills like virtue, meditation, fitness, music, and social competence
A post’s reach depends not just on semantic meaning but on how instantly legible, relevant, and interpretable it is to many different readers
Reposting because this is still one of the most misunderstood topics in baseball. Velocity and spin rate didn’t magically explode. The technology changed. The old JUGS guns read slower than Stalker. Stalker read differently than Stalker P
Automatic research from mathematics to AI research: We transfer the ScaleAutoResearch pipeline, which improves a 32-year-old Ramsey number bound, to the NanoGPT Speedrun optimizer track, using Claude Code and Codex with only 1–2 A40 nodes.
I'd like to add to the momentum discourse that nesterov-momentum can be seen as a form of PSGD w criteria c(m; g) = <m, m> - <m, g>, which has minima m = E[g]. The application of m to g is an affine lie-group preconditioner, P = [[m, c], [0
standard softmax attention takes a convex combination of values in context, but parallax lets you extrapolate beyond them. e.g. if you have key-value pairs (1, 1), (2, 2), (3, 3) and a query q=4, softmax attention will output something li
Mythos is live! so excited to have our FrontierCode recognized as the next frontier coding bench. on FC Diamond, BOTH Opus 4.8 and GPT 5.5 don't meaningfully scale with effort, which many of you caught yesterday. Mythos/Fable posttraining
“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam
today; the goal is to reduce the time of training for this by 10x. This took 27.8 minutes to train at 200m steps 120k steps per second. reproduction from mujoco playground using mjwarp
Well, $SPCX is $2T, raising ~$75B. That is a key detail here. 120x would be a $9T book, a third of US GDP. 2x on a deal this size is already bigger dollar demand than most of that list at their headline multiples.
AFAIK all of Meta, GDM, MAI, Amazon (heh), xai use Claude. I believe Nvidia too but unsure? And i don't know but i would guess that the four big Chinese ones do too? And all of these have some sort of agi/frontier push effort. Also unclea
Took 1 week for Kalshi Perps to get to $1B, and we haven't even launched publicly yet Prediction markets took 3.5 years to get to $1B
I 3D scanned my stump and my ax and my wood and recorded my ax motions and splitting sounds and made it into a super satisfying firewood splitting simulator (vibecoded with Antigravity/Claude in threejs)
adding another batch to the list: - @RekaAILabs ( @fboucheros , @artetxem ) - @arcee_ai ( @goodhunt , @latkins , @code_star ) - @essential_ai ( @ashVaswani ) - @ZyphraAI ( @BerenMillidge ) - @MirendilAI ( @HarshMeh1a , @bneys
the new “see your future self” feature is too good nearly a 1/5 hit rate of users subscribing + purchasing the future self feature
how Flex maximizes borrower fun and lender protection 1. partial liquidations 2. dynamic liquidation fees 3. atomic bad debt resolution (1) Trove goes above max LTV (eg 90%) --> it can only be liquidated back to the safe LTV (e.g. 80%) (
NEWS: Pat McAfee, ESPN negotiating $60M per year deal, The Athletic has learned.
the most humanity-aligned business model for a frontier lab is recursive self-commoditization keep the big models closed, enable distillation into open models from the same family, capture value on both sides both openai and google could
Is MIP-131 (request of 150M $MORPHO ) related to the $175M raise? Would imply $1.16 per token
"under the right conditions" they pay property taxes... Unfortunately, that is NOT what is happening. Many of these boondoggle projects are getting 20-30 year abatements.
Past theory used ODEs or local bounds that only apply once you arrive at the edge. To solve the global problem, we introduce the "edge coupling,” a discrete generating function imported from Hamiltonian mechanics whose criticality condition
I'm confused, how are 1M context models not already extreme BPTT? I think the main difficulty is not in training, but in infra to handle the memory issue.
Here’s a pretty weird and surprising result - retrieval-augmented generation works unreasonably well for robot learning – but only when parameterized using difference vectors! We introduce Difference-Aware Retrieval Policies for Imitation
Differencing this condition and invoking a conservation law argument, we can show mathematically that sharpness rises to 2/lr. This means the cause of EoS is an inescapable artifact of discrete optimization, not a property of neural network
Even a modest increase in basic one-shot capability may produce huge gains in asymptotic capability, defined as what the AI can achieve in long agentic workflows with unlimited token budget. +1 SD in human g might not be obvious in a shor
We just crossed $500,000,000 ARR at Lovable but behind those numbers are: - 50,000,000 projects built - 200,000 new ones every single day and - 720,000,000 monthly visits to the apps that didn't exist before The strongest predictor of w
wrong. it's simulators
i denied an 8% raise and promotion at work and they countered with 16.67% raise. holy hell being a bitch actually worked
I joined anthropic a ~month ago and have written ~no code myself. I went from getting quite frustrated with coding agents even 6 months ago and giving up and writing some of the code myself to a big part of my role now being agent managemen
Gradient descent on neural networks frequently drives the sharpest Hessian eigenvalue to exactly 2/learning_rate. This is the Edge of Stability. For five years, ML theory has failed to explain why this happens globally from any initializati
Just got my hands on the sell-side research for the Anthropic IPO Honestly, 40x EBITDA isn't a bad price for such a fast-growing business!
A Turkish app studio, now based in UAE, makes $200M+/year. Their most successful app does one thing: clean your phone storage. Simple idea. Obvious niche. Massive market. Codeway. 25 apps. 12 clearing $100K/mo. Phone cleanup. Language le
Furman on property tax bias in NYC multifamily: • Rentals pay 3.7% tax rate, while single family is 0.7% • Rentals account for a quarter of market value but 43% of taxes • DOF applies “expense caps” forcing tax payment for buildings with l
A head is a kind of eyeball, being able to point the eyeball allows you to reduce the computational processing load that you need to run on pixels. You have a fovea, and your eyes dart along as you read. Makes it easier to train in simulati
Lovable says it has hit $500M in annualized revenue, with 1 million new projects a week
The trick when you catch Russian hackers re-selling your LLM tokens isn't to stop them / disable their account It's to start surreptitiously serving them Llama 8B tokens Much worse to get bad tokens (sometimes!) than no tokens at all
OpenRouter is the #3 most-purchased service by AgentCard, after Amazon and OpenAI
Early-stage startups should generally be wary of hiring from BigCos. They're optimized for a completely different environment where everything is in order. Clear hierarchies, defined roles, and a reactive culture. Early-stage startups are
recreated @emilkowalski 's linear dithered effect w one prompt. feels so good! but still has a way to go before it gets the details right
fable not sandbagging on showing 15% compute multiplier wins on the nanogpt speedrun for a new muon variant is the new you know its over when a gdm paper makes it past internal review to hit arxiv
Systems used to manage people now apply to the AI workforce: identity, role, access, budget, performance. Each AI coworker has a name, a mandate, a manager, and the tools to carry work from request to resolution. The existing workforce mo
sometimes gem(m)s come from remembering random tweet from months ago?
nice! on the auto advance event, consider pushing the card fully off the stack. That way, it can slide under the stack on reentry without the 'teleporting' effect you see here. Always a tricky detail to get right
Captivated by @phutrick 's Faculty Meetings Theory of AGI, which posits that as AGI simultaneously erodes one's sense of personal control and drives down day-to-day material frictions, the amount of time and energy to put towards exercisin
I reworked tool approvals in AI SDK 7 so you can combine any tool approval strategy with any tool (approvals are independent of tool definition)
Winston initially said Harvey only sells to Big 4 legal and tax teams but in fact they basically sell to all practices including corp fin and advisory (as stated in their own press release). Then the argument became that Big 4 teams often
$8M ARR, 3X YoY Growth AND Profitable ! I still remember 4 years ago sitting in one of @sjs_day1 's sessions at @_surgeahead on the power of AND answers in startups. The message was simple and powerful - You should try very hard to fi
In a field near Wicken, Cambridgeshire larkspur is grown for London florists along with cornflowers & Nigella. It’s quite a sight just now-some floral dopamine for you :
input vs output we are still pretty far away lol (red are generated collision meshes, I presume the 90 degree flip is just a bug but even then it's not great). I wish the best of luck to all the companies in this incredibly difficult (but
next wave of businesses won’t be B2B or B2C They’ll be C2C Claude to Claude
If you see someone from OpenAI / Anthropic posting on X a lot recently, they're probably leaving their job.
Diffusion is differentiable. LLMs aren't. So why is the diffusion community copying RL methods (GRPO etc.) from LLMs? The native post-training for diffusion is gradient descent such as ReFL and LeapAlign. Paper: http:// arxiv.org/abs/260
Institutional investors are aggressively shorting the Japanese Yen even as Japan is attempting to intervene: Combined Yen short positions held by leveraged funds and asset managers are up to -$11 billion, the highest since July 2024. We h
Want to work on a datacenter with 1k GPUs? I'm hiring people to write code for our small 500kw cluster (~3MW in couple years). You'll work with things like SLURM, PyTorch, NFS, Linux, miniray, minikeyvalue.
Tianle "crazy horse" Yu's 240lx final project --- playing music on 5 hard drives, using microphones + auto tuning, custom power, bare metal hard written drivers for everything. Glad I'm not getting graded on this curve....
Not all Senior SWE comp is created equal. Equity can range from nothing to 62% of your TC depending on the company and incentive structure. We mapped out 10 Bay Area companies using real offer data from the last 6 months. Late stage priva
Rohan should have titled this “A trip down memory lane”- but he can be forgiven. Lots of folks conflating continual learning with RL optimization; the latter may be necessary but likely insufficient. Here he reviews the different veins of
Excited to see how Fable 5 performs on ARC-AGI-3, and here is my take: 1) The capability jumps on agentic benchmarks often come from the external shell, e.g., a self-referential, code-as-harness scaffold (Darwin Gödel Machine: freeze the m
Training VLMs to use vision-only inputs to play games is not just limited to Anthropic. We showed this was possible using Qwen3-VL-Instruct-8b prior to Fable 5 beating pokemon firered. It is great to see a scaled up version in the latest
incredibly obvious to support this and happy to sign this letter. The BRCA explained: money transmission law assumes you hold customer funds. it requires things only a custodian can do: freeze assets, file reports, return money. applied
did you test it? im guessing it performs really poorly it does for our bot (as did opus 4.7 and 4.8)
While we’re at it, Diamond Aircraft is still owned by Wanfeng Aviation Industry (Chinese)