A Lean-Checked Ethereum Verification Benchmark
Lean-checked tasks turn Ethereum verification into a benchmark you can actually trust
Balanced technical research, market structure, policy, hardware, and a few durable startup/product stories; skipped most congratulatory replies and near-duplicate AI launch chatter.
Lean-checked tasks turn Ethereum verification into a benchmark you can actually trust
Direct access to compute, metals, and other underlyings would make futures markets more useful for real hedging
The system combines exploration with test-time training in a way that could generalize beyond one paper
The paper lays out a data recipe for agentic coding that is detailed enough to evaluate and reuse
Thirty minutes of human demonstrations regularize self-play enough to produce much more human-like driving policies
The paper says long training eventually erodes a model’s ability to adapt, even in stationary pretraining setups
Environment simulation becomes the training objective instead of a post-hoc hack for agents
The benchmark measures whether models can tell they are being evaluated, a failure mode that matters for real deployments
Token-level credit assignment without a value network is a useful step toward cheaper long-horizon training
The argument is that explainability still lacks clear tasks, stakeholder framing, and evaluation standards
Streaming raw accelerometer data from a wearable turns a consumer device into a control surface
A dedicated lens path for green pixels gives Sony a meaningful resolution boost without brute force
The new slider can animate arbitrary paint expressions, which makes live cartography far more expressive
The project shows how algebraic definitions can be used to identify and fuse kernels automatically
Giving agents direct access to pipelines, connectors, and tunnels makes them operators rather than just code writers
Workflow automation inside the canvas makes model-driven design much more shareable and team-native
A humanoid robotics company going public is a sign the category has moved beyond pure demo stage
Yield from lending tokenized stocks matters more than the wrapper itself because it creates a real use case
Heavy-duty turbines are complex enough that AI-era power demand can reshape their supply chain and economics
The tweet points at a key AI-hardware assumption: memory pricing can make or break the economics
A small checkout change can materially lower the legal risk of running subscriptions at scale
The change shows how quickly a tax rule can reshape the value of a supposedly protected savings wrapper
The terms create a sharp mismatch between where the product operates and where it says disputes must be handled
It shows how a genuinely useful grassroots tool can collide with large-company risk management
The result is that disagreement itself can be informative when you ask the right follow-up question
Auto-generated docs are a force multiplier for ecosystems, and TypeScript still lacks a truly great version
The combination of revenue acceleration and a million users makes this more than a routine funding post
Treating colds and flu as solvable problems reframes respiratory infection as infrastructure, not nuisance
Useful memory has to be biased and proactive if agents are going to improve across repeated work
We’ve found an empirical law governing plasticity loss in transformer models. The surprising part: pretraining on uniform data distributions doesn’t seem to make models immune.
Your coding agent reads every web page wrong. And pays 2.5x extra to do it. pixelbrowse fixes it with one screenshot. −74% tokens, 4x faster. give Claude Code eyes. New: pixelbrowse. 𝚙𝚒𝚙 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 𝚙𝚒𝚡𝚎𝚕𝚛𝚊𝚐 · http:// git
There are a few things that I look back on as my mistakes in the early days. Quake was overly ambitious technically. We could have done all the great multiplayer and modding work inside a Doom++ engine, allowing the designers to work with
"what is the project?" in codex 5 high without: 17k tokens with: 144k tokens
Two roles in the future: FDEs and Recruiters (context acquisition specialists). FDEs get permissioned access to/instrument pockets of reality into something computers can effectively search and learn over. Recruiters acquire the incremental
Assuming llama-like scaling with GQA: 2.25h², MLP: 3×3.5h² this is a total of 12.75h² bytes in FP8 per layer; so theoretically, at hidden size=1536 we could store entire layer in the SRAM (plus something for kv cache) of a B200; this would
PeptAI Update: VEGFR2 Results Back, De Novo Panel Complete What's New: • Assay validated: We ran a test on VEGFR2 using molecules with known binding behavior: real binders, engineered variants, and a scrambled sequence that should show no
Watch our robot repeatedly assemble a collapsible crate, even while I perturb it It's a long-horizon task that requires many subtle non-prehensile actions. This model was trained from scratch locally on my 5090 with only ~2 hours of data.
AgentWorldBench: 7-domain benchmark with ground-truth observations from real environments, constructed from 5 frontier model trajectories on 9 established benchmarks. Results: Qwen-AgentWorld-397B-A17B achieves the highest overall score (
Meet Qwen-AgentWorld — a native language world model that simulates 7 agent environments (MCP, Search, Terminal, SWE, Web, OS, Android) within a single model. Environment modeling is the training objective from day one, not a post-hoc ada
We have finished full fine-tuning of gemma-4-e2b-it for Nepali language. This model was trained on 1xA100 GPU for approximately 9 days on 2M rows of mixed SFT dataset: Nepali: himalaya-ai/nepali-sft-dataset English: teknium/OpenHermes-2.
We got approval from JFSA and issued the first bank-backed JPY stablecoin on Ethereum. On day 1, we issue $70m ish, and we will go bigger. This is an internal launch and withdrawal to Ethereum and other chains from the SBI exchange will b
The Goldfinch lesson isn’t that underwriting is hard. TradFi underwrites credit every day. The mechanics are well understood. The lesson is liability. If an underwriter controls where capital flows but only holds a small portion of the d
here's what this looks like in practice two PRs that were opened using my testing framework, the first one runs the real opencode flow, opens the auth url in the browser, and returns second one is a recording of the web ui, once you exper
CoT summaries should be in third person IMO. first-person CoT summaries are bizarre -- they introduce another first-person narrator which the user is encouraged to conflate with the assistant, but which isn't produced by or even *visible t
The teacher distribution matters a ton - just because we expect it to be "better" or "more knowledgeable" doesn't mean it's good to distill its distribution. We show the blindspots of vanilla context distillation, and how it can hurt rather
1/n Thrilled to open-source DFD (Data-Forcing Distillation) — a one-line fix that restores diversity and fidelity in DMD for few-step video distillation. No GAN, and just 100–300 finetuning steps. It improves on three video generation tasks
been tried this for a week - they seem to have forked the chrome engine, never blocked or detected as automation - no such flakey experiences with agent browsing, like alert() blocking. they just works
yesterday I was debugging a poorly-performing training run with Claude Code and I discovered that instead of training on 30 batches of data it had somehow decided to train a new model for 500 steps on each batch and then average the 30 sets
Daily update: July 1 decays to 20%, but with more volume July 31 up to 76% and August 31 at 91%. That seems overconfident to me in terms of ability to implement sufficient KYC quickly or otherwise find a solution, but I am hopeful.
While Meta was slow yesterday we had: - Single best subscription rate day - 3rd highest total new subscriber day - 100% of new subscribers take our high AOV subscription offer (never happened before) - 0% of these people paused or cancelled
At Google executives project their importance by how slow they are at approvals. A slowness doom loop. This is why I banned approvals at Vercel. We only have vetos. Want to block something? No problem, speak up. Have nothing to say or taki
the best engineering teams have been merging thousands of PRs per day in slack with @capydotai if you want the most natural slack coworker experience paired with a SOTA coding harness that works with any model AND supports external subs
Agent Zero v2.0 is out. Meet the A0 Launcher. A desktop app (Win/Mac/Linux) that manages your local and remote instances, and onboards new users by installing the container runtime for them. You don't need to know what Docker is anymore.
On-policy distillation has the same systems bottleneck as RL: rollouts dominate training time on reasoning workloads. Going async fixes throughput but feeds the learner stale-policy data, and what staleness does to OPD specifically was unst
*The Transformer Cookbook* by @pentagonalize @davidweichiang et al. A beautiful introduction to "hardcoding" algorithms inside the weights of a transformer (addition, lookup, branching) following the seminal RASP paper ( https:// srush
There are more bookstores, restaurants, wine bars, and plant shops than ever before in the US, and our libraries on average have longer opening hours. We don't have a shortage of consumption-oriented third places. What we are missing is th
- it’s pre-validated because it already was a high engagement reddit post - it contains some obvious elements of LARP and fishy numbers so people (including me) feel compelled to comment on that - it contains Big Number so it gets attention
a tiny detail we shipped recently to @AsideAI is popover notifications i kept missing agent's confirm notifications and leaving agents idle for hours so instead of just sending a notification, we show a popover when the agent needs fina
RAMTRON F-RAM. Peak logotype, logomark and name.
We find that plasticity loss happens both in nonstationary continual-learning environments but also in stationary pretraining-like setups. This means that if you pretrain long enough your LLM will eventually lose the ability to adapt to new
When you build an LLM agent team (planner, executor, verifier, ...), cost & accuracy stop being properties of one model. Instead, they depend on which model fills which role, and where it's hosted. The right choices can lift accuracy by up
Because AI is agents now, not chatbots. Agents use tools to ground results, and you can RL them till they do it correctly. Reward hacking is the new hallucination problem.
Most companies think their design problem is a designer problem. Most often the ceiling on design inside most organizations isn't set by the design team. It's set by how much leadership understands, invests in, and connects design to wh
i think this is the reason why the claude desktop app is behind codex, and it's totally fine, i think slack is actually a great way for "teams of humans" to interact with "teams of agents"
not quite 2D not quite 3D
built a complex version of a tesseract with superellipsoids rippling cube to sphere in a 7x7x7 lattice #p5js
A year later and Deel has their own stablecoin (DLUSD), wallet, and yield product (w/ a card soon to follow).... the BYC (Bank Your Customer) era is officially underway
Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the published SOTA of real Nature papers — on their own, no web search, with the original method hidden? Introducing NatureBench
Benchmarks are everything! As evals make more tasks measurable, the difference between a $1,000/hour and a $3,000/hour partner will be a proven benchmark metric.
tech adoption usually starts ugly: spreadsheet held together by fear. operator who knows the weird exception. customer who refuses to wait for the roadmap. then the tool eats one task. then one handoff. then one budget line. that is the
this is actually insane, am I going to become a case study for how AI spend can get out of control https:// brainrot.nicolasbrillante.com is what cost $80k btw.
How my /goals look atm. The process ultimately spits out the Intent with relevant docs made in the shaping process. I /copy it and start the goal with it. I run the same process over and over (manual loop) for each slice of work - trying
Testing the @insta360 Luna Ultra 12x zoom on an Alaskan Brown Bear.. I doubt they expected people to use their camera on bears
LLMs don't just hallucinate because they lack knowledge—they hallucinate because they don't know what they don't know. Existing knowledge augmentation blindly injects more data, treating every error as a knowledge gap. But overconfident wro
Algorithmic Art a 6 hour plot #plottertwitter
Look at how cool the action is on this Sykes herringbone generator. This design allows you to cut teeth down to a sharp vee (though this part has a central groove). All of these machines use a 30° helix angle on the guide. My father used
FSD 14.3.4 can dodge flying leaves , squirrels crossing the road , and random trash on the ground… but somehow completely misses a 10-gallon gas can that fell off the car in front. Drove right over it and popped it. My whole car is now soa
Benchmarks are very important source of truth. Often times, we dont really know what works best for what. BenchPLM is just the start, there are so many other benchmarks we are building internally to understand what really works and what doe
I didn’t really consider the possibility that an Anthropic customer would sue to enjoin BIS’s export control action. The legal arguments in the complaint are very strong, as you’d expect. If Anthropic was making them, they’d probably prev
it will take a while before silicon valley really internalizes that when you identify a technology literally existential for a business you really can’t sell it as a service
Your Cash ISA tax free amount is going down from 20k to 12k. Unless of course, you're over 65. Then it stays at 20k. Also over 65s can keep transferring non-cash ISAs into cash ISAs. You can't. Why is everything in British politics like
GRASS is terrible for crypto. Here's why: - The business is farming user data and selling to AI labs for nearly $100m ARR. - In return, they give you a memecoin with 0 value accrual - We have no idea where the revenue goes (assume it lines
New genre: system prompts that elicit the worst possible research taste
people are clowning on him for this post bc they don't realize how big a deal this is - Claude Code feels like I've got a pairing partner, tag feels like managing a team. I take on way more 1-off projects and parallel research tracks with t
go write 'be thorough' in a prompt and come back and tell me its the same without it 1) this is an LLM, any content can change its nature 2) you need to trigger depth in some scenarios. there are many ways to do it, 'thorough' is one http
This one is for builders who want an agent that can operate their computer. Today, we're releasing HoloDesktop CLI. Powered by Holo3 models, it brings H Agent directly into the agent harnesses you already use, including @Claude Code, @
Periodic reminder that you cannot breed out a polygenic and the generational response to selection is extremely slow. The Nazis sterilized or killed nearly all schizophrenics in Germany (a far more heritable phenotype than criminality) and
Related: Bytedance's viral makeup filter back then was a revelation. No mesh, no traditional occlusion, just pure GAN I believe for AR use-cases, we should directly go from stereoscopic view to augmented view, no polygons in-between
New research from our lab: do the "experts" inside a frontier Mixture-of-Experts model form real, separable modules, a math expert, an Arabic expert, a code expert? The tempting assumption: If experts specialize into clean modules you coul
LOVE the direct quote