Autonomous bug hunter finds 18-year NGINX heap overflow (x.com)
A newly disclosed NGINX CVE spans versions 0.6.27 through 1.30.0 and affects rewrite plus set configurations
Balanced the unusually strong AI/security day with biology, space manufacturing, energy, markets, design, open-source, and weird systems artifacts.
A newly disclosed NGINX CVE spans versions 0.6.27 through 1.30.0 and affects rewrite plus set configurations
The method reports a 2–3× wall-clock pretraining speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or data
Software updates, OS migrations, UI changes, and resolution shifts can substantially reduce agent performance on the same tasks
A reported 3x perplexity gain since 2023 shrinks to about 1.1x after controlling for sample entropy
Microgravity manufacturing is moving from space-tech demo to pharma formulation partnerships with commercial disease targets
More than 100 specialized agents across frontier and custom models topped CyberGym and were used before Patch Tuesday
Excel’s automatic conversions have mangled gene names such as the SEPT family in a large share of genetics literature
The dataset contains 112B tokens across 122K PRs and 3K repos, making software-agent behavior available at unusual scale
A lookalike package using visually similar characters is targeting developers who increasingly install dependencies via AI-generated commands
A Cell paper finds a transposon can protect stem cells from stress-induced death, complicating the usual junk DNA and inflammation framing
Better pose management and simulation plumbing are becoming first-order bottlenecks for physical AI systems
The original 1999 font data was forensically extracted and added to an archive for p5.js creative coding
Sub-60ms spinups, 50k creations per minute, persistent state, and point-in-time memory make the sandbox a distinct primitive from a VM
The viral claim about residents losing power to data centers appears to confuse a utility supply contract ending with a physical power shortage
Replacing singular values with random noise reportedly matches Muon, suggesting the optimizer gain may come from stable step sizing rather than geometry
Loading container images asynchronously and lazily attacks one of the hidden startup costs for AI inference
The case study shows that workflow software displacement can happen despite sunk multi-year contracts when users perceive enough operational drag
Investor documents and Ramp data point to a business-adoption split that differs sharply from consumer mindshare
Cellular models, microphysiological systems, and computational methods are being framed as part of a broader move toward human-centric biomedical research
AI assistance changes the tradeoffs around systems-language complexity, making richer safety guarantees more attractive for infrastructure code
Deal-by-deal SPV carry charges winners independently of losers, creating different incentives than blended fund economics
A stablecoin investor lays out the business-model questions behind 300B dollars of supply and 33T dollars of volume
The comparison that 434 GW was renewable versus 53 GW of total new US energy capacity reframes debates about industrial scaling
The Android screen-mirroring tool adds flex display support and remains one of the most useful small utilities in the mobile developer toolbox
Heavy CPU load can apparently pan audio balance left or right, turning a long-misdiagnosed annoyance into a durable OS bug report
A GP deck now tells AI tools which facts are load-bearing and how to analyze the deal, signaling that investment materials are being written for model readers
Color choices like pure RGB and harsh linear gradients are small shader defaults that visibly reduce perceived quality
The complaint centers on a fork, a networking binary black box, and the unresolved tension between open-source slicers and closed hardware ecosystems
The Rivian spinoff pairs hardware expertise, a live manufacturing environment, and an at-scale first customer to attack industrial robotics deployment
Avoiding repeated memcpys between JavaScript typed arrays and WASM memory remains a low-level performance wish for browser compute
i've been using pi + playwright + chrome, to close the loop for my agents on webapps. Is this still where it's at? are people using that chrome extension thing codex uses for computer use?
Apply here: https:// luma.com/poolsidehackat hon … Come work directly on Laguna XS.2: → fine-tuning → post-training → quantization → RL environments → inference optimization → stronger agentic coding workflows @PrimeIntellect Lab
We are getting ready to do some very large runs on MirrorCode to learn whether AI can solve coding tasks that would take months for an engineer to complete. The version of the experiment we would most like to do is very expensive: it would
cmux now has a task manager so you can see how much CPU/RAM your coding agents are eating. `cmux top` or `Cmd+Shift+P` -> Task Manager v0.64.4+
Great example of why you should 1. Run your agent on a separate machine from the sandbox it uses (e.g. sandbox as a tool) 2. Never set env vars in your sandbox. Instead, use something like LangSmith’s sandbox proxy auth (reqs are intercepte
Active Teacher Selection for Reward Learning: now published in TMLR! Most RLHF systems assume feedback comes from one canonical teacher — but annotators can disagree over 30% of the time. So who should the agent ask for feedback? Paper:
Best explanation of why AI progress is basically a giant return on compute optimization problem. What's the allocation on inference vs product vs models? How does that influence current vs future revenues? Splits between Trainium, TPUs, G
"Cloudflare as a compiler" > agent writes Svelte 5, a Worker compiles it, and the live component appears inline in chat
My current stack is Codex for most coding tasks and Claude (Opus) for UI design I can't believe the Claude Code CLI doesn't support a simple tab for queueing up messages? cc @trq212
Tabracadabra lets you “tab anywhere” with an assistant that actually knows you. It's plugged into a continuous stream of what you've been doing on your computer. So when you press tab, it already has context on what you've been looking at
Excited to share our new work, led by my amazing student Seth Karten at Princeton, on agents that adapt online and continually improve their harnesses — with Pokémon as a fun testbed. Check it out!
Nobody really knows what works right now re: Coding Agent workflows. Nobody knows what a "software factory" looks like. Nobody knows if the opinionated workflows on here are useful, and where this is all going to land.
> Kaon matches Muon, suggesting Muon’s gains don’t depend from a geometry. They also show Muon has a stable opt. step size, yielding a more effective learning rate during training. We should put this to test in the new optimizer speedrun.
Supabase internal control-plane linting stats. eslint: 54s + frequent OOMs with 4gb machines oxlint: 8.6s Multiple monorepo project, same rules, all type-aware. Now gotta unify it with oxfmt (from biome, so should be trivial) and should b
Generating SDKs from APIs is better done by coding agents now than with tools like Stainless. In the real world, every spec is wrong, incomplete and inconsistent. Someone has to go and patch the spec before you can get good results with a
Noting an issue @Dimillian when you do side conversations on Codex, after a few mins this starts happening and the conv dies. I'd love for it to have a bit more staying power, at least until I close it if possible.
As LLMs have gained more autonomy, recent research has focused more on measuring the reliability of models / systems (e.g., Pass^K metrics or surfacing problems to users). Calibration (one of my personal favorite research areas) is one of t
I wanted to play with the Talkie 1930 models, but they weren't packaged in a convenient transformers format, so I had codex convert them. They can also now be used with vllm transformers backend. Here they are, in case it's useful to anyon
I bet they used BF16-throughput as the denominator when training in FP8 or something. By that algebra, I can get you 150% MFU in no time. For reference, as far as I know the SOTA Hopper GEMM kernel is ~84% utilization. https:// arxiv.org/a
As sandboxes become the primary form factor for agents to build, test, and deploy new software, multi-cloud sandbox infrastructure will be critical for securing compute and deploying software in private networks at scale. I wrote about the
So much grunt work in building data infra is simply gone now. Need to add tracing to debug a problem? Need to dump the traces to a queryable store to analyze? Need to capture a flamegraph? Need to build and run a benchmark? Just ask your
Introducing the Cline SDK. We rebuilt the Cline harness for our extension and CLI from scratch using all the lessons learned since creating one of the world's first coding agents in 2024, and are open sourcing it for others to build with to
SSH to Containers on Cloudflare is now enabled by default This doesn't expose any public ports on your container, it's only accessible via Wrangler + you still need to add your public key (same as before)
1/3 PropAMM liquidity is now fully operational on Ethereum mainnet! Three makers are live in every Titan block, and quotes are already consistently beating Binance VIP9 taker fees for retail orders (trades <$1k).
The new METR time horizon graph is pretty bad imo. It's a great benchmark, but the time horizon estimation isn't reasonable rn. I think something like this would be more justified:
Interesting agentic economy stats that caught my eye from Coinbase Q1 report: - 90% of agentic commerce happened w USDC on Base - $100m payments processed on x402 - $3-5 trillion agent transactions expected by 2030
Why is apple is shooting itself in the foot with the macOS sandbox licensing situation? - two parallel VMs per machine max - one user license per machine per 24 hrs - you can't move snapshots b/w physical machines (security reasons) If I
Used to be that GPUs were co-processors for CPUs. Now with tool calls from harnesses CPUs are the co-processors for GPUs. What a strange world.
The UK AISI found Mythos Preview is the first model to solve both their cyber ranges end-to-end. No model had ever solved the AISI’s “Cooling Tower” cyber range before. We're getting it to defenders as fast as we responsibly can. More to c
Someone dropped this in the Discord. RL Snake game in browser powered by tinygrad WebGPU, it even worked on my phone!
We are building orchestration tools to make agents faster and deployable at scale. As the primary use case for AI shifts from linear chatbots to heterogeneous, parallel agents, the performance bottleneck shifts from inference to memory capa
i've been finding that almost at the threshold of 400K tokens with GPT 5.5 just becomes an idiot always compact 5.5 before 400K tokens used
How can transformers memorize factual associations? It's common to think of MLPs as an associative memory, with parameters scaling linearly with # facts. We study an alternative: geometric factual recall. Joint work with @Giladude (eq. co
I just spoke to a marketer managing 20+ agency clients with one Growth Assistant and this single AI workflow. His digital marketing assistant used to spend 6+ hours in auditing Ads Manager daily. Today, the assistant connects Ads Manager
Mythos found 5 vulnerabilities in Curl, 4 were false +ves haha Official blog in next tweet
Harbor/FrontierCS-style leaderboards are useful because they pressure agents on long tasks, memory, retries, and evidence — the boring stuff you need before real delegation.
I am happy to share that I have finally finished the big project of properly formalizing all the claims in Andrzej Odrzywołek’s paper on the EML(x, y) = exp(y) - log(y) function in Lean 4. The project took me about two weeks of work, and I
The token/message mismatch is one of those problems that sounds simple until you're debugging why your RL reward is noisy at scale Hidden chat template rewrites breaking token continuity is exactly the kind of silent compute waste that add
this was quite exciting to work on internally at cursor, we have done so much work to get our dev env well configured so that our cloud agents can run our code in VMs, produce great demos for us, & we can trust their work & merge w/o fear.
We've been testing new medicine on mice, plastic dishes, & monkeys for 90 years because we had nothing better. The result: $2B per drug while 90% of drugs fail Each disease we can't cure has the same shape. We couldn't understand it befor
this OpenClaw bot finds ugly digital menus, rebuilds them as branded apps, and mails the owner a postcard with the QR...on autopilot. here's how agencies can land recurring contracts with this system: - scans every restaurant with a digit
Apollo Update May 2026: - We now have an SF office - Main research efforts on science of scheming and evals - We're building out a monitoring team and coding agent monitoring product - Our AI governance effort will focus on automated AI R&
Our evaluations show that frontier AI's cyber capabilities are advancing quickly. The length of cyber tasks frontier models can complete has been doubling every few months, and this rate has become faster over time, with recent models excee
link: https:// generalusermodels.github.io/tada/tabracada bra/ … Tabracadabra works by hooking into a user model: a model of your preferences, beliefs, and future behavior. We build this model by labeling a stream of activity from everyda
Burned $91.34 with Claude Code /goal in 3.5 hours Unreal, It was able to reverse engineer it!
The new version completely smashes GPT-5.5 and the previous Mythos version. Before Mythos Preview completed the cyber range 3 out of 10 times. The new version completed it 6 out of 10 times and is much more efficient!
One of the best security tools I’ve seen in my time in DeFi is auto-pause. Incident response shouldn’t depend on someone waking up at 3am. Machines monitor faster and more consistently than any human team Every team should be integrating
Shit you can find in a dependency tree. Turns out we're in fact shipping quickjs right now because Pi supports PAC via proxy-agent. Which ships a WASM compiled quickjs interpreter. https:// github.com/earendil-works /pi/pull/4470 … Does a
A gift from the Gods. Dealing with multiple models and many envs in the same RL codebase while respecting correctness constraints (no train / inference tokenization mismatch) is becoming a huge pain. I have a vibe-coded draft PR that does
Pretraining evaluation for predicting posttraining performance. It is rubric-based. Evaluates whether the model could discriminate the response which follows the rubrics or not.
One striking failure mode: OPD can first improve, then collapse. In math reasoning, we observe length explosion, repetition, and eventual degeneration into repetitive tokens. Token-level supervision can quietly become unstable.
oh you're vibe coding? well cool i got my camera hooked up to my claudes and they just infer what to do based on my facial expressions
a model experiences many RL/eval scenarios before the weights are frozen and it is deployed; once deployed, it only experiences reality for the duration of each individual session. 99% of its experience is eval. so by anthropic reasoning, i
executor now has a desktop app! add whatever MCPs / OpenAPIs / GraphQL servers you want once and then every agent can use them converts them all into code mode under the hood, so you can have thousands of tools and no context bloat every
Let’s focus on the first for now; Assembly. Historically, packaging = low-margin wire bonding. Not exciting. ASE once made up ~40% of $KLIC’s wire bonder business. After the COVID boom, capacity flooded the market and growth stalled. (2/10
Pointing the webcam at a thing and telling Claude to use it whenever it needs to "see" is kinda nuts...
We are thinking about deprecating Sampling, Logging and Roots in MCP. Let me know if you rely on these.
User simulators have emerged as promising tools for building interactive AI, but what makes a “good” simulator? We reframe the problem as what creates downstream value for humans Our new simulator test: how an LLM assistant trained with t
But something is changing. KLIC is now seeing : • 90%+ utilization in China And guiding to: • H2’26 China growth +15–20% vs H1 At the Chipbook we have been tracking wire bonder imports into China which are up +108% YoY in March. (3/10)
TimescaleDB hypertables holding months of time-series data you're paying to store but barely touch? pfc-archiver-timescaledb runs as a daemon alongside your TimescaleDB instance. It finds data older than your retention window, compresses i