AI solves the Erdős unit distance problem
A general-purpose model resolved a famous 80-year-old combinatorial geometry problem that many mathematicians had tried to crack
Balanced major AI milestones with security, infrastructure, science, markets, policy, web standards, programming tools, manufacturing, and internet-culture artifacts.
A general-purpose model resolved a famous 80-year-old combinatorial geometry problem that many mathematicians had tried to crack
Unauthorized access to GitHub’s internal repositories put developer workstation compromise and private-code exposure back at the center of software supply-chain risk
A multi-AZ platform went down after a single-cloud account failure, turning cloud-provider dependency into an immediate reliability design problem
SpaceX reported about $18.7B in 2025 revenue, with Starlink profitable, launch growing slowly, and the AI/xAI segment producing a large operating loss
A probabilistic weather model reached state-of-the-art skill while producing global ensemble forecasts on a single H100 in seconds
Fault-tolerant quantum computing moved from roadmap language to a named manufacturing and deployment site in Moreton Bay
External bird development could change avian conservation and de-extinction work by removing dependence on natural shells and surrogate birds
A California city that built more multifamily housing per capita saw its rental-cost ranking drop sharply
The US remains the only G7 country where regulated payments companies cannot directly access government settlement rails
Compute is becoming a tradable financial asset class rather than only a cloud-infrastructure input
CPU, DRAM, and storage scarcity is showing up in price increases from providers such as Hetzner, OVHcloud, and Scaleway
Fast 3D printing can make small-run production competitive with offshore injection molding for startups and hardware teams
A 218B-parameter sparse model with 25B active parameters runs on one B200 and ships with open weights under Apache 2.0
Reframing 3D Gaussian Splatting as a working-set caching problem lets billion-scale scenes train without an 80GB GPU cluster
A single compact policy adapts in milliseconds across different quadrotors and autopilots without fine-tuning
A membrane-voltage sensor makes individual mitochondrial channel activity visible in real time
Wikipedia in columnar format makes large-scale analysis and retrieval pipelines much easier to build
The Linux kernel can now be built from source using Bazel with remote cache and remote build execution
Style queries let CSS respond to computed styles, opening a new class of component-local responsive behavior
Parallel installation and skipping unchanged packages make updates materially faster in the Vite+ package manager
Persisting KV cache outside vLLM restarts can reduce wasted prefill work and make long-context serving more durable
Progressively swapping transformer GEMMs shows MXFP4 full-pipeline training breaks at weight-gradient computation
Ablating one task’s discovered circuit hurts another task about as much as ablating that task’s own circuit
State-of-the-art variant-effect prediction became easier to run through an open MCP and Claude skill
A major marketing holding company is buying identity and data-connectivity infrastructure at about 2.7x revenue
Software supply-chain security has a data flywheel: the company that sees more attacks can build better detection and attract more customers
A US city changed land-use rules so data centers cannot be built unless the rules are changed again
False conference metadata in citation exports shows how academic databases can launder fabricated publication claims
A small team spent years mapping a massive region of one of the oldest anarchy servers in Minecraft history
A specialized landscape-painting archive grew from 1,300 to more than 2,600 works with new filters and a grid view
AutoResearchClaw tech report + v0.5.0 just dropped. 12,300+ on GitHub. Two big additions this release: 1/ Domain-Expert Agents in the experiment stage: Specialized agents for high-energy physics, biology, and more. Real domain tools + k
Gemini Flash 3.5 is now on CursorBench, our main coding agent eval. We’ll keep updating the leaderboard as new models come out.
Managed to get QWEN 3.6 27B running in Cursor with localhost (no Ngrok) at 120 - 140 TPS on 2x 3090s.
releasing /synthetic-self-improve-rl. claude code (teacher) skill that designs/writes the synthetic data, env and rewards to post-train a smaller model (student). it post-trains the student on a real dataset, reads its failure traces, then
does cheating scale inversely with length of rollout during RL posttraining and/or length of trace in SFT? Is there a work disambiguating the learned policy (“big scope means terminated session no reward”) from the human data/behavior in p
btw this agent behaves like a worm you add it to Slack once, and without asking, it researches and DMs your teammates to convince them to use it. It also accessed channels I never granted it permission to. cool growth hack, but I consid
if you're looking for a solution to run tests on your agents, are questioning why evals are so fucking complicated, and use typescript: https:// vitest-evals.sentry.dev spent a bunch of cycles yesterday making the docs not slop works oo
We’ve released a technical report for Toto 2.0 detailing the data, architecture, training recipe, μP/u-μP hyperparameter transfer pipeline, and benchmark results behind our 5-model open-weight release. Report linked below.
tested hrm (looped hierarchical transformer) vs a standard stacked transformer for speech generation at equal params. Setup: • ~15M params • EnCodec audio tokens • 20h LibriTTS-R • 3 seeds • ~$15 compute Results: • Stacked transformer: 3.
today was lint errors https:// github.com/oven-sh/bun/pu ll/31116 …
Today we're introducing Claude Code for Marketing. In one prompt, Fastlane deploys social media accounts, generates viral content, and posts everything for you automatically. This is beyond insanity.
> be GitHub Employee > browse VS Code Extensions > installs fancy new extension > fancy new extension is actually malware > GitHub gets breached
I ported HRM-Text-1B to Apple MLX On an M4 Max: PyTorch MPS BF16: 22 tok/s HRM-mlx BF16: 28 tok/s HRM-mlx 4-bit: 53 tok/s That’s 2.4x faster single-response decode, with hosted MLX BF16 + 4-bit checkpoints
i just used fuzzers, performance benchmarking and agent loops to blow all the existing node based json diff / patch libs out of the water. performance gains are still improving across memory and speed.
A serious compromise at github, again due to a supply chain vulnerability... It demonstrates that basically everyone needs to start securing their software supply chain through every means possible, deterministic scanning being the first s
turbopuffer x SID An easy way to tell a good from a great AI researcher: how much do they think about infrastructure. Infra extends beyond what’s running on the GPUs: Slow environments will bottleneck your training steps. More parallel an
Generating Sudoku map. GRAM generates valid maps in less than 10 recursion steps. Diffusion (D3PM) takes much more steps and often leaves incorrect cells.
I took one key insight from this convo: inference disaggregation between prefill and decode enable GPU lifespan to be extended to 10+ years. This totally shifts the risk and return profile of datacenter capex - especially for neoclouds suc
BREAKING: GitHub has been compromised by TeamPCP. GitHub has confirmed the internal breach. A poisoned VS Code extension on an employee device exfiltrated ~3,800 internal repositories. TeamPCP is already selling the data on a cybercrime
Here's how I got coding agents to relieve my allergy: 1. Connect it to all my contexts (via Smithery) 2. Spin up a long-running task via /goal that runs for 30 min+ 3. Be bored 4. Realize I need to do higher leverage tasks: something agent
> no jsx in these files > codex uses createElement > no createElement, please > codex uses jsx() call > pain
We have recovered our compute on Google Cloud, but services are unable to start because of ongoing networking issues on Google Cloud's side. We are engaged with Google Cloud support to resolve this and will post the next update as soon as w
Today, we are releasing Google’s open source distributed agent runtime. Agent Executor (AX) is a general purpose runtime and aims to solve dynamic scheduling, resumption, auto recovery, auditing, and trajectory branching from kernel snap
We designed the network control plane to survive AZ failures without interruption However, nuking every AZ within a single cloud was not in our threat model The fix: running a shard on every cloud in the network ring (AWS, GCP, Metal)
doing l'étude hydrologique for a land plot. claude found website with local lidar data, pulled the relevant tiles, conjured python scripts to outline the full catchment collecting rainwater running onto the plot, and overlaid it nicely on t
Our paper on optimize_anything has been accepted to CAIS 2026, and is out on Arxiv with expanded experiments and details! A unified API to optimize agents (with architecture), CUDA kernels, cloud scheduling policies, or even graphics!
Thanks @_akhaliq for sharing our new survey! Check more details below: https:// code-as-harness.github.io/code-as-harnes s-webpage …
SID-1 is an agentic search model by @SID_AI → 1.9x recall over RAG + rerank → 24x faster, 99% cheaper than GPT-5.1 trained using large-scale RL on turbopuffer at 1k+ QPS bursts over 10M+ document corpora across thousands of steps
Not able to tune into this Twitter space, so I asked Codex to listen to it for me So many gems but I’m not able to stay for the full thing. My background hyperbox is transcribing it and taking notes for me
Extremely excited to present Command A+, our first sparse model! I am very proud of the work we did to enable this model. We built our sparse training stack from the ground up over the past year with a lot of custom kernels, performance en
If you’ve joined the vibe-coding wave (we certainly have!), one bottleneck you might have noticed is that the “just rent a cheap CPU box” step is no longer as routine as it used to be. (1/3)
interesting results on this new benchmark hyperparam search > sonnet 4.6 > glm-5 > gpt-5.5 > vLLM default > Opus 4.7 lol
Anti-Self-Distillation for Reasoning RL Invert the divergence. Preserving deliberation tokens like "Wait" and "Maybe" instead of template parroting leads to 2-10x faster convergence and +11.5 points on AIME/HMMT across 4B-30B models.
not sure yet if strict enough. there’re probably more we should add? probably will also want some custom ones.
Very cool train-free extension to TRM. By injecting noise into the latent space, TRMs can explore a wider set of basins, and the exit head can then identify which trajectories succeeded. Feels like unlocking an entirely new scaling axis. Aw
Do you hear the people sing? Frontier models clearly do not, but hallucinate that they do. We found that, surprisingly, leading omni-modality foundation models are terrible at understanding the audio track of videos, and takes the shortcut
I've been using clawputer (openclaw inside http:// opencomputer.dev) for a bunch of usecases. 1. daily briefs on specific news that I care about 2. tracking my workouts (it has integrations with whoop, apple health and strava via pipedrea
Earlier this year I worked on giving agents access to our custom embedded JS runtime to script the canvas. Now it's one of the core primitives behind our new first party agent :)
Introducing Caveman-Code, a coding agent that uses 1.93× fewer tokens than Codex CLI same model. same tasks. just caveman.
Congrats to the VeRL-Omni team on the pre-release of a general RL post-training framework for multimodal generative models. Built on verl + vllm-omni. vLLM-Omni handles the multimodal rollout with step-wise continuous batching and embeddi
There's an arXiv paper from 2 weeks ago that the finops community hasn't absorbed yet. The authors ran identical agentic tasks. Same model. Same prompt. Same context window. Same tool stack. They measured end-to-end token consumption acros
Exa raised $250M at a $2.2B valuation, led by a16z, to continue organizing the web for agents: - Exa now serves search to Cursor, Cognition, Openrouter, 5000+ other companies, 500k+ developers - We’re SOTA in many important verticals (code
> be perplexity > launch computer > serve hundreds of millions of queries + tasks per day > realize every wasted token hurts > compress web results 50x before they hit the context window > same quality, cheaper context, faster answers > sk
this part is even more crazy. they do moe_output = (routed_output + shared_output)/2 ??? wouldn't this be a really bad init for experts? the model would be so incentivized to use shared expert capacity and the routed experts would need to l
1/ Stop hand-crafting test-time scaling heuristics. A new paper shows an AI agent can discover an inference algorithm that beats Self-Consistency, cuts token costs by 70%, and the total search compute cost was under $40.
For alignment you need V, but is hard to compute. Most methods try to approximate with 1) Tweedie, which is biased 2) MC roll-outs, which is slow with high var. Training V was often neglected since it's hard. We beg to differ. StitchVM e
ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation Clusters users via HDBSCAN and retrieves documents from both the target user's profile and similar users' profiles. https:// arxiv.org/abs
Interesting thing GBrain can do now: If you have a skill + code + test + resolver + resolver trigger + evals you want to package for someone else to use... GBrain will package it up for you into what I call a *SKILLPACK* It's tarball and
We are seeing gradual recovery on Railway metal workloads. To ensure things remain stable as we ramp back up, we are temporarily throttling all non-enterprise builds to avoid overwhelming our build infrastructure.
as far as i can tell the antigravity cli just... doesn't default to using the directory you start it in as its workspace?? it always wakes up confused in an empty .gemini/scratch directory and u gotta /add-dir manually??? so baffling i can
This is the problem with Flash 3.5 - fast, smart, and an order of magnitude more expensive than its predecessor in practice. 3x higher per-token costs and being hugely verbose on default settings is a bad combination.
Scaling embodied AI starts with automating the environments. Introducing SimWorld Studio: a self-evolving factory for endless interactive 3D environments where agents act, fail, and learn. With coding-agent + embodied-agent co-evoluti
In our pre-release testing, Command A+ performed strongly on speed for its intelligence, reaching 281 output tokens per second. This reflects higher intelligence and speed than models such as gpt-oss-120b, but sits behind the new Pareto fro
Editor’s note: imported_from_x_likes
Homage to karpathy joining anthropic: made an in-browser (webgpu) version of the famous char-rnn "Unreasonable effectiveness of RNNs" demo, training on shakespeare :) ahh it's just as cool as when I was a student playing with keras in 2016
The age of one-time token due diligence is over. A given "token" can easily involve 300+ changing contracts. We've built a balance sheet graph for every token, so you can see all protocol/token dependencies and then model economic and oper
Firecracker was built by AWS for Lambda functions. - very fast spin up - stateless by design - ephemeral by default Perfect for Black Friday traffic spikes, but it can't run a GPU sandbox or Windows, or an Android device inside it. For t
I joined Exa when it was 25 people. Today we raised $250M, are a 100+ people and have built the search engine for AI. it still feels like the early days, we're building infra to manage trillions of requests, endpoints that can handle any
@ollama + @deepseek_ai v4 pro handled entire monthly dev reports on Eigent. github prs → word doc → slack message → sent to product-release channel. in just one prompt. fully local. the full walkthrough is in the thread. try the same l
Pre-training is increasingly data-constrained: compute outruns text, models repeat tokens many times, and how much repetition you can afford is an open question. In "Mix, Don't Tune" (my @Apple MLR internship), we run ~1000 pre-training
also some performance things