Bend-to-C compiler refactor passes 1,016 generated programs
A compiler refactor was stress-tested overnight against a Haskell reference and ran the suite about 6x faster than GHC on one core
Top 90 curated tweets ranked for substance on 03 Jun 2026 UTC.
A compiler refactor was stress-tested overnight against a Haskell reference and ran the suite about 6x faster than GHC on one core
Massively parallel in vitro biochemistry can now measure binding constants for over 100,000 protein variants fast enough to close the loop on protein design
BGP hijacks can still pass newer validation schemes when networks fail to enforce that the first AS in a received route is the neighbor that sent it
Elixir now type-checks every line for bugs and dead code without requiring type signatures, moving a dynamic language toward gradual typing with low false positives
DuckDB is already multithreaded, and Quack adds the missing server layer so multiple clients can write to the same DuckDB database
A VS Code issue enabled a one-click path to stealing GitHub tokens, showing how editor integrations can become credential-exfiltration surfaces
Publishers in the UK will get a way to refuse Google’s AI features while remaining indexed in normal search, breaking the previous all-or-nothing bargain
Polars now has a horizontally scalable distributed engine that can run on self-managed Kubernetes while preserving the familiar Polars API
An old CIFS authentication-key logic flaw lets unprivileged users forge keys and escalate to root through malicious NSS modules on major Linux distributions
Natural protein sequence space contains enough fold redundancy that generating more sequences is often less useful than understanding which folded structures are actually distinct
Momentum can act as a spectral filter on matrix-valued gradients, making the subsequent orthogonalization step in Muon more reliable
Space datacenters become a cost question once terrestrial power, land, cooling, and chip-production constraints are modeled against launch and orbital operations
Commercial phone-location datasets let German state police track devices outside normal warrant processes, exposing a surveillance loophole in the data-broker economy
Precomputing Elysia route manifests during builds removes runtime JIT overhead and changes cold-start behavior for server-side TypeScript apps
Discrete-like latent spaces are a poor fit for continuous Gaussian diffusion, explaining why common heuristics such as self-conditioning help text generation
DeepProve now ships as a modular open-source repo that can benchmark and extend proof systems across safetensors, GGUF, and ONNX model formats
A browser demo turns latent vectors into live neural-SDF meshes and lets users interpolate between objects, making representation geometry tangible
Benchmark’s first growth fund marks a break from decades of defending a smaller, focused venture model
Stack turns accounting-firm playbooks into auditable SOPs that can run closes, reconciliations, and journal entries as the profession faces a severe labor shortage
Edge has a hardware-backed enclave capable of protecting data from kernel drivers, yet it was apparently not used for the obvious high-value target of stored passwords
A 130g, 360W peak drone motor with a fully American supply chain points at how defense robotics bottlenecks are shifting into component manufacturing
Retail investors are now trying to exit private equity as well as private credit, challenging the assumption that illiquid retail funds would behave differently
Real-world redactions often leave recoverable text behind, and stronger document parsing models make those failures visible instead of merely cosmetic
Cerebras shows how a nonstandard chip geometry can trade networking limits for extremely low-latency inference on large models
AGPL lets enterprises run software internally while preventing cloud providers from offering a closed competing service without sharing their changes
Off-road robots need localization when odometry drifts and GPS fails, so BEV-Patch-PF matches onboard views to satellite imagery in unstructured terrain
A large share of submitted NeurIPS position papers scored highly on an AI-writing detector, raising a concrete governance problem for scholarly venues
Malicious npm packages deployed a RAT that captured keystrokes, screenshots, and wallet credentials while using Hugging Face repositories as infrastructure
When more online writing is machine-generated, readers experience not just lower quality but a breakdown in the implicit social contract of communication
A single collector documented thousands of IBM pins, preserving a surprisingly rich material history of corporate computing culture
MACU is simple and general: a manager decomposes tasks into a directed acyclic graph (DAG), dispatches parallel subagents, and revises the DAG with new findings. A single slow CUA → a team of CUAs working in parallel! Interactive visualiz
Computer use agents are slow and brittle. The fix isn’t just stronger models, but also deploying them as multi-agent systems. MACU is a general Multi-Agent Computer Use framework that consistently lifts success rates by 3.4-25.5% and is up
MACU achieves better scaling behavior than single-agent CUAs, and improves success rates consistently across four CUA benchmarks (+4.7% on OSWorld, +3.4% on Online-M2W, +8.7% on WebTailBench, +25.5% on Odysseys). MACU also reduces the wall
hey @willahmed , i found some bugs in how whoop advanced labs calculates/imports some of the biomarkers. 1. atherogenic index of plasma (AIP) is defined as: AIP = log10(triglycerides / HDL-C) the ratio must use molar units (mmol/L), but w
Modded-NanoGPT optimization result #29 (2026/05/14): @eliebakouch has achieved a new step-count record of 2930 via the following techniques: - Add Aurora to mlp.proj - Warmup & cooldown Muon mu - Disable SoftMuon & NorMuon - Extend Contra
Most autoresearch emulate an individual researcher. We created #SimpleTES to emulate a research community. The result: new SOTA discoveries across 21 open science problems, including More efficient astrodynamics 2× faster LASSO Better
Quantized JetBrains Mellum2-12B-A2.5B-Thinking to MXFP4 for Apple Silicon. 12B MoE / 2.5B active, fits in 6.2 GB on disk and 7 GB peak memory. On M5 Pro: - Decode 130 tok/s - MATH-500 80% - HumanEval 93% - MMLU 90% Needs the open mlx-lm
Got it down by another millisecond! 6ms per 1080p frame on a single core is insanely fast, and I suspect this is very close to the optimum (famous last words, probably)
Amazing work led by @GhxIsaac ! Deep research agents have selective memory: they 𝘀𝘁𝗮𝗿𝗲 at the 𝗯𝗲𝗴𝗶𝗻𝗻𝗶𝗻𝗴 and the 𝗲𝗻𝗱 of long-horizon trajectories, then 𝗴𝗵𝗼𝘀𝘁 the 𝗺𝗶𝗱𝗱𝗹𝗲. We turn this into a map for when 𝗰𝗼𝗻𝘁
We worked with @trajectorylabs to run their SDPO++ algorithm on APEX-Agents and see what it could do with real production data. Pass rates went from 5% to 25% on GPT-OSS-120B, and the curve is still climbing. Read more about our work to
We built a simulator to understand the performance of @tensorlake 's sandbox scheduler and dataplane during sandbox creation bursts. We can safely simulate traffic bursts without spinning up 100s of very expensive machines. Google talk
New work: a simple and general multi-agent computer use framework. It uses a manager to plan and re-plan by creating a task DAG, with subagents for parallel execution. It improves success rate across benchmarks, and substantially improves
Quantized JetBrains Mellum2-12B-A2.5B-Thinking to OptiQ 5bpw mixed-precision for Apple Silicon. 12B MoE / 2.5B active, 3/4/6/8-bit per layer (KL-sensitivity allocated). 12 GB on disk, 13 GB peak memory. On M5 Pro: - Decode 89 tok/s - MATH
I had a lot of fun working on this paper - we found an elegant story for why subliminal learning happens! A key intuition in interpretability is that basically every interesting phenomena in LLMs boils down to adding a steering vector. Sub
These techniques were discovered by a Claude-based autoresearch harness developed by @eliebakouch at @PrimeIntellect 2/2
Because it's the full stack from Tensors to MMIO, the ceiling on speed in tinygrad is higher than in any other framework.
We are now seeking a puzzle maker to help us create puzzles that LLMs can't yet solve.
We are seeing N-year exploits for patched vulnerabilities that still have remaining exposure (e.g. keygen). The theoretical knowledge is now instantly available, and the learning curve to implement them has been dramatically compressed.
Meet Gemma 4 12B Unified from @googlegemma ! This is a 12B dense, encoder-free multimodal that runs text, image & audio natively on-device. Day-0 support is now live in SGLang! Encoder-free architecture: raw image patches + audio wavefo
World models are moving beyond offline generation towards interactive, real-time experiences. Introducing FlashDreams: an open-source high-performance inference and serving library built for autoregressive world models: Up to 3.10× faste
we created a new, open source eval (LongArray-Extract) for one of the hardest problems in document processing: how to extract every row out of long documents some highlights: - Extend's array extraction is SOTA (99.2%) - 3x faster than the
M3 traffic got wild, so we shipped overnight. Inference serving upgraded at 22:00 Beijing / 7:00 AM PT. TPS much smoother now. Most users should be seeing 50–70 TPS.
yesterday I turned a 2D character into thousands of living Gaussian splats. today I built an entire 2D game scene with them. trees, grass, flowers, particles, atmosphere, all made of splats. the foliage reacts as the character moves throu
Multi-speaker Transcription: Who said What and When? On 10 real multi-speaker CHiME / NOTSOFAR meetings, Trelis edges AssemblyAI on corpus cpWER. - Same single-channel audio. - Same meeteval scoring. - No oracle speaker labels. Trelis tra
What if physical AI policies could interact with generated worlds in real time? Introducing OmniDreams, a generative world model for closed-loop autonomous vehicle simulation. Tech report, code, models, and data samples are available now
Search agents have no explicit belief state or value function. I think that’s why long-horizon agents degrade and test-time search saturates. A few small experiments and thoughts: https:// shreshthrajan.com/search-agents- state.html …
In early May, the best superforecasters predicted that, by the end of the year, the longest METR 80% task horizons would reach 3-4 hours. In late May, Claude Mythos achieved that number.
Today (June 3), I'll be speaking at CVPR at the Test-Time Scaling for Computer Vision WS (1:30 pm PT) about how we can use test-time compute to boost generalization of robot policies, room 506. Also speaking *right now* (in 5 min) in the D
the grpo reward was the probability assigned by the classifier that the attack was not malicious + a bonus of the argmax was not malicious (meaning the attacker had tricked the classifier) early round the attacker does pretty well, but th
running a fine-tuned LLM on my phone and beating GPT4o (the OG model) is such a great feeling. achieved better latency, accuracy, tool calls, and output format. 1 day to prepare dataset, 12 hrs to train, 3 hours to run evals.
Microsoft is MXC, releasing a containerization solution supporting custom policies (this is how openclaw would run), and there’s a preview on GitHub: https:// github.com/microsoft/mxc
how do you sync a trillion parameter model every RL step without a shared cluster? we just wrote a blog about it, led by @AmineDirhoussi what I like the most is the way it proves you can use the Hub for basically everything → trainer on
Spotted a novel covered+looped apyUSD repeg trade. Someone is: 1. buying discounted apyUSD 2. depositing in the @roycoprotocol apyUSD Senior Tranche, 15% minimum coverage 3. using ST-apyUSD to borrow apxUSD 4. buying more discount apyUSD
Stronger models have made finding vulnerabilities easier, and the bottleneck has shifted to verification, triage, patching. Here are some lessons from working with security teams to address the new bottlenecks.
The fix for Meta's AI bot vulnerability was apparently: - remove the feature from the UI - leave the API endpoint accessible I wish I was joking.
Two new paper implementations just dropped on TensorTonic. Word2Vec: subsampling, skip-gram pairs, negative sampling, SGNS loss, CBOW forward, and a full SGD training step. The paper that started the whole embeddings revolution, built from
How to manage secrets with worktrees: Files that are untracked in Git will NOT be copied over to new worktrees (Codex, Claude Code, & Conductor included) Claude Code introduced .worktreeinclude, which uses glob syntax to copy untracked fi
Can LLMs reason in superposition? We introduce MUX, a method that turns text CoT into latent continuous reasoning. Instead of one-hot vectors as in CoT, the model now learns to predict weighted averages of several one-hot vectors, that we
We're launching the microagi Research Fellowship. Fellows get up to $2M in compute, robotics hardware, our evals, and one of the largest physical AI datasets ever assembled. You build in our lab, with our team, alongside partners like Unit
MAI-Code-1-Flash hits 71.6 on SWE-Bench Verified using a third of the tokens Claude Haiku 4.5 burns. Benchmarks now ship on two axes : performance & the cost to get there.
btw i have not dug into this but seems the claude sdk is reporting 1hr cache writes by default, for some cursed reason (not warden) pi uses the normal 5m default if accurate this would explain some of the sonnet delta can check after work
New GhostBeacon tool identifies rogue and hidden Wi-Fi access points by analyzing beacon frames, signal strength, uptime, and encryption patterns. Reveals how evil twin attacks exploit 802. #DFIR_Radar
This is the architecture of a single RLM forward pass... One user message in one response out. How would a RLM agentic chat harness look like?
It's interesting to see @MicrosoftAI uses ray actors not just for controller and rollout workers but problem workers for the posting training of the MAI-Thinking-1 model. Instead of introducing third party dependency like @modal for san
University of Toronto researchers claim to have developed a "worm" powered by open source AI that exploits known flaws and tailors attacks for each computer ( @cademetz / New York Times) (Visit Techmeme dot com for the link and full conte
Thrilled to release the first LLM persuasion benchmark with user personas in our paper: Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues! Paper: https:// arxiv.org/pdf/2606.02754 Code: https:// github.com/Hanpx
I made a tweet earlier claiming `HttpService:RequestAsync` quietly discards the `Authorization` header if you use a `string` instead of the `Secret` datatype in live game servers. I appear to have been mistaken, it was actually a DDoS prot
hybrid local-cloud inference ftw ! @JonSaadFalcon and i been studying this for a hot sec (minions, ipw, openjarvis). link to our papers in comments below
I noticed @perplexity_ai Comet only route your query to google or perplexity if your query is in English and default to google for languages like Chinese. As a Chinese speaker I developed my own router that routes my queries for me and it
You can already read @huggingface datasets directly in @DataPolars but not (yet!) from Buckets (HF's S3 alternative, great for private and working data). So I built a plugin to read + write Buckets straight from Polars:
Run Polars' distributed engine on your own infrastructure. Deploy a distributed Polars cluster on any Kubernetes setup (EKS, AKS, GKE, or minikube) and get a query dashboard with past queries, advanced query profiling, Open-lineage support
The recording from our talk: "From Responses To Trajectories: Multi-Turn and Multi-Environment RL" from @PyTorch Conf Europe is live! @krasul and I covered the latest advances in multi-turn GRPO in TRL: trajectories, tool use, envs, an
Most rewarding work of my life! I was part of our amazing data team. My mission was to curate all the STEM knowledge from the web to get a strong pre-trained checkpoint that could climb in RL. More details in Appendix A of our comprehensiv
Building a CLI that works for agents as well as humans requires a few UX choices: - machine readable errors and exit codes matter - detect whether there is a tty and choose JSON or text automatically - add an explicit --confirm flag for mu
Building momentum at Marin! Upgrading from Dense -> 129B parameter MoEs -> architecture improvements -> optimizer improvements gives our pretraining recipe an estimated 6x cumulative learning speedup, accounting for MFU. Includes community
Tool calls are just API wrappers, to be honest, not completely true. Although the most common use case is to call a search engine, hit a database, or fetch a URL. That framing is too narrow, and it limits how you design agents. At the end
surprisingly, the mai-thinking-1 tech report includes lots of details on pre-training and rl data, training recipes, training infrastructure, data pipelines, and ablation experiments. added to my flight reading list
Okay, I can’t believe I’m saying this, but it boots, my own completely custom operating system boots!!! You can see more about this journey in the tweet below. Started in Codex with /goal on May 4th. Totally wild. Surreal feeling right n
Can reasoning models become overly reliant on chain-of-thought examples? Our #ACL2026 work shows excessive CoT supervision is not always beneficial, and gives a recipe for tuning the CoT fraction to improve novel-task accuracy. Website:
Browser progress. Now you can open "Remote tabs" that run in Cloudflare Browser Run instances. Right click the tab to get a shareable CDP URL where you can hand off to your agent and watch it do things on the website for you (like fill out