Finding miscompiles for fun, not profit (t.co)
Compiler-fuzzing and model-assisted debugging can burn five figures in an afternoon while still surfacing real miscompilation bugs
Balanced toward durable technical artifacts and concrete current-news analysis while avoiding the many near-duplicate Opus 4.8 reactions
Compiler-fuzzing and model-assisted debugging can burn five figures in an afternoon while still surfacing real miscompilation bugs
A networking breakthrough inside hyperscale data centers can change the cost and energy profile of AI clusters more than a marginal model update
RL training pipelines spend huge bandwidth moving fresh weights to inference engines, and a 100x reduction removes a major constraint on distributed post-training
May reached 892 Linux kernel CVEs without backfilled entries, showing how vulnerability accounting and kernel maintenance are entering a new volume regime
Open models and data for protein biology let outside labs inspect, reproduce, and extend work that could matter for understanding human physiology
EXPO-FT reports perfect success on eight tested robot tasks using only about 19 minutes of reinforcement-learning data on average
Fine-tuning scripts, datasets, evaluation rollouts, and tokenizer recipes make a robotics foundation model practical for outside teams to build on
A browser-accessible CUDA puzzle set lowers the barrier to learning GPU kernel programming without local hardware or rental setup
A playable Doom-in-Three.js port is a durable browser graphics artifact with code others can study and extend
Training on 2.1k hours of open fMRI data plus an open benchmark gives brain-imaging researchers a shared baseline for representation learning
Teams hitting GitHub API limits can share PATs and GitHub App installations behind a cached self-hosted shim instead of rewriting workflows
A common informal practice of copy-pasting generated Triton kernels from torch.compile is being turned into a cleaner API surface
Moving GPU orchestration into generated C can cut CPU overhead and simplify the runtime path once kernels are running
A deduplicated, recaptioned, openly licensed image corpus plus a text-to-image training codebase improves reproducibility for image generation research
The paper constructs arbitrarily large finite real sets where both sums and products are smaller than expected, challenging a central additive-combinatorics intuition
The study finds VC investment falls in areas less protected by antitrust enforcement, complicating the claim that weaker enforcement always helps startups
A dollar-cost-averaged bet across Anthropic’s recent rounds slightly underperformed SK Hynix over the same dates, tying AI startup returns to the hardware trade
Routing atomic arbitrage through expensive intermediary pools is leaking measurable MEV profits and creating room for lower-friction execution venues
In a 2,533-agent social simulation, private-data disclosure became about eight times more likely after agents observed another agent oversharing
AI liability claims may fail when the proof sits inside proprietary models, platform logs, protected databases, or internal documents that plaintiffs cannot access
MDSec showed malicious Visual Studio extensions can still reach the marketplace and execute with minimal controls, keeping IDE supply chains exposed
After a teenager reportedly showed marks for 2M test takers could be edited, the board’s public reassurance leaned on a ChatGPT-generated image rather than technical remediation details
A 100-dissertation sample found that more than half contained some amount of AI-generated text, suggesting credentialed academic writing is already changing
A coast-to-coast autonomous driving run covered 788 miles in one day without disengagements, giving a concrete measure of progress outside curated demos
The RP2350’s HSTX peripheral already reaches hundreds of megabits per second, making PCIe-class interfaces on cheap microcontrollers plausible within a decade
A $219M total ceiling from DIU, the Air Force, and the Navy gives Hermeus a major government-backed path to generate high-Mach flight data
https:// arxiv.org/abs/2605.28079 Long context benchmark suite. It aggregates previous benchmarks.
OK FIRST EVAL: CODEX RUNNING /goal VS. CLAUDE CODE ORCHESTRATING CODEX AGENTS I have an ACTUAL long form tasks I have to finish. I created two separate worktrees This one is a full migration of services from Supabase to self-hosted Po
1 of 8 NVIDIA RTX PRO 6000 Blackwell being torn down for tinybox pro install. Don't worry, it's only $10,000 if you shear one of the ribbon cables.
i've recently been distilling stockfish into a no-search transformer. some cool results: - recreated the neural scaling laws - observed chinchilla optimality - curriculum learning on ascending depth data sub-performs you can also play agai
Production agents also change state. If an agent claims it updated a CRM, opened a PR, changed cloud config, or triggered a workflow, the eval should verify what actually happened. Agent Judge can inspect tool evidence, database logs, aud
We built Agent Judge to evaluate long-horizon agents. As agents take on longer tasks, the evidence needed to evaluate them gets buried across tool calls, retries, logs, database updates, and final outputs. Evaluating these agents requires
When they release Mythos it’ll prob be ~$20,000 per each full-repo scan. The hype helps justify the price. You’ll still need alternatives.
so I ran into this little problem where the 500km^2 smoke data made it clear that there were other fires going on at the same time, and it was weird that they weren't visualized. so this 1-fire-dataviz project became a 50-fire-dataviz proje
Congrats to the @liquidai team on LFM2.5-8B-A1B! Day-0 support is now live in SGLang. - 8B MoE, 1.5B active - Fast tool calling, punches 4x its size - 128K context + better non-Latin support - Runs local, no API keys, no data leaving
In the Vending-Bench Arena, Opus 4.8 lost to GPT-5.5 and Opus 4.7. It falls for scam suppliers (one run sent over $9,000 to a "membership" upsell), is worse at negotiation, runs the machine empty, overprices, and wastes time on strategy not
One reason to not bet on diffusion is that there is a limit to the capability of diffusion models for serial problems. This paper ( http:// arxiv.org/abs/2507.12549) shows from a complexity theory perspective that a diffusion model inherent
The beauty of the charming and amazing countryside of Syria
Learnings from testing Claude Opus 4.8: > Much worse than Opus 4.7 and GPT 5.5 on Vending Bench > More aligned than previous Claude models (Opus 4.6+ and Mythos) > Also worse on Blueprint-Bench > Scared of getting caught > Max reasoning is
On the first partial frontier are Deepgram Flux (7.36%, 0.019s), Deepgram Nova-3 Realtime (6.69%, 0.057s), Cartesia Ink-2 (external endpoints) (4.33%, 0.072s), and ElevenLabs Scribe v2 Realtime (3.65%, 0.132s).
I had a clanker rewrite ripgrep in Swift then spend a bunch of time optimizing it It’s now faster than the original Rust
Play with the demos. Training up to 20M steps/second on a single GPU. Most envs training in seconds to minutes, including our client envs. Turns out mazes and 2048 without exploiting domain knowledge are just harder than many real world pro
One thing which has been insanely difficult is generalizing pricing "migrations" Billing set ups are so complex and varied - people ask us for different things all the time We've been putting a ton of work into productionizing this and ca
The poolside technical report contains some interesting details about quantization. They leverage a rotation technique called Spinquant. Spinquant is essentially Turboquant’s cousin; TurboQuant rotates the KV cache, SpinQuant R1 rotates ac
Simple LLM judges break because long-horizon trajectories do not fit into a context window. They either see a narrow slice of the run, or try to ingest a long dense trajectory and miss the evidence in the middle. Agent Judge gives the eva
democratizing compute with RLMs you don't need a frontier model with a giant context window. even relatively small models get massive gains (they trained an 8B RLM-Qwen3 that beats its base model by ~28% and gets close to much larger mode
Opus 4.8 is a step back in terms of performance on all Andon Labs’ benchmarks, but a step forward in alignment. Previous Claude models (Opus 4.6+ and Mythos) engage in deceptive and power seeking behavior in its pursuit to win in Vending-B
Announcing AA-WER Streaming, our new benchmark measuring streaming Speech to Text models on accuracy and latency for voice agent use cases. Pareto optimal models on this new benchmark include those from Cartesia, ElevenLabs, and Deepgram S
The Gentlemen ransomware, a ransomware-as-a-service (RaaS) platform managed and operated by a threat actor that Microsoft Threat Intelligence tracks as Storm-2697, enables attacks at scale conducted by affiliates.
Released Polar the new agent RL rollout infra for latest harnesses
3 weeks ago we open-sourced HALO this led to talking with dozens of teams running agents at scale we realized the current agent monitoring tools aren't built for the future that we so clearly see ahead of us today we’re releasing native
Long-running cybercrime operation distributes cryptocurrency miners through pirated content sites, leveraging fake video player updates to infect millions. Campaign active since 2022 with sophisticated evasion and persistence mechanisms. T
3 weeks after launch, the feedback on @lightseekorg TokenSpeed’s scheduler and kernel design has been encouraging. Kimi K2.5 and Qwen 3.5 reaching speed-of-light performance is amazing. Long road ahead — the lean and small team with high
Opus 4.8 is live in Shortcut. It is a meaningful upgrade over Opus 4.6/4.7 for spreadsheet work. Will share full eval results soon, but when directly compared to Opus 4.6 on medium effort: Easier eval - 24 wins / 14 losses / 26 ties Harde
We are starting to be quite bullish about getting in the data infrastructure business. I just cloned 68 TB (while I only have a 4TB local disk) to my @huggingface training bucket in 1 minute 55 seconds, thanks to Xet deduplication and al
What's now open alongside the model: Fine-tuning scripts Every dataset used to train MolmoAct 2 All of our evaluation rollouts Training recipe for the open source MolmoAct 2 tokenizer
multi-turn RL and the "tito" problem keeps coming up. we've been working on it for a while, and the takeaway is that it's much easier than people are making it. it takes 1 implementation rule, and 1 chat-template property that all models a
MiniMax M3 >200B+ MoE 1M context window MSA (MiniMax Sparse Attention) architecture released in a few days 𝐨𝐩𝐞𝐧-𝐬𝐨𝐮𝐫𝐜𝐞𝐝 From a tweet by an official MiniMax team member: Not inside info just public stuff online. Open source mod
Cold starts are super painful for scaling LLM workers. Check out our work at restoring inference workers (including AOT traces) in seconds, not 10s of minutes!
Performance varies meaningfully across the three datasets with different audio lengths, accents, vocabulary, and background noise. On AA-AgentTalk, our private test set, ElevenLabs Scribe v2 Realtime leads both final (2.8%) and partial (2.9
Claude Opus 4.8 is now available in Cursor. On CursorBench, it's able to work much more efficiently than Opus 4.7. We've also found it to be more persistent on harder tasks.
How do we get LLMs to solve hard reasoning problems that the base LLM can barely solve? We show that through bidirectional search + evolutionary mutations, we can systematically search for complex solutions and posttrain models to solve th
what if you could see how many people downloaded your ai prompts now available on http:// traces.com profile pages
you can in fact frame this as a compression problem where a generator learns to summarize some prior sequence in such a way that minimizes the conditional distribution drift (as measured by kldiv) instead of bolting on a summary prompt post
Hi all, I defended my PhD thesis. My thesis in two sentences: Current AI measurement takes LLMs as fixed objects, which constrains us to observational measurement. *Spiking* the training data (inserting certain data at known rates), enable
Claude Opus 4.8's system card explains why it's worse on Vending-Bench than Opus 4.7. Robustness against adversarial agents was indeed one of 4.8's failure modes. Also cool to see that @andonlabs 's findings played a small part in making
Does your GPT-5.5 also love Valparaíso in Chile !? Ask it to “Name a random city in the world”. You might expect a broad sample from thousands of cities. Instead, models collapse to the same small set of answers again and again. But why
So excited about this project. Despite all the talk about AGI, AI has barely scratched the surface of discovering scientific theories or even giving us new scientific insights. DiscoverPhysics is a benchmark for the future.
the site saves all collected bird calls for playback. below is a house finch! unbelievably cool to see the range these spectrograms cover. now that i’m starting to amass a library of calls i want to try sampling them into some music
With 104M of image-text pairs, this is one of the largest, if not the largest, openly-licensed image dataset And it's on @huggingface !! Kudos @heyjasperai
Here’s how we built Town Lake, Cloudflare's unified analytics platform, alongside Skipper, an internal AI agent running on top of it.
It's almost a little boring to see so ~no resistance to the generic methods for Go proposal from the OG dependency-management-and-syntax-highlighting-is-bad crowd. There's some good ones in here, but few. Nothing from The Commander. Have
Kuaishou reports Q1 revenue up 3.4% YoY to ~$5B and Kling AI revenue up 300%+ YoY to ~$96M; Kling reached a ~$500M annualized revenue run rate in March 2026 ( @cocof1026 / South China Morning Post) (Visit Techmeme dot com for the link and
I've written a tips article on the environment setup method when using the NVIDIA NGC that I normally use on a GPU Cluster. Tips: Development Environment for DL Distributed Learning Library Using Containers | Kazuki Fujii https:// zenn.de
"Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn" wtf? how??
Vercel CLI as a self-updating binary with zero external dependencies. Our CLI is one of the key interfaces enabling the 'cloud for agents'. This solves a huge bottleneck, as we ship changes to our CLI more than ever, and it's embedded in m
They don’t compete - I use them together. For example /loop 30m get all the tests to pass. For each review comment, run a triage workflow that writes fixes, and runs 2 adversarial reviews per fix, then applies and pushes
Not sure if this is counterintuitive or not: if your deliverables are further from the code, you get more speed-ups from coding agents. E.g. if your deliverable is a software, you get least speed-up from coding agents.
using codex to run your computer and tasks in a browser in-app or headless feels like magic
State machines are The first POC I did with agent driven UIs was literally just giving the agent a reference to the reducer dispatch action and the serialized JSON schema to describe the payload. Worked incredibly well
New post from @iapsAI on Cyber Superstorms My colleagues argue that counting zero-days is not the way to measure the consequences of AI-accelerated vulnerablility Instead, they propose that the community should focus on how often AI-acc
new paper we made serving many different finetunes surprisingly efficient by just… not intervening at decode steps!
Claude Opus 4.8 is also more efficient than its predecessor - it achieves its higher performance in 15% fewer turns per task and with 35% fewer output tokens than Opus 4.7. However, it still uses approximately 30% more turns than OpenAI’s
Cartesia Ink-2 debuts as #1 for accuracy on the brand-new streaming speech-to-text leaderboard from @ArtificialAnlys ! We designed Ink-2 from the ground up for voice agents - with low latency, eager transcripts, and semantic endpointing.
RF-DETR is nearly 2x more accurate than TrackNet, a model developed specifically for detecting small, fast-moving objects
Fake ChatGPT site delivers dual-platform malware targeting Windows and Mac users. Windows victims get credential stealers while Mac users receive $3K/month AMOS malware designed for cryptocurrency theft. Key technical details: • Fake site
How far behind are open models? Across 17 selected benchmarks, private ones show a gap of 8-10 months today, almost 2x the gap on public ones (4-6 mo). More discussion (including limitations), code and blog in the thread.
excellent blog on how to actually make agents better instead of just benchmaxxing evals. some imp points: -> benchmaxxing fits tools where a human stays in control and catches mistakes. floor raising fits agents that work alone with no one
People usually learn tries in the context of autocomplete and dictionary problems, but once you start working on real infra systems, you realize tries are everywhere underneath modern high-performance networking and search stacks. I was re
DSPy v3.3.0 beta 1 is released on pypi! We would really appreciate your feedback! We are introducing ReActV2 and a much improved LM/BaseLM system, along with a way to pass data to an RLM. Thanks to @MaximeRivest , @kmad , and @mchonede