March 24, 2026

What Is Autoresearch? Karpathy's ML Agent Explained

Autoresearch ran 700 ML experiments in 2 days, cutting GPT-2 training time by 11%. How it works, what it costs, and why it needs a GPU.

What Is Autoresearch? Karpathy's ML Agent Explained

You go to sleep. An AI agent stays up, rewriting your training code, running trials every five minutes, keeping only what works. By morning, it's tested 100 setups you'd never have tried.

That's autoresearch. Andrej Karpathy released it in March 2026 as a 630-line Python script. It hit 53,100 GitHub stars in weeks (karpathy/autoresearch on GitHub, 2026). Not a framework. Not a platform. Just a loop: modify code, train, evaluate, repeat. No human in the loop.

What caught people off guard wasn't the idea — it's that it worked. In two days, autoresearch completed 700 trials and surfaced 20 genuine wins for GPT-2 training, shaving time from 2.02 hours down to 1.80 hours (VentureBeat's autoresearch coverage, 2026). That's an 11% speedup found entirely by machine.

TL;DR: Autoresearch is Karpathy's open-source tool that lets an AI agent (Claude or Codex) autonomously modify training code and run GPU experiments every 5 minutes overnight. It ran 700 experiments in 2 days, found 20 improvements, and cut GPT-2 training time by 11% (VentureBeat, 2026). You need one NVIDIA GPU and an API key.

Original analysis: The cost-per-experiment breakdown below is calculated from public GPU cloud pricing — no other coverage does this math.

What Does Autoresearch Actually Do?

Autoresearch pairs an AI coding agent with a real LLM training setup, then lets that agent run wild without supervision (The New Stack's breakdown of the experiment loop, 2026). It doesn't suggest changes. It writes code, kicks off training runs, and judges the results on its own.

Here's how the loop works:

The AI agent reads the current training code. Using Claude or Codex via API, it studies the codebase and picks a change — a new learning rate schedule, a bigger batch size, a tweak to the architecture.
It edits the code directly. The change hits disk. No human approval needed.
Training runs for exactly 5 minutes. This fixed time budget is the key design choice. Model size, optimizer, data pipeline — doesn't matter. Every trial gets the same 5-minute window, so results are directly comparable.
The agent checks the loss. Better? Keep it. Worse? Revert.
Repeat. Next idea, next trial, no pause.

At 12 trials per hour, an 8-hour overnight session covers around 100 configurations (Data Science Dojo's autoresearch explainer, 2026). That's a search space no human researcher would cover in a week. So what happens when you let it run for two full days?

A glowing AI processor chip representing the GPU computing hardware that powers autonomous ML experiments

Why does the 5-minute cap matter so much? Traditional hyperparameter sweeps let each run take as long as it needs. That makes comparison messy. Autoresearch sidesteps the problem entirely. Every trial is equal. Swap the architecture, triple the batch size, rewrite the data loader — the results still land on the same scoreboard.

Why Does Training Require a GPU?

Here's the part that trips people up: the AI agent itself doesn't need a GPU. Claude and Codex run in the cloud via API. The GPU is for training the small language model in each trial loop.

NVIDIA's H100 delivers 3-6x faster transformer training than the A100, with up to 30x faster inference (Northflank's H100 vs A100 benchmark, 2025). What a GPU handles in 5 minutes would take a CPU hours. That gap isn't a minor detail — it's the reason autoresearch exists.

Key insight: The fixed 5-minute budget creates a natural GPU dependency — not because of raw compute, but because experiments become non-comparable at different time scales.

Think about it this way. The fixed 5-minute budget means GPU throughput isn't optional. Swap in a CPU, and that same trial stretches to hours. You'd get 1-2 runs per night instead of 100. The entire feedback loop falls apart.

There's a deeper hardware story too. Compute FLOPS has scaled 60,000x over 20 years, but memory bandwidth has only scaled 100x — making bandwidth the real bottleneck for LLM training (APXML's compute requirements guide, 2025). GPUs solve this with high-bandwidth memory (HBM). CPUs don't have it.

Autoresearch needs a single NVIDIA GPU running CUDA. Karpathy kept the code to 630 lines by targeting one platform. Supporting CPU, Apple Silicon MPS, and AMD ROCm would triple the codebase. Community forks exist for Apple Silicon, but they're not in the main repo.

Close-up of an NVIDIA GPU graphics card with illuminated green PCB and cooling fans used for ML training

What Results Has Autoresearch Achieved So Far?

Karpathy's original run produced 700 trials over two days, surfacing 20 genuine improvements to GPT-2 training (VentureBeat's autoresearch coverage, 2026). The wins stacked. GPT-2 training time dropped from 2.02 hours to 1.80 hours — an 11% speedup from changes no human proposed.

But the results that turned heads came from outside Karpathy's lab.

Shopify CEO Tobi Lutke ran 37 trials overnight. His 0.8 billion parameter model beat his hand-tuned 1.6 billion parameter model — a 19% gain with half the parameters (Fortune's analysis of autonomous AI agents, 2026). He then pointed the same pattern at Shopify's Liquid templating engine: 53% faster rendering and 61% fewer memory allocations from 93 automated commits.

Read that again. A CEO aimed an AI agent at production infrastructure, went to bed, and woke up to a measurably faster system. No ML expertise needed for the overnight run. Just a GPU, an API key, and faith in the loop.

Rows of illuminated server racks in a modern data center powering AI model training

The first night alone produced around 50 trials with zero human input (The New Stack's breakdown of the experiment loop, 2026). Most trials get discarded. That's fine. The ones that survive compound — each win becomes the baseline for the next round.

How Much Does It Actually Cost to Run Autoresearch?

Nobody in the current coverage does this math. So we did.

Autoresearch has two cost buckets: the GPU for training and the LLM API for the agent's reasoning.

GPU cost per overnight session:

An H100 80GB runs $2.74 per hour on cloud platforms (Northflank's H100 vs A100 benchmark, 2025). An A100 80GB costs $1.76 per hour. For an 8-hour session:

H100: $2.74 × 8 = $21.92
A100: $1.76 × 8 = $14.08

At 100 trials per session, that works out to $0.14 to $0.22 per trial. How does that compare to a human? An ML researcher running the same tests manually — say 3 per day — would need 33 working days to match one overnight session.

LLM API cost per session:

The agent makes an API call per trial — reading code, proposing changes, writing the diff. Based on current Claude and Codex pricing, each interaction costs between $0.02 and $0.10 depending on context length and model tier. For 100 trials: $2-10 in API costs.

Total overnight price tag: $16-32. That buys the equivalent of a month's worth of manual testing.

How Does Autoresearch Compare to Traditional AutoML?

The global AutoML market hit $4.92 billion in 2025, projected to reach $92.31 billion by 2034 at 38.52% CAGR (Fortune Business Insights AutoML report, 2025). But autoresearch isn't playing in that market. It's doing something different.

Close-up of a circuit board with microprocessor representing the compute hardware behind automated ML systems

AutoML tools like AutoGluon, FLAML, and Optuna search a predefined parameter space. You set the ranges, they optimize within bounds. Systematic, but boxed in.

Autoresearch doesn't search parameters. It rewrites source code. The agent can swap the optimizer, redesign the architecture, add data augmentation, or restructure the entire training loop. This isn't hyperparameter tuning. It's automated research — and the distinction matters.

Feature	Traditional AutoML	Autoresearch
Search space	Predefined hyperparameters	Entire codebase
Changes	Numerical parameters	Source code modifications
Agent	Optimization algorithm	LLM (Claude/Codex)
Creativity	None (grid/random/Bayesian)	Can invent new approaches
Setup	Framework-specific config	630 lines of Python
Cost	Compute only	Compute + LLM API
Reproducibility	Deterministic	Non-deterministic (LLM varies)

The tradeoff? AutoML is predictable and reproducible. Autoresearch is creative but non-deterministic. Run it twice and you'll likely get different improvements. That's a feature when you're exploring. It's a problem when you need audit trails.

Can You Run Autoresearch Without an NVIDIA GPU?

Short answer: not with the official repo. Karpathy targeted CUDA only to keep the code at 630 lines. Multi-backend support would have doubled or tripled the codebase.

The community moved fast though. Within weeks, forks appeared:

Apple Silicon (MLX): Community ports using Apple's MLX framework for M1/M2/M3 Macs. Training is slower than NVIDIA GPUs but functional for smaller models.
AMD (ROCm): Experimental support through PyTorch's ROCm backend. Less mature than CUDA but workable.
CPU-only: Technically possible but impractical. A 5-minute GPU experiment becomes a multi-hour ordeal, breaking the fast feedback loop that makes autoresearch work.

A digital brain composed of neural network connections representing autonomous machine learning architecture

Here's the thing: the GPU requirement isn't arbitrary. It's structural. The system's entire value — 100 trials overnight — depends on each one finishing in 5 minutes. Remove the GPU and you don't get a slower version of autoresearch. You get a broken one.

What Does Autoresearch Mean for AI Research?

Seventy-three percent of engineering teams now use AI coding tools daily, up from 41% in 2025 (Pragmatic Engineer's AI tooling survey, 2026). Claude Code alone writes around 4% of all public GitHub commits — 135,000 per day — and that share is projected to hit 20% by year's end (Gradually.ai's Claude Code statistics, 2026).

Developer coding on a laptop with lines of code on screen representing the human-AI collaboration in autonomous research

Key insight: Autoresearch represents a shift from AI-assisted coding (human directs, AI executes) to AI-directed research (AI directs, GPU executes) — the human becomes the reviewer, not the driver.

Autoresearch sits where two trends collide: autonomous AI agents and cheap GPU compute. It isn't the first tool to automate ML work. But it's the first to treat training code — not just parameters — as the search space. That's a conceptual leap worth paying attention to.

The implications reach well beyond ML. Lutke proved you can aim this pattern at any codebase with a measurable metric. Rendering speed. Memory usage. API latency. If you can score it, an agent can optimize it while you sleep.

With 4.3 million AI repos on GitHub and a 178% year-over-year jump in LLM-focused projects (GitHub Octoverse 2025 report, 2025), those 53,100 stars tell a clear story. Developers don't just want tools that assist. They want tools that produce results overnight.

Frequently Asked Questions

What do you need to run autoresearch?

Three things: a single NVIDIA GPU with CUDA support, a Python environment, and an API key for Claude or Codex. The script is 630 lines of Python with minimal dependencies (karpathy/autoresearch on GitHub, 2026). An A100 or H100 is recommended — cloud rental runs $14-22 for an 8-hour overnight session.

How is autoresearch different from hyperparameter tuning?

Traditional hyperparameter tuning searches predefined numerical ranges — learning rate between 0.001 and 0.1, batch size from 16 to 256. Autoresearch modifies actual source code. The AI agent can rewrite the optimizer, change the model architecture, or restructure the entire training pipeline. It's closer to automated research than parameter optimization.

Can autoresearch work on problems beyond LLM training?

Yes. Shopify CEO Tobi Lutke applied the same pattern to Shopify's Liquid templating engine, achieving 53% faster rendering and 61% fewer memory allocations from 93 automated commits (Fortune's analysis of autonomous AI agents, 2026). Any codebase with a measurable performance metric can benefit.

Is autoresearch safe to run on production code?

Autoresearch works on isolated training scripts, not production systems. Each experiment modifies code in a sandboxed environment, trains for 5 minutes, and reverts if results don't improve. Lutke's Shopify experiments ran on a separate branch with human review before merging. The agent proposes; humans approve production deployments.

How long until autoresearch supports Apple Silicon or AMD GPUs?

Community forks for Apple Silicon (via MLX) already exist but aren't in the main repo. Karpathy has indicated he's keeping the official code minimal — CUDA only. AMD ROCm support is experimental through PyTorch's existing backend. For Mac users, the MLX forks work for smaller models but run slower than NVIDIA GPUs.

The Bottom Line

Autoresearch isn't magic. It's a tight loop: an AI agent writes code, a GPU trains models, a scoring function decides what survives. The real story is the economics. For $16-32, you get a month's worth of manual testing done overnight.

Those 53,100 stars reflect something real. Developers are rethinking optimization. Why hand-tune when an agent can test 100 setups while you sleep? Why stay inside hyperparameter grids when the agent can rewrite the whole training loop?

Got an NVIDIA GPU and a measurable metric? Give autoresearch one overnight run. Worst case: you spend $20 and learn nothing. Best case: you wake up to an 11% gain you'd never have found on your own.

Related reading: