CUDA Agent: RL for GPU Kernel Optimization

GPU kernel optimization is one of the clearest examples of a task where “can write code” is not enough.

A correct CUDA kernel may still be useless if it is slower than PyTorch eager execution or torch.compile. The hard part is not syntax. The hard part is hardware-aware performance: memory access, thread mapping, operator fusion, occupancy, shared memory, registers, profiling noise, and the long loop of trying something, measuring it, and refining it.

The paper “CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation” treats that loop as the central object of training.

The core idea is straightforward and important:

If CUDA optimization is an iterative development skill, train the model inside an iterative CUDA development environment with reliable execution feedback.

This is a useful shift for agent builders. CUDA Agent is not merely prompting a general-purpose model to produce better kernels. It is training an agent to use a structured environment: write kernels, compile, run correctness checks, profile performance, inspect failures, and optimize again.

Why general coding ability does not solve CUDA

Modern coding models are strong at many software tasks. But CUDA kernel generation is unusually unforgiving.

The output must satisfy two different constraints at once:

Correctness: compile successfully and match the PyTorch reference behavior.
Performance: beat meaningful baselines such as eager execution and torch.compile.

Many models can improve the first constraint. The second is harder.

A model may generate a plausible kernel that passes tests but performs poorly because it materializes intermediate tensors, uses inefficient memory access patterns, underutilizes parallelism, or chooses a bad tiling strategy. A general-purpose LLM can sound like a CUDA expert without consistently behaving like one under a profiler.

The CUDA Agent paper argues that previous approaches are limited in two ways.

Training-free refinement systems can search and debug at test time, but they remain bounded by the base model’s intrinsic CUDA skill. Fine-tuning approaches with fixed multi-turn loops can use execution feedback, but they often constrain the agent’s autonomy and waste context by carrying previous attempts in rigid trajectories.

CUDA Agent instead scales three things together: data, environment, and RL stability.

Component 1: scalable task synthesis

High-quality expert CUDA kernels are expensive to collect. The authors avoid relying on human-written optimized labels by framing training as reinforcement learning over PyTorch reference operators.

They build a data synthesis pipeline with three stages:

crawl seed operators from PyTorch and Transformer libraries;
use LLM-based combinatorial synthesis to create fused multi-operator tasks;
filter tasks so they are executable, deterministic, non-trivial, and have reasonable workloads.

The fusion part matters. A fused task is not just a list of independent kernels. Fusion changes the optimization landscape: intermediate global-memory writes may disappear, register/shared-memory pressure changes, and a good mapping for one operator may not be optimal for the combined computation.

For agent training, this creates a richer curriculum than isolated toy kernels.

Component 2: a skill-augmented CUDA environment

The most agentic part of CUDA Agent is its environment.

The model is placed inside a CUDA development workspace with files such as the PyTorch reference model, C++ bindings, generated .cu kernels, verification scripts, and profiling scripts. A skill specification describes the intended workflow: implement the accelerated model, validate correctness, profile latency, interpret feedback, and iterate.

This matters because the environment defines the feedback channel.

For performance optimization, reward quality is everything. If timing is noisy or correctness checks are weak, the policy can learn artifacts instead of real optimization. The authors therefore emphasize rigorous correctness/performance tests and isolation against reward hacking.

Their sandbox design decouples CPU and GPU resources: a Docker-based terminal sandbox handles CPU-centric work such as compilation, while verification and profiling jobs are dispatched to a dedicated GPU sandbox pool. The paper reports a pool of 128 NVIDIA H20 GPUs to provide stable latency measurement and exclusive resource allocation.

This is a production-grade lesson: if you want to train agents on real tools, the environment must make the reward trustworthy.

Component 3: stable long-context agentic RL

CUDA Agent is trained from Seed1.6, a Mixture-of-Experts model with 23B active and 230B total parameters.

The training setup is large-scale: PPO optimization, global batch size 1024, 150 training steps, 32k context for single-turn RL, 128k context for agentic RL, up to 150 turns during training rollouts, and up to 200 turns during evaluation.

The authors highlight multi-stage warm-up as essential:

Rejection Sampling Fine-Tuning gives the actor a strong behavioral prior.
Value pretraining stabilizes the critic.
A robust milestone-based reward schedule avoids directly chasing noisy raw speed-up ratios.

The ablations are the most convincing part of this story. Removing the agent loop, robust reward, RFT, or value pretraining substantially hurts optimization performance. Without RFT or value pretraining, training becomes unstable and can collapse.

For agent builders, the lesson is not “just run PPO.” It is that long-horizon tool-using RL needs careful initialization, reward design, and environment control.

Results on KernelBench

CUDA Agent is evaluated on KernelBench Levels 1–3, totaling 250 operator tasks.

Overall, CUDA Agent reports:

98.8% pass rate;
98.4% faster rate vs PyTorch eager;
96.8% faster rate vs torch.compile;
2.60× geometric mean speed-up vs eager;
2.11× geometric mean speed-up vs torch.compile.

The comparison against frontier coding models is striking. Claude Opus 4.5 reaches a 95.2% pass rate but only a 66.4% faster rate against torch.compile. Gemini 3 Pro reaches 91.2% pass and 69.6% faster rate. CUDA Agent reaches 96.8% faster rate overall.

The strongest result is Level 2, operator sequences: CUDA Agent achieves 100% faster rate vs torch.compile and 2.80× speed-up. This is exactly where learned iterative optimization can be valuable, because rule-based compiler heuristics may struggle with non-trivial fusion patterns.

On Level 3, the hardest split, CUDA Agent reports 94.0% pass rate, 90.0% faster rate vs compile, and 1.52× speed-up vs compile.

The key point is not only that the model writes correct kernels. It learns to produce kernels that are usually faster than a strong compiler baseline in this benchmark setup.

What this suggests for agent design

CUDA Agent is interesting beyond CUDA.

It shows a pattern for training specialist agents:

Create many realistic tasks from reference implementations.
Put the model in the same tool environment it will use at inference time.
Make correctness and performance measurable.
Design rewards that resist hacking and reflect real goals.
Use staged training to avoid long-horizon RL collapse.

This pattern applies to many domains.

A database agent could optimize SQL queries using query plans and latency measurements. A DevOps agent could learn from deployment checks and incident recovery signals. A coding agent could learn from test suites, coverage, benchmarks, and static analysis. A research agent could learn from verification and replication results.

The broader principle is:

Agents learn skills more reliably when the environment turns expert feedback into structured, trustworthy reward.

Prompting can demonstrate a workflow. Skills can document a workflow. But training inside the workflow can change the model’s actual policy.

Caveats

The paper is careful about two important limitations.

First, CUDA Agent is not compared against more sophisticated compiler frameworks such as TVM. torch.compile is widely used and convenient for large-scale RL training, but it is not the entire compiler optimization universe.

Second, the system is expensive. A large GPU pool, process-level isolation, stable profiling, long-context rollouts, and multi-stage RL training are serious engineering investments. This is not a weekend reproduction project.

There are also deployment questions beyond the benchmark: hardware portability, shape distributions, maintainability, security review, regression testing, and integration with real ML frameworks.

So the right conclusion is not that CUDA engineers are obsolete.

The right conclusion is that agentic RL can teach a model a real systems-optimization workflow when the environment is sufficiently well designed.

Builder takeaway

CUDA Agent is a strong example of moving from passive code generation to active systems optimization.

A normal coding model predicts code. CUDA Agent learns a loop: generate, compile, verify, profile, diagnose, refine.

That loop is the real skill.

For agent builders, the paper is a reminder that capability does not live only in model weights. It also lives in the environment, the tools, the measurement system, the reward function, and the training curriculum.

If we want agents that can improve performance-critical software, we should stop treating execution feedback as a prompt accessory. We should treat it as the training substrate.

Reference

Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou, “CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation,” arXiv:2602.24286v1, 2026. https://arxiv.org/abs/2602.24286