RLVR GPU training costs, benchmarks, and pricing

This note connects RLVR training costs to measured baselines and published runs. RLVR (reinforcement learning from verifiable rewards) uses checkable signals—rules, parsers, tests—not learned human-preference reward models.³⁹ When rollouts stay short, wall time looks like ordinary RLHF-style loops; when long chain-of-thought dominates (8K–32K tokens), generation often eats 70–90% of wall time and training can land ~5–10× slower than plain SFT.³⁴ The sections below use 7B short-rollout throughput and cloud list prices to illustrate $0.29–$1.11 per million tokens at that scale (MI300X leads on raw $/throughput in the tables), then extrapolate as rollouts and model size grow. Example run costs range from small open demos to large-cluster GRPO.⁵⁶⁷⁴²

Interactive cost & time estimator

Adjust model, algorithm, rollout length, and token target to compare H100, H200, and MI300X. Figures use measured 7B GRPO/PPO baselines where available, rollout-length penalties from the long-form note below, and April 2026 list pricing. Estimates are illustrative—see confidence notes under the charts.

Model

Algorithm

Rollout length

Token target

GPU specs — confirmed pricing April 2026

Summary at 4× GPUs

Time to convergence and total cost — 1 to 8 GPUs

Hours to converge

Total cost ($)

Real published RLVR runs — calibration anchors

published

DeepSeek-R1-Zero 671B MoE

512×H800 · 198h · ~101K GPU-hrs · ~$200K · GRPO · custom

open source

SimpleRL-Zoo 32B

64×H100 · ~36h · ~2,300 GPU-hrs · ~$4,600 · GRPO · veRL

open source

SimpleRL-Zoo 7B

16×H100 · ~15h · ~240 GPU-hrs · ~$500 · GRPO · veRL

open source

Open-R1 7B

8×H100 · ~3h · 24 GPU-hrs · ~$72 · GRPO · TRL (3.1× slower than OpenRLHF)

published

DeepScaleR 1.5B

8–32×A100 · multi-stage · 3,800 GPU-hrs · ~$4,500 · iterative ctx scaling

open source

R1-V 2B VLM

8×A100 · 30min · 4 GPU-hrs · $2.62 · GRPO · TRL

Measured GRPO throughput (ROCm + veRL; short GSM8K rollouts)

The AMD ROCm walkthrough (April 2025) reports GRPO and PPO on GSM8K with veRL v0.3.0 on MI300X and H100—i.e. RLHF with rule-based rewards on math, which is the standard pattern when response lengths are short (512–1024 tokens in their table).¹ Those runs give a clean GPU and algorithm comparison (GRPO 1.7–2.5× faster than PPO in the same harness) before we extrapolate to regimes where long rollouts dominate cost.¹²

GSM8K in this stack: GSM8K is Cobbe et al.’s grade-school math word problems (arXiv:2110.14168). veRL’s GSM8K example scores completions with a simple rule (parse the answer after `####`, compare to the label)—a verifiable reward that fits short generations and aligns with how many math RL pipelines are built; veRL’s docs describe that agent under the broad RLHF umbrella.³⁸ The ROCm numbers are measured system throughputs for that recipe; the rest of the note scales from them toward long-CoT RLVR where decoding dominates.

Additional related measurements come from Yotta Labs on MI300X + veRL⁸ and the OpenRLHF paper.² This 7B measured slice is still where most public tok/GPU/s tables live—few comparable public rows exist for 14B, 32B, or 70B in the same reporting format.

GPU	Model	Algorithm	TP	Tokens/GPU/sec	Response Length	Framework	Source
H100 SXM	Qwen2-7B	GRPO	2	1,544	1,024	veRL v0.3	AMD ROCm Blog ¹
MI300X	Qwen2-7B	GRPO	2	1,748	1,024	veRL v0.3	AMD ROCm Blog ¹
H100 SXM	DeepSeek-7B	GRPO	2	1,624	1,024	veRL v0.3	AMD ROCm Blog ¹
MI300X	DeepSeek-7B	GRPO	2	1,899	1,024	veRL v0.3	AMD ROCm Blog ¹
H100 SXM	Qwen2-7B	PPO	2	907	512	veRL v0.3	AMD ROCm Blog ¹
MI300X	Qwen2-7B	PPO	2	921	512	veRL v0.3	AMD ROCm Blog ¹
H100 SXM	DeepSeek-7B	PPO	4	624	512	veRL v0.3	AMD ROCm Blog ¹
MI300X	DeepSeek-7B	PPO	4	767	512	veRL v0.3	AMD ROCm Blog ¹

These benchmarks used 8 GPUs per node with train_batch_size=1024.¹ Two critical patterns emerge: GRPO achieves 1.7–2.5× higher throughput than PPO because it eliminates the critic model entirely,⁹ and MI300X outperforms H100 by 1–23% across all configurations, likely due to its larger 192GB HBM3 memory enabling more efficient batching.¹⁸ Yotta Labs separately confirmed MI300X efficiency with veRL v0.5.0, achieving 14.01% training MFU for 7B GRPO with optimal TP=1 on a single MI300X, and near-linear scaling across data parallelism dimensions.⁸

The OpenRLHF paper provides a framework-level comparison: OpenRLHF completes one GSM8K GRPO epoch in 1,657 seconds versus TRL's 5,189 seconds — a 3.1× speedup — on identical hardware and hyperparameters.²

Estimated throughput across model sizes requires significant extrapolation

Published RLVR throughput data is almost exclusively at 7B scale. The estimates below combine the measured 7B baselines with a rough, non-cited vLLM-style inference scaling heuristic (7B-class vs 32B-class decode often differs by roughly ~5× on a single GPU in community reports; re-measure for your stack),⁴⁰ the GPTOSS-20B MoE training benchmark (~500–598 tok/s on 512 H800 GPUs with veRL+Megatron),¹⁰ and the OLMo 3 32B RL training disclosure showing inference-to-training compute ratios of 5–14×.¹¹

Model Size	H100 SXM (tok/GPU/s)	MI300X (tok/GPU/s)	A100 80GB (tok/GPU/s)	Confidence	Basis
7B	1,544–1,624	1,748–1,899	~900–1,100	Measured	AMD ROCm Blog ¹ (veRL v0.3, GRPO, 1K resp.)
14B	~700–950	~800–1,100	~400–600	Estimated	~0.5–0.6× of 7B; heuristic scaling ⁴⁰
32B	~300–450	~350–520	~150–250	Estimated	~5× drop from 7B (heuristic ⁴⁰); GPTOSS-20B MoE ~500–598 tok/s on 512 GPUs ¹⁰
70B	~120–200	~140–240	~60–100	Estimated	Requires multi-GPU TP; extrapolated from 32B
235B-A22B (MoE)	~200–350	~230–400	N/A	Estimated	MoE with ~22B active params ≈ 32B-scale compute
671B-A37B (MoE)	~100–200	~120–250	N/A	Estimated	Requires 96+ GPUs; EP+TP+PP parallelism ⁷

Critical caveat: These estimates assume short (1K) response lengths matching the benchmark conditions. At RLVR-typical response lengths of 8K–32K tokens, overall step throughput stays roughly proportional (more tokens produced per step), but each step takes 8–32× longer to complete because autoregressive generation scales linearly with output length. KV cache pressure at longer contexts can also force batch size reductions, further degrading effective throughput. The QeRL paper notes that typical RL training for reasoning models takes 20–100 hours on 8× H100 GPUs — suggesting effective throughput is significantly lower than these per-token rates imply when accounting for real-world RLVR conditions.¹²

Wall-clock training times from published RLVR runs

The table below compiles selected RLVR-style training runs with disclosed compute details. Dollar costs use illustrative $/GPU-hour assumptions where sources do not publish invoices (see ⁴²). Costs span from a third-party README ~$2.62 to order-of-magnitude ~$200,000, driven primarily by model scale and rollout length.

Run	Model	Size	Algorithm	GPUs	Wall-Clock	GPU-Hours	Est. Cost	Framework
DeepSeek-R1-Zero ⁶⁷	DeepSeek-V3-Base	671B MoE	GRPO	512×H800	~198 hrs	~101K	~$200K	Custom
DeepSeek-R1 (RL stage) ⁶⁷	DeepSeek-V3-Base	671B MoE	GRPO	512×H800	~80 hrs	~41K	~$82K	Custom
ProRL-1.5B-v2 ¹³	—	1.5B	RLVR	H100s	—	>20K	~$60K+	OpenRLHF
DAPO ¹⁴	Qwen2.5-32B	32B	DAPO	128×H20	Several days	Not disclosed	—	veRL
SimpleRL-Zoo (32B) ¹⁵	Qwen2.5-32B	32B	GRPO	64×H100	~36 hrs	~2,300	~$4,600	veRL
DeepScaleR ¹⁶	R1-Distill-Qwen-1.5B	1.5B	GRPO	8–32×A100	Multi-stage	3,800	~$4,500	veRL
SimpleRL-Zoo (7B) ¹⁵	Qwen2.5-7B	7B	GRPO	16×H100	~15 hrs	~240	~$500	veRL
Dr. GRPO ¹⁷	Qwen2.5-Math-7B	7B	Dr. GRPO	8×A100	~27 hrs	~216	~$430	Oat
Open-R1 ¹⁸	Qwen2.5-Math-7B	7B	GRPO	8×H100	~3 hrs	24	~$72	TRL
Mini-R1 ¹⁸	Qwen2.5-3B-Instruct	3B	GRPO	4×H100	~6 hrs	24	~$72	TRL
microR1 ¹⁸	Qwen2.5-3B-Instruct	3B	GRPO	8×A100	~3 hrs	24	~$44	Pure PyTorch
TinyZero ⁹	Qwen2.5-3B	3B	PPO	2×H200	<5 hrs	<10	<$30	veRL
R1-V ⁵	Qwen2-VL-2B	2B VLM	GRPO	8×A100	30 min	4	$2.62	TRL

DeepSeek-R1-Zero represents the largest published RLVR run: 512 H800 GPUs for 198 hours training the 671B MoE model with GRPO.⁶⁷ The full DeepSeek-R1 pipeline (including V3 pre-training) consumed 2.788 million H800 GPU-hours at an estimated $5.58M (same order-of-magnitude $/GPU-hr caveat as ⁴²).⁷ At the other extreme, community GRPO-on-VLM writeups report ~30 minutes on 8 A100s with a README-derived total of ~$2.62 (GPU-hours × quoted rates)—use as an illustration, not a peer-reviewed benchmark.⁵ DeepScaleR's iterative context-length scaling approach (8K→16K→24K) reduced estimated compute from ~70K to just 3,800 A100-hours — demonstrating that curriculum-based context scaling is a critical cost optimization for RLVR.¹⁶

RLVR is faster than RLHF but far slower than SFT

GRPO eliminates the critic model that PPO requires,⁹ and practitioners often cite an order-of-magnitude ~40–50% memory savings versus four-model PPO+RM stacks—treat that band as a heuristic, not a controlled measurement in this note. The ROCm + veRL GRPO vs PPO comparison is measured: GRPO is 1.7–2.5× faster than PPO under the same settings.¹ Traditional PPO-based RLHF requires four large models in GPU memory simultaneously (policy, reference, reward model, critic), while GRPO needs only two (policy and reference).⁹ DAPO goes further by removing the KL penalty entirely, eliminating even the reference model.¹⁴

The throughput hierarchy is clear from measured data. GRPO on a 7B model achieves 1,544 tok/GPU/s versus PPO's 907 tok/GPU/s on an H100 — a 1.70× improvement.¹ On MI300X, GRPO reaches 1,748 tok/GPU/s versus PPO's 921 tok/GPU/s (1.90×).¹ One non-peer-reviewed Substack overview claims GRPO can reduce overall training cost to roughly 1/18 of some traditional PPO-style RL setups when memory savings enable larger batches—use as anecdotal context, not a universal ratio.⁹

However, RLVR's rule-based verification advantage (no neural reward model forward pass) is substantially offset by its long rollouts. The DC-SFT paper reports that in their VLM setting, SFT achieved about 4.9× higher training efficiency than GRPO (see their tables for exact comparisons—do not read this as a universal constant across tasks).¹⁹ Multiple sources confirm the 5–10× slowdown of online RL versus offline SFT, driven by autoregressive rollout generation consuming 70–90% of RL training time.³⁴²⁰ A single RLVR training step can take minutes to over an hour depending on model size and rollout length, compared to seconds for SFT. The OLMo 3 32B reasoner allocated 20 H100 nodes for inference alongside 8 for training — inference consumed 5–14× more compute than policy updates, with learner GPUs idle 75% of the time waiting for rollout data.¹¹

Framework landscape for RLVR training

Four major frameworks dominate RLVR training, each with distinct strengths. The table below reflects capabilities as of early April 2026.

Feature	OpenRLHF v0.9.9 ²²¹	veRL v0.7.1 ²²	TRL v1.0.0 ²³	NeMo RL ²⁴²⁵
GRPO	✅	✅	✅	✅
DAPO	✅ (via flags)	✅ (reference impl.)	❌	✅
Dr. GRPO	✅	✅	❌	❌
REINFORCE++	✅ (native, recommended)	✅	❌	❌
PRIME	❌	✅	❌	❌
Max proven scale	70B+ dense	671B MoE	~72B (with DeepSpeed)	340B+
Inference backend	vLLM	vLLM + SGLang	vLLM	Megatron native
Training backend	DeepSpeed ZeRO-3	FSDP/Megatron	HF Accelerate	Megatron Core
Async training	✅ (--async_train)	✅ (1-step off-policy)	Experimental	✅
FP8 end-to-end	❌	❌	❌	✅
Ease of use	Medium	Medium-High	Highest	Low

OpenRLHF (widely used open-source RLHF/RLVR stack; EMNLP 2025) is the throughput leader for dense models up to 70B in the OpenRLHF team’s own comparisons, achieving 1.22–1.68× speedup over veRL in long-CoT RLVR settings across 1.5B–14B models with 1K–8K generation lengths.² Its Ray+vLLM+DeepSpeed architecture is battle-tested by Google, ByteDance, NVIDIA, and Tencent.²¹ The team explicitly recommends REINFORCE++-baseline for RLVR tasks due to its robustness across reward patterns.²

veRL (EuroSys 2025; large public footprint on GitHub) has broad algorithm support and large proven scale — the DAPO paper's results were produced using veRL,¹⁴ and it has been validated on DeepSeek-V3 671B and Qwen3-235B MoE models.²² Its Megatron backend enables expert parallelism essential for MoE training. Note that OpenRLHF's published speed advantage was measured against veRL v0.4.0; veRL has since released significant optimizations through v0.7.1 that may have closed this gap.²²²

TRL (Hugging Face; frequent releases) prioritizes accessibility over raw throughput.²³ Its GRPOTrainer requires minimal code and integrates natively with Hugging Face's ecosystem. It is ~3.1× slower than OpenRLHF on GRPO benchmarks,² making it best suited for prototyping and smaller-scale training. The Open-R1 and Mini-R1 reproduction projects both use TRL.¹⁸

NeMo RL (successor to deprecated NeMo-Aligner) is NVIDIA's enterprise offering, uniquely supporting end-to-end FP8 training and Megatron Core's full 3D parallelism suite.²⁴ It trained Nemotron 3 Nano with GRPO across multiple environments simultaneously, using up to 49K-token generation lengths.²⁵

Notable emerging frameworks include AReaL (Ant Group/Tsinghua), achieving 2.77× speedup via fully asynchronous RL,²⁰ and ROLL Flash (Alibaba), reaching 2.24× speedup on RLVR through queue scheduling and rollout-train decoupling.³

Cloud GPU pricing for RLVR workloads in April 2026

RLVR training demands multi-GPU clusters with high-bandwidth interconnect (NVLink intra-node, InfiniBand inter-node). Prices below are per GPU per hour, transcribed from provider websites and aggregators in March–April 2026 where footnoted.²⁶²⁷²⁸²⁹³⁰ Rows without an inline footnote marker (e.g. some prepaid or bare-metal SKUs) should be spot-checked on the vendor site—live quotes move weekly.⁴¹

H100 SXM 80GB

Provider	On-Demand	Spot/Community	Reserved	InfiniBand
Vast.ai	~$1.54	~$1.50+	Available	Varies ²⁹
RunPod (Community)	~$1.99–$2.49	—	Enterprise	NVLink ²⁶
FluidStack	~$2.10	—	Custom	✅
Vultr (36-mo)	$2.30	—	36-mo prepaid	NVLink
Lambda (1-Click)	$2.76	—	1–3 yr custom	✅ ³¹
RunPod (Secure)	~$2.69–$2.99	—	Enterprise	NVLink ²⁶
DigitalOcean (8×)	$2.99	—	$1.99 (committed)	NVLink ²⁸
Crusoe	$3.90	Contact sales	Custom	✅
Lambda (Instance)	$3.99–$4.29	—	Custom	NVLink ³¹
CoreWeave	$6.16	—	Up to 60% off	✅ ²⁷

H200 SXM 141GB

Provider	On-Demand	Reserved	Notes
Vast.ai	~$1.50–$4.00	Available	Marketplace, variable ²⁹
DigitalOcean	$3.44	Multi-month discount	NVLink HGX ²⁸
Crusoe	$4.29	Custom	InfiniBand
CoreWeave	$6.31	Up to 60% off	InfiniBand ²⁷

MI300X 192GB

Provider	On-Demand	Reserved	Notes
Vast.ai	~$0.95–$3.12	Available	Marketplace spot from $0.95 ²⁹
TensorWave	$2.25	$1.71 (dedicated)	Bare metal, Infinity Fabric
Vultr	$1.85 (preempt.)	$1.85 (24-mo)	24-month prepaid
DigitalOcean	$1.99	Multi-month	Infinity Fabric ²⁸
Crusoe	$3.45	Custom	Infinity Fabric

A100 80GB SXM

Provider	On-Demand	Spot	Notes
Vast.ai	~$0.50–$2.00	From ~$0.50	Marketplace ²⁹
RunPod (Community)	~$0.89	—	Per-second billing ²⁶
Crusoe	$1.95	$1.30	Clean energy
RunPod (Secure)	~$1.89	—	SOC2 ²⁶
CoreWeave	$2.70	—	InfiniBand ²⁷
Lambda	$2.79	—	8× nodes ³¹

For serious RLVR training at scale (multi-node with InfiniBand), Lambda 1-Click Clusters at $2.76/GPU/hr for H100 and DigitalOcean at $1.99/GPU/hr for MI300X represent the strongest price-performance combinations with production-grade interconnect.²⁸³¹ Vast.ai offers the lowest absolute prices but lacks guaranteed InfiniBand connectivity.²⁹

Cost per million tokens processed during RLVR training

The following calculations combine measured GRPO throughput at 7B scale (1K response length) with current GPU pricing.¹ These represent best-case costs — real-world RLVR with longer rollouts will have higher per-step costs (though similar per-token costs if memory permits maintaining batch size).

H100 SXM — 7B GRPO (1,544 tok/GPU/s = 5.56M tok/GPU/hr) ¹

Provider	$/GPU/hr	$/M Tokens	Notes
Vast.ai	$1.54	$0.28	Marketplace; reliability varies ²⁹
RunPod (Community)	$1.99	$0.36	Per-second billing ²⁶
Lambda (1-Click)	$2.76	$0.50	InfiniBand clusters ³¹
DigitalOcean (8×)	$2.99	$0.54	NVLink HGX ²⁸
Crusoe	$3.90	$0.70	Clean energy
Lambda (Instance)	$3.99	$0.72	Self-serve ³¹
CoreWeave	$6.16	$1.11	Enterprise; reserved ~$2.46→$0.44 ²⁷

MI300X — 7B GRPO (1,748 tok/GPU/s = 6.29M tok/GPU/hr) ¹

Provider	$/GPU/hr	$/M Tokens	Notes
Vast.ai (spot)	$0.95	$0.15	Marketplace; lowest possible ²⁹
Vultr (24-mo)	$1.85	$0.29	Prepaid commitment
DigitalOcean	$1.99	$0.32	Best reliable value ²⁸
TensorWave	$2.25	$0.36	Bare metal, dedicated
Crusoe	$3.45	$0.55	InfiniBand

Estimated cost scaling by model size (H100, Lambda 1-Click ~$2.76/hr) ³¹

Model Size	Est. tok/GPU/s	M tok/GPU/hr	$/M Tokens	Est. Cost for 1B Token Run
7B	1,544 (measured ¹)	5.56	$0.50	$500
14B	~800 (est.)	~2.88	~$0.96	$960
32B	~375 (est.)	~1.35	~$2.04	$2,040
70B	~160 (est.)	~0.58	~$4.76	$4,760

MI300X provides a 30–45% cost advantage over H100 for RLVR training at current market rates, combining higher throughput (+13% for 7B GRPO ¹) with lower pricing (DigitalOcean MI300X at $1.99 ²⁸ vs Lambda H100 at $2.76 ³¹). veRL documents strong MI300X support (ROCm integration and tuning notes).⁸ For other frameworks, confirm AMD paths in upstream docs before committing to a stack.

Long rollouts dominate RLVR cost structure

The defining characteristic of RLVR versus traditional RLHF is the length of generated rollouts. While RLHF typically generates 512–2048 token responses, RLVR reasoning chains routinely reach 8K–32K tokens, with production systems like DAPO using max_response_length of 20,480 tokens¹⁴ and NeMo RL's Nemotron training reaching 49K tokens.²⁵ Production RLVR workload characterization (PolyTrace) shows math reasoning tasks averaging ~9,839 output tokens per sample.⁴

Rollout generation consumes 70–90% of total RLVR training time.³⁴ This means the bottleneck is autoregressive decoding — a memory-bandwidth-bound operation that cannot be trivially accelerated by adding more compute. Each GRPO step generates G completions per prompt (typically G=8–64; DeepSeek-R1 used G=64 ⁶), multiplying the generation burden. For a concrete example from the HuggingFace "Keep the Tokens Flowing" analysis:²⁰ generating 512 rollouts at 8K tokens for a 32B model on 8 H100 inference GPUs takes approximately 7 minutes for generation alone, before any gradient computation.

The long-tail distribution of rollout lengths creates severe GPU idling. ROLL Flash found that the longest responses can exceed the median by over 20×, meaning most GPUs finish early and wait for stragglers in synchronous systems.³ The OLMo 3 32B reasoner's learner GPUs spent 75% of time idle waiting for inference data.¹¹ This has spawned several optimization approaches:

Asynchronous training (ROLL Flash,³ AReaL,²⁰ PipelineRL) decouples generation from training, achieving 2.0–2.8× speedups by overlapping these phases
Dynamic sampling (DAPO ¹⁴) filters prompts where all responses are correct or all wrong, avoiding wasted computation on zero-gradient batches — achieving the same performance with 1/3 the training steps
Overlong reward shaping (DAPO ¹⁴) applies soft penalties for responses exceeding a threshold rather than hard truncation, preventing training instability
Token-level policy gradient loss (DAPO,¹⁴ veRL ²²) averages loss across total tokens rather than per-sample-then-per-batch, preventing gradient dilution for long sequences
Dr. GRPO's debiased advantage ¹⁷ removes variance normalization and length divisors from GRPO, eliminating bias toward shorter responses and providing unbiased policy gradients
Clip-Higher (DAPO ¹⁴) uses asymmetric clipping (ε_low=0.2, ε_high=0.28) to preserve exploration by allowing more room for increasing low-probability tokens, combating entropy collapse; related asymmetric clipping ideas appear in VAPO.³⁷
NAT (Not All Tokens are Needed) ³³ performs policy optimization on only ~50% of tokens from each rollout while computing rewards on full responses, reducing activation memory
FP8 precision (JetRL ³⁴) achieves 1.07–1.33× rollout speedup, but naive mixed-precision (BF16 training + FP8 rollout) fails catastrophically at context lengths beyond 8K due to numerical precision mismatches
Iterative context scaling (DeepScaleR ¹⁶) trains at 8K→16K→24K progressively, reducing total compute by ~18× versus training at maximum length from the start

KV cache memory scales linearly with sequence length. For a Qwen-32B model, hosting with full 131K context increases memory requirements from ~70GB to approximately 400GB. Frameworks address this through dynamic memory management (veRL's `free_cache_engine=True` offloads KV cache after rollout generation ³⁵) and PagedAttention (vLLM) for non-contiguous memory allocation. Prefix caching is particularly valuable for GRPO since all G completions per prompt share the same prompt prefix — SGLang's RadixAttention provides 3–5× cache hit improvement in this multi-completion scenario.³⁶

Data quality assessment and key caveats

Every number in this report carries a confidence level that readers should understand when making cost projections.

Measured benchmarks (high confidence): AMD ROCm publishes GRPO/PPO throughputs for veRL v0.3 on GSM8K with rule-based rewards and short completions (7B models, 8 GPUs per node)—a solid baseline for GPU and algorithm comparisons.¹ Yotta Labs reports complementary MI300X + veRL numbers.⁸ DeepSeek-R1-Zero's training configuration (512×H800, ~198 hours) was disclosed via Stanford FMTI and Nature supplementary materials.⁶⁷ OpenRLHF vs TRL timing comparisons come from the peer-reviewed EMNLP 2025 paper.²

Derived with caveats (medium confidence): Wall-clock times and GPU-hours for open-source reproductions (DeepScaleR,¹⁶ SimpleRL-Zoo,¹⁵ Dr. GRPO ¹⁷) come from GitHub READMEs, blog posts, and WandB logs — credible but not peer-reviewed. The 70–90% rollout time proportion is consistent across ROLL Flash,³ veRL,²² OpenRLHF,² and NAT ³³ papers. The RLVR vs SFT slowdown band (5–10×) combines DC-SFT’s in-paper ~4.9× VLM comparison¹⁹ with production-style reports.¹¹

Estimated with significant uncertainty (lower confidence): Throughput estimates for 14B, 32B, and 70B models are extrapolations from 7B measured data combined with a heuristic inference scaling ratio (not a single cited benchmark row).⁴⁰ Real-world throughput depends heavily on batch size, parallelism strategy, sequence length, and framework optimizations. The 14B–70B rows in the throughput and cost tables should be treated as rough order-of-magnitude guides, not precision benchmarks. Cloud GPU pricing fluctuates; spot/marketplace rates (especially Vast.ai ²⁹) can vary by 2–3× within a single week. Un-footnoted list prices should be reverified on provider pages.⁴¹ The OpenRLHF vs veRL framework comparison was published by the OpenRLHF team ² against an older veRL version (v0.4.0); veRL's subsequent optimizations through v0.7.1 may have changed this relationship.²²

Unresolved gaps: Few public tables report long-rollout RLVR throughputs (8K–32K tokens) with the same tok/GPU/s detail as short-completion 7B baselines; few RLVR-oriented throughput benchmarks in this note’s sense exist for H200 GPUs. No framework has published comparable benchmarks across all GPU types. Qwen/QwQ training compute has never been publicly disclosed. DAPO's total GPU-hours on 128×H20 were not reported.¹⁴ The interaction between long rollout lengths (8K–32K) and per-token throughput under GPU memory pressure lacks systematic benchmarking — current data either measures short (1K) rollouts or reports only wall-clock totals without per-token rates.

Conclusion

RLVR training costs are dominated by a single bottleneck: autoregressive rollout generation of long reasoning chains.³⁴ The algorithmic efficiency of GRPO over PPO (1.7–2.5× measured throughput advantage,¹ plus the heuristic “fewer resident models” memory story ⁹) is real but secondary to the 70–90% of wall-clock time spent generating 8K–32K token completions.³¹¹ The most impactful cost optimizations are therefore architectural — asynchronous training (2–2.8× speedup ³²⁰), dynamic sampling (3× step reduction ¹⁴), and iterative context scaling (18× compute reduction ¹⁶) — rather than hardware-level. At current market rates, MI300X at $1.99/GPU/hr with 13% higher GRPO throughput than H100 offers the best raw price-performance,¹²⁸ though H100's broader framework support and InfiniBand availability make it the safer choice for production runs. The framework choice matters enormously: OpenRLHF's 3.1× advantage over TRL and 1.2–1.7× over veRL v0.4 on identical hardware ² represents a larger throughput delta than any GPU generational improvement. For organizations planning RLVR training, the decision tree is: veRL or NeMo RL for MoE models above 200B,²²²⁴ OpenRLHF for dense models up to 70B where throughput is critical,² and TRL for rapid prototyping ²³ — then invest heavily in the async and dynamic sampling optimizations that cut wall-clock time by 2–3× regardless of hardware choice.

References

AMD ROCm Blog. "Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration." April 2025. GRPO/PPO throughput on GSM8K with veRL v0.3, rule-based rewards, 7B models, 512–1024-token responses, 8 GPUs per node. https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html
Hu, J. et al. "OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework." EMNLP 2025. https://arxiv.org/html/2501.03262v4 | https://arxiv.org/pdf/2405.11143 | https://aclanthology.org/2025.emnlp-demos.48/
ROLL Flash Team. "Part II: ROLL Flash — Accelerating RLVR and Agentic Training with Asynchrony." arXiv:2510.11345, October 2025. https://arxiv.org/abs/2510.11345 | https://arxiv.org/html/2510.11345
"RL in the Wild: Characterizing RLVR Training in LLM Deployment." arXiv:2509.25279. https://arxiv.org/html/2509.25279
$2.62 / “R1-V” row: The table summarizes community README / blog claims (GPU-hours × quoted $/hr), not a paper. For the R1-VL line of work see e.g. Chen et al. "R1-VL: Advancing Multimodal Reasoning from Optimized Cold Start to Staged Reinforcement Learning." arXiv:2503.12937. https://arxiv.org/abs/2503.12937 | GitHub: https://github.com/jingyi0000/R1-VL — Related small-VLM GRPO artifacts include lmms-lab’s Qwen2-VL-2B-GRPO-8k card: https://huggingface.co/lmms-lab/Qwen2-VL-2B-GRPO-8k — Third-party cost writeup (example): PhotoAtomic. "R1-V: Witness the aha moment of VLM with less than $3." https://github.com/PhotoAtomic/deep-agent-R1-V
Stanford CRFM. "DeepSeek Transparency Report." FMTI December 2025. https://crfm.stanford.edu/fmti/December-2025/company-reports/DeepSeek_FinalReport_FMTI2025.html
DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437. https://arxiv.org/pdf/2412.19437
Yotta Labs. "Performance Optimization for Reinforcement Learning on AMD GPUs." 2025. https://www.yottalabs.ai/post/performance-optimization-for-reinforcement-learning-on-amd-gpus
Wolfe, C. "Group Relative Policy Optimization (GRPO)." Cameron Wolfe's Substack. https://cameronrwolfe.substack.com/p/grpo
"Enabling Large Scale RLHF of GPTOSS with Megatron backend in VeRL." Hugging Face Blog. https://huggingface.co/blog/yiakwy-xpu-team/enabling-large-scale-rlhf-of-gptoss-with-megatron
Wolfe, C. "OLMo 3 and the Open LLM Renaissance." Cameron Wolfe's Substack. https://cameronrwolfe.substack.com/p/olmo-3 | See also: Kim, T. "OLMo 3: The Architecture of 'Fully' Open Sourced Reasoning Models." https://www.terrencekim.net/2026/01/olmo-3-architecture-of-fully-open.html
Neurohive. "QeRL: Training 32B Models on Single H100 vs Three GPUs, Beating LoRA in Accuracy." https://neurohive.io/en/state-of-the-art/qerl-2/
OpenRLHF GitHub. ProRL-1.5B-v2. https://github.com/OpenRLHF/OpenRLHF
DAPO Team (ByteDance Seed). "DAPO: An Open-Source LLM Reinforcement Learning System at Scale." arXiv:2503.14476. https://arxiv.org/abs/2503.14476 | https://arxiv.org/pdf/2503.14476 | https://dapo-sia.github.io/
SimpleRL-Zoo (THU-ML / community recipes on veRL). "SimpleRL-Zoo: Investigating and Taming Zero-shot Reinforcement Learning for Open Base Models in the Wild." arXiv:2503.18892. https://arxiv.org/abs/2503.18892 | Hugging Face Papers: https://huggingface.co/papers/2503.18892 | veRL GitHub (framework used in many recipes): https://github.com/volcengine/verl
"DeepScaleR: Achieving Superior Performance with a Small Model Through Reinforcement Learning." Medium / arXiv. https://medium.com/@jenray1986/deepscaler-achieving-superior-performance-with-a-small-model-through-reinforcement-learning-562a4381c11f
Dr. GRPO (debiased GRPO / “Understanding R1-Zero-like training”): Liu et al. "Understanding R1-Zero-Like Training: A Perspective of Model Specialization." arXiv:2503.20783. https://arxiv.org/abs/2503.20783 | Code: https://github.com/sail-sg/understand-r1-zero | Oat implementation (as in the cost table): https://github.com/sail-sg/oat — *Note:* Other papers reuse “Dr. GRPO” wording for different fixes (e.g. noise-corrected variants on arXiv); this note means Liu et al. unless stated otherwise.
Open-R1 (Hugging Face). GitHub: https://github.com/huggingface/open-r1 (includes `grpo.py` and training scripts). Mini-R1 tutorial (Countdown / “aha moment”): https://huggingface.co/blog/open-r1/mini-r1-contdown-game — microR1 and similar rows in the table are other small-model GRPO reproductions tracked from READMEs; treat timings/costs as illustrative.
"Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training." arXiv:2602.10815. https://arxiv.org/html/2602.10815v1
Hugging Face. "Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries." https://huggingface.co/blog/async-rl-training-landscape
vLLM Blog. "Accelerating RLHF with vLLM, Best Practice from OpenRLHF." April 2025. https://blog.vllm.ai/2025/04/23/openrlhf-vllm.html | OpenRLHF GitHub: https://github.com/OpenRLHF/OpenRLHF
veRL (Volcano Engine Reinforcement Learning). GitHub: https://github.com/volcengine/verl | Documentation: https://verl.readthedocs.io/en/latest/perf/best_practices.html | DeepWiki: https://deepwiki.com/volcengine/verl
Hugging Face TRL. GRPOTrainer docs. https://huggingface.co/docs/trl/main/en/grpo_trainer | GitHub: https://github.com/huggingface/trl
NVIDIA NeMo RL Documentation. https://docs.nvidia.com/nemo/rl/latest/ | DAPO walkthrough: https://docs.nvidia.com/nemo/rl/latest/guides/dapo.html | GitHub: https://github.com/NVIDIA-NeMo/RL
NVIDIA Developer Blog. "Reinforcement Learning with NVIDIA NeMo-RL: Reproducing a DeepScaleR Recipe Using GRPO." https://developer.nvidia.com/blog/reinforcement-learning-with-nvidia-nemo-rl-reproducing-a-deepscaler-recipe-using-grpo/ | See also: arXiv:2512.20848 (Nemotron 3 Nano). https://arxiv.org/html/2512.20848v1
RunPod GPU Pricing. https://www.runpod.io/gpu-pricing | Review: https://dupple.com/tools/runpod | Guide: https://compute.hivenet.com/post/runpod-pricing-complete-guide-to-gpu-cloud-costs
CoreWeave Cloud Pricing. https://www.coreweave.com/pricing | Review: https://www.thundercompute.com/blog/coreweave-gpu-pricing-review
DigitalOcean GPU Droplets Pricing. https://docs.digitalocean.com/products/droplets/details/pricing/
Vast.ai Pricing Documentation. https://docs.vast.ai/documentation/instances/pricing
ByteIota. "GPU Cloud Pricing: H100 Costs $2.49 or $12.30 in 2026." https://byteiota.com/gpu-cloud-pricing-h100-costs-2-49-or-12-30-in-2026/
Lambda Labs GPU Cloud. https://lambdalabs.com/service/gpu-cloud (1-Click Clusters and on-demand instances)
"Not All Tokens are Needed: Token-Efficient Reinforcement Learning." arXiv:2603.06619. https://arxiv.org/html/2603.06619
"Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow." arXiv:2601.14243. https://arxiv.org/html/2601.14243
veRL Performance Tuning Guide. https://verl.readthedocs.io/en/latest/perf/perf_tuning.html
SGLang GitHub. https://github.com/sgl-project/sglang
"VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks." arXiv:2504.05118. https://arxiv.org/html/2504.05118v1
veRL Documentation. "GSM8K Example." Notes the Cobbe et al. paper focuses on a verifier for Best-of-N, while the veRL walkthrough uses a rule-based reward on GSM8K and refers to the setup as an RLHF agent. https://verl.readthedocs.io/en/latest/examples/gsm8k_example.html | Cobbe et al. "Training Verifiers to Solve Math Word Problems." arXiv:2110.14168. https://arxiv.org/pdf/2110.14168
Su, Y. et al. "Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains." arXiv:2503.23829 (RL with verifiable rewards across domains). https://arxiv.org/abs/2503.23829
Inference throughput ratio (7B vs 32B): vLLM documents throughput *measurement* APIs and practices (e.g. benchmark utilities in the project docs). https://docs.vllm.ai/en/latest/api/vllm/benchmarks/throughput.html — This note does not cite a single row that yields “6.3k → 1.2k tok/s”; the ~5× class ratio used in extrapolations is an uncalibrated heuristic from informal community reports—re-benchmark for your model, batch, and backend.
Pricing rows without inline citations: Figures for some prepaid, bare-metal, or list SKUs (e.g. FluidStack, Vultr long-commit, TensorWave, Crusoe) were transcribed from public pages in March–April 2026 and will drift; confirm list/contract rates before budgeting.
Dollar cost column (DeepSeek and similar): Where a source publishes GPU-hours but not total spend, this note uses an illustrative ~USD 2 per H800 GPU-hour (order-of-magnitude cloud list pricing) to turn hours into ~$200K / ~$82K style totals—not DeepSeek’s invoice. Stanford FMTI and DeepSeek-V3 report give the underlying hour disclosures (references 6 and 7 above).

← Back to Blog

Interactive cost & time estimator

Measured GRPO throughput (ROCm + veRL; short GSM8K rollouts)

Estimated throughput across model sizes requires significant extrapolation

Wall-clock training times from published RLVR runs

RLVR is faster than RLHF but far slower than SFT

Framework landscape for RLVR training

Cloud GPU pricing for RLVR workloads in April 2026

H100 SXM 80GB

H200 SXM 141GB

MI300X 192GB

A100 80GB SXM

Cost per million tokens processed during RLVR training

H100 SXM — 7B GRPO (1,544 tok/GPU/s = 5.56M tok/GPU/hr) 1

MI300X — 7B GRPO (1,748 tok/GPU/s = 6.29M tok/GPU/hr) 1

Estimated cost scaling by model size (H100, Lambda 1-Click ~$2.76/hr) 31

Long rollouts dominate RLVR cost structure

Data quality assessment and key caveats

Conclusion

References

H100 SXM — 7B GRPO (1,544 tok/GPU/s = 5.56M tok/GPU/hr) ¹

MI300X — 7B GRPO (1,748 tok/GPU/s = 6.29M tok/GPU/hr) ¹

Estimated cost scaling by model size (H100, Lambda 1-Click ~$2.76/hr) ³¹