This note connects RLVR training costs to measured baselines and published runs. RLVR (reinforcement learning from verifiable rewards) uses checkable signals—rules, parsers, tests—not learned human-preference reward models.39 When rollouts stay short, wall time looks like ordinary RLHF-style loops; when long chain-of-thought dominates (8K–32K tokens), generation often eats 70–90% of wall time and training can land ~5–10× slower than plain SFT.34 The sections below use 7B short-rollout throughput and cloud list prices to illustrate $0.29–$1.11 per million tokens at that scale (MI300X leads on raw $/throughput in the tables), then extrapolate as rollouts and model size grow. Example run costs range from small open demos to large-cluster GRPO.56742

Interactive cost & time estimator

Adjust model, algorithm, rollout length, and token target to compare H100, H200, and MI300X. Figures use measured 7B GRPO/PPO baselines where available, rollout-length penalties from the long-form note below, and April 2026 list pricing. Estimates are illustrative—see confidence notes under the charts.

GPU specs — confirmed pricing April 2026
Summary at 4× GPUs
Time to convergence and total cost — 1 to 8 GPUs
Hours to converge
Total cost ($)
Real published RLVR runs — calibration anchors
published
DeepSeek-R1-Zero 671B MoE
512×H800 · 198h · ~101K GPU-hrs · ~$200K · GRPO · custom
open source
SimpleRL-Zoo 32B
64×H100 · ~36h · ~2,300 GPU-hrs · ~$4,600 · GRPO · veRL
open source
SimpleRL-Zoo 7B
16×H100 · ~15h · ~240 GPU-hrs · ~$500 · GRPO · veRL
open source
Open-R1 7B
8×H100 · ~3h · 24 GPU-hrs · ~$72 · GRPO · TRL (3.1× slower than OpenRLHF)
published
DeepScaleR 1.5B
8–32×A100 · multi-stage · 3,800 GPU-hrs · ~$4,500 · iterative ctx scaling
open source
R1-V 2B VLM
8×A100 · 30min · 4 GPU-hrs · $2.62 · GRPO · TRL

Measured GRPO throughput (ROCm + veRL; short GSM8K rollouts)

The AMD ROCm walkthrough (April 2025) reports GRPO and PPO on GSM8K with veRL v0.3.0 on MI300X and H100—i.e. RLHF with rule-based rewards on math, which is the standard pattern when response lengths are short (512–1024 tokens in their table).1 Those runs give a clean GPU and algorithm comparison (GRPO 1.7–2.5× faster than PPO in the same harness) before we extrapolate to regimes where long rollouts dominate cost.12

GSM8K in this stack: GSM8K is Cobbe et al.’s grade-school math word problems (arXiv:2110.14168). veRL’s GSM8K example scores completions with a simple rule (parse the answer after `####`, compare to the label)—a verifiable reward that fits short generations and aligns with how many math RL pipelines are built; veRL’s docs describe that agent under the broad RLHF umbrella.38 The ROCm numbers are measured system throughputs for that recipe; the rest of the note scales from them toward long-CoT RLVR where decoding dominates.

Additional related measurements come from Yotta Labs on MI300X + veRL8 and the OpenRLHF paper.2 This 7B measured slice is still where most public tok/GPU/s tables live—few comparable public rows exist for 14B, 32B, or 70B in the same reporting format.

GPUModelAlgorithmTPTokens/GPU/secResponse LengthFrameworkSource
H100 SXMQwen2-7BGRPO21,5441,024veRL v0.3AMD ROCm Blog 1
MI300XQwen2-7BGRPO21,7481,024veRL v0.3AMD ROCm Blog 1
H100 SXMDeepSeek-7BGRPO21,6241,024veRL v0.3AMD ROCm Blog 1
MI300XDeepSeek-7BGRPO21,8991,024veRL v0.3AMD ROCm Blog 1
H100 SXMQwen2-7BPPO2907512veRL v0.3AMD ROCm Blog 1
MI300XQwen2-7BPPO2921512veRL v0.3AMD ROCm Blog 1
H100 SXMDeepSeek-7BPPO4624512veRL v0.3AMD ROCm Blog 1
MI300XDeepSeek-7BPPO4767512veRL v0.3AMD ROCm Blog 1

These benchmarks used 8 GPUs per node with train_batch_size=1024.1 Two critical patterns emerge: GRPO achieves 1.7–2.5× higher throughput than PPO because it eliminates the critic model entirely,9 and MI300X outperforms H100 by 1–23% across all configurations, likely due to its larger 192GB HBM3 memory enabling more efficient batching.18 Yotta Labs separately confirmed MI300X efficiency with veRL v0.5.0, achieving 14.01% training MFU for 7B GRPO with optimal TP=1 on a single MI300X, and near-linear scaling across data parallelism dimensions.8

The OpenRLHF paper provides a framework-level comparison: OpenRLHF completes one GSM8K GRPO epoch in 1,657 seconds versus TRL's 5,189 seconds — a 3.1× speedup — on identical hardware and hyperparameters.2


Estimated throughput across model sizes requires significant extrapolation

Published RLVR throughput data is almost exclusively at 7B scale. The estimates below combine the measured 7B baselines with a rough, non-cited vLLM-style inference scaling heuristic (7B-class vs 32B-class decode often differs by roughly ~5× on a single GPU in community reports; re-measure for your stack),40 the GPTOSS-20B MoE training benchmark (~500–598 tok/s on 512 H800 GPUs with veRL+Megatron),10 and the OLMo 3 32B RL training disclosure showing inference-to-training compute ratios of 5–14×.11

Model SizeH100 SXM (tok/GPU/s)MI300X (tok/GPU/s)A100 80GB (tok/GPU/s)ConfidenceBasis
7B1,544–1,6241,748–1,899~900–1,100MeasuredAMD ROCm Blog 1 (veRL v0.3, GRPO, 1K resp.)
14B~700–950~800–1,100~400–600Estimated~0.5–0.6× of 7B; heuristic scaling 40
32B~300–450~350–520~150–250Estimated~5× drop from 7B (heuristic 40); GPTOSS-20B MoE ~500–598 tok/s on 512 GPUs 10
70B~120–200~140–240~60–100EstimatedRequires multi-GPU TP; extrapolated from 32B
235B-A22B (MoE)~200–350~230–400N/AEstimatedMoE with ~22B active params ≈ 32B-scale compute
671B-A37B (MoE)~100–200~120–250N/AEstimatedRequires 96+ GPUs; EP+TP+PP parallelism 7

Critical caveat: These estimates assume short (1K) response lengths matching the benchmark conditions. At RLVR-typical response lengths of 8K–32K tokens, overall step throughput stays roughly proportional (more tokens produced per step), but each step takes 8–32× longer to complete because autoregressive generation scales linearly with output length. KV cache pressure at longer contexts can also force batch size reductions, further degrading effective throughput. The QeRL paper notes that typical RL training for reasoning models takes 20–100 hours on 8× H100 GPUs — suggesting effective throughput is significantly lower than these per-token rates imply when accounting for real-world RLVR conditions.12


Wall-clock training times from published RLVR runs

The table below compiles selected RLVR-style training runs with disclosed compute details. Dollar costs use illustrative $/GPU-hour assumptions where sources do not publish invoices (see 42). Costs span from a third-party README ~$2.62 to order-of-magnitude ~$200,000, driven primarily by model scale and rollout length.

RunModelSizeAlgorithmGPUsWall-ClockGPU-HoursEst. CostFramework
DeepSeek-R1-Zero 67DeepSeek-V3-Base671B MoEGRPO512×H800~198 hrs~101K~$200KCustom
DeepSeek-R1 (RL stage) 67DeepSeek-V3-Base671B MoEGRPO512×H800~80 hrs~41K~$82KCustom
ProRL-1.5B-v2 131.5BRLVRH100s>20K~$60K+OpenRLHF
DAPO 14Qwen2.5-32B32BDAPO128×H20Several daysNot disclosedveRL
SimpleRL-Zoo (32B) 15Qwen2.5-32B32BGRPO64×H100~36 hrs~2,300~$4,600veRL
DeepScaleR 16R1-Distill-Qwen-1.5B1.5BGRPO8–32×A100Multi-stage3,800~$4,500veRL
SimpleRL-Zoo (7B) 15Qwen2.5-7B7BGRPO16×H100~15 hrs~240~$500veRL
Dr. GRPO 17Qwen2.5-Math-7B7BDr. GRPO8×A100~27 hrs~216~$430Oat
Open-R1 18Qwen2.5-Math-7B7BGRPO8×H100~3 hrs24~$72TRL
Mini-R1 18Qwen2.5-3B-Instruct3BGRPO4×H100~6 hrs24~$72TRL
microR1 18Qwen2.5-3B-Instruct3BGRPO8×A100~3 hrs24~$44Pure PyTorch
TinyZero 9Qwen2.5-3B3BPPO2×H200<5 hrs<10<$30veRL
R1-V 5Qwen2-VL-2B2B VLMGRPO8×A10030 min4$2.62TRL

DeepSeek-R1-Zero represents the largest published RLVR run: 512 H800 GPUs for 198 hours training the 671B MoE model with GRPO.67 The full DeepSeek-R1 pipeline (including V3 pre-training) consumed 2.788 million H800 GPU-hours at an estimated $5.58M (same order-of-magnitude $/GPU-hr caveat as 42).7 At the other extreme, community GRPO-on-VLM writeups report ~30 minutes on 8 A100s with a README-derived total of ~$2.62 (GPU-hours × quoted rates)—use as an illustration, not a peer-reviewed benchmark.5 DeepScaleR's iterative context-length scaling approach (8K→16K→24K) reduced estimated compute from ~70K to just 3,800 A100-hours — demonstrating that curriculum-based context scaling is a critical cost optimization for RLVR.16


RLVR is faster than RLHF but far slower than SFT

GRPO eliminates the critic model that PPO requires,9 and practitioners often cite an order-of-magnitude ~40–50% memory savings versus four-model PPO+RM stacks—treat that band as a heuristic, not a controlled measurement in this note. The ROCm + veRL GRPO vs PPO comparison is measured: GRPO is 1.7–2.5× faster than PPO under the same settings.1 Traditional PPO-based RLHF requires four large models in GPU memory simultaneously (policy, reference, reward model, critic), while GRPO needs only two (policy and reference).9 DAPO goes further by removing the KL penalty entirely, eliminating even the reference model.14

The throughput hierarchy is clear from measured data. GRPO on a 7B model achieves 1,544 tok/GPU/s versus PPO's 907 tok/GPU/s on an H100 — a 1.70× improvement.1 On MI300X, GRPO reaches 1,748 tok/GPU/s versus PPO's 921 tok/GPU/s (1.90×).1 One non-peer-reviewed Substack overview claims GRPO can reduce overall training cost to roughly 1/18 of some traditional PPO-style RL setups when memory savings enable larger batches—use as anecdotal context, not a universal ratio.9

However, RLVR's rule-based verification advantage (no neural reward model forward pass) is substantially offset by its long rollouts. The DC-SFT paper reports that in their VLM setting, SFT achieved about 4.9× higher training efficiency than GRPO (see their tables for exact comparisons—do not read this as a universal constant across tasks).19 Multiple sources confirm the 5–10× slowdown of online RL versus offline SFT, driven by autoregressive rollout generation consuming 70–90% of RL training time.3420 A single RLVR training step can take minutes to over an hour depending on model size and rollout length, compared to seconds for SFT. The OLMo 3 32B reasoner allocated 20 H100 nodes for inference alongside 8 for training — inference consumed 5–14× more compute than policy updates, with learner GPUs idle 75% of the time waiting for rollout data.11


Framework landscape for RLVR training

Four major frameworks dominate RLVR training, each with distinct strengths. The table below reflects capabilities as of early April 2026.

FeatureOpenRLHF v0.9.9 221veRL v0.7.1 22TRL v1.0.0 23NeMo RL 2425
GRPO
DAPO✅ (via flags)✅ (reference impl.)
Dr. GRPO
REINFORCE++✅ (native, recommended)
PRIME
Max proven scale70B+ dense671B MoE~72B (with DeepSpeed)340B+
Inference backendvLLMvLLM + SGLangvLLMMegatron native
Training backendDeepSpeed ZeRO-3FSDP/MegatronHF AccelerateMegatron Core
Async training✅ (--async_train)✅ (1-step off-policy)Experimental
FP8 end-to-end
Ease of useMediumMedium-HighHighestLow

OpenRLHF (widely used open-source RLHF/RLVR stack; EMNLP 2025) is the throughput leader for dense models up to 70B in the OpenRLHF team’s own comparisons, achieving 1.22–1.68× speedup over veRL in long-CoT RLVR settings across 1.5B–14B models with 1K–8K generation lengths.2 Its Ray+vLLM+DeepSpeed architecture is battle-tested by Google, ByteDance, NVIDIA, and Tencent.21 The team explicitly recommends REINFORCE++-baseline for RLVR tasks due to its robustness across reward patterns.2

veRL (EuroSys 2025; large public footprint on GitHub) has broad algorithm support and large proven scale — the DAPO paper's results were produced using veRL,14 and it has been validated on DeepSeek-V3 671B and Qwen3-235B MoE models.22 Its Megatron backend enables expert parallelism essential for MoE training. Note that OpenRLHF's published speed advantage was measured against veRL v0.4.0; veRL has since released significant optimizations through v0.7.1 that may have closed this gap.222

TRL (Hugging Face; frequent releases) prioritizes accessibility over raw throughput.23 Its GRPOTrainer requires minimal code and integrates natively with Hugging Face's ecosystem. It is ~3.1× slower than OpenRLHF on GRPO benchmarks,2 making it best suited for prototyping and smaller-scale training. The Open-R1 and Mini-R1 reproduction projects both use TRL.18

NeMo RL (successor to deprecated NeMo-Aligner) is NVIDIA's enterprise offering, uniquely supporting end-to-end FP8 training and Megatron Core's full 3D parallelism suite.24 It trained Nemotron 3 Nano with GRPO across multiple environments simultaneously, using up to 49K-token generation lengths.25

Notable emerging frameworks include AReaL (Ant Group/Tsinghua), achieving 2.77× speedup via fully asynchronous RL,20 and ROLL Flash (Alibaba), reaching 2.24× speedup on RLVR through queue scheduling and rollout-train decoupling.3


Cloud GPU pricing for RLVR workloads in April 2026

RLVR training demands multi-GPU clusters with high-bandwidth interconnect (NVLink intra-node, InfiniBand inter-node). Prices below are per GPU per hour, transcribed from provider websites and aggregators in March–April 2026 where footnoted.2627282930 Rows without an inline footnote marker (e.g. some prepaid or bare-metal SKUs) should be spot-checked on the vendor site—live quotes move weekly.41

H100 SXM 80GB

ProviderOn-DemandSpot/CommunityReservedInfiniBand
Vast.ai~$1.54~$1.50+AvailableVaries 29
RunPod (Community)~$1.99–$2.49EnterpriseNVLink 26
FluidStack~$2.10Custom
Vultr (36-mo)$2.3036-mo prepaidNVLink
Lambda (1-Click)$2.761–3 yr custom 31
RunPod (Secure)~$2.69–$2.99EnterpriseNVLink 26
DigitalOcean (8×)$2.99$1.99 (committed)NVLink 28
Crusoe$3.90Contact salesCustom
Lambda (Instance)$3.99–$4.29CustomNVLink 31
CoreWeave$6.16Up to 60% off27

H200 SXM 141GB

ProviderOn-DemandReservedNotes
Vast.ai~$1.50–$4.00AvailableMarketplace, variable 29
DigitalOcean$3.44Multi-month discountNVLink HGX 28
Crusoe$4.29CustomInfiniBand
CoreWeave$6.31Up to 60% offInfiniBand 27

MI300X 192GB

ProviderOn-DemandReservedNotes
Vast.ai~$0.95–$3.12AvailableMarketplace spot from $0.95 29
TensorWave$2.25$1.71 (dedicated)Bare metal, Infinity Fabric
Vultr$1.85 (preempt.)$1.85 (24-mo)24-month prepaid
DigitalOcean$1.99Multi-monthInfinity Fabric 28
Crusoe$3.45CustomInfinity Fabric

A100 80GB SXM

ProviderOn-DemandSpotNotes
Vast.ai~$0.50–$2.00From ~$0.50Marketplace 29
RunPod (Community)~$0.89Per-second billing 26
Crusoe$1.95$1.30Clean energy
RunPod (Secure)~$1.89SOC2 26
CoreWeave$2.70InfiniBand 27
Lambda$2.798× nodes 31

For serious RLVR training at scale (multi-node with InfiniBand), Lambda 1-Click Clusters at $2.76/GPU/hr for H100 and DigitalOcean at $1.99/GPU/hr for MI300X represent the strongest price-performance combinations with production-grade interconnect.2831 Vast.ai offers the lowest absolute prices but lacks guaranteed InfiniBand connectivity.29


Cost per million tokens processed during RLVR training

The following calculations combine measured GRPO throughput at 7B scale (1K response length) with current GPU pricing.1 These represent best-case costs — real-world RLVR with longer rollouts will have higher per-step costs (though similar per-token costs if memory permits maintaining batch size).

H100 SXM — 7B GRPO (1,544 tok/GPU/s = 5.56M tok/GPU/hr) 1

Provider$/GPU/hr$/M TokensNotes
Vast.ai$1.54$0.28Marketplace; reliability varies 29
RunPod (Community)$1.99$0.36Per-second billing 26
Lambda (1-Click)$2.76$0.50InfiniBand clusters 31
DigitalOcean (8×)$2.99$0.54NVLink HGX 28
Crusoe$3.90$0.70Clean energy
Lambda (Instance)$3.99$0.72Self-serve 31
CoreWeave$6.16$1.11Enterprise; reserved ~$2.46→$0.44 27

MI300X — 7B GRPO (1,748 tok/GPU/s = 6.29M tok/GPU/hr) 1

Provider$/GPU/hr$/M TokensNotes
Vast.ai (spot)$0.95$0.15Marketplace; lowest possible 29
Vultr (24-mo)$1.85$0.29Prepaid commitment
DigitalOcean$1.99$0.32Best reliable value 28
TensorWave$2.25$0.36Bare metal, dedicated
Crusoe$3.45$0.55InfiniBand

Estimated cost scaling by model size (H100, Lambda 1-Click ~$2.76/hr) 31

Model SizeEst. tok/GPU/sM tok/GPU/hr$/M TokensEst. Cost for 1B Token Run
7B1,544 (measured 1)5.56$0.50$500
14B~800 (est.)~2.88~$0.96$960
32B~375 (est.)~1.35~$2.04$2,040
70B~160 (est.)~0.58~$4.76$4,760

MI300X provides a 30–45% cost advantage over H100 for RLVR training at current market rates, combining higher throughput (+13% for 7B GRPO 1) with lower pricing (DigitalOcean MI300X at $1.99 28 vs Lambda H100 at $2.76 31). veRL documents strong MI300X support (ROCm integration and tuning notes).8 For other frameworks, confirm AMD paths in upstream docs before committing to a stack.


Long rollouts dominate RLVR cost structure

The defining characteristic of RLVR versus traditional RLHF is the length of generated rollouts. While RLHF typically generates 512–2048 token responses, RLVR reasoning chains routinely reach 8K–32K tokens, with production systems like DAPO using max_response_length of 20,480 tokens14 and NeMo RL's Nemotron training reaching 49K tokens.25 Production RLVR workload characterization (PolyTrace) shows math reasoning tasks averaging ~9,839 output tokens per sample.4

Rollout generation consumes 70–90% of total RLVR training time.34 This means the bottleneck is autoregressive decoding — a memory-bandwidth-bound operation that cannot be trivially accelerated by adding more compute. Each GRPO step generates G completions per prompt (typically G=8–64; DeepSeek-R1 used G=64 6), multiplying the generation burden. For a concrete example from the HuggingFace "Keep the Tokens Flowing" analysis:20 generating 512 rollouts at 8K tokens for a 32B model on 8 H100 inference GPUs takes approximately 7 minutes for generation alone, before any gradient computation.

The long-tail distribution of rollout lengths creates severe GPU idling. ROLL Flash found that the longest responses can exceed the median by over 20×, meaning most GPUs finish early and wait for stragglers in synchronous systems.3 The OLMo 3 32B reasoner's learner GPUs spent 75% of time idle waiting for inference data.11 This has spawned several optimization approaches:

  • Asynchronous training (ROLL Flash,3 AReaL,20 PipelineRL) decouples generation from training, achieving 2.0–2.8× speedups by overlapping these phases
  • Dynamic sampling (DAPO 14) filters prompts where all responses are correct or all wrong, avoiding wasted computation on zero-gradient batches — achieving the same performance with 1/3 the training steps
  • Overlong reward shaping (DAPO 14) applies soft penalties for responses exceeding a threshold rather than hard truncation, preventing training instability
  • Token-level policy gradient loss (DAPO,14 veRL 22) averages loss across total tokens rather than per-sample-then-per-batch, preventing gradient dilution for long sequences
  • Dr. GRPO's debiased advantage 17 removes variance normalization and length divisors from GRPO, eliminating bias toward shorter responses and providing unbiased policy gradients
  • Clip-Higher (DAPO 14) uses asymmetric clipping (ε_low=0.2, ε_high=0.28) to preserve exploration by allowing more room for increasing low-probability tokens, combating entropy collapse; related asymmetric clipping ideas appear in VAPO.37
  • NAT (Not All Tokens are Needed) 33 performs policy optimization on only ~50% of tokens from each rollout while computing rewards on full responses, reducing activation memory
  • FP8 precision (JetRL 34) achieves 1.07–1.33× rollout speedup, but naive mixed-precision (BF16 training + FP8 rollout) fails catastrophically at context lengths beyond 8K due to numerical precision mismatches
  • Iterative context scaling (DeepScaleR 16) trains at 8K→16K→24K progressively, reducing total compute by ~18× versus training at maximum length from the start

KV cache memory scales linearly with sequence length. For a Qwen-32B model, hosting with full 131K context increases memory requirements from ~70GB to approximately 400GB. Frameworks address this through dynamic memory management (veRL's `free_cache_engine=True` offloads KV cache after rollout generation 35) and PagedAttention (vLLM) for non-contiguous memory allocation. Prefix caching is particularly valuable for GRPO since all G completions per prompt share the same prompt prefix — SGLang's RadixAttention provides 3–5× cache hit improvement in this multi-completion scenario.36


Data quality assessment and key caveats

Every number in this report carries a confidence level that readers should understand when making cost projections.

Measured benchmarks (high confidence): AMD ROCm publishes GRPO/PPO throughputs for veRL v0.3 on GSM8K with rule-based rewards and short completions (7B models, 8 GPUs per node)—a solid baseline for GPU and algorithm comparisons.1 Yotta Labs reports complementary MI300X + veRL numbers.8 DeepSeek-R1-Zero's training configuration (512×H800, ~198 hours) was disclosed via Stanford FMTI and Nature supplementary materials.67 OpenRLHF vs TRL timing comparisons come from the peer-reviewed EMNLP 2025 paper.2

Derived with caveats (medium confidence): Wall-clock times and GPU-hours for open-source reproductions (DeepScaleR,16 SimpleRL-Zoo,15 Dr. GRPO 17) come from GitHub READMEs, blog posts, and WandB logs — credible but not peer-reviewed. The 70–90% rollout time proportion is consistent across ROLL Flash,3 veRL,22 OpenRLHF,2 and NAT 33 papers. The RLVR vs SFT slowdown band (5–10×) combines DC-SFT’s in-paper ~4.9× VLM comparison19 with production-style reports.11

Estimated with significant uncertainty (lower confidence): Throughput estimates for 14B, 32B, and 70B models are extrapolations from 7B measured data combined with a heuristic inference scaling ratio (not a single cited benchmark row).40 Real-world throughput depends heavily on batch size, parallelism strategy, sequence length, and framework optimizations. The 14B–70B rows in the throughput and cost tables should be treated as rough order-of-magnitude guides, not precision benchmarks. Cloud GPU pricing fluctuates; spot/marketplace rates (especially Vast.ai 29) can vary by 2–3× within a single week. Un-footnoted list prices should be reverified on provider pages.41 The OpenRLHF vs veRL framework comparison was published by the OpenRLHF team 2 against an older veRL version (v0.4.0); veRL's subsequent optimizations through v0.7.1 may have changed this relationship.22

Unresolved gaps: Few public tables report long-rollout RLVR throughputs (8K–32K tokens) with the same tok/GPU/s detail as short-completion 7B baselines; few RLVR-oriented throughput benchmarks in this note’s sense exist for H200 GPUs. No framework has published comparable benchmarks across all GPU types. Qwen/QwQ training compute has never been publicly disclosed. DAPO's total GPU-hours on 128×H20 were not reported.14 The interaction between long rollout lengths (8K–32K) and per-token throughput under GPU memory pressure lacks systematic benchmarking — current data either measures short (1K) rollouts or reports only wall-clock totals without per-token rates.


Conclusion

RLVR training costs are dominated by a single bottleneck: autoregressive rollout generation of long reasoning chains.34 The algorithmic efficiency of GRPO over PPO (1.7–2.5× measured throughput advantage,1 plus the heuristic “fewer resident models” memory story 9) is real but secondary to the 70–90% of wall-clock time spent generating 8K–32K token completions.311 The most impactful cost optimizations are therefore architectural — asynchronous training (2–2.8× speedup 320), dynamic sampling (3× step reduction 14), and iterative context scaling (18× compute reduction 16) — rather than hardware-level. At current market rates, MI300X at $1.99/GPU/hr with 13% higher GRPO throughput than H100 offers the best raw price-performance,128 though H100's broader framework support and InfiniBand availability make it the safer choice for production runs. The framework choice matters enormously: OpenRLHF's 3.1× advantage over TRL and 1.2–1.7× over veRL v0.4 on identical hardware 2 represents a larger throughput delta than any GPU generational improvement. For organizations planning RLVR training, the decision tree is: veRL or NeMo RL for MoE models above 200B,2224 OpenRLHF for dense models up to 70B where throughput is critical,2 and TRL for rapid prototyping 23 — then invest heavily in the async and dynamic sampling optimizations that cut wall-clock time by 2–3× regardless of hardware choice.


References

  1. AMD ROCm Blog. "Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration." April 2025. GRPO/PPO throughput on GSM8K with veRL v0.3, rule-based rewards, 7B models, 512–1024-token responses, 8 GPUs per node. https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html
  2. Hu, J. et al. "OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework." EMNLP 2025. https://arxiv.org/html/2501.03262v4 | https://arxiv.org/pdf/2405.11143 | https://aclanthology.org/2025.emnlp-demos.48/
  3. ROLL Flash Team. "Part II: ROLL Flash — Accelerating RLVR and Agentic Training with Asynchrony." arXiv:2510.11345, October 2025. https://arxiv.org/abs/2510.11345 | https://arxiv.org/html/2510.11345
  4. "RL in the Wild: Characterizing RLVR Training in LLM Deployment." arXiv:2509.25279. https://arxiv.org/html/2509.25279
  5. $2.62 / “R1-V” row: The table summarizes community README / blog claims (GPU-hours × quoted $/hr), not a paper. For the R1-VL line of work see e.g. Chen et al. "R1-VL: Advancing Multimodal Reasoning from Optimized Cold Start to Staged Reinforcement Learning." arXiv:2503.12937. https://arxiv.org/abs/2503.12937 | GitHub: https://github.com/jingyi0000/R1-VL — Related small-VLM GRPO artifacts include lmms-lab’s Qwen2-VL-2B-GRPO-8k card: https://huggingface.co/lmms-lab/Qwen2-VL-2B-GRPO-8k — Third-party cost writeup (example): PhotoAtomic. "R1-V: Witness the aha moment of VLM with less than $3." https://github.com/PhotoAtomic/deep-agent-R1-V
  6. Stanford CRFM. "DeepSeek Transparency Report." FMTI December 2025. https://crfm.stanford.edu/fmti/December-2025/company-reports/DeepSeek_FinalReport_FMTI2025.html
  7. DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437. https://arxiv.org/pdf/2412.19437
  8. Yotta Labs. "Performance Optimization for Reinforcement Learning on AMD GPUs." 2025. https://www.yottalabs.ai/post/performance-optimization-for-reinforcement-learning-on-amd-gpus
  9. Wolfe, C. "Group Relative Policy Optimization (GRPO)." Cameron Wolfe's Substack. https://cameronrwolfe.substack.com/p/grpo
  10. "Enabling Large Scale RLHF of GPTOSS with Megatron backend in VeRL." Hugging Face Blog. https://huggingface.co/blog/yiakwy-xpu-team/enabling-large-scale-rlhf-of-gptoss-with-megatron
  11. Wolfe, C. "OLMo 3 and the Open LLM Renaissance." Cameron Wolfe's Substack. https://cameronrwolfe.substack.com/p/olmo-3 | See also: Kim, T. "OLMo 3: The Architecture of 'Fully' Open Sourced Reasoning Models." https://www.terrencekim.net/2026/01/olmo-3-architecture-of-fully-open.html
  12. Neurohive. "QeRL: Training 32B Models on Single H100 vs Three GPUs, Beating LoRA in Accuracy." https://neurohive.io/en/state-of-the-art/qerl-2/
  13. OpenRLHF GitHub. ProRL-1.5B-v2. https://github.com/OpenRLHF/OpenRLHF
  14. DAPO Team (ByteDance Seed). "DAPO: An Open-Source LLM Reinforcement Learning System at Scale." arXiv:2503.14476. https://arxiv.org/abs/2503.14476 | https://arxiv.org/pdf/2503.14476 | https://dapo-sia.github.io/
  15. SimpleRL-Zoo (THU-ML / community recipes on veRL). "SimpleRL-Zoo: Investigating and Taming Zero-shot Reinforcement Learning for Open Base Models in the Wild." arXiv:2503.18892. https://arxiv.org/abs/2503.18892 | Hugging Face Papers: https://huggingface.co/papers/2503.18892 | veRL GitHub (framework used in many recipes): https://github.com/volcengine/verl
  16. "DeepScaleR: Achieving Superior Performance with a Small Model Through Reinforcement Learning." Medium / arXiv. https://medium.com/@jenray1986/deepscaler-achieving-superior-performance-with-a-small-model-through-reinforcement-learning-562a4381c11f
  17. Dr. GRPO (debiased GRPO / “Understanding R1-Zero-like training”): Liu et al. "Understanding R1-Zero-Like Training: A Perspective of Model Specialization." arXiv:2503.20783. https://arxiv.org/abs/2503.20783 | Code: https://github.com/sail-sg/understand-r1-zero | Oat implementation (as in the cost table): https://github.com/sail-sg/oat — *Note:* Other papers reuse “Dr. GRPO” wording for different fixes (e.g. noise-corrected variants on arXiv); this note means Liu et al. unless stated otherwise.
  18. Open-R1 (Hugging Face). GitHub: https://github.com/huggingface/open-r1 (includes `grpo.py` and training scripts). Mini-R1 tutorial (Countdown / “aha moment”): https://huggingface.co/blog/open-r1/mini-r1-contdown-game — microR1 and similar rows in the table are other small-model GRPO reproductions tracked from READMEs; treat timings/costs as illustrative.
  19. "Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training." arXiv:2602.10815. https://arxiv.org/html/2602.10815v1
  20. Hugging Face. "Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries." https://huggingface.co/blog/async-rl-training-landscape
  21. vLLM Blog. "Accelerating RLHF with vLLM, Best Practice from OpenRLHF." April 2025. https://blog.vllm.ai/2025/04/23/openrlhf-vllm.html | OpenRLHF GitHub: https://github.com/OpenRLHF/OpenRLHF
  22. veRL (Volcano Engine Reinforcement Learning). GitHub: https://github.com/volcengine/verl | Documentation: https://verl.readthedocs.io/en/latest/perf/best_practices.html | DeepWiki: https://deepwiki.com/volcengine/verl
  23. Hugging Face TRL. GRPOTrainer docs. https://huggingface.co/docs/trl/main/en/grpo_trainer | GitHub: https://github.com/huggingface/trl
  24. NVIDIA NeMo RL Documentation. https://docs.nvidia.com/nemo/rl/latest/ | DAPO walkthrough: https://docs.nvidia.com/nemo/rl/latest/guides/dapo.html | GitHub: https://github.com/NVIDIA-NeMo/RL
  25. NVIDIA Developer Blog. "Reinforcement Learning with NVIDIA NeMo-RL: Reproducing a DeepScaleR Recipe Using GRPO." https://developer.nvidia.com/blog/reinforcement-learning-with-nvidia-nemo-rl-reproducing-a-deepscaler-recipe-using-grpo/ | See also: arXiv:2512.20848 (Nemotron 3 Nano). https://arxiv.org/html/2512.20848v1
  26. RunPod GPU Pricing. https://www.runpod.io/gpu-pricing | Review: https://dupple.com/tools/runpod | Guide: https://compute.hivenet.com/post/runpod-pricing-complete-guide-to-gpu-cloud-costs
  27. CoreWeave Cloud Pricing. https://www.coreweave.com/pricing | Review: https://www.thundercompute.com/blog/coreweave-gpu-pricing-review
  28. DigitalOcean GPU Droplets Pricing. https://docs.digitalocean.com/products/droplets/details/pricing/
  29. Vast.ai Pricing Documentation. https://docs.vast.ai/documentation/instances/pricing
  30. ByteIota. "GPU Cloud Pricing: H100 Costs $2.49 or $12.30 in 2026." https://byteiota.com/gpu-cloud-pricing-h100-costs-2-49-or-12-30-in-2026/
  31. Lambda Labs GPU Cloud. https://lambdalabs.com/service/gpu-cloud (1-Click Clusters and on-demand instances)
  32. "Not All Tokens are Needed: Token-Efficient Reinforcement Learning." arXiv:2603.06619. https://arxiv.org/html/2603.06619
  33. "Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow." arXiv:2601.14243. https://arxiv.org/html/2601.14243
  34. veRL Performance Tuning Guide. https://verl.readthedocs.io/en/latest/perf/perf_tuning.html
  35. SGLang GitHub. https://github.com/sgl-project/sglang
  36. "VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks." arXiv:2504.05118. https://arxiv.org/html/2504.05118v1
  37. veRL Documentation. "GSM8K Example." Notes the Cobbe et al. paper focuses on a verifier for Best-of-N, while the veRL walkthrough uses a rule-based reward on GSM8K and refers to the setup as an RLHF agent. https://verl.readthedocs.io/en/latest/examples/gsm8k_example.html | Cobbe et al. "Training Verifiers to Solve Math Word Problems." arXiv:2110.14168. https://arxiv.org/pdf/2110.14168
  38. Su, Y. et al. "Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains." arXiv:2503.23829 (RL with verifiable rewards across domains). https://arxiv.org/abs/2503.23829
  39. Inference throughput ratio (7B vs 32B): vLLM documents throughput *measurement* APIs and practices (e.g. benchmark utilities in the project docs). https://docs.vllm.ai/en/latest/api/vllm/benchmarks/throughput.html — This note does not cite a single row that yields “6.3k → 1.2k tok/s”; the ~5× class ratio used in extrapolations is an uncalibrated heuristic from informal community reports—re-benchmark for your model, batch, and backend.
  40. Pricing rows without inline citations: Figures for some prepaid, bare-metal, or list SKUs (e.g. FluidStack, Vultr long-commit, TensorWave, Crusoe) were transcribed from public pages in March–April 2026 and will drift; confirm list/contract rates before budgeting.
  41. Dollar cost column (DeepSeek and similar): Where a source publishes GPU-hours but not total spend, this note uses an illustrative ~USD 2 per H800 GPU-hour (order-of-magnitude cloud list pricing) to turn hours into ~$200K / ~$82K style totals—not DeepSeek’s invoice. Stanford FMTI and DeepSeek-V3 report give the underlying hour disclosures (references 6 and 7 above).