Tech · Hugging Face

vLLM V0 to V1: Correctness Before Corrections in RL

Wed, May 6 · 7:06 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

★ Tier-1 Source

Figure 1. Trainer-side metrics for the vLLM V0 reference (blue), the initial vLLM V1 attempt (red), and the final vLLM V1 run after our fixes (green), including the fp32 lm_head . The final V1 run returns close to the V0 trajectory across clip rate, KL, entropy, and reward.

TL;DR. vLLM V1 matched their vLLM V0 reference after they fixed four things: processed rollout logprobs, V1-specific runtime defaults, the inflight weight-update path, and the fp32 lm_head used for the final projection.

Key facts

The reference run used vLLM 0.8.5; the V1 runs used vLLM 0.18.1
Once processed_logprobs is on for V1, the mean policy ratio stays centered extremely close to 1.0 across all three runs
Those metrics came from a GSPO training run, the objective used for this experiment
In this online RL setup, it was a V1-only difference in cache lifetime and reuse relative to the V0 reference path

Summary

The reference run used vLLM 0.8.5; the V1 runs used vLLM 0.18.1. Verify that V1 returned rollout logprobs in the form the trainer expected. Those metrics came from a GSPO training run, the objective used for this experiment. Semantic mismatch: the backend returns logprobs with different meaning relative to what the trainer expects.

Read full article at Hugging Face →