Tech · Hugging Face
vLLM V0 to V1: Correctness Before Corrections in RL
Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.
★ Tier-1 Source
TL;DR. vLLM V1 matched their vLLM V0 reference after they fixed four things: processed rollout logprobs, V1-specific runtime defaults, the inflight weight-update path, and the fp32 lm_head used for the final projection.
Key facts
- The reference run used vLLM 0.8.5; the V1 runs used vLLM 0.18.1
- Once processed_logprobs is on for
- V1, the mean policy ratio stays centered extremely close to 1.0 across all
- three runs
- Those metrics came from a GSPO training run, the objective used for this
- experiment
- In this
- online RL setup, it was a V1-only difference in cache lifetime and reuse
- relative to the V0 reference path
Summary
The reference run used vLLM 0.8.5; the V1 runs used vLLM 0.18.1. Verify that V1 returned rollout logprobs in the form the trainer expected. Those metrics came from a GSPO training run, the objective used for this experiment. Semantic mismatch: the backend returns logprobs with different meaning relative to what the trainer expects.