← Back to KHAO

Tech ·

vLLM V0 to V1: Correctness Before Corrections in RL

2 min read

Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.

★ Tier-1 Source

Figure 1. Trainer-side metrics for the vLLM V0 reference (blue), the initial vLLM V1 attempt (red), and the final vLLM V1 run after our fixes (green), including the fp32 lm_head . The final V1 run returns close to the V0 trajectory across clip rate, KL, entropy, and reward.

TL;DR. vLLM V1 matched their vLLM V0 reference after they fixed four things: processed rollout logprobs, V1-specific runtime defaults, the inflight weight-update path, and the fp32 lm_head used for the final projection.

Key facts

Summary

The reference run used vLLM 0.8.5; the V1 runs used vLLM 0.18.1. Verify that V1 returned rollout logprobs in the form the trainer expected. Those metrics came from a GSPO training run, the objective used for this experiment. Semantic mismatch: the backend returns logprobs with different meaning relative to what the trainer expects.

Read full article at Hugging Face →