Nvidia · Hugging Face
Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation
Compiled by KHAO Editorial — aggregated from 1 source + 2 references discovered via search. See llms.txt for citation guidance.
★ Tier-1 Source
NVIDIA Cosmos Predict 2.5 is a large-scale world model capable of generating physically plausible videos conditioned on text, images, or video clips.
Key facts
- LoRA adapters are injected into the DiT's attention projections ( to_q, to_k, to_v, to_out.0 ) and feedforward layers ( ff.net.0.proj, ff.net.2 )
- Conclusion: Training for 100 epochs (~2.5 hours on 8× H100s) is already sufficient to substantially improve all three metrics
- The team use Cosmos Reason2 as an LLM judge, scoring each example from 1 to 5
- The team use rank=32 as a starting point, resulting in ~50M trainable parameters
Summary
Training robot policies requires demonstration data, but collecting real-robot trajectories is slow and expensive. This makes it practical to fine-tune on a single GPU and flexibly swap adapters for different domains at inference. This guide walks through parameter-efficient fine-tuning of Cosmos Predict 2.5 with LoRA and DoRA, using the diffusers and accelerate libraries with support for both single- and multi-GPU training. Diffusers (pulls in transformers and peft automatically), accelerate. At minimum one 80 GB GPU for single-GPU training; 8× H100s recommended for faster iteration.