Models · Hugging Face

Hugging Face introduce multi-environment text and omni training in Nemotron 3 Nano Omni

Tue, Apr 28 · 3:58 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 2 outlets. See llms.txt for citation guidance.

✓ KHAO Verified

Omni RL trains the model to reason across images, video, audio, and text within a unified framework, covering tasks from single-modality to fully multimodal scenarios.

Key facts

Efficiency highlights Compared to other open omni models with the same interactivity, Nemotron 3 Nano Omni delivers 7.4x higher system efficiency for multi-document use cases and 9.2x higher system
The audio side is powered by Parakeet-TDT-0.6B-v2, connected to the backbone through its own 2-layer MLP projector
The SFT stages are trained on NVIDIA H100, scaling from 32 to 128 nodes depending on the stage
The model backbone interleaves three key components: 23 Mamba selective state-space layers for efficient long-context processing; 23 MoE layers with 128 experts, top-6 routing, and a shared expert

Summary

NVIDIA Nemotron 3 Nano Omni is a new omni-modal understanding model built for real-world document analysis, multiple image reasoning, automatic speech recognition, long audio-video understanding, agentic computer use, and general reasoning. It extends the Nemotron multimodal line from a strong vision-language system to a broader text + image + video + audio model. Nemotron 3 Nano Omni delivers best-in-class accuracy on complex document intelligence leaderboards such as MMlongbench-Doc, OCRBenchV2, while also leading in video and audio leaderboards like WorldSense and DailyOmni. Under the hood, it combines the Nemotron 3 hybrid Mamba-Transformer Mixture-of-Experts backbone with a C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder. The architecture is designed to preserve fine visual detail, add native audio understanding, and scale to long multimodal contexts for dense images, documents, videos, and mixed-modality reasoning.