Sa2VA is a cutting-edge open-source multi-modal large language model (MLLM) developed by ByteDance that unifies dense segmentation, visual understanding, and language-based reasoning across both images and videos. It merges the segmentation power of a state-of-the-art video segmentation model (based on SAM‑2) with the vision-language reasoning capabilities of a strong LLM backbone (derived from models like InternVL2.5 / Qwen-VL series), yielding a system that can answer questions about visual content, perform referring segmentation, and maintain temporal consistency across frames in video. With minimal instruction tuning (often one-shot), Sa2VA can handle tasks such as “segment the main subject,” “what are the objects in this scene?”, or “track this object through the video,” outputting pixel-perfect masks or spoken/textual answers as appropriate.
Features
- Unified image/video + language understanding: supports both visual question-answering and dense segmentation on images and videos
- Referring segmentation: given a natural-language prompt (like “segment the man in red jacket”), it outputs precise segmentation masks aligned with semantic intent
- Video-level temporal consistency: maintains stable segmentation/tracking of objects across frames in a video, useful for video editing, object tracking, or temporal analysis
- Multi-size model family (1B, 4B, 8B, 26B, etc.) to match different hardware/resource constraints or performance needs
- Open-source with pretrained weights, demo code, inference scripts and evaluation tooling — ready to integrate or extend for custom applications
- Combines segmentation (from SAM-2) with strong language understanding (from VLLM backbone), enabling complex, multi-modal tasks (e.g. description + segmentation + reasoning) in one model