Open Vision Agents by Stream is an open source framework from Stream for building real time, multimodal AI agents that watch, listen, and respond to live video streams. It focuses on combining video understanding models, such as YOLO and Roboflow based detectors, with real time large language models like OpenAI Realtime and Gemini Live to create interactive experiences. The framework uses Stream’s ultra low latency edge network so agents can join sessions quickly and maintain very low audio and video latency while processing frames and generating responses. Developers work with an agent abstraction that connects video edge providers, LLMs, and processors into pipelines, making it easier to orchestrate tasks like object detection, pose estimation, and conversational guidance. The project includes SDKs for React, Android, iOS, Flutter, React Native, and Unity, enabling integration into a wide variety of client environments such as mobile apps, web apps, and games.
Features
- Framework for multimodal agents that process live video, audio, and text
- Integrations with YOLO, Roboflow, and real time LLMs like OpenAI and Gemini
- Ultra low latency streaming via Stream’s global edge network
- Agent abstraction with processors for detection, pose, and custom logic
- SDKs for React, Android, iOS, Flutter, React Native, and Unity
- Ready made examples for sports coaching, safety monitoring, and interactive apps