Framework for building real-time voice and multimodal AI agents
Qwen3-omni is a natively end-to-end, omni-modal LLM
Voice Recognition to Text Tool
Large Multimodal Models for Video Understanding and Editing
A python tool that uses GPT-4, FFmpeg, and OpenCV
Code and models for ICML 2024 paper, NExT-GPT
AI framework for automated short video creation and editing tools
Code for running inference and finetuning with SAM 3 model
Powerful open source team chat application
Search all of YouTube from the command line
Multimodal embedding and reranking models built on Qwen3-VL
Multimodal Diffusion with Representation Alignment
Label Studio is a multi-type data labeling and annotation tool
A Multi-Modal World Model for Reconstructing, Generating, Simulation
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Generating Immersive, Explorable, and Interactive 3D Worlds
Use Microsoft Edge's online text-to-speech service from Python
Uses Qwen3-ASR, local LLM, Whisper, TEN-VAD
Qwen2.5-VL is the multimodal large language model series
Automatically translates the text of a video based on a subtitle file
OCR expert VLM powered by Hunyuan's native multimodal architecture
Public opinion analysis system
Data Infrastructure providing an approach to multimodal AI workloads
Build multimodal language agents for fast prototype and production
21 Lessons, Get Started Building with Generative AI