Large Multimodal Models for Video Understanding and Editing
Voice Recognition to Text Tool
A python tool that uses GPT-4, FFmpeg, and OpenCV
Code and models for ICML 2024 paper, NExT-GPT
AI framework for automated short video creation and editing tools
Code for running inference and finetuning with SAM 3 model
Search all of YouTube from the command line
Multimodal embedding and reranking models built on Qwen3-VL
Multimodal Diffusion with Representation Alignment
Label Studio is a multi-type data labeling and annotation tool
A Multi-Modal World Model for Reconstructing, Generating, Simulation
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Generating Immersive, Explorable, and Interactive 3D Worlds
Use Microsoft Edge's online text-to-speech service from Python
Uses Qwen3-ASR, local LLM, Whisper, TEN-VAD
Qwen2.5-VL is the multimodal large language model series
Automatically translates the text of a video based on a subtitle file
OCR expert VLM powered by Hunyuan's native multimodal architecture
Public opinion analysis system
21 Lessons, Get Started Building with Generative AI
Data Infrastructure providing an approach to multimodal AI workloads
Build multimodal language agents for fast prototype and production
A Pioneering Open-Source Alternative to GPT-4o
A Web UI for easy subtitle using whisper model
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning