A state-of-the-art open visual language model
Generate audiobooks from EPUBs, PDFs and text with captions
Easily turn large sets of image urls to an image dataset
A robust, efficient, low-latency speech-to-text library
Abstraction layer over YouTube's internal API
Simple HTML5, YouTube and Vimeo player
Towards Real-World Vision-Language Understanding
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Let's use AI to Earn
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
CLIP, Predict the most relevant text snippet given an image
Software version control visualization
4M: Massively Multimodal Masked Modeling
A simple screen parsing tool towards pure vision based GUI agent
OpenAI swift async text to image for SwiftUI app using OpenAI
ShanaEncoder is audio/video encoding program based on FFmpeg.
An enhanced HTML 5 file input for Bootstrap 5.x/4.x./3.x
A standalone lightweight auxiliary CLI video player for BlackVideo.
Implementation of Dreambooth
Packages with more than 80 components for all delphi versions
An open-source framework for training large multimodal models
A convenient and easy to use image viewer for your iOS app
The ultimate tool to automate custom telegram message forwarding