Showing 213 open source projects for "speech to text"

View related business solutions
  • Software for managing apps and accounts | WebCatalog Icon
    Software for managing apps and accounts | WebCatalog

    Tired of juggling countless browser tabs? WebCatalog Desktop turns your favorite web apps into dedicated desktop apps

    Turn websites into desktop apps with WebCatalog Desktop—your all-in-one tool to manage apps and accounts. Switch between multiple accounts, organize apps by workflow, and access a curated catalog of desktop apps for Mac and Windows.
    Learn More
  • Point of Sale. Powerful and Simple. Icon
    Point of Sale. Powerful and Simple.

    For retail store owners and multi-location retail operations needing a tool to manage sales, inventory, staff and channels in one place

    Vibe Retail is an all-in-one retail point-of-sale and operations platform built for single-store and multi-location retailers seeking to unify inventory, sales, staff and customer data from one mobile-friendly interface. The system lets you track inventory across locations and warehouses, handle item variations (size, color, material), manage purchase orders and supplier deliveries, print custom barcodes, and transfer stock between stores in real time. On the sales side, Vibe supports multiple payment types (cards, cash, checks, gift cards, EBT), layaway workflows, serial number tracking, delivery management, loyalty programs and branded receipts. Retailers can integrate with online platforms (such as Shopify and WooCommerce), sync in-store and online sales, access 40+ real-time reports on sales, inventory and performance, set up promotions and discounts, and print receipts from mobile devices.
    Learn More
  • 1
    Fish Speech

    Fish Speech

    SOTA Open Source TTS

    Fish Speech is a state-of-the-art open-source text-to-speech project that has evolved into the OpenAudio series of advanced TTS models. The repository hosts the code and tooling for training, fine-tuning, and serving high-quality TTS, while the current flagship models (OpenAudio-S1 and S1-mini) are distributed via Fish Audio’s playground and Hugging Face.
    Downloads: 13 This Week
    Last Update:
    See Project
  • 2
    Hugging Face - Speech To Speech

    Hugging Face - Speech To Speech

    Open speech-to-speech models and pipelines by Hugging Face toolkit AI

    This project from Hugging Face focuses on enabling direct speech-to-speech processing using modern machine learning models. It provides tools and reference implementations that allow audio input to be transformed into audio output without requiring an intermediate text representation. Hugging Face - Speech To Speech builds on recent advances in speech modeling, combining components such as speech recognition, translation, and synthesis into unified pipelines. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 3
    Speech-AI-Forge

    Speech-AI-Forge

    Speech-AI-Forge is a project developed around TTS generation model

    Speech-AI-Forge is a full-stack project built around modern text-to-speech generation models, providing both an API server and a Gradio-based web UI for interactive use. At its core, it acts as a hub that wires together multiple speech-related capabilities, including TTS, speech-to-text and LLM-based control flows, behind a consistent interface.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 4
    Whisper

    Whisper

    Robust Speech Recognition via Large-Scale Weak Supervision

    OpenAI Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented...
    Downloads: 69 This Week
    Last Update:
    See Project
  • Feroot AI automates website security with 24/7 monitoring Icon
    Feroot AI automates website security with 24/7 monitoring

    Trusted by enterprises, healthcare providers, retailers, SaaS platforms, payment service providers, and public sector organizations.

    Feroot unifies JavaScript behavior analysis, web compliance scanning, third-party script monitoring, consent enforcement, and data privacy posture management to stop Magecart, formjacking, and unauthorized tracking.
    Learn More
  • 5
    MLX-Audio

    MLX-Audio

    A text-to-speech, speech-to-text and speech-to-speech library

    MLX-Audio is a speech library built on Apple’s MLX framework and optimized for Apple Silicon machines (M-series Macs). It focuses on text-to-speech and speech-to-speech workflows, with APIs and a command-line interface that make it easy to generate high-quality audio from text. Because it uses MLX and targets Apple Silicon, inference is fast and can take advantage of hardware acceleration and quantization for efficient on-device performance. ...
    Downloads: 13 This Week
    Last Update:
    See Project
  • 6
    abogen

    abogen

    Generate audiobooks from EPUBs, PDFs and text with captions

    abogen is a tool designed to generate audiobooks (or speech narrations) from textual sources such as EPUBs, PDFs, or plain text, with synchronized captions. In other words, it automates the pipeline of reading a digital book (or document), converting its text into speech via a TTS engine, and packaging the result into an audiobook format — likely along with timestamped captions or subtitles that align with the spoken audio.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 7
    Voice-Pro

    Voice-Pro

    Comprehensive Gradio WebUI for audio processing

    Voice-Pro is the best gradio WebUI for transcription, translation and text-to-speech. It can be easily installed with one click. Create a virtual environment using Miniconda, running completely separate from the Windows system (fully portable). Supports real-time transcription and translation, as well as batch mode.
    Downloads: 36 This Week
    Last Update:
    See Project
  • 8
    ChatTTS

    ChatTTS

    A generative speech model for daily dialogue

    ChatTTS is an open-source conversational text-to-speech model optimized for dialogue, developed by 2Noise. Trained on 100,000+ hours of English and Chinese conversation data, it excels at generating expressive prosody—pauses, interjections, laughter—for more natural-sounding speech synthesis in assistant and chatbot applications.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 9
    RealtimeSTT

    RealtimeSTT

    A robust, efficient, low-latency speech-to-text library

    RealtimeSTT is a Python-based realtime speech-to-text engine emphasizing low latency, wake-word detection, voice activity detection, and automatic speech segmentation. It provides asynchronous callbacks, nanosecond-precision timestamps, and CLI tools, suitable for building voice assistants, meeting transcribers, or live caption systems.
    Downloads: 5 This Week
    Last Update:
    See Project
  • ToogleBox: Simplify, Automate and Improve Google Workspace Functionalities Icon
    ToogleBox: Simplify, Automate and Improve Google Workspace Functionalities

    The must-have platform for Google Workspace

    ToogleBox was created as a solution to address the challenges faced by Google Workspace Super Admins. We developed a premium and secure Software-as-a-Service (SaaS) product completely based on specific customer needs. ToogleBox automates most of the manual processes when working with Google Workspace functionalities and includes additional features to improve the administrator experience.
    Learn More
  • 10
    VoxCPM2

    VoxCPM2

    Tokenizer-Free TTS for Multilingual Speech Generation

    VoxCPM2 is an advanced open-source text-to-speech system that redefines speech synthesis by eliminating traditional tokenization and instead generating continuous speech representations through a diffusion-based autoregressive architecture. Built on top of the MiniCPM model family, it enables highly natural, expressive, and context-aware speech generation that adapts tone, emotion, and pacing directly from input text.
    Downloads: 21 This Week
    Last Update:
    See Project
  • 11
    SpeechRecognition

    SpeechRecognition

    Speech recognition module for Python

    Library for performing speech recognition, with support for several engines and APIs, online and offline. Recognize speech input from the microphone, transcribe an audio file, save audio data to an audio file. Show extended recognition results, calibrate the recognizer energy threshold for ambient noise levels (see recognizer_instance.energy_threshold for details). Listening to a microphone in the background, various other useful recognizer features. The easiest way to install this is using...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 12
    Parlant

    Parlant

    The behavior guidance framework for customer-facing LLM agents

    Parlant is a lightweight speech-to-text and text-to-speech framework designed for real-time AI-driven voice applications.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 13
    VoxCPM

    VoxCPM

    TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

    VoxCPM is a tokenizer-free text-to-speech system that models speech in a continuous space, aiming for extremely realistic, context-aware synthesis and true-to-life zero-shot voice cloning. Instead of converting speech into discrete tokens, it uses an end-to-end diffusion-autoregressive architecture built on the MiniCPM-4 backbone, combining hierarchical language modeling, finite scalar quantization (FSQ), and local Diffusion Transformers.
    Downloads: 58 This Week
    Last Update:
    See Project
  • 14
    TADA

    TADA

    Open Source Speech Language Model

    TADA is an open-source speech-language modeling framework designed to unify spoken audio and text representations within a single generative architecture. The system focuses on aligning speech and text streams using a dual-alignment mechanism that synchronizes the acoustic signal with its textual representation. By modeling both modalities together, the framework allows developers to build systems capable of generating, understanding, and transforming speech and language simultaneously. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    StreamSpeech

    StreamSpeech

    StreamSpeech is a seamless model for offline speech recognition

    ...The model supports eight tasks: offline ASR, speech-to-text translation, speech-to-speech translation, and TTS, as well as their streaming or simultaneous counterparts, all handled by the same underlying system. During simultaneous translation, StreamSpeech can optionally output intermediate ASR transcripts and text translations, giving users or downstream applications real-time visibility into what the system is hearing and how it is translating.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Qwen3-TTS

    Qwen3-TTS

    Qwen3-TTS is an open-source series of TTS models

    Qwen3-TTS is an open-source text-to-speech (TTS) project built around the Qwen3 large language model family, focused on generating high-quality, natural-sounding speech from plain text input. It provides researchers and developers with tools to transform text into expressive, intelligible audio, supporting multiple languages and voice characteristics tuned for clarity and fluidity.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 17
    Spark TTS

    Spark TTS

    Spark-TTS Inference Code

    Spark TTS is an open-source, PyTorch-based text-to-speech inference system that leverages large language models to produce highly natural, intelligible speech from text input. It uses an efficient single-stream architecture where speech tokens are directly reconstructed from the predictions of an LLM, removing the need for external acoustic models or complex vocoders and making the generation pipeline cleaner and faster.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    Matcha-TTS

    Matcha-TTS

    A fast TTS architecture with conditional flow matching

    Matcha-TTS is a non-autoregressive neural text-to-speech architecture that uses conditional flow matching to generate speech quickly while maintaining natural quality. It models speech as an ODE-based generative process, and conditional flow matching lets it reach high-quality audio in only a few synthesis steps, which greatly reduces latency compared to score-matching diffusion approaches.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 19
    Qwen3-Omni

    Qwen3-Omni

    Qwen3-omni is a natively end-to-end, omni-modal LLM

    Qwen3-Omni is a natively end-to-end multilingual omni-modal foundation model that processes text, images, audio, and video and delivers real-time streaming responses in text and natural speech. It uses a Thinker-Talker architecture with a Mixture-of-Experts (MoE) design, early text-first pretraining, and mixed multimodal training to support strong performance across all modalities without sacrificing text or image quality. The model supports 119 text languages, 19 speech input languages, and 10 speech output languages. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 20
    WhisperX

    WhisperX

    Automatic Speech Recognition with Word-level Timestamps

    ...Its architecture combines multiple components to enhance both performance and usability in real-world transcription tasks. Overall, whisperx provides a more robust and scalable solution for high-quality speech-to-text applications.
    Downloads: 15 This Week
    Last Update:
    See Project
  • 21
    pyttsx3

    pyttsx3

    Offline Text To Speech synthesis for python

    pyttsx3 is an offline text-to-speech library for Python that wraps native speech engines instead of calling cloud APIs. It is designed to work entirely without an internet connection, making it suitable for local automation, kiosks, accessibility tools, and embedded applications. On Windows it uses SAPI5, on Linux it typically uses eSpeak or eSpeak-NG, and on macOS it can use NSSpeechSynthesizer or AVSpeechSynthesizer, giving it broad cross-platform compatibility. ...
    Downloads: 17 This Week
    Last Update:
    See Project
  • 22
    edge-tts

    edge-tts

    Use Microsoft Edge's online text-to-speech service from Python

    edge-tts is a Python module and command-line tool that gives you direct access to Microsoft Edge’s online text-to-speech service without needing the Edge browser, Windows, or any API key. It wraps the same cloud voices used by Edge, exposing them through a simple CLI (edge-tts, edge-playback) and a Python API, so you can script high-quality speech generation in your own applications. The tool lets you list available voices, specify locale and voice name, and generate audio files in common formats like MP3 or WAV. ...
    Downloads: 23 This Week
    Last Update:
    See Project
  • 23
    IndexTTS2

    IndexTTS2

    Industrial-level controllable zero-shot text-to-speech system

    IndexTTS is a modern, zero-shot text-to-speech (TTS) system engineered to deliver high-quality, natural-sounding speech synthesis with few requirements and strong voice-cloning capabilities. It builds on state-of-the-art models such as XTTS and other modern neural TTS backbones, improving them with a conformer-based speech conditional encoder and upgrading the decoder to a high-quality vocoder (BigVGAN2), leading to clearer and more natural audio output. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 24
    Pocket TTS

    Pocket TTS

    A TTS that fits in your CPU (and pocket)

    Pocket TTS is a lightweight text-to-speech project designed to run efficiently on CPUs, targeting developers who want local speech generation without depending on GPUs or hosted web APIs. It is built to feel practical in everyday applications, where installation and usage should be as simple as adding a dependency and calling a function. The project focuses on keeping the runtime footprint manageable while still producing natural-sounding speech, which makes it attractive for offline tools, prototypes, and privacy-sensitive workflows. ...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 25
    Text To Speech Unlimited

    Text To Speech Unlimited

    Chuyển đổi văn bản thành giọng nói không giới hạn

    Chuyển đổi văn bản thành giọng nói không giới hạn số lượng từ và có thể điều chỉnh tốc độ đọc, giọng đọc
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB