audio synthesis free download

Showing 12 open source projects for "audio synthesis"

View related business solutions

AI Models Python Clear Filters & Widen Search

anny is an all-in-one platform for managing hybrid workplaces and shared resources.
For Businesses looking for a flexible solution for internal and external bookings

Enable your employees to easily book desks, meeting rooms, parking spots, equipment, and more – all in one place. With flexible rules and group permissions, you stay in full control of who can access what.

Learn More
Monitor production, track downtime and improve OEE.
For manufacturing companies interested in OEE monitoring solutions

Evocon is a visual and user-friendly OEE software that helps manufacturing companies improve productivity and remove waste as they become better.

Learn More
1

Step-Audio

Open-source framework for intelligent speech interaction

Step-Audio is a unified, open-source framework aimed at building intelligent speech systems that combine both comprehension and generation: it integrates large language models (LLMs) with speech input/output to handle not only semantic understanding but also rich vocal characteristics like tone, style, dialect, emotion, and prosody. The design moves beyond traditional separate-component pipelines (ASR → text model → TTS), instead offering a multimodal model that ingests speech or audio and produces speech accordingly, enabling natural dialogue, voice cloning, and expressive speech synthesis. ...

Downloads: 4 This Week

Last Update: 2026-03-16
See Project
2

Step-Audio 2

Multi-modal large language model designed for audio understanding

Step-Audio2 is an advanced, end-to-end multimodal large language model designed for high-fidelity audio understanding and natural speech conversation: unlike many pipelines that separate speech recognition, processing, and synthesis, Step-Audio2 processes raw audio, reasons about semantic and paralinguistic content (like emotion, speaker characteristics, non-verbal cues), and can generate contextually appropriate responses — including potentially generating or transforming audio output. ...

Downloads: 0 This Week

Last Update: 2026-03-16
See Project
3

GLM-TTS

Controllable & emotion-expressive zero-shot TTS

GLM-TTS is an advanced text-to-speech synthesis system built on large language model technologies that focuses on producing high-quality, expressive, and controllable spoken output, including features like emotion modulation and zero-shot voice cloning. It uses a two-stage architecture where a generative LLM first converts text into intermediate speech token sequences and then a Flow-based neural model converts those tokens into natural audio waveforms, enabling rich prosody and voice character even for unseen speakers. ...

Downloads: 2 This Week

Last Update: 2026-04-10
See Project
4

Qwen2.5-Omni

Capable of understanding text, audio, vision, video

...Very strong benchmark performance across modalities (audio understanding, speech recognition, image/video reasoning) and often outperforming or matching single-modality models at a similar scale. Real-time streaming responses, including natural speech synthesis (text-to-speech) and chunked inputs for low latency interaction.

Downloads: 1 This Week

Last Update: 2025-09-23
See Project
Download the most trusted enterprise browser
Chrome Enterprise brings enterprise controls and easy integrations to the browser users already know and love.

Chrome Enterprise is ideal for businesses of all sizes, IT professionals, and organizations looking for a secure, scalable, and easily managed browser solution that supports remote work, data protection, and streamlined enterprise operations.

Learn More
5

HY-World 1.5

A Systematic Framework for Interactive World Modeling

...It blends advanced reasoning with multimodal synthesis, enabling agents to describe scenes, generate context-appropriate responses, and contribute to narrative or gameplay flows. The underlying framework typically supports large-context state tracking across extended interactions, blending temporal and spatial multimodal signals.

Downloads: 19 This Week

Last Update: 5 days ago
See Project
6

TADA

Open Source Speech Language Model

...This approach can support applications such as conversational AI, speech synthesis, multimodal language modeling, and speech understanding systems. The project explores ways to treat speech and text as integrated data streams rather than separate pipelines, enabling more coherent interactions between language and audio. Because it operates as a generative framework, TADA can be used for research into advanced speech-language models and multimodal artificial intelligence systems.

Downloads: 0 This Week

Last Update: 2026-03-24
See Project
7

IndexTTS2

Industrial-level controllable zero-shot text-to-speech system

IndexTTS is a modern, zero-shot text-to-speech (TTS) system engineered to deliver high-quality, natural-sounding speech synthesis with few requirements and strong voice-cloning capabilities. It builds on state-of-the-art models such as XTTS and other modern neural TTS backbones, improving them with a conformer-based speech conditional encoder and upgrading the decoder to a high-quality vocoder (BigVGAN2), leading to clearer and more natural audio output.

Downloads: 5 This Week

Last Update: 2025-11-27
See Project
8

Kitten TTS

State-of-the-art TTS model under 25MB

KittenTTS is an open-source, ultra-lightweight, and high-quality text-to-speech model featuring just 15 million parameters and a binary size under 25 MB. It is designed for real-time CPU-based deployment across diverse platforms. Ultra-lightweight, model size less than 25MB. CPU-optimized, runs without GPU on any device. High-quality voices, several premium voice options available. Fast inference, optimized for real-time speech synthesis.

Downloads: 17 This Week

Last Update: 2026-02-24
See Project
9

Qwen3-TTS

Qwen3-TTS is an open-source series of TTS models

Qwen3-TTS is an open-source text-to-speech (TTS) project built around the Qwen3 large language model family, focused on generating high-quality, natural-sounding speech from plain text input. It provides researchers and developers with tools to transform text into expressive, intelligible audio, supporting multiple languages and voice characteristics tuned for clarity and fluidity. The project includes pre-trained models and inference scripts that let users synthesize speech locally or...

Downloads: 11 This Week

Last Update: 2026-03-17
See Project
Cloud-Based Software Licensing - Zentitle by Nalpeiron
The #1 Software Licensing Solution. Release new Software License Models fast with no engineering. Increase software sales and drive up revenues.

1000’s software companies have used Zentitle to launch new software products fast and control their entitlements easily - many going from startup to IPO on our platform. Our software monetization infrastructure allows you to easily build or

Learn More
10

GLM-4-Voice

GLM-4-Voice | End-to-End Chinese-English Conversational Model

GLM-4-Voice is an open-source speech-enabled model from ZhipuAI, extending the GLM-4 family into the audio domain. It integrates advanced voice recognition and generation with the multimodal reasoning capabilities of GLM-4, enabling smooth natural interaction via spoken input and output. The model supports real-time speech-to-text transcription, spoken dialogue understanding, and text-to-speech synthesis, making it suitable for conversational AI, virtual assistants, and accessibility applications. ...

Downloads: 2 This Week

Last Update: 3 days ago
See Project
11

CSM (Conversational Speech Model)

A Conversational Speech Generation Model

The CSM (Conversational Speech Model) is a speech generation model developed by Sesame AI that creates RVQ audio codes from text and audio inputs. It uses a Llama backbone and a smaller audio decoder to produce audio codes for realistic speech synthesis. The model has been fine-tuned for interactive voice demos and is hosted on platforms like Hugging Face for testing. CSM offers a flexible setup and is compatible with CUDA-enabled GPUs for efficient execution.

Downloads: 5 This Week

Last Update: 2025-03-19
See Project
12

VALL-E

PyTorch implementation of VALL-E (Zero-Shot Text-To-Speech)

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. ...

Downloads: 0 This Week

Last Update: 2023-04-14
See Project