inference engine free download

Showing 76 open source projects for "inference engine"

View related business solutions

Mac Clear Filters & Widen Search

Create a personalized AI chatbot for each team in minutes
Get better, faster answers for your whole team with an AI chatbot trained on your company documents.

QueryPal is the lifeline your team needs. Our AI chatbot integrates seamlessly with your communication channels, using advanced language understanding to identify and auto-answer repetitive questions — in seconds.

Learn More
Tremendous is the global payouts platform for businesses sending gift cards and money at scale.
Getting started is simple: add a funding method and place your first order in minutes.

Trusted by 20,000+ leading organizations, Tremendous has delivered billions of rewards and enables businesses to reach recipients across 230+ countries and regions. Recipients have 2,500+ payout options to choose from, including gift cards, prepaid cards, cash transfers, and charitable donations.

Learn More
1

Transformer Engine

A library for accelerating Transformer models on NVIDIA GPUs

Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower memory utilization in both training and inference. TE provides a collection of highly optimized building blocks for popular Transformer architectures and an automatic mixed precision-like API that can be used seamlessly with your framework-specific code.

Downloads: 2 This Week

Last Update: 3 hours ago
See Project
2

MLX Engine

LM Studio Apple MLX engine

MLX Engine is the Apple MLX-based inference backend used by LM Studio to run large language models efficiently on Apple Silicon hardware. Built on top of the mlx-lm and mlx-vlm ecosystems, the engine provides a unified architecture capable of supporting both text-only and multimodal models. Its design focuses on high-performance on-device inference, leveraging Apple’s MLX stack to accelerate computation on M-series chips.

Downloads: 1 This Week

Last Update: 2026-04-14
See Project
3

Temporal Inference Engine

A real time inference engine for temporal logical specifications

A real time inference engine for temporal logical specifications, which is able to acquire, process and generate any binary or real signal through POSIX IPC, files or UNIX sockets. Specifications of signals and dynamic systems are represented as special graphs and executed in real time, with a predictable sampling time of few milliseconds. Real time signal processing, dynamic system control, state machine modeling and logical property verification are some fields of application of this software. ...

1 Review

Downloads: 0 This Week

Last Update: 2026-02-18
See Project
4

ReactiveMP.jl

High-performance reactive message-passing based Bayesian engine

ReactiveMP.jl is a Julia package that provides an efficient reactive message passing based Bayesian inference engine on a factor graph. The package is a part of the bigger and user-friendly ecosystem for automatic Bayesian inference called RxInfer. While ReactiveMP.jl exports only the inference engine, RxInfer provides convenient tools for model and inference constraints specification as well as routines for running efficient inference both for static and real-time datasets.

Downloads: 2 This Week

Last Update: 4 days ago
See Project
Windocks - Docker Oracle and SQL Server Containers
Deliver faster. Provision data for AI/ML. Enhance data privacy. Improve quality.

Windocks is a leader in cloud native database DevOps, recognized by Gartner as a Cool Vendor, and as an innovator by Bloor research in Test Data Management. Novartis, DriveTime, American Family Insurance, and other enterprises rely on Windocks for on-demand database environments for development, testing, and DevOps. Windocks software is easily downloaded for evaluation on standard Linux and Windows servers, for use on-premises or cloud, and for data delivery of SQL Server, Oracle, PostgreSQL, and MySQL to Docker containers or conventional database instances.

Learn More
5

vLLM

A high-throughput and memory-efficient inference and serving engine

vLLM is a fast and easy-to-use library for LLM inference and serving. High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more.

Downloads: 43 This Week

Last Update: 4 days ago
See Project
6

SimpleLLM

950 line, minimal, extensible LLM inference engine built from scratch

SimpleLLM is a minimal, extensible large language model inference engine implemented in roughly 950 lines of code, built from scratch to serve both as a learning tool and a research platform for novel inference techniques. It provides the core components of an LLM runtime—such as tokenization, batching, and asynchronous execution—without the abstraction overhead of more complex engines, making it easier for developers and researchers to understand and modify. ...

Downloads: 2 This Week

Last Update: 2026-01-28
See Project
7

Nano-vLLM

A lightweight vLLM implementation built from scratch

Nano-vLLM is a lightweight implementation of the vLLM inference engine designed to run large language models efficiently while maintaining a minimal and readable codebase. The project recreates the core functionality of vLLM in a simplified architecture written in approximately a thousand lines of Python, making it easier for developers and researchers to understand how modern LLM inference systems work.

Downloads: 2 This Week

Last Update: 2026-04-13
See Project
8

uzu

A high-performance inference engine for AI models

uzu is a high-performance inference engine designed to run artificial intelligence models efficiently on Apple Silicon hardware. Written primarily in Rust and leveraging Apple’s Metal framework, the project focuses on maximizing performance when executing large language models and other AI workloads on devices such as Mac computers with M-series chips. The engine implements a hybrid architecture in which model layers can be executed either as custom GPU kernels or through Apple’s MPSGraph API, allowing it to balance performance and compatibility depending on the workload. ...

Downloads: 0 This Week

Last Update: 2026-03-15
See Project
9

gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models

Gemma.cpp is a C++ implementation for running inference with Gemma models efficiently on CPUs and GPUs. Developed by Google, it allows running large language models (LLMs) like Gemma with minimal hardware, focusing on optimized performance and low latency. Gemma.cpp is intended for developers seeking to deploy LLMs in production environments without needing massive computational resources.

Downloads: 2 This Week

Last Update: 2025-03-25
See Project
Ditto Edge Server is a lightweight standalone server for resource-constrained edge environments, based on the core Ditto Edge SDK.
With Ditto Edge Server, you can join devices as small as a Raspberry Pi to a local mesh network and synchronize data across edge environments.

Ditto's Edge SDK is the only thing your edge devices need to ensure your application is operational in any environment, regardless of network conditions.

Learn More
10

Jlama

Jlama is a modern LLM inference engine for Java

Jlama is a modern inference engine written entirely in Java that enables developers to run large language models locally within Java applications. Unlike frameworks that require external APIs or remote services, Jlama performs inference directly on a machine using pre-trained models. This allows organizations to integrate generative AI features into their systems while maintaining full control over data privacy and infrastructure.

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
11

RTP-LLM

Alibaba's high-performance LLM inference engine for diverse apps

RTP-LLM is an open-source large language model inference acceleration engine developed by Alibaba to provide high-performance serving infrastructure for modern LLM deployments. The system focuses on improving throughput, latency, and resource utilization when running large models in production environments. It achieves this by implementing optimized GPU kernels, batching strategies, and memory management techniques tailored for transformer inference workloads. ...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
12

Chitu

High-performance inference framework for large language models

Chitu is a high-performance inference engine designed to deploy and run large language models efficiently in production environments. The framework focuses on improving efficiency, flexibility, and scalability for organizations that need to run LLM inference workloads across different hardware platforms. It supports heterogeneous computing environments, including CPUs, GPUs, and various specialized AI accelerators, allowing models to run across a wide range of infrastructure configurations. ...

Downloads: 0 This Week

Last Update: 2026-04-09
See Project
13

HunyuanWorld-Voyager

RGBD video generation model conditioned on camera input

...The system jointly produces aligned RGB and depth video sequences, making it directly applicable to 3D reconstruction tasks. At its core, Voyager integrates a world-consistent video diffusion model with an efficient long-range world exploration engine powered by auto-regressive inference. To support training, the team built a scalable data engine that automatically curates large video datasets with camera pose estimation and metric depth prediction. As a result, Voyager delivers state-of-the-art performance on world exploration benchmarks while maintaining photometric, style, and 3D consistency.

Downloads: 17 This Week

Last Update: 6 days ago
See Project
14

SAM 3

Code for running inference and finetuning with SAM 3 model

SAM 3 (Segment Anything Model 3) is a unified foundation model for promptable segmentation in both images and videos, capable of detecting, segmenting, and tracking objects. It accepts both text prompts (open-vocabulary concepts like “red car” or “goalkeeper in white”) and visual prompts (points, boxes, masks) and returns high-quality masks, boxes, and scores for the requested concepts. Compared with SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an...

Downloads: 51 This Week

Last Update: 6 days ago
See Project
15

AI Runner

Offline inference engine for art, real-time voice conversations

AI Runner is an offline inference engine designed to run a collection of AI workloads on your own machine, including image generation for art, real-time voice conversations, LLM-powered chatbots and automated workflows. It is implemented as a desktop-oriented Python application and emphasizes privacy and self-hosting, allowing users to work with text-to-speech, speech-to-text, text-to-image and multimodal models without sending data to external services.

Downloads: 12 This Week

Last Update: 2025-12-11
See Project
16

mllm

Fast Multimodal LLM on Mobile Devices

mllm is an open-source inference engine designed to run multimodal large language models efficiently on mobile devices and edge computing environments. The framework focuses on delivering high-performance AI inference in resource-constrained systems such as smartphones, embedded hardware, and lightweight computing platforms. Implemented primarily in C and C++, it is designed to operate with minimal external dependencies while taking advantage of hardware-specific acceleration technologies such as ARM NEON and x86 AVX2 instructions. ...

Downloads: 1 This Week

Last Update: 2026-03-09
See Project
17

llama2.c

Inference Llama 2 in one file of pure C

llama2.c is a minimalist implementation of the Llama 2 language model architecture designed to run entirely in pure C. Created by Andrej Karpathy, this project offers an educational and lightweight framework for performing inference on small Llama 2 models without external dependencies. It provides a full training and inference pipeline: models can be trained in PyTorch and later executed using a concise 700-line C program (run.c). While it can technically load Meta’s official Llama 2...

Downloads: 3 This Week

Last Update: 3 days ago
See Project
18

CTranslate2

Fast inference engine for Transformer models

CTranslate2 is a C++ and Python library for efficient inference with Transformer models. The project implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to accelerate and reduce the memory usage of Transformer models on CPU and GPU. The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many...

Downloads: 13 This Week

Last Update: 2026-02-04
See Project
19

Pruna AI

Pruna is a model optimization framework built for developers

Pruna is an open-source, self-hostable AI inference engine designed to help teams deploy and manage large language models (LLMs) efficiently across private or hybrid infrastructures. Built with performance and developer ergonomics in mind, Pruna simplifies inference workflows by enabling multi-model orchestration, autoscaling, GPU resource allocation, and compatibility with popular open-source models.

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
20

MLC LLM

Universal LLM Deployment Engine with ML Compilation

MLC LLM is a machine learning compiler and deployment framework designed to enable efficient execution of large language models across a wide range of hardware platforms. The project focuses on compiling models into optimized runtimes that can run natively on devices such as GPUs, mobile processors, browsers, and edge hardware. By leveraging machine learning compilation techniques, mlc-llm produces high-performance inference engines that maintain consistent APIs across platforms. The system...

Downloads: 22 This Week

Last Update: 2026-03-09
See Project
21

wllama

WebAssembly binding for llama.cpp - Enabling on-browser LLM inference

wllama is a WebAssembly-based library that enables large language model inference directly inside a web browser. Built as a binding for the llama.cpp inference engine, the project allows developers to run LLM models locally without requiring a server backend or dedicated GPU hardware. The library leverages WebAssembly SIMD capabilities to achieve efficient execution within modern browsers while maintaining compatibility across platforms.

Downloads: 1 This Week

Last Update: 2026-03-10
See Project
22

Mooncake

Mooncake is the serving platform for Kimi

...The platform was originally developed as part of the serving infrastructure for the Kimi large language model system. Its architecture centers on a high-performance transfer engine that provides unified data transfer across different storage and networking technologies. This engine enables efficient movement of tensors and model data across heterogeneous environments such as GPU memory, system memory, and distributed storage systems. Mooncake also introduces distributed key-value cache storage that allows inference systems to reuse previously computed attention states, significantly improving throughput in large-scale deployments. ...

Downloads: 0 This Week

Last Update: 2026-04-01
See Project
23

mistral.rs

Fast, flexible LLM inference

mistral.rs is a fast and flexible LLM inference engine implemented in Rust, designed to run and serve modern language models with an emphasis on performance and practical deployment. It provides multiple entry points for developers, including a CLI for running models locally and an HTTP server that exposes an OpenAI-compatible API surface for easy integration with existing clients.

Downloads: 1 This Week

Last Update: 2026-04-02
See Project
24

LightLLM

LightLLM is a Python-based LLM (Large Language Model) inference

LightLLM is a high-performance inference and serving framework designed specifically for large language models, focusing on lightweight architecture, scalability, and efficient deployment. The framework enables developers to run and serve modern language models with significantly improved speed and resource efficiency compared to many traditional inference systems. Built primarily in Python, the project integrates optimization techniques and ideas from several leading open-source...

Downloads: 0 This Week

Last Update: 2026-03-05
See Project
25

Open WebUI

User-friendly AI Interface

Open WebUI is an extensible, feature-rich, and user-friendly self-hosted AI platform designed to operate entirely offline. It supports various LLM runners like Ollama and OpenAI-compatible APIs, with a built-in inference engine for Retrieval Augmented Generation (RAG), making it a powerful AI deployment solution. Key features include effortless setup via Docker or Kubernetes, seamless integration with OpenAI-compatible APIs, granular permissions and user groups for enhanced security, responsive design across devices, and full Markdown and LaTeX support for enriched interactions. ...

Downloads: 127 This Week

Last Update: 15 hours ago
See Project