inference engine free download

45 projects for "inference engine" with 1 filter applied:

BSD Clear Filters & Widen Search

Rezku Point of Sale
Designed for Real-World Restaurant Operations

Rezku is an all-inclusive ordering platform and management solution for all types of restaurant and bar concepts. You can now get a fully custom branded downloadable smartphone ordering app for your restaurant exclusively from Rezku.

Learn More
Jscrambler: Pioneering Client-Side Protection Platform
Jscrambler offers an exclusive blend of cutting-edge first-party JavaScript obfuscation and state-of-the-art third-party tag protection.

Jscrambler is the leader in Client-Side Protection and Compliance. We were the first to merge advanced polymorphic JavaScript obfuscation with fine-grained third-party tag protection in a unified Client-Side Protection and Compliance Platform. Our integrated solution ensures a robust defense against current and emerging client-side cyber threats, data leaks, and IP theft, empowering software development and digital teams to innovate securely. With Jscrambler, businesses adopt a unified, future-proof client-side security policy all while achieving compliance with emerging security standards including PCI DSS v4.0. Trusted by digital leaders worldwide, Jscrambler gives businesses the freedom to innovate securely.

Learn More
1

MLX Engine

LM Studio Apple MLX engine

MLX Engine is the Apple MLX-based inference backend used by LM Studio to run large language models efficiently on Apple Silicon hardware. Built on top of the mlx-lm and mlx-vlm ecosystems, the engine provides a unified architecture capable of supporting both text-only and multimodal models. Its design focuses on high-performance on-device inference, leveraging Apple’s MLX stack to accelerate computation on M-series chips.

Downloads: 1 This Week

Last Update: 2 days ago
See Project
2

Temporal Inference Engine

A real time inference engine for temporal logical specifications

A real time inference engine for temporal logical specifications, which is able to acquire, process and generate any binary or real signal through POSIX IPC, files or UNIX sockets. Specifications of signals and dynamic systems are represented as special graphs and executed in real time, with a predictable sampling time of few milliseconds. Real time signal processing, dynamic system control, state machine modeling and logical property verification are some fields of application of this software. ...

1 Review

Downloads: 0 This Week

Last Update: 2026-02-18
See Project
3

SimpleLLM

950 line, minimal, extensible LLM inference engine built from scratch

SimpleLLM is a minimal, extensible large language model inference engine implemented in roughly 950 lines of code, built from scratch to serve both as a learning tool and a research platform for novel inference techniques. It provides the core components of an LLM runtime—such as tokenization, batching, and asynchronous execution—without the abstraction overhead of more complex engines, making it easier for developers and researchers to understand and modify. ...

Downloads: 2 This Week

Last Update: 2026-01-28
See Project
4

uzu

A high-performance inference engine for AI models

uzu is a high-performance inference engine designed to run artificial intelligence models efficiently on Apple Silicon hardware. Written primarily in Rust and leveraging Apple’s Metal framework, the project focuses on maximizing performance when executing large language models and other AI workloads on devices such as Mac computers with M-series chips. The engine implements a hybrid architecture in which model layers can be executed either as custom GPU kernels or through Apple’s MPSGraph API, allowing it to balance performance and compatibility depending on the workload. ...

Downloads: 1 This Week

Last Update: 2026-03-15
See Project
MicroStation by Bentley Systems is the trusted computer-aided design (CAD) software built specifically for infrastructure design.
Microstation enables architects, engineers, and designers to create precise 2D and 3D drawings that bring complex projects to life.

MicroStation is the only computer-aided design software for infrastructure design, helping architects and engineers like you bring their vision to life, present their designs to their clients, and deliver their projects to the community.

Learn More
5

Nano-vLLM

A lightweight vLLM implementation built from scratch

Nano-vLLM is a lightweight implementation of the vLLM inference engine designed to run large language models efficiently while maintaining a minimal and readable codebase. The project recreates the core functionality of vLLM in a simplified architecture written in approximately a thousand lines of Python, making it easier for developers and researchers to understand how modern LLM inference systems work.

Downloads: 2 This Week

Last Update: 2026-04-13
See Project
6

RTP-LLM

Alibaba's high-performance LLM inference engine for diverse apps

RTP-LLM is an open-source large language model inference acceleration engine developed by Alibaba to provide high-performance serving infrastructure for modern LLM deployments. The system focuses on improving throughput, latency, and resource utilization when running large models in production environments. It achieves this by implementing optimized GPU kernels, batching strategies, and memory management techniques tailored for transformer inference workloads. ...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
7

Chitu

High-performance inference framework for large language models

Chitu is a high-performance inference engine designed to deploy and run large language models efficiently in production environments. The framework focuses on improving efficiency, flexibility, and scalability for organizations that need to run LLM inference workloads across different hardware platforms. It supports heterogeneous computing environments, including CPUs, GPUs, and various specialized AI accelerators, allowing models to run across a wide range of infrastructure configurations. ...

Downloads: 0 This Week

Last Update: 2026-04-09
See Project
8

HunyuanWorld-Voyager

RGBD video generation model conditioned on camera input

...The system jointly produces aligned RGB and depth video sequences, making it directly applicable to 3D reconstruction tasks. At its core, Voyager integrates a world-consistent video diffusion model with an efficient long-range world exploration engine powered by auto-regressive inference. To support training, the team built a scalable data engine that automatically curates large video datasets with camera pose estimation and metric depth prediction. As a result, Voyager delivers state-of-the-art performance on world exploration benchmarks while maintaining photometric, style, and 3D consistency.

Downloads: 18 This Week

Last Update: 7 days ago
See Project
9

SAM 3

Code for running inference and finetuning with SAM 3 model

SAM 3 (Segment Anything Model 3) is a unified foundation model for promptable segmentation in both images and videos, capable of detecting, segmenting, and tracking objects. It accepts both text prompts (open-vocabulary concepts like “red car” or “goalkeeper in white”) and visual prompts (points, boxes, masks) and returns high-quality masks, boxes, and scores for the requested concepts. Compared with SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an...

Downloads: 50 This Week

Last Update: 6 days ago
See Project
Kinetic Software - Epicor ERP
Discrete, make-to-order and mixed-mode manufacturers who need a global cloud ERP solution

Grow, thrive, and compete in a global marketplace with Kinetic—an industry-tailored, cognitive ERP that helps you work smarter and stay connected.

Learn More
10

AI Runner

Offline inference engine for art, real-time voice conversations

AI Runner is an offline inference engine designed to run a collection of AI workloads on your own machine, including image generation for art, real-time voice conversations, LLM-powered chatbots and automated workflows. It is implemented as a desktop-oriented Python application and emphasizes privacy and self-hosting, allowing users to work with text-to-speech, speech-to-text, text-to-image and multimodal models without sending data to external services.

Downloads: 9 This Week

Last Update: 2025-12-11
See Project
11

mllm

Fast Multimodal LLM on Mobile Devices

mllm is an open-source inference engine designed to run multimodal large language models efficiently on mobile devices and edge computing environments. The framework focuses on delivering high-performance AI inference in resource-constrained systems such as smartphones, embedded hardware, and lightweight computing platforms. Implemented primarily in C and C++, it is designed to operate with minimal external dependencies while taking advantage of hardware-specific acceleration technologies such as ARM NEON and x86 AVX2 instructions. ...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
12

MLC LLM

Universal LLM Deployment Engine with ML Compilation

MLC LLM is a machine learning compiler and deployment framework designed to enable efficient execution of large language models across a wide range of hardware platforms. The project focuses on compiling models into optimized runtimes that can run natively on devices such as GPUs, mobile processors, browsers, and edge hardware. By leveraging machine learning compilation techniques, mlc-llm produces high-performance inference engines that maintain consistent APIs across platforms. The system...

Downloads: 21 This Week

Last Update: 2026-03-09
See Project
13

Mooncake

Mooncake is the serving platform for Kimi

...The platform was originally developed as part of the serving infrastructure for the Kimi large language model system. Its architecture centers on a high-performance transfer engine that provides unified data transfer across different storage and networking technologies. This engine enables efficient movement of tensors and model data across heterogeneous environments such as GPU memory, system memory, and distributed storage systems. Mooncake also introduces distributed key-value cache storage that allows inference systems to reuse previously computed attention states, significantly improving throughput in large-scale deployments. ...

Downloads: 0 This Week

Last Update: 10 hours ago
See Project
14

LightLLM

LightLLM is a Python-based LLM (Large Language Model) inference

LightLLM is a high-performance inference and serving framework designed specifically for large language models, focusing on lightweight architecture, scalability, and efficient deployment. The framework enables developers to run and serve modern language models with significantly improved speed and resource efficiency compared to many traditional inference systems. Built primarily in Python, the project integrates optimization techniques and ideas from several leading open-source...

Downloads: 0 This Week

Last Update: 2026-03-05
See Project
15

Open WebUI

User-friendly AI Interface

Open WebUI is an extensible, feature-rich, and user-friendly self-hosted AI platform designed to operate entirely offline. It supports various LLM runners like Ollama and OpenAI-compatible APIs, with a built-in inference engine for Retrieval Augmented Generation (RAG), making it a powerful AI deployment solution. Key features include effortless setup via Docker or Kubernetes, seamless integration with OpenAI-compatible APIs, granular permissions and user groups for enhanced security, responsive design across devices, and full Markdown and LaTeX support for enriched interactions. ...

Downloads: 130 This Week

Last Update: 1 day ago
See Project
16

wllama

WebAssembly binding for llama.cpp - Enabling on-browser LLM inference

wllama is a WebAssembly-based library that enables large language model inference directly inside a web browser. Built as a binding for the llama.cpp inference engine, the project allows developers to run LLM models locally without requiring a server backend or dedicated GPU hardware. The library leverages WebAssembly SIMD capabilities to achieve efficient execution within modern browsers while maintaining compatibility across platforms.

Downloads: 0 This Week

Last Update: 2026-03-10
See Project
17

Parallax

Parallax is a distributed model serving framework

Parallax is a decentralized inference framework designed to run large language models across distributed computing resources. Instead of relying on centralized GPU clusters in data centers, the system allows multiple heterogeneous machines to collaborate in serving AI inference workloads. Parallax divides model layers across different nodes and dynamically coordinates them to form a complete inference pipeline. A two-stage scheduling architecture determines how model layers are allocated to...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
18

PicoLM

Run a 1-billion parameter LLM on a $10 board with 256MB RAM

PicoLM is an open-source inference framework designed to run large language models on extremely constrained hardware environments such as inexpensive single-board computers and embedded systems. The project focuses on enabling efficient local inference by optimizing memory usage, computation, and system dependencies so that relatively large models can operate on devices with minimal RAM. It is written primarily in C and designed with a minimalist architecture that removes unnecessary...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
19

FlowGram

Extensible workflow development framework

FlowGram is an open-source, node-based workflow development framework and toolkit aimed at helping developers build custom AI-workflow platforms or automation systems through a visual, drag-and-drop interface. Instead of shipping as a ready-made product, it provides the building blocks — a canvas for wiring together nodes, a form engine for configuring node parameters, a variable-scope and type-inference engine, and a set of “materials” (pre-built node types such as code execution, conditional logic, LLM calls, etc.) that can be composed into larger workflows. This makes FlowGram highly flexible: you can prototype data-processing pipelines, AI-agent flows, automation scripts, or even business process automation without writing all the plumbing yourself. ...

Downloads: 2 This Week

Last Update: 6 days ago
See Project
20

Secret Llama

Fully private LLM chatbot that runs entirely with a browser

Secret Llama is a privacy-first large-language-model chatbot that runs entirely inside your web browser, meaning no server is required and your conversation data never leaves your device. It focuses on open-source model support, letting you load families like Llama and Mistral directly in the client for fully local inference. Because everything happens in-browser, it can work offline once models are cached, which is helpful for air-gapped environments or travel. The interface mirrors the modern chat UX you’d expect—streaming responses, markdown, and a clean layout—so there’s no usability tradeoff to gain privacy. Under the hood it uses a web-native inference engine to accelerate model execution with GPU/WebGPU when available, keeping responses responsive even without a backend. ...

Downloads: 1 This Week

Last Update: 2025-11-07
See Project
21

Matrix

Multi-Agent daTa geneRation Infra and eXperimentation framework

...That design makes Matrix particularly well-suited for large-batch inference, model benchmarking, data curation, augmentation, or generation — whether for language, code, dialogue, or multimodal tasks. It supports both open-source LLMs and proprietary models (via integration with model backends), and works with containerized or sandboxed environments for safe tool execution or external code runs.

Downloads: 0 This Week

Last Update: 2026-03-05
See Project
22

MiniSearch

Minimalist web-searching platform with an AI assistant

MiniSearch is a minimalist web search application with a built-in AI assistant that runs largely inside the browser for privacy-focused information retrieval. The project combines metasearch capabilities with local or remote language model inference to provide conversational answers alongside traditional search results. It is designed to be lightweight, easy to deploy with Docker, and configurable for both personal and hosted use cases. The platform supports browser-level integration so users can set it as their default search engine for quick access. Its architecture emphasizes privacy by avoiding tracking and minimizing data collection while still enabling advanced AI features. ...

Downloads: 2 This Week

Last Update: 6 days ago
See Project
23

Superduper

Superduper: Integrate AI models and machine learning workflows

Superduper is a Python-based framework for building end-2-end AI-data workflows and applications on your own data, integrating with major databases. It supports the latest technologies and techniques, including LLMs, vector-search, RAG, and multimodality as well as classical AI and ML paradigms. Developers may leverage Superduper by building compositional and declarative objects that out-source the details of deployment, orchestration versioning, and more to the Superduper engine. This...

Downloads: 0 This Week

Last Update: 2025-08-26
See Project
24

FlexLLMGen

Running large language models on a single GPU

FlexLLMGen is an open-source inference engine designed to run large language models efficiently on limited hardware resources such as a single GPU. The system focuses on high-throughput generation workloads where large batches of text must be processed quickly, such as large-scale data extraction or document analysis tasks. Instead of requiring expensive multi-GPU systems, the framework uses techniques such as memory offloading, compression, and optimized batching to run large models on commodity hardware. ...

Downloads: 0 This Week

Last Update: 2026-03-10
See Project
25

PasteGuard

Masks sensitive data and secrets before they reach AI

...PasteGuard supports two primary modes: mask mode, which anonymizes data and still uses external APIs; and route mode, which forwards sensitive requests to a local LLM inference engine while sending the rest to the cloud. It can be self-hosted via Docker, works with a wide range of SDKs and tools, and includes a browser extension for automatic protection in everyday AI chats.

Downloads: 0 This Week

Last Update: 2026-03-13
See Project