inference engine free download

Showing 18 open source projects for "inference engine"

View related business solutions

C++ Clear Filters & Widen Search

Enterprise-Class Managed File Transfer.
For organizations that need to automate secure file transfers to protect sensitive data.

Diplomat MFT by Coviant Software is a secure, reliable managed file transfer solution designed to simplify and automate SFTP, FTPS, and HTTPS file transfers. Built for seamless integration, Diplomat MFT works across major cloud storage platforms, including AWS S3, Azure Blob, Google Cloud, Oracle Cloud, SharePoint, Dropbox, Box, and more.

Learn More
Process Street | Compliance Operations Platform
Systemize execution. Prove compliance.

Bring compliance and operations under one roof with an AI agent that automates workflows, policies that enforce rules, and a platform that delivers results.

Learn More
1

gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models

Gemma.cpp is a C++ implementation for running inference with Gemma models efficiently on CPUs and GPUs. Developed by Google, it allows running large language models (LLMs) like Gemma with minimal hardware, focusing on optimized performance and low latency. Gemma.cpp is intended for developers seeking to deploy LLMs in production environments without needing massive computational resources.

Downloads: 2 This Week

Last Update: 2025-03-25
See Project
2

RTP-LLM

Alibaba's high-performance LLM inference engine for diverse apps

RTP-LLM is an open-source large language model inference acceleration engine developed by Alibaba to provide high-performance serving infrastructure for modern LLM deployments. The system focuses on improving throughput, latency, and resource utilization when running large models in production environments. It achieves this by implementing optimized GPU kernels, batching strategies, and memory management techniques tailored for transformer inference workloads. ...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
3

mllm

Fast Multimodal LLM on Mobile Devices

mllm is an open-source inference engine designed to run multimodal large language models efficiently on mobile devices and edge computing environments. The framework focuses on delivering high-performance AI inference in resource-constrained systems such as smartphones, embedded hardware, and lightweight computing platforms. Implemented primarily in C and C++, it is designed to operate with minimal external dependencies while taking advantage of hardware-specific acceleration technologies such as ARM NEON and x86 AVX2 instructions. ...

Downloads: 1 This Week

Last Update: 2026-03-09
See Project
4

CTranslate2

Fast inference engine for Transformer models

CTranslate2 is a C++ and Python library for efficient inference with Transformer models. The project implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to accelerate and reduce the memory usage of Transformer models on CPU and GPU. The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many...

Downloads: 13 This Week

Last Update: 2026-02-04
See Project
ERP Software To Simplify Your Manufacturing
From quote to cash and with AI in mind, our ERP software will become the most valuable asset at your company.

Global Shop Solutions AI-integrated ERP software provides the applications needed to deliver a quality part on time, every time from quote to cash and everything in between, including shop management, scheduling, inventory, accounting, quality control, CRM and 25 more.

Learn More
5

Cactus

Low-latency AI inference engine optimized for mobile devices

Cactus is a low-latency, energy-efficient AI inference framework designed specifically for mobile devices and wearables, enabling advanced machine learning capabilities directly on-device. It provides a full-stack architecture composed of an inference engine, a computation graph system, and highly optimized hardware kernels tailored for ARM-based processors. Cactus emphasizes efficient memory usage through techniques such as zero-copy computation graphs and quantized model formats, allowing large models to run within the constraints of mobile hardware. ...

Downloads: 0 This Week

Last Update: 3 days ago
See Project
6

Mooncake

Mooncake is the serving platform for Kimi

...The platform was originally developed as part of the serving infrastructure for the Kimi large language model system. Its architecture centers on a high-performance transfer engine that provides unified data transfer across different storage and networking technologies. This engine enables efficient movement of tensors and model data across heterogeneous environments such as GPU memory, system memory, and distributed storage systems. Mooncake also introduces distributed key-value cache storage that allows inference systems to reuse previously computed attention states, significantly improving throughput in large-scale deployments. ...

Downloads: 0 This Week

Last Update: 2026-04-01
See Project
7

PowerInfer

High-speed Large Language Model Serving for Local Deployment

PowerInfer is a high-performance inference engine designed to run large language models efficiently on personal computers equipped with consumer-grade GPUs. The project focuses on improving the performance of local AI inference by optimizing how neural network computations are distributed between CPU and GPU resources. Its architecture exploits the observation that only a subset of neurons in large models are frequently activated, allowing the system to preload frequently used neurons into GPU memory while processing less common activations on the CPU. ...

Downloads: 1 This Week

Last Update: 2026-03-04
See Project
8

qvac-fabric-llm.cpp

QVAC Fabric: cross-platform LLM inference and fine-tuning

qvac-fabric-llm.cpp is a cross-platform large language model inference and fine-tuning engine built as an advanced fork of llama.cpp, designed to run efficiently across desktops, mobile devices, and heterogeneous GPU environments. The project focuses on removing hardware limitations traditionally associated with LLM deployment by enabling support for a wide range of backends, including Vulkan, Metal, CUDA, and CPU, making it accessible on devices ranging from smartphones to enterprise servers. ...

Downloads: 0 This Week

Last Update: 2026-03-31
See Project
9

DALI

A GPU-accelerated library containing highly optimized building blocks

...Deep learning applications require complex, multi-stage data processing pipelines that include loading, decoding, cropping, resizing, and many other augmentations. These data processing pipelines, which are currently executed on the CPU, have become a bottleneck, limiting the performance and scalability of training and inference. DALI addresses the problem of the CPU bottleneck by offloading data preprocessing to the GPU. Additionally, DALI relies on its own execution engine, built to maximize the throughput of the input pipeline.

Downloads: 1 This Week

Last Update: 2026-02-19
See Project
Transforming NetOps Through No-Code Network Automation - NetBrain
For anyone searching for a complete no-code automation platform for hybrid network observability and AIOps

NetBrain, founded in 2004, provides a powerful no-code automation platform for hybrid network observability, allowing organizations to enhance their operational efficiency through automated workflows. The platform applies automation across three key workflows: troubleshooting, change management, and assessment.

Learn More
10

stable-diffusion.cpp

Diffusion model(SD,Flux,Wan,Qwen Image,Z-Image,...) inference

stable-diffusion.cpp is a lightweight, high-performance implementation of Stable Diffusion and related generative models written entirely in portable C/C++, designed to run on virtually any device without heavy dependencies. It enables text-to-image and image-to-image generation, supports a growing set of models like SD1.x, SD2.x, SDXL, SD-Turbo, Qwen Image, and more, and is continually updated with support for cutting-edge model variants including video and image editing models. The project...

Downloads: 21 This Week

Last Update: 2 days ago
See Project
11

OceanBase seekdb

The AI-Native Search Database

seekdb is an AI-native search database from OceanBase that unifies vector, full-text, relational, JSON, and GIS data into a single query engine. The system is designed to support hybrid search workloads and in-database AI workflows without requiring multiple specialized databases. It enables developers to perform semantic search, keyword search, and structured SQL queries within the same platform, simplifying modern AI application stacks. seekdb also embeds AI capabilities directly in the database layer, including embedding generation, reranking, and LLM inference for end-to-end RAG pipelines. ...

Downloads: 6 This Week

Last Update: 6 days ago
See Project
12

OnnxStream

Lightweight inference library for ONNX files, written in C++

The challenge is to run Stable Diffusion 1.5, which includes a large transformer model with almost 1 billion parameters, on a Raspberry Pi Zero 2, which is a microcomputer with 512MB of RAM, without adding more swap space and without offloading intermediate results on disk. The recommended minimum RAM/VRAM for Stable Diffusion 1.5 is typically 8GB. Generally, major machine learning frameworks and libraries are focused on minimizing inference latency and/or maximizing throughput, all of which at the cost of RAM usage. So I decided to write a super small and hackable inference library specifically focused on minimizing memory consumption: OnnxStream. OnnxStream is based on the idea of decoupling the inference engine from the component responsible for providing the model weights, which is a class derived from WeightsProvider. ...

Downloads: 22 This Week

Last Update: 2024-08-14
See Project
13

MACE

Deep learning inference framework optimized for mobile platforms

Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices. Runtime is optimized with NEON, OpenCL and Hexagon, and Winograd algorithm is introduced to speed up convolution operations. The initialization is also optimized to be faster. Chip-dependent power options like big.LITTLE scheduling, Adreno GPU hints are included as advanced APIs.

Downloads: 0 This Week

Last Update: 2022-01-13
See Project
14

Su-render

An automatic knowledge inference engine. Given a set of statements it can derive actions and other statements from the the set of assumptions.

Downloads: 0 This Week

Last Update: 2021-12-12
See Project
15

DeepSpeech

Open source embedded speech-to-text engine

...If you want to use the pre-trained English model for performing speech-to-text, you can download it (along with other important inference material) from the DeepSpeech releases page.

Downloads: 14 This Week

Last Update: 2021-04-08
See Project
16

EulerMoz

EulerMoz is an inference engine supporting logic based proofs based on EulerSharp project.

Downloads: 0 This Week

Last Update: 2013-04-05
See Project
17

TooCoM

TooCoM, a Tool to Operationalize an Ontology with the COnceptual graph Model. TooCoM allows the user to edit, test, operationalize and use in an inference engine an heavy-weight ontology in a graphical way by using the Entity-Relationship paradigm.

1 Review

Downloads: 0 This Week

Last Update: 2014-05-19
See Project
18

embedded expert system engine for xps

This project provides an "embedded expert system", i.e. a limited ability inference engine, with a demo rule set, taken from the field of X-ray Photoelectron Spectroscopy.

Downloads: 6 This Week

Last Update: 2013-02-21
See Project