Showing 26 open source projects for "inference engine"

View related business solutions
  • Skillfully - The future of skills based hiring Icon
    Skillfully - The future of skills based hiring

    Realistic Workplace Simulations that Show Applicant Skills in Action

    Skillfully transforms hiring through AI-powered skill simulations that show you how candidates actually perform before you hire them. Our platform helps companies cut through AI-generated resumes and rehearsed interviews by validating real capabilities in action. Through dynamic job specific simulations and skill-based assessments, companies like Bloomberg and McKinsey have cut screening time by 50% while dramatically improving hire quality.
    Learn More
  • The Most Powerful Software Platform for EHSQ and ESG Management Icon
    The Most Powerful Software Platform for EHSQ and ESG Management

    Addresses the needs of small businesses and large global organizations with thousands of users in multiple locations.

    Choose from a complete set of software solutions across EHSQ that address all aspects of top performing Environmental, Health and Safety, and Quality management programs.
    Learn More
  • 1
    Transformer Engine

    Transformer Engine

    A library for accelerating Transformer models on NVIDIA GPUs

    Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower memory utilization in both training and inference. TE provides a collection of highly optimized building blocks for popular Transformer architectures and an automatic mixed precision-like API that can be used seamlessly with your framework-specific code.
    Downloads: 31 This Week
    Last Update:
    See Project
  • 2
    vLLM

    vLLM

    A high-throughput and memory-efficient inference and serving engine

    vLLM is a fast and easy-to-use library for LLM inference and serving. High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more.
    Downloads: 42 This Week
    Last Update:
    See Project
  • 3
    Chitu

    Chitu

    High-performance inference framework for large language models

    Chitu is a high-performance inference engine designed to deploy and run large language models efficiently in production environments. The framework focuses on improving efficiency, flexibility, and scalability for organizations that need to run LLM inference workloads across different hardware platforms. It supports heterogeneous computing environments, including CPUs, GPUs, and various specialized AI accelerators, allowing models to run across a wide range of infrastructure configurations. ...
    Downloads: 19 This Week
    Last Update:
    See Project
  • 4
    SimpleLLM

    SimpleLLM

    950 line, minimal, extensible LLM inference engine built from scratch

    SimpleLLM is a minimal, extensible large language model inference engine implemented in roughly 950 lines of code, built from scratch to serve both as a learning tool and a research platform for novel inference techniques. It provides the core components of an LLM runtime—such as tokenization, batching, and asynchronous execution—without the abstraction overhead of more complex engines, making it easier for developers and researchers to understand and modify. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Securing the Cloud Made Easy Icon
    Securing the Cloud Made Easy

    Multi-cloud security delivered — now and in the future.

    Designed for organizations operating in the cloud who need complete, centralized visibility of their entire cloud estate and want more time and resources dedicated to remediating the actual risks that matter, Orca Security is an agentless cloud Security Platform that provides security teams with 100% coverage their entire cloud environment.
    Learn More
  • 5
    Nano-vLLM

    Nano-vLLM

    A lightweight vLLM implementation built from scratch

    Nano-vLLM is a lightweight implementation of the vLLM inference engine designed to run large language models efficiently while maintaining a minimal and readable codebase. The project recreates the core functionality of vLLM in a simplified architecture written in approximately a thousand lines of Python, making it easier for developers and researchers to understand how modern LLM inference systems work.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Cactus

    Cactus

    Low-latency AI inference engine optimized for mobile devices

    Cactus is a low-latency, energy-efficient AI inference framework designed specifically for mobile devices and wearables, enabling advanced machine learning capabilities directly on-device. It provides a full-stack architecture composed of an inference engine, a computation graph system, and highly optimized hardware kernels tailored for ARM-based processors. Cactus emphasizes efficient memory usage through techniques such as zero-copy computation graphs and quantized model formats, allowing large models to run within the constraints of mobile hardware. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 7
    SAM 3

    SAM 3

    Code for running inference and finetuning with SAM 3 model

    SAM 3 (Segment Anything Model 3) is a unified foundation model for promptable segmentation in both images and videos, capable of detecting, segmenting, and tracking objects. It accepts both text prompts (open-vocabulary concepts like “red car” or “goalkeeper in white”) and visual prompts (points, boxes, masks) and returns high-quality masks, boxes, and scores for the requested concepts. Compared with SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an...
    Downloads: 71 This Week
    Last Update:
    See Project
  • 8
    Pruna AI

    Pruna AI

    Pruna is a model optimization framework built for developers

    Pruna is an open-source, self-hostable AI inference engine designed to help teams deploy and manage large language models (LLMs) efficiently across private or hybrid infrastructures. Built with performance and developer ergonomics in mind, Pruna simplifies inference workflows by enabling multi-model orchestration, autoscaling, GPU resource allocation, and compatibility with popular open-source models.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 9
    HunyuanWorld-Voyager

    HunyuanWorld-Voyager

    RGBD video generation model conditioned on camera input

    ...The system jointly produces aligned RGB and depth video sequences, making it directly applicable to 3D reconstruction tasks. At its core, Voyager integrates a world-consistent video diffusion model with an efficient long-range world exploration engine powered by auto-regressive inference. To support training, the team built a scalable data engine that automatically curates large video datasets with camera pose estimation and metric depth prediction. As a result, Voyager delivers state-of-the-art performance on world exploration benchmarks while maintaining photometric, style, and 3D consistency.
    Downloads: 10 This Week
    Last Update:
    See Project
  • Empower Your Workforce and Digitize Your Shop Floor Icon
    Empower Your Workforce and Digitize Your Shop Floor

    Benefits to Manufacturers

    Easily connect to most tools and equipment on the shop floor, enabling efficient data collection and boosting productivity with vital insights. Turn information into action to generate new ideas and better processes.
    Learn More
  • 10
    AI Runner

    AI Runner

    Offline inference engine for art, real-time voice conversations

    AI Runner is an offline inference engine designed to run a collection of AI workloads on your own machine, including image generation for art, real-time voice conversations, LLM-powered chatbots and automated workflows. It is implemented as a desktop-oriented Python application and emphasizes privacy and self-hosting, allowing users to work with text-to-speech, speech-to-text, text-to-image and multimodal models without sending data to external services.
    Downloads: 13 This Week
    Last Update:
    See Project
  • 11
    llama2.c

    llama2.c

    Inference Llama 2 in one file of pure C

    llama2.c is a minimalist implementation of the Llama 2 language model architecture designed to run entirely in pure C. Created by Andrej Karpathy, this project offers an educational and lightweight framework for performing inference on small Llama 2 models without external dependencies. It provides a full training and inference pipeline: models can be trained in PyTorch and later executed using a concise 700-line C program (run.c). While it can technically load Meta’s official Llama 2...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 12
    Parallax

    Parallax

    Parallax is a distributed model serving framework

    Parallax is a decentralized inference framework designed to run large language models across distributed computing resources. Instead of relying on centralized GPU clusters in data centers, the system allows multiple heterogeneous machines to collaborate in serving AI inference workloads. Parallax divides model layers across different nodes and dynamically coordinates them to form a complete inference pipeline. A two-stage scheduling architecture determines how model layers are allocated to...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 13
    MLC LLM

    MLC LLM

    Universal LLM Deployment Engine with ML Compilation

    MLC LLM is a machine learning compiler and deployment framework designed to enable efficient execution of large language models across a wide range of hardware platforms. The project focuses on compiling models into optimized runtimes that can run natively on devices such as GPUs, mobile processors, browsers, and edge hardware. By leveraging machine learning compilation techniques, mlc-llm produces high-performance inference engines that maintain consistent APIs across platforms. The system...
    Downloads: 36 This Week
    Last Update:
    See Project
  • 14
    LightLLM

    LightLLM

    LightLLM is a Python-based LLM (Large Language Model) inference

    LightLLM is a high-performance inference and serving framework designed specifically for large language models, focusing on lightweight architecture, scalability, and efficient deployment. The framework enables developers to run and serve modern language models with significantly improved speed and resource efficiency compared to many traditional inference systems. Built primarily in Python, the project integrates optimization techniques and ideas from several leading open-source...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    marqo

    marqo

    Tensor search for humans

    A tensor-based search and analytics engine that seamlessly integrates with your applications, websites, and workflows. Marqo is a versatile and robust search and analytics engine that can be integrated into any website or application. Due to horizontal scalability, Marqo provides lightning-fast query times, even with millions of documents. Marqo helps you configure deep-learning models like CLIP to pull semantic meaning from images. It can seamlessly handle image-to-image, image-to-text and...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 16
    Superduper

    Superduper

    Superduper: Integrate AI models and machine learning workflows

    Superduper is a Python-based framework for building end-2-end AI-data workflows and applications on your own data, integrating with major databases. It supports the latest technologies and techniques, including LLMs, vector-search, RAG, and multimodality as well as classical AI and ML paradigms. Developers may leverage Superduper by building compositional and declarative objects that out-source the details of deployment, orchestration versioning, and more to the Superduper engine. This...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 17
    LMCache

    LMCache

    Supercharge Your LLM with the Fastest KV Cache Layer

    LMCache is an extension layer for LLM serving engines that accelerates inference, especially with long contexts, by storing and reusing key-value (KV) attention caches across requests. Instead of rebuilding KV states for repeated or shared text segments, LMCache persists and retrieves them from multiple tiers—GPU memory, CPU DRAM, and local disk—then injects them into subsequent requests to reduce TTFT and increase throughput.
    Downloads: 16 This Week
    Last Update:
    See Project
  • 18
    Matrix

    Matrix

    Multi-Agent daTa geneRation Infra and eXperimentation framework

    ...That design makes Matrix particularly well-suited for large-batch inference, model benchmarking, data curation, augmentation, or generation — whether for language, code, dialogue, or multimodal tasks. It supports both open-source LLMs and proprietary models (via integration with model backends), and works with containerized or sandboxed environments for safe tool execution or external code runs.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    X-AnyLabeling

    X-AnyLabeling

    Effortless data labeling with AI support from Segment Anything

    X-AnyLabeling is an open-source data annotation platform designed to streamline the process of labeling datasets for computer vision and multimodal AI applications. The software integrates an AI-powered labeling engine that allows users to generate annotations automatically with the assistance of modern vision models such as Segment Anything and various object detection frameworks. It supports labeling tasks across images and videos and enables developers to prepare training datasets for...
    Downloads: 61 This Week
    Last Update:
    See Project
  • 20
    FlexLLMGen

    FlexLLMGen

    Running large language models on a single GPU

    FlexLLMGen is an open-source inference engine designed to run large language models efficiently on limited hardware resources such as a single GPU. The system focuses on high-throughput generation workloads where large batches of text must be processed quickly, such as large-scale data extraction or document analysis tasks. Instead of requiring expensive multi-GPU systems, the framework uses techniques such as memory offloading, compression, and optimized batching to run large models on commodity hardware. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Lightning Bolts

    Lightning Bolts

    Toolbox of models, callbacks, and datasets for AI/ML researchers

    Bolts package provides a variety of components to extend PyTorch Lightning, such as callbacks & datasets, for applied research and production. Torch ORT converts your model into an optimized ONNX graph, speeding up training & inference when using NVIDIA or AMD GPUs. We can introduce sparsity during fine-tuning with SparseML, which ultimately allows us to leverage the DeepSparse engine to see performance improvements at inference time.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    FrankMocap

    FrankMocap

    A Strong and Easy-to-use Single View 3D Hand+Body Pose Estimator

    FrankMocap is a monocular 3D human capture system that estimates body, hand, and optionally face pose from a single RGB image or video. It regresses parametric human models (e.g., SMPL/SMPL-X) directly, producing temporally stable meshes and joint angles suitable for animation or analytics. The pipeline couples a robust 2D keypoint detector with 3D mesh regression networks and priors that keep results anatomically plausible. It can run frame-by-frame or with temporal smoothing, and includes...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    TensorFlowTTS

    TensorFlowTTS

    Real-Time State-of-the-art Speech Synthesis for Tensorflow 2

    TensorFlowTTS is a state-of-the-art, open-source speech synthesis library built on TensorFlow 2. It offers a variety of architectures for text-to-speech, including classic and modern models such as Tacotron‑2, FastSpeech / FastSpeech2, and neural vocoders like MelGAN and Multiband‑MelGAN. Because it’s based on TensorFlow 2, it can leverage optimizations such as fake-quantization aware training and pruning — which allow models to run faster than real time and to be deployable on mobile or...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Minkowski Engine

    Minkowski Engine

    Auto-diff neural network library for high-dimensional sparse tensors

    The Minkowski Engine is an auto-differentiation library for sparse tensors. It supports all standard neural network layers such as convolution, pooling, unspooling, and broadcasting operations for sparse tensors. The Minkowski Engine supports various functions that can be built on a sparse tensor. We list a few popular network architectures and applications here.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    DeepSpeech

    DeepSpeech

    Open source embedded speech-to-text engine

    ...If you want to use the pre-trained English model for performing speech-to-text, you can download it (along with other important inference material) from the DeepSpeech releases page.
    Downloads: 11 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next
MongoDB Logo MongoDB