Replace OpenAI GPT with another LLM in your app
A high-throughput and memory-efficient inference and serving engine
High-performance inference framework for large language models
Low-latency REST API for serving text-embeddings
Inference Llama 2 in one file of pure C
High-performance Inference and Deployment Toolkit for LLMs and VLMs
A lightweight vLLM implementation built from scratch
950 line, minimal, extensible LLM inference engine built from scratch
AirLLM 70B inference with single 4GB GPU
Operating LLMs in production
GLM-4.5: Open-source LLM for intelligent agents by Z.ai
Performance-optimized AI inference on your GPUs
A course of learning LLM inference serving on Apple Silicon
Accelerate local LLM inference and finetuning
Qwen3 is the large language model series developed by Qwen team
State-of-the-art Parameter-Efficient Fine-Tuning
Parallax is a distributed model serving framework
Ling is a MoE LLM provided and open-sourced by InclusionAI
Run Local LLMs on Any Device. Open-source
LightLLM is a Python-based LLM (Large Language Model) inference
Phi-3.5 for Mac: Locally-run Vision and Language Models
ChatGLM-6B: An Open Bilingual Dialogue Language Model
Technical principles related to large models
PyTorch library of curated Transformer models and their components
Unified KV Cache Compression Methods for Auto-Regressive Models