Port of Facebook's LLaMA model in C/C++
Run Local LLMs on Any Device. Open-source
Large Language Model Text Generation Inference
Operating LLMs in production
C++ implementation of ChatGLM-6B & ChatGLM2-6B & ChatGLM3 & GLM4(V)
A high-throughput and memory-efficient inference and serving engine
Phi-3.5 for Mac: Locally-run Vision and Language Models
Sparsity-aware deep learning inference runtime for CPUs
Openai style api for open large language models
Neural Network Compression Framework for enhanced OpenVINO
Replace OpenAI GPT with another LLM in your app
FlashInfer: Kernel Library for LLM Serving
Efficient few-shot learning with Sentence Transformers
State-of-the-art Parameter-Efficient Fine-Tuning
A high-performance ML model serving framework, offers dynamic batching
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
PyTorch library of curated Transformer models and their components
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
DoWhy is a Python library for causal inference
Libraries for applying sparsification recipes to neural networks
LLM training code for MosaicML foundation models
Easiest and laziest way for building multi-agent LLMs applications
Optimizing inference proxy for LLMs
Low-latency REST API for serving text-embeddings
20+ high-performance LLMs with recipes to pretrain, finetune at scale