FastChat is an open platform for training, serving, and evaluating large language model-based chatbots. If you do not have enough memory, you can enable 8-bit compression by adding --load-8bit to the commands above. This can reduce memory usage by around half with slightly degraded model quality. It is compatible with the CPU, GPU, and Metal backend. Vicuna-13B with 8-bit compression can run on a single NVIDIA 3090/4080/T4/V100(16GB) GPU. In addition to that, you can add --cpu-offloading to commands above to offload weights that don't fit on your GPU onto the CPU memory. This requires 8-bit compression to be enabled and the bitsandbytes package to be installed, which is only available on linux operating systems.
Features
- The weights, training code, and evaluation code for state-of-the-art models
- A distributed multi-model serving system with Web UI and OpenAI-compatible RESTful APIs
- For training, serving, and evaluating large language models
- Reduce the CPU RAM requirement of weight conversion
- Inference with Command Line Interface
- Several supported models