- Dynamic memory allocation on demand to fully utilize device memory. No preset scratch size or memory size any more.
- Drop Baichuan/InternLM support since they were integrated in llama.cpp.
- API change:
- CMake CUDA option:
-DGGML_CUBLASchanged to-DGGML_CUDA - CMake CUDA architecture:
-DCUDA_ARCHITECTURESchanged to-DCMAKE_CUDA_ARCHITECTURES num_threadsinGenerationConfigwas removed: the optimal thread settings will be automatically selected.