Download Latest Version IPEX-LLM release 2.2.0 source code.tar.gz (4.2 MB)
Email in envelope

Get an email when there's a new version of Intel LLM Library for PyTorch

Home / v0.1.0-mas
Name Modified Size InfoDownloads / Week
Parent folder
Multi-Arc Serving release 0.1.0 source code.tar.gz 2025-04-07 4.2 MB
Multi-Arc Serving release 0.1.0 source code.zip 2025-04-07 6.0 MB
README.md 2025-04-07 1.9 kB
Totals: 3 Items   10.2 MB 0

Overview

This release introduces the latest update to the Multi-ARC vLLM serving solution, optimized for Intel Xeon + ARC platforms with ipex-llm vLLM. The new version delivers low latency and high throughput LLM serving with improved model compatibility and resource efficiency. Major component upgrades include: vLLM upgraded to 0.6.6, PyTorch upgraded to 2.6, oneAPI upgraded to 2025.0, oneCCL patch updated to 0.0.6.6.

New Features

  • Optimized vLLM serving for Intel Xeon + ARC multi-GPU platforms, enabling lower latency and higher throughput.
  • Supported various LLM models.
  • Enhanced support for loading models with minimal memory requirements.
  • Refined Docker image for improved ease of use and deployment.
  • Improved WebUI model connectivity and stability.
  • Added VLLM_LOG_OUTPUT=1 option to enable detailed input/output logging for vLLM.

Bug Fixes

  • Resolved multimodal issues including get_image failures and inference errors with models such as MiniCPM-V-2_6, Qwen2-VL, and GLM-4v-9B.
  • Fixed Qwen2-VL multi-request crash by removing Qwen2VisionAttention’s attention_mask and addressing mrope_positions instability.
  • Updated profile_run usage to avoid OOM (Out of Memory) crashes.
  • Resolved GQA kernel issues causing errors with multiple concurrent outputs.
  • Fixed --enable-prefix-caching none crash in specific cases.
  • Addressed low-bit overflow causing !!!!!! output error in DeepSeek-R1-Distill-Qwen-14B.
  • Resolved GPTQ and AWQ-related errors to improve compatibility across more models.

Docker Images

Source: README.md, updated 2025-04-07