Instructions to use QuixiAI/DeepSeek-R1-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use QuixiAI/DeepSeek-R1-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use QuixiAI/DeepSeek-R1-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuixiAI/DeepSeek-R1-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuixiAI/DeepSeek-R1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ
- SGLang
How to use QuixiAI/DeepSeek-R1-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "QuixiAI/DeepSeek-R1-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuixiAI/DeepSeek-R1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "QuixiAI/DeepSeek-R1-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuixiAI/DeepSeek-R1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use QuixiAI/DeepSeek-R1-AWQ with Docker Model Runner:
docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ
AMD Instinct MI210 + vllm fail to run this model, any solutions please? Is there any other deepseek-r1-671b models that can run succesfully on AMD Instinct MI210 + vllm? Thanks!
Error message:
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 372, in init
assert self.quant_method is not None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
vLLM version? start up command? OS env?
vLLM version? start up command? OS env?
Hi! @v2ray
Here is the details of vllm version and OS env: https://github.com/vllm-project/vllm/issues/16386
My start up commands are:
Start a docker container:
docker run -it --rm --ipc=host --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri/card5 --device=/dev/mem --group-add render --cap-add=SYS_PTRACE --security-opt seccomp=unconfined
--network host
--name dsr1awq
--shm-size 896g
-v "/root/models:/models"
--privileged
-p 6381:6381
-p 1001:1001
-p 2001:2001
-e NCCL_IB_HCA=mlx5
-e NCCL_P2P_DISABLE=1
vllm-dsr1:v1 bashIn the started docker container, run model:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m vllm.entrypoints.openai.api_server --model /models/DeepSeek-R1-awq --tensor-parallel-size 8 --port 1001 --enforce_eager --distributed-executor-backend mp --pipeline-parallel-size 1 --max-model-len 1024 --dtype float16 --max-num-batched-tokens 1024 --trust-remote-code --enable-prefix-caching
The docker image vllm-dsr1:v1 is the alias of rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250311.
Thanks for your help!
vllm0.7.3
Try build from source.
vllm0.7.3
Try build from source.
Hi! Even building from source will still encounter the same assertion error. See https://github.com/vllm-project/vllm/issues/15101
Welp, then I have no idea. I only tested it on CUDA hardware.