GPU requirement for hosting this model?
#9
by
csgxy2022
- opened
Two A100 GPUs, trying to host this model but got OOM issue.
docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=xxx" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model gradientai/Llama-3-8B-Instruct-Gradient-1048k --tensor-parallel-size 2
I got:
torch.cuda.OutOfMemoryError: CUDA out of memory
No problem hosting the original Llma-3-8B-Instruct model
This comment has been hidden
The following should do the job for vllm! (A100.x2)
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--shm-size=8g \
--ipc=host \
vllm/vllm-openai:latest \
--model gradientai/Llama-3-8B-Instruct-1048k \
--tensor-parallel-size 2 \
--max-model-len 65536
vLLM is not optimal, it would require around ~1000GB vRAM (IIRC) to serve a model with this hidden dim and ctx length.