🤯 Complete and available deployment environment
#48
by
Lokis
- opened
Instance tested: AWS g5 L40S
CUDA 12.1 · 48 GB VRAM · 376 GB RAM · 2 × Intel 6248 (80 threads)
1. What finally works
Item | Version / Value |
---|---|
Model | openbmb/MiniCPM-o-2_6 |
PyTorch | 2.5.1 + cu121 |
TorchVision / TorchAudio | 0.20.1 + cu121 / 2.5.1 + cu121 |
Flash‑Attention 2 | 2.7.1 (built from source) |
vLLM | OpenBMB fork – minicpmo branch (≈ 0.9 dev) |
Auth | vLLM --api-keys (OpenAI‑style sk‑… ) |
Runtime flags | --dtype bfloat16 • ctx = 3072 • batch = 16 384 |
2. Copy‑&‑paste install script
#######################################################################
# A. pyenv + Python 3.10 virtualenv (name: hf310)
#######################################################################
sudo apt update && sudo apt install -y \
make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev \
libsqlite3-dev wget curl llvm libncursesw5-dev xz-utils tk-dev \
libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
curl https://pyenv.run | bash
cat >> ~/.bashrc <<'EOF'
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
EOF
exec $SHELL # reload shell
pyenv install 3.10.14
pyenv virtualenv 3.10.14 hf310
pyenv activate hf310
#######################################################################
# B. Torch 2.5.1 + Flash‑Attn 2.7.1 (CUDA 12.1)
#######################################################################
pip install torch==2.5.1+cu121 torchvision==0.20.1+cu121 \
torchaudio==2.5.1+cu121 \
--index-url https://download.pytorch.org/whl/cu121 --no-cache-dir
# compile Flash‑Attn for L40S (SM 89)
export TORCH_CUDA_ARCH_LIST="8.9"
pip install --no-build-isolation --no-cache-dir \
"flash-attn @ git+https://github.com/Dao-AILab/[email protected]"
#######################################################################
# C. vLLM (MiniCPMO branch) + multimodal extras
#######################################################################
pip install --no-cache-dir \
"git+https://github.com/OpenBMB/vllm.git@minicpmo#egg=vllm[audio,video]"
#######################################################################
# D. Sanity check
#######################################################################
python - <<'PY'
import torch, vllm, importlib.util as iu
print("Torch :", torch.__version__)
print("vLLM :", vllm.__version__)
print("Flash :", "OK" if iu.find_spec("flash_attn_2_cuda") else "MISSING")
PY
3. Run the server
# Optional: create keys
echo "sk-$(head -c48 /dev/urandom | base64 | tr -dc A-Za-z0-9)" > /opt/keys.txt
export VLLM_API_KEYS_FILE=/opt/keys.txt
vllm serve openbmb/MiniCPM-o-2_6 \
--trust-remote-code \
--dtype bfloat16 \
--port 8000 \
--gpu-memory-utilization 0.95 \
--max-model-len 3072 \
--max-num-batched-tokens 16384 \
--task generate \
--limit-mm-per-prompt image=4,video=1,audio=1 \
--download-dir /opt/dlami/nvme/pkgs
First request compiles CUDA graphs; subsequent calls are instant.
4. Client example (OpenAI‑compatible)
from openai import OpenAI
client = OpenAI(api_key="sk-yourKeyHere", base_url="http://YOUR_IP:8000/v1")
resp = client.chat.completions.create(
model="openbmb/MiniCPM-o-2_6",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image briefly."},
{"type": "image_url", "image_url": {"url": "https://picsum.photos/seed/cat/512"}}
],
}],
max_tokens=64,
)
print(resp.choices[0].message.content)
5. Performance knobs (single L40 S)
Flag | Default | Tweak | Effect |
---|---|---|---|
--max-model-len |
3072 | 2048 | +2 GB VRAM |
--max-num-batched-tokens |
16384 | 12288 | ↓ latency |
--gpu-memory-utilization |
0.95 | 0.92–0.96 | find sweet‑spot |
--enforce-eager |
off | on | −1 GB peak, −10 % speed |
--quantization awq … |
off | on | weights 15 GB → 9 GB |
Final thoughts
Setting up MiniCPM‑o 2.6 was an Olympic triathlon of dependency mismatches: Flash‑Attn wants Torch 2.4, vLLM wants 2.7, CUDA wheels exist only for 2.5. After enough pip sorcery to void every warranty, it finally works—so you don’t have to suffer. Happy multimodal hacking! 🎉