🤯 Complete and available deployment environment

#48
by Lokis - opened

Instance tested: AWS g5 L40S
CUDA 12.1 · 48 GB VRAM · 376 GB RAM · 2 × Intel 6248 (80 threads)


1. What finally works

Item Version / Value
Model openbmb/MiniCPM-o-2_6
PyTorch 2.5.1 + cu121
TorchVision / TorchAudio 0.20.1 + cu121 / 2.5.1 + cu121
Flash‑Attention 2 2.7.1 (built from source)
vLLM OpenBMB fork – minicpmo branch (≈ 0.9 dev)
Auth vLLM --api-keys (OpenAI‑style sk‑…)
Runtime flags --dtype bfloat16  •  ctx = 3072  •  batch = 16 384

2. Copy‑&‑paste install script

#######################################################################
# A. pyenv + Python 3.10 virtualenv (name: hf310)
#######################################################################
sudo apt update && sudo apt install -y \
  make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev \
  libsqlite3-dev wget curl llvm libncursesw5-dev xz-utils tk-dev \
  libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

curl https://pyenv.run | bash
cat >> ~/.bashrc <<'EOF'
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
EOF
exec $SHELL                          # reload shell

pyenv install 3.10.14
pyenv virtualenv 3.10.14 hf310
pyenv activate hf310

#######################################################################
# B. Torch 2.5.1 + Flash‑Attn 2.7.1 (CUDA 12.1)
#######################################################################
pip install torch==2.5.1+cu121 torchvision==0.20.1+cu121 \
            torchaudio==2.5.1+cu121 \
            --index-url https://download.pytorch.org/whl/cu121 --no-cache-dir

# compile Flash‑Attn for L40S (SM 89)
export TORCH_CUDA_ARCH_LIST="8.9"
pip install --no-build-isolation --no-cache-dir \
  "flash-attn @ git+https://github.com/Dao-AILab/[email protected]"

#######################################################################
# C. vLLM (MiniCPMO branch) + multimodal extras
#######################################################################
pip install --no-cache-dir \
  "git+https://github.com/OpenBMB/vllm.git@minicpmo#egg=vllm[audio,video]"

#######################################################################
# D. Sanity check
#######################################################################
python - <<'PY'
import torch, vllm, importlib.util as iu
print("Torch :", torch.__version__)
print("vLLM  :", vllm.__version__)
print("Flash :", "OK" if iu.find_spec("flash_attn_2_cuda") else "MISSING")
PY

3. Run the server

# Optional: create keys
echo "sk-$(head -c48 /dev/urandom | base64 | tr -dc A-Za-z0-9)" > /opt/keys.txt
export VLLM_API_KEYS_FILE=/opt/keys.txt

vllm serve openbmb/MiniCPM-o-2_6 \
  --trust-remote-code \
  --dtype bfloat16 \
  --port 8000 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 3072 \
  --max-num-batched-tokens 16384 \
  --task generate \
  --limit-mm-per-prompt image=4,video=1,audio=1 \
  --download-dir /opt/dlami/nvme/pkgs

First request compiles CUDA graphs; subsequent calls are instant.


4. Client example (OpenAI‑compatible)

from openai import OpenAI
client = OpenAI(api_key="sk-yourKeyHere", base_url="http://YOUR_IP:8000/v1")

resp = client.chat.completions.create(
    model="openbmb/MiniCPM-o-2_6",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image briefly."},
            {"type": "image_url", "image_url": {"url": "https://picsum.photos/seed/cat/512"}}
        ],
    }],
    max_tokens=64,
)
print(resp.choices[0].message.content)

5. Performance knobs (single L40 S)

Flag Default Tweak Effect
--max-model-len 3072 2048 +2 GB VRAM
--max-num-batched-tokens 16384 12288 ↓ latency
--gpu-memory-utilization 0.95 0.92–0.96 find sweet‑spot
--enforce-eager off on −1 GB peak, −10 % speed
--quantization awq … off on weights 15 GB → 9 GB

Final thoughts

Setting up MiniCPM‑o 2.6 was an Olympic triathlon of dependency mismatches: Flash‑Attn wants Torch 2.4, vLLM wants 2.7, CUDA wheels exist only for 2.5. After enough pip sorcery to void every warranty, it finally works—so you don’t have to suffer. Happy multimodal hacking! 🎉

Sign up or log in to comment