openbmb/MiniCPM-o-2_6 · 🤯 Complete and available deployment environment

Instance tested: AWS g5 L40S
CUDA 12.1 · 48 GB VRAM · 376 GB RAM · 2 × Intel 6248 (80 threads)

1. What finally works

Item	Version / Value
Model	`openbmb/MiniCPM-o-2_6`
PyTorch	2.5.1 + cu121
TorchVision / TorchAudio	0.20.1 + cu121 / 2.5.1 + cu121
Flash‑Attention 2	2.7.1 (built from source)
vLLM	OpenBMB fork – `minicpmo` branch (≈ 0.9 dev)
Auth	vLLM `--api-keys` (OpenAI‑style `sk‑…`)
Runtime flags	`--dtype bfloat16` • ctx = 3072 • batch = 16 384

2. Copy‑&‑paste install script

#######################################################################
# A. pyenv + Python 3.10 virtualenv (name: hf310)
#######################################################################
sudo apt update && sudo apt install -y \
  make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev \
  libsqlite3-dev wget curl llvm libncursesw5-dev xz-utils tk-dev \
  libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

curl https://pyenv.run | bash
cat >> ~/.bashrc <<'EOF'
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
EOF
exec $SHELL                          # reload shell

pyenv install 3.10.14
pyenv virtualenv 3.10.14 hf310
pyenv activate hf310

#######################################################################
# B. Torch 2.5.1 + Flash‑Attn 2.7.1 (CUDA 12.1)
#######################################################################
pip install torch==2.5.1+cu121 torchvision==0.20.1+cu121 \
            torchaudio==2.5.1+cu121 \
            --index-url https://download.pytorch.org/whl/cu121 --no-cache-dir

# compile Flash‑Attn for L40S (SM 89)
export TORCH_CUDA_ARCH_LIST="8.9"
pip install --no-build-isolation --no-cache-dir \
  "flash-attn @ git+https://github.com/Dao-AILab/[email protected]"

#######################################################################
# C. vLLM (MiniCPMO branch) + multimodal extras
#######################################################################
pip install --no-cache-dir \
  "git+https://github.com/OpenBMB/vllm.git@minicpmo#egg=vllm[audio,video]"

#######################################################################
# D. Sanity check
#######################################################################
python - <<'PY'
import torch, vllm, importlib.util as iu
print("Torch :", torch.__version__)
print("vLLM  :", vllm.__version__)
print("Flash :", "OK" if iu.find_spec("flash_attn_2_cuda") else "MISSING")
PY

3. Run the server

# Optional: create keys
echo "sk-$(head -c48 /dev/urandom | base64 | tr -dc A-Za-z0-9)" > /opt/keys.txt
export VLLM_API_KEYS_FILE=/opt/keys.txt

vllm serve openbmb/MiniCPM-o-2_6 \
  --trust-remote-code \
  --dtype bfloat16 \
  --port 8000 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 3072 \
  --max-num-batched-tokens 16384 \
  --task generate \
  --limit-mm-per-prompt image=4,video=1,audio=1 \
  --download-dir /opt/dlami/nvme/pkgs

First request compiles CUDA graphs; subsequent calls are instant.

4. Client example (OpenAI‑compatible)

from openai import OpenAI
client = OpenAI(api_key="sk-yourKeyHere", base_url="http://YOUR_IP:8000/v1")

resp = client.chat.completions.create(
    model="openbmb/MiniCPM-o-2_6",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image briefly."},
            {"type": "image_url", "image_url": {"url": "https://picsum.photos/seed/cat/512"}}
        ],
    }],
    max_tokens=64,
)
print(resp.choices[0].message.content)

5. Performance knobs (single L40 S)

Flag	Default	Tweak	Effect
`--max-model-len`	3072	2048	+2 GB VRAM
`--max-num-batched-tokens`	16384	12288	↓ latency
`--gpu-memory-utilization`	0.95	0.92–0.96	find sweet‑spot
`--enforce-eager`	off	on	−1 GB peak, −10 % speed
`--quantization awq …`	off	on	weights 15 GB → 9 GB

Final thoughts

Setting up MiniCPM‑o 2.6 was an Olympic triathlon of dependency mismatches: Flash‑Attn wants Torch 2.4, vLLM wants 2.7, CUDA wheels exist only for 2.5. After enough pip sorcery to void every warranty, it finally works—so you don’t have to suffer. Happy multimodal hacking! 🎉

🤯 Complete and available deployment environment

1. What finally works

2. Copy‑&‑paste install script

3. Run the server

4. Client example (OpenAI‑compatible)

5. Performance knobs (single L40 S)

Final thoughts

5. Performance knobs (single L40 S)