Instructions to use Denali-AI/Rainier-VL-2B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Denali-AI/Rainier-VL-2B-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Denali-AI/Rainier-VL-2B-Base", trust_remote_code=True)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Denali-AI/Rainier-VL-2B-Base", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Denali-AI/Rainier-VL-2B-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Denali-AI/Rainier-VL-2B-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Denali-AI/Rainier-VL-2B-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Denali-AI/Rainier-VL-2B-Base

SGLang

How to use Denali-AI/Rainier-VL-2B-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Denali-AI/Rainier-VL-2B-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Denali-AI/Rainier-VL-2B-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Denali-AI/Rainier-VL-2B-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Denali-AI/Rainier-VL-2B-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Denali-AI/Rainier-VL-2B-Base with Docker Model Runner:
```
docker model run hf.co/Denali-AI/Rainier-VL-2B-Base
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Access to the Rainier-VL model family is granted on request. These weights are released by Denali AI under the Apache 2 license for product and research use in industrial / product-line visual inspection. By requesting access you agree to use the model in compliance with the license and applicable law, and not for unlawful surveillance or harm. Requests are reviewed manually.

Accuracy vs. memory: Rainier-VL-2B-Base is Pareto-dominant — highest VisA AUROC at the lowest VRAM in the field.

Domain defect detection — AUROC (higher=better)

Model	VisA	BSData	BTAD
Rainier-VL-2B-Base	0.949	0.982	0.983
Qwen2.5-VL-3B	0.702	0.619	0.785
Qwen3-VL-4B	0.840	0.915	0.571
Qwen3.5-4B	0.683	0.794	0.766
Gemma-4-E2B	0.625	0.933	0.321
Holo-3.1-4B	0.728	0.855	0.637
NuExtract3	0.709	0.911	0.270
GLM-4.6V-Flash	0.852	0.943	0.898
GRM-2.5	0.683	0.794	0.766

Served on vLLM — accuracy, latency, throughput, footprint (same harness)

Model	Params (B)	VRAM (GB)	VisA AUROC	Latency (ms)	Decode tok/s	Throughput tok/s
Rainier-VL-2B-Base	2.10	4.19	0.949	27	276	2830
Qwen3-VL-4B	4.44	8.88	0.877	64	138	2094
Qwen3-VL-2B	2.00	4.00	0.839	67	288	2809
NuExtract3	4.54	9.32	0.734	69	138	1589
Qwen2.5-VL-3B	3.75	7.51	0.718	79	173	2211
Qwen3.5-4B	4.54	9.32	0.673	57	139	1679
GRM-2.5	4.54	9.32	0.673	66	139	1599
Holo-3.1-4B	4.54	10.35	0.660	70	139	1697
Gemma-4-E2B	5.10	10.25	0.642	31	163	3260

VRAM = bf16 weight residency (lowest serving footprint); Rainier's peak at batch 1 is 4.61 GB. Rainier carries roughly half the VRAM of the 3 B+ field.

Latency is per-query inference time; the one-time SigLIP image encode is a fixed per-image setup cost (run once, reused across every query on that image) and is not a per-query term.

Rainier-VL-2B-Base offers the best balance of space, accuracy, latency, throughput, speed, and deployability — and is outright best on accuracy (0.949, vs. next-best 0.877), footprint (2.10 B / 4.19 GB), and latency (27 ms, below the fastest peer's 31 ms), while remaining competitive on throughput (2830 tok/s) and decode (276 tok/s). It is also the only model in the set that emits defect boxes and pixel masks, and carries the lowest VRAM footprint in the comparison band — one that grows far more slowly with context length than the transformer peers, since its Mamba state-space layers hold a fixed-size state where a transformer's KV-cache grows with every token.

Capabilities vs. every evaluated model

✓ = supported in the public release; ~ = partial/limited (e.g. JSON via prompting, not schema-enforced); ✗ = not available. Grounding = documented bbox output; Seg. = pixel masks; Stream. O(1) = constant-time-per-frame streaming.

Model	VQA	Struct. JSON	Grounding (bbox)	Seg. (mask)	Stream. O(1)
Rainier-VL-2B-Base	✓	✓	✓	✓	✓
Qwen2.5-VL-3B	✓	~	✓	✗	✗
Qwen3-VL-2B	✓	~	✓	✗	✗
Qwen3-VL-4B	✓	~	✓	✗	✗
Gemma-4-E2B	✓	~	✗	✗	✗
GLM-4.6V-Flash	✓	~	✓	✗	✗
Holo-3.1-4B	✓	~	✓	✗	✗
NuExtract3	✓	✓	✗	✗	✗
Qwen3.5-4B	✓	~	✗	✗	✗
GRM-2.5	✓	~	✗	✗	✗

Rainier is the only model in the evaluated set that emits pixel masks and exposes a constant-time-per-frame streaming path.

VRAM vs. context length (3 B+ field, bf16)

Peak inference VRAM (GB) generating 128→32K tokens. Rainier holds the lowest footprint at every length — roughly half the field — because its hybrid backbone keeps a fixed-size Mamba state for most layers, so VRAM grows far slower than the transformer peers' KV-caches. Peers measured to 16K, Rainier to 32K; "—" = not measured.

Model	128	512	2048	8192	16384	32768
Rainier-VL-2B-Base	4.58	4.47	4.61	5.27	6.14	7.89
Qwen2.5-VL-3B	7.81	7.81	7.81	7.91	8.21	—
Qwen3-VL-4B	9.19	9.19	9.37	10.29	11.52	—
NuExtract3	9.34	9.34	9.34	9.46	9.75	—
Qwen3.5-4B	9.34	9.34	9.34	9.46	9.75	—
GRM-2.5	9.34	9.34	9.34	9.46	9.75	—
Holo-3.1-4B	9.35	9.35	9.35	9.48	9.76	—
Gemma-4-E2B	10.76	10.76	10.76	10.76	11.22	—

Measured eager bf16, single image; the Rainier curve is conservative (cached single-step decode kernel not engaged on the eager path). Decode is also O(1) per token — Rainier's cached state-update decode holds ~76 tok/s flat from 512→8192 tokens, where a cache-free re-scan degrades 28→20.

Defect detection, localization & segmentation

Capability	Metric	Value
Defect detection (image)	AUROC — BTAD / BSData / VisA	0.983 / 0.982 / 0.949
Defect segmentation (OneFormer-FT)	val mIoU	0.753
Defect segmentation	pixel-AUROC	0.731
Defect segmentation (end-to-end)	mIoU	0.274
Defect box (GIoU head)	Acc@0.25	0.295
Defect box (GIoU head)	mIoU	0.176

Qualitative defect inspection (held-out VisA)

Real held-out VisA images (PCB, candle, cashew, chewing-gum). Each cell: Rainier-VL-2B-Base predicted defect box (red) vs ground-truth box (green), the P(yes) defect verdict, and box IoU. The model separates defective from clean and lands boxes on the defect region — tight on clear surface defects, looser on small/diffuse ones.

Downloads last month: 23

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Denali-AI/Rainier-VL-2B-Base

Quantizations

1 model

Collection including Denali-AI/Rainier-VL-2B-Base

Rainier-VL Model Family

Collection

Compact ~2.1B VLM for industrial/product defect inspection: SigLIP + OrthoBernstein + Zamba2 hybrid. Eager + GGUF, base + toy fine-tune. • 4 items • Updated 4 days ago