Instructions to use ByteDance/Ouro-2.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ByteDance/Ouro-2.6B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ByteDance/Ouro-2.6B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ByteDance/Ouro-2.6B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ByteDance/Ouro-2.6B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ByteDance/Ouro-2.6B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ByteDance/Ouro-2.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ByteDance/Ouro-2.6B

SGLang

How to use ByteDance/Ouro-2.6B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ByteDance/Ouro-2.6B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ByteDance/Ouro-2.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ByteDance/Ouro-2.6B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ByteDance/Ouro-2.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ByteDance/Ouro-2.6B with Docker Model Runner:
```
docker model run hf.co/ByteDance/Ouro-2.6B
```

Fix UniversalTransformerCache.get_mask_sizes for batched generation

by KristianS7 - opened Mar 9

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

+12

-0

KristianS7

Mar 9

Problem

Batched generation (batch_size > 1) produces corrupted output for all sequences except the longest (unpadded) one in the batch.

UniversalTransformerCache inherits Cache.get_mask_sizes, which falls back to return cache_position.shape[0], 0 when layer_idx >= len(self.layers).

Since UniversalTransformerCache manages its own flat key_cache/value_cache lists and keeps self.layers empty ([]), this fallback always fires:

Prefill: works correctly (cache_position spans full input length)
Autoregressive decoding: fails — cache_position has length 1, so the 4D attention mask is built for kv_length=1 instead of cached_length + 1

This undersized mask gets broadcasted across the full KV cache, losing per-position padding information and corrupting every padded sequence in the batch.

Manifests differently by attention implementation:

"eager": garbled whitespace output
"sdpa": RuntimeError: (*bias): last dimension must be contiguous
"flash_attention_2": works fine (ignores 4D mask)

Fix

Override get_mask_sizes to return the correct (seq_length + query_length, 0), matching the semantics of DynamicCacheLayer.get_mask_sizes.

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "ByteDance/Ouro-1.4B"  # or Ouro-2.6B
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, dtype=torch.bfloat16, device_map="auto",
    attn_implementation="eager",
)
tokenizer.padding_side = "left"
tokenizer.pad_token_id = 0

prompts = ["What is 2+2?", "Explain why the sky is blue in one sentence."]
batch = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
outputs = model.generate(**batch, max_new_tokens=64, do_sample=False, eos_token_id=2, pad_token_id=0)

See also: same fix merged for the Thinking variant — https://huggingface.co/ByteDance/Ouro-1.4B-Thinking/discussions/5

Fix UniversalTransformerCache.get_mask_sizes for batched generation9ff42b9a

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment