Instructions to use ByteDance/Ouro-2.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ByteDance/Ouro-2.6B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ByteDance/Ouro-2.6B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("ByteDance/Ouro-2.6B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ByteDance/Ouro-2.6B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ByteDance/Ouro-2.6B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance/Ouro-2.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ByteDance/Ouro-2.6B
- SGLang
How to use ByteDance/Ouro-2.6B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ByteDance/Ouro-2.6B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance/Ouro-2.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ByteDance/Ouro-2.6B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance/Ouro-2.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ByteDance/Ouro-2.6B with Docker Model Runner:
docker model run hf.co/ByteDance/Ouro-2.6B
Lower evaluation results
Dear Authors,
Thank you for your contribution to this research direction. I'm currently trying to reproduce the GSM8K results reported for Ouro 1.4B R4 and Ouro 2.6B R4, but I'm encountering some difficulties.
I ran the following evaluation code:
import lm_eval
results = lm_eval.simple_evaluate(
model="hf",
model_args="pretrained=ByteDance/Ouro-1.4B,trust_remote_code=True,dtype=float32",
tasks=["gsm8k_cot"],
num_fewshot=3,
batch_size=1,
limit=50,
device="cuda:0",
)
With this setup, I obtain ~0.5 accuracy for Ouro 1.4B and ~0.6 for Ouro 2.6B. May I ask whether there is anything incorrect in my configuration, or whether I am missing any additional steps required to replicate the reported results?
Thank you for your time and guidance.
Hi @MianchuWang ,
It may be too late for you, but for future reference: the main issue in your config is limit=50. Evaluating on only 50 samples introduces high variance. You need to remove the limit and run on the full dataset to get stable results.
Additionally, ensure NO chat template is applied to the prompts (as you already did) and exact match under flexible-extract should be reported.
With the full dataset and raw text formatting, I can reproduce all paper results with both vLLM and HF backends using the standard lm_eval settings.
Versions:
- vllm: 0.16.0
- transformers: 4.57.6
- lm-eval: 0.4.11