Overview
HyperCLOVA X SEED 32B Think is an updated vision-language thinking model that advances the SEED Think 14B line beyond simple scaling, pairing a unified vision-language Transformer backbone with a reasoning-centric training recipe. SEED 32B Think processes text tokens and visual patches within a shared embedding space, supports long-context multimodal understanding up to 128K tokens, and provides an optional “thinking mode” for deep, controllable reasoning. Building on the earlier 14B model, SEED 32B Think further strengthens Korean-centric reasoning and agentic capabilities, improving practical reasoning quality and reliability in real-world use.
Basic Information
- Architecture : Transformer-based vision-language model (VLM) architecture (Dense Model)
- Parameters : 32B
- Input Format: Text/Image/Video
- Output Format: Text
- Context Length : 128K
- Knowledge Cutoff: May 2025
Benchmarks
- General Knowledge (Korean Text): KoBalt, CLIcK, HAERAE Bench 1.0
- Vision Understanding : ChartVQA, TextVQA, K-MMBench, K-DTCBench
- Agentic Tasks: Tau^2-Airline, Tau^2-Retail, Tau^2-Telecom
Examples
- Solving 2026 Korean CSAT Math Problem

- Understanding Text layout
Inference
We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.
Capabilities
- Inputs: Text, Image
- Outputs: Text
Requirements
- 4x NVIDIA A100 80GB
- Docker & Docker Compose
- NVIDIA Driver 525+, CUDA 12.1+
Installation
# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe
# Install dependencies
pip install huggingface_hub safetensors torch openai easydict
# Download model (~60GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Think-32B \
--local-dir ./models/HyperCLOVAX-SEED-Think-32B
# Convert model to component format
python convert_model.py \
--input ./models/HyperCLOVAX-SEED-Think-32B \
--output ./track_a \
--track a
# Configure environment
cp .env.example .env
# Edit .env:
# VLM_MODEL_PATH=./track_a/llm/HyperCLOVAX-SEED-Think-32B
# VLM_ENCODER_VISION_MODEL_PATH=./track_a/ve/HyperCLOVAX-SEED-Think-32B
# Build and run
docker compose --profile track-a build
docker compose --profile track-a up -d
# Wait for model loading (~5 minutes)
docker compose logs -f vlm
Basic Usage
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/a/v1",
api_key="not-needed"
)
# Image understanding
response = client.chat.completions.create(
model="track_a_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image."}
]
}
],
max_tokens=512,
extra_body={"chat_template_kwargs": {"thinking": False}}
)
print(response.choices[0].message.content)
Reasoning Mode
Enable chain-of-thought reasoning for complex tasks:
response = client.chat.completions.create(
model="track_a_model",
messages=[
{"role": "user", "content": "Solve step by step: 3x + 7 = 22"}
],
max_tokens=1024,
extra_body={
"thinking_token_budget": 500,
"chat_template_kwargs": {"thinking": True}
}
)
# Response includes <think>...</think> with reasoning process
print(response.choices[0].message.content)
More Examples
Video Understanding
response = client.chat.completions.create(
model="track_a_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
{"type": "text", "text": "Describe this video."}
]
}
],
max_tokens=512,
extra_body={"chat_template_kwargs": {"thinking": False}}
)
Base64 Image Input
import base64
with open("image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="track_a_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
{"type": "text", "text": "What is in this image?"}
]
}
],
max_tokens=512,
extra_body={"chat_template_kwargs": {"thinking": False}}
)
Using curl
curl -X POST http://localhost:8000/a/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "track_a_model",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image."}
]
}
],
"max_tokens": 512,
"extra_body": {"chat_template_kwargs": {"thinking": false}}
}'
Model Capabilities
| Input | Output |
|---|---|
| Text | Text |
| Image | Text |
| Video | Text |
| Image + Text | Text |
| Video + Text | Text |
Features:
- Reasoning mode with
<think>...</think>output - Multi-turn conversation support
- Image/Video understanding
Architecture
User Request
(Image/Video/Text)
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ OmniServe │
│ POST /a/v1/chat/completions │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ [1] INPUT ENCODING │ │
│ │ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ Vision Encoder │ │ │
│ │ └────────┬────────┘ │ │
│ │ │ embeddings │ │
│ └────────────────────────────┼─────────────────────────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ LLM (32B) │◀──── text │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ Text Response │
│ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
Response
(Text)
Hardware Requirements
| Component | GPU | VRAM |
|---|---|---|
| Vision Encoder | 1x | ~8GB |
| LLM (32B) | 2x | ~60GB |
| Total | 3x | ~68GB |
Key Parameters
| Parameter | Description | Default |
|---|---|---|
chat_template_kwargs.thinking |
Enable reasoning | false |
thinking_token_budget |
Max reasoning tokens | 500 |
max_tokens |
Max output tokens | - |
temperature |
Sampling temperature | 0.7 |
For more details, see OmniServe documentation.
Citation
TBU (Technical Report)
Questions
For any other questions, please feel free to contact us at [email protected].
License
The model is licensed under HyperCLOVA X SEED 32B Think Model License Agreement
- Downloads last month
- 83

