Overview

HyperCLOVA X SEED 32B Think is an updated vision-language thinking model that advances the SEED Think 14B line beyond simple scaling, pairing a unified vision-language Transformer backbone with a reasoning-centric training recipe. SEED 32B Think processes text tokens and visual patches within a shared embedding space, supports long-context multimodal understanding up to 128K tokens, and provides an optional “thinking mode” for deep, controllable reasoning. Building on the earlier 14B model, SEED 32B Think further strengthens Korean-centric reasoning and agentic capabilities, improving practical reasoning quality and reliability in real-world use.

Technical Report

HyperCLOVAX-SEED-Think-32B Tech Report (PDF)

Basic Information

Architecture : Transformer-based vision-language model (VLM) architecture (Dense Model)
Parameters : 32B
Input Format: Text/Image/Video
Output Format: Text
Context Length : 128K
Knowledge Cutoff: May 2025

Benchmarks

General Knowledge (Korean Text): KoBalt, CLIcK, HAERAE Bench 1.0
Vision Understanding : ChartVQA, TextVQA, K-MMBench, K-DTCBench
Agentic Tasks: Tau^2-Airline, Tau^2-Retail, Tau^2-Telecom

Examples

Solving 2026 Korean CSAT Math Problem
Understanding Text layout

Inference

We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.

Capabilities

Inputs: Text, Image
Outputs: Text

Requirements

4x NVIDIA A100 80GB
Docker & Docker Compose
NVIDIA Driver 525+, CUDA 12.1+

Installation

# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe

# Install dependencies
pip install huggingface_hub safetensors torch openai easydict

# Download model (~60GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Think-32B \
    --local-dir ./models/HyperCLOVAX-SEED-Think-32B

# Convert model to component format
python convert_model.py \
    --input ./models/HyperCLOVAX-SEED-Think-32B \
    --output ./track_a \
    --track a

# Configure environment
cp .env.example .env
# Edit .env:
# VLM_MODEL_PATH=./track_a/llm/HyperCLOVAX-SEED-Think-32B
# VLM_ENCODER_VISION_MODEL_PATH=./track_a/ve/HyperCLOVAX-SEED-Think-32B

# Build and run
docker compose --profile track-a build
docker compose --profile track-a up -d

# Wait for model loading (~5 minutes)
docker compose logs -f vlm

Basic Usage

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/a/v1",
    api_key="not-needed"
)

# Image understanding
response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)

print(response.choices[0].message.content)

Reasoning Mode

Enable chain-of-thought reasoning for complex tasks:

response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {"role": "user", "content": "Solve step by step: 3x + 7 = 22"}
    ],
    max_tokens=1024,
    extra_body={
        "thinking_token_budget": 500,
        "chat_template_kwargs": {"thinking": True}
    }
)

# Response includes <think>...</think> with reasoning process
print(response.choices[0].message.content)

More Examples

Video Understanding

response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
                {"type": "text", "text": "Describe this video."}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)

Base64 Image Input

import base64

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
                {"type": "text", "text": "What is in this image?"}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)

Using curl

curl -X POST http://localhost:8000/a/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_a_model",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
          {"type": "text", "text": "Describe this image."}
        ]
      }
    ],
    "max_tokens": 512,
    "extra_body": {"chat_template_kwargs": {"thinking": false}}
  }'

Model Capabilities

Input	Output
Text	Text
Image	Text
Video	Text
Image + Text	Text
Video + Text	Text

Features:

Reasoning mode with <think>...</think> output
Multi-turn conversation support
Image/Video understanding

Architecture

                         User Request
                       (Image/Video/Text)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            OmniServe                                    │
│                  POST /a/v1/chat/completions                            │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                     [1] INPUT ENCODING                           │   │
│  │                                                                  │   │
│  │                   ┌─────────────────┐                            │   │
│  │                   │  Vision Encoder │                            │   │
│  │                   └────────┬────────┘                            │   │
│  │                            │ embeddings                          │   │
│  └────────────────────────────┼─────────────────────────────────────┘   │
│                               ▼                                         │
│                       ┌──────────────┐                                  │
│                       │  LLM (32B)   │◀──── text                        │
│                       └──────┬───────┘                                  │
│                              │                                          │
│                              ▼                                          │
│                        Text Response                                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                           Response
                            (Text)

Hardware Requirements

Component	GPU	VRAM
Vision Encoder	1x	~8GB
LLM (32B)	2x	~60GB
Total	3x	~68GB

Key Parameters

Parameter	Description	Default
`chat_template_kwargs.thinking`	Enable reasoning	`false`
`thinking_token_budget`	Max reasoning tokens	500
`max_tokens`	Max output tokens	-
`temperature`	Sampling temperature	0.7