Skywork-R1V-38B-AWQ
This is the AWQ quantized version of Skywork-R1V-38B, offering improved inference efficiency while maintaining model quality.
Model Description
Skywork R1V is a pioneering multimodal model with advanced reasoning capabilities through Chain-of-Thought. This quantized version maintains the core strengths of the original model while reducing computational requirements.
For detailed information about the model architecture and capabilities, please refer to the original Skywork-R1V repository and technical report.
Benchmark Results
The AWQ quantized model maintains strong performance across key benchmarks:
Benchmark | Score |
---|---|
MMMU | 0.6 |
MathV | 0.59 |
AIME_2024 | 0.6 |
MATH_500 | 0.83 |
These results demonstrate that the quantized model preserves the mathematical and multimodal reasoning capabilities of the original model.
Usage
You can use the quantized model with different inference frameworks:
Using VLLM
Python API
import os
from vllm import LLM, SamplingParams
from vllm.entrypoints.chat_utils import load_chat_template
model_name = "Skywork/Skywork-R1V-38B-AWQ" # or local path
llm = LLM(model_name,
dtype='float16',
quantization="awq",
gpu_memory_utilization=0.85,
max_model_len=4096,
trust_remote_code=True,
)
# Add your inference code here
OpenAI-compatible API Server
MODEL_ID="Skywork/Skywork-R1V-38B-AWQ" # or local path
CUDA_VISIBLE_DEVICES=0 \
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_ID \
--dtype float16 \
--quantization awq \
--port 23334 \
--max-model-len 12000 \
--gpu-memory-utilization 0.9 \
--trust-remote-code
Using LMDeploy
import os
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
model_path = "Skywork/Skywork-R1V-38B-AWQ" # or local path
engine_config = TurbomindEngineConfig(cache_max_entry_count=0.75)
chat_template_config = ChatTemplateConfig(model_name=model_path)
pipe = pipeline(model_path,
backend_config=engine_config,
chat_template_config=chat_template_config,
)
# Example: Multimodal inference
image = load_image('table.jpg')
response = pipe(('Describe this image?', image))
print(response.text)
Hardware Requirements
The AWQ quantization reduces the memory footprint compared to the original FP16 model. We recommend:
- At least one GPU with 30GB+ VRAM for inference
- For optimal performance with longer contexts, 40GB+ VRAM is recommended
Citation
If you use this model in your research, please cite:
@article{skywork2025r1v,
title = {Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought},
author = {Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou},
year = {2025},
journal = {https://github.com/SkyworkAI/Skywork-R1V/blob/main/report/Skywork_R1V.pdf},
url = {https://huggingface.co/Skywork/Skywork-R1V-38B}
}
Skywork-R1V-38B-AWQ (中文说明)
这是 Skywork-R1V-38B 的 AWQ 量化版本,提供了更高效的推理性能,同时保持模型质量。
模型描述
Skywork R1V 是一个开创性的多模态模型,通过思维链(Chain-of-Thought)技术具备出色的推理能力。这个量化版本保持了原始模型的核心优势,同时降低了计算需求。
有关模型架构和能力的详细信息,请参阅原始 Skywork-R1V 代码库和技术报告。
基准测试结果
AWQ 量化模型在关键基准测试中保持了强劲的性能:
基准测试 | 分数 |
---|---|
MMMU | 0.6 |
MathV | 0.59 |
AIME_2024 | 0.6 |
MATH_500 | 0.83 |
这些结果表明,量化模型保留了原始模型的数学和多模态推理能力。
使用方法
您可以使用不同的推理框架来使用这个量化模型:
使用 VLLM
Python API
import os
from vllm import LLM, SamplingParams
from vllm.entrypoints.chat_utils import load_chat_template
model_name = "Skywork/Skywork-R1V-38B-AWQ" # 或本地路径
llm = LLM(model_name,
dtype='float16',
quantization="awq",
gpu_memory_utilization=0.85,
max_model_len=4096,
trust_remote_code=True,
)
# 在此添加您的推理代码
OpenAI 兼容的 API 服务器
MODEL_ID="Skywork/Skywork-R1V-38B-AWQ" # 或本地路径
CUDA_VISIBLE_DEVICES=0 \
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_ID \
--dtype float16 \
--quantization awq \
--port 23334 \
--max-model-len 12000 \
--gpu-memory-utilization 0.9 \
--trust-remote-code
使用 LMDeploy
import os
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
model_path = "Skywork/Skywork-R1V-38B-AWQ" # 或本地路径
engine_config = TurbomindEngineConfig(cache_max_entry_count=0.75)
chat_template_config = ChatTemplateConfig(model_name=model_path)
pipe = pipeline(model_path,
backend_config=engine_config,
chat_template_config=chat_template_config,
)
# 示例:多模态推理
image = load_image('table.jpg')
response = pipe(('描述这个图片?', image))
print(response.text)
硬件要求
与原始 FP16 模型相比,AWQ 量化减少了内存占用。我们建议:
- 至少一块 30GB+ 显存的 GPU 用于推理
- 对于更长上下文的最佳性能,建议使用 40GB+ 显存
引用
如果您在研究中使用此模型,请引用:
@article{skywork2025r1v,
title = {Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought},
author = {Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou},
year = {2025},
journal = {https://github.com/SkyworkAI/Skywork-R1V/blob/main/report/Skywork_R1V.pdf},
url = {https://huggingface.co/Skywork/Skywork-R1V-38B}
}
- Downloads last month
- 3