Model Overview

  • Model_Architecture: DeepSeek V3
    • Input: Text
    • Output: Text
  • Supported_Hardware_Microarchitecture: AMD MI350/MI355
  • ROCm: "7.0"
  • Operating Systems: Linux
  • Inference Engine: vLLM
  • Model Optimizer: AMD-Quark
  • Quantization:
    • Weight:
      • Type: OCP MXFP4
      • Mode: Static
    • Activation:
      • Type: OCP MXFP4
      • Mode: Dynamic
    • KV_Cache:
      • Type: OCP FP8
      • Mode: Static
  • Calibration_Dataset: Pile

This model was built with DeepSeek by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from unsloth/DeepSeek-V3-0324-BF16 using AMD-Quark.
Weights and activations were quantized to MXFP4, and KV caches were quantized to FP8.
The AutoSmoothQuant algorithm was applied to enhance accuracy during quantization.

Quantization Scripts

cd Quark/examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py \
    --model_dir "/deepseek-ai/DeepSeek-V3-0324-BF16/" \
    --quant_scheme "w_mxfp4_a_mxfp4" \
    --quant_algo_config_file "llm_ptq/models/deepseekv2v3/autosmoothquant_config.json" \
    --num_calib_data 128 \
    --exclude_layers "$exclude_layers"\
    --multi_gpu true \
    --quant_algo "autosmoothquant" \
    --model_export "hf_format" \
    --output_dir "$output_dir"

Deployment

  • Backend: vLLM
  • Description: This model can be deployed efficiently using the vLLM backend.

Evaluation

  • Tasks:
    • Wikitext
    • GSM8K
  • Framework: lm-evaluation-harness
  • Engine: vLLM

Accuracy

  • wikitext-ppl
    3.33074593544006

Reproduction Command

Wikitext:  
  lm_eval \
    --model vllm \
    --model_args pretrained="amd/DeepSeek-V3-0324-WMXFP4-AMXFP4-MoE-Quant-ASQ",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \
    --tasks wikitext \
    --fewshot_as_multiturn \
    --apply_chat_template \
    --num_fewshot 5 \
    --batch_size auto

GSM8K:
  lm_eval \
    --model vllm \
    --model_args pretrained="amd/DeepSeek-V3-0324-WMXFP4-AMXFP4-MoE-Quant-ASQ",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \
    --tasks gsm8k_llama \
    --fewshot_as_multiturn \
    --apply_chat_template \
    --num_fewshot 8 \
    --batch_size auto

License

Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved.
Downloads last month
56
Safetensors
Model size
342B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/DeepSeek-V3-0324-WMXFP4-AMXFP4-MoE-Quant-ASQ

Quantized
(16)
this model