Qwen3-0.6B-FP8-KV

Lightweight OCP FP8_e4m3 quant of Qwen3-0.6B with end-to-end KV-cache FP8 support, built with AMD Quark for ROCm.

Introduction

Qwen3-0.6B-FP8-KV is an OCP-standard FP8_e4m3 quantization of Qwen/Qwen3-0.6B, produced with AMD Quark.

Quantization Strategy

  • Quantizer: AMD Quark v0.9+
  • Numeric Format: OCP FP8_e4m3, symmetric per-tensor
  • Scope: All Linear layers (excl. lm_head), activations and the KV cache
  • Block Size: 128 (OCP-aligned)
  • Calibration: 128 Pile samples
  • Metadata: scales & block info in JSON; weights in SafeTensors

Performance Snapshot

Metric FP16 Baseline FP8_e4m3 Quantized
Wikitext2 Perplexity ~22.1 ~25.8
Memory Footprint 1.0ร— 0.50ร—
Inference Throughput 1.0ร— 1.3ร—

Evaluation

We measured perplexity on WikiText2:

  • FP16 (Qwen3-0.6B) โ†’ 22.1 PPL
  • FP8_e4m3 (this model) โ†’ 25.8 PPL

License

This model inherits the Qwen3-0.6B license.

Downloads last month
3
Safetensors
Model size
596M params
Tensor type
BF16
ยท
F8_E4M3
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for EliovpAI/Qwen3-0.6B-FP8-KV

Finetuned
Qwen/Qwen3-0.6B
Quantized
(138)
this model