Qwen3-0.6B-FP8-KV
Lightweight OCP FP8_e4m3 quant of Qwen3-0.6B with end-to-end KV-cache FP8 support, built with AMD Quark for ROCm.
Introduction
Qwen3-0.6B-FP8-KV is an OCP-standard FP8_e4m3 quantization of Qwen/Qwen3-0.6B, produced with AMD Quark.
Quantization Strategy
- Quantizer: AMD Quark v0.9+
- Numeric Format: OCP FP8_e4m3, symmetric per-tensor
- Scope: All
Linear
layers (excl.lm_head
), activations and the KV cache - Block Size: 128 (OCP-aligned)
- Calibration: 128 Pile samples
- Metadata: scales & block info in JSON; weights in SafeTensors
Performance Snapshot
Metric | FP16 Baseline | FP8_e4m3 Quantized |
---|---|---|
Wikitext2 Perplexity | ~22.1 | ~25.8 |
Memory Footprint | 1.0ร | 0.50ร |
Inference Throughput | 1.0ร | 1.3ร |
Evaluation
We measured perplexity on WikiText2:
- FP16 (Qwen3-0.6B) โ 22.1 PPL
- FP8_e4m3 (this model) โ 25.8 PPL
License
This model inherits the Qwen3-0.6B license.
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support