EliovpAI
/

Qwen3-0.6B-FP8-KV

Text Generation

text-generation-inference

Model card Files Files and versions

Qwen3-0.6B-FP8-KV

Lightweight OCP FP8_e4m3 quant of Qwen3-0.6B with end-to-end KV-cache FP8 support, built with AMD Quark for ROCm.

Introduction

Qwen3-0.6B-FP8-KV is an OCP-standard FP8_e4m3 quantization of Qwen/Qwen3-0.6B, produced with AMD Quark.

Quantization Strategy

Quantizer: AMD Quark v0.9+
Numeric Format: OCP FP8_e4m3, symmetric per-tensor
Scope: All Linear layers (excl. lm_head), activations and the KV cache
Block Size: 128 (OCP-aligned)
Calibration: 128 Pile samples
Metadata: scales & block info in JSON; weights in SafeTensors

Performance Snapshot

Metric	FP16 Baseline	FP8_e4m3 Quantized
Wikitext2 Perplexity	~22.1	~25.8
Memory Footprint	1.0×	0.50×
Inference Throughput	1.0×	1.3×

Evaluation

We measured perplexity on WikiText2:

FP16 (Qwen3-0.6B) → 22.1 PPL
FP8_e4m3 (this model) → 25.8 PPL

License

This model inherits the Qwen3-0.6B license.

Downloads last month: 3

Safetensors

Model size

596M params

Tensor type

BF16

·

F8_E4M3

·

Inference Providers NEW

Text Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EliovpAI/Qwen3-0.6B-FP8-KV

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Quantized

(138)

this model