nvidia
/

Llama-3.1-8B-Instruct-FP8

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

zhiyucheng commited on Oct 11, 2024

Commit

adb8efe

·

verified ·

1 Parent(s): b64a154

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -54,7 +54,7 @@ The model is quantized with nvidia-modelopt **v0.15.1**  <br>
 **Test Hardware:** H100 <br>
 ## Post Training Quantization
-This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-8B to FP8 data type, ready for inference with TensorRT-LLM and vLLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H100, we achieved speedup
 ## Usage

 **Test Hardware:** H100 <br>
 ## Post Training Quantization
+This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-8B to FP8 data type, ready for inference with TensorRT-LLM and vLLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H100, we achieved 1.3x speedup.
 ## Usage