Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,7 @@ This model is ready for commercial and non-commercial use. <br>
|
|
9 |
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA [(Meta-Llama-3.1-8B-Instruct) Model Card](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
|
10 |
|
11 |
### License/Terms of Use:
|
12 |
-
[llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
|
13 |
|
14 |
|
15 |
## Model Architecture:
|
@@ -54,7 +54,7 @@ The model is quantized with nvidia-modelopt **v0.15.1** <br>
|
|
54 |
**Test Hardware:** H100 <br>
|
55 |
|
56 |
## Post Training Quantization
|
57 |
-
This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-8B to FP8 data type, ready for inference with TensorRT-LLM and vLLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H100, we achieved 1.3x speedup.
|
58 |
|
59 |
## Usage
|
60 |
|
@@ -121,10 +121,10 @@ The accuracy (MMLU, 5-shot) and throughputs (tokens per second, TPS) benchmark r
|
|
121 |
We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.3x** speedup with FP8.
|
122 |
|
123 |
### Deploy with vLLM
|
124 |
-
To deploy the quantized checkpoint with [vLLM](https://github.com/vllm-project/vllm.git), follow the instructions below:
|
125 |
|
126 |
1. Install vLLM from directions [here](https://github.com/vllm-project/vllm?tab=readme-ov-file#getting-started).
|
127 |
-
2. To use a Model Optimizer PTQ checkpoint with vLLM, `quantization=modelopt` flag must be passed into the config while initializing the `LLM` Engine.
|
128 |
|
129 |
Example deployment on H100:
|
130 |
|
|
|
9 |
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA [(Meta-Llama-3.1-8B-Instruct) Model Card](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
|
10 |
|
11 |
### License/Terms of Use:
|
12 |
+
[llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/blob/main/LICENSE)
|
13 |
|
14 |
|
15 |
## Model Architecture:
|
|
|
54 |
**Test Hardware:** H100 <br>
|
55 |
|
56 |
## Post Training Quantization
|
57 |
+
This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-8B-Instruct to FP8 data type, ready for inference with TensorRT-LLM and vLLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H100, we achieved 1.3x speedup.
|
58 |
|
59 |
## Usage
|
60 |
|
|
|
121 |
We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.3x** speedup with FP8.
|
122 |
|
123 |
### Deploy with vLLM
|
124 |
+
To deploy the quantized checkpoint with [vLLM](https://github.com/vllm-project/vllm.git), follow the instructions below:
|
125 |
|
126 |
1. Install vLLM from directions [here](https://github.com/vllm-project/vllm?tab=readme-ov-file#getting-started).
|
127 |
+
2. To use a Model Optimizer PTQ checkpoint with vLLM, `quantization=modelopt` flag must be passed into the config while initializing the `LLM` Engine.
|
128 |
|
129 |
Example deployment on H100:
|
130 |
|