zhiyucheng commited on
Commit
4b4e754
·
verified ·
1 Parent(s): adb8efe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -9,7 +9,7 @@ This model is ready for commercial and non-commercial use. <br>
9
  This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA [(Meta-Llama-3.1-8B-Instruct) Model Card](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
10
 
11
  ### License/Terms of Use:
12
- [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
13
 
14
 
15
  ## Model Architecture:
@@ -54,7 +54,7 @@ The model is quantized with nvidia-modelopt **v0.15.1** <br>
54
  **Test Hardware:** H100 <br>
55
 
56
  ## Post Training Quantization
57
- This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-8B to FP8 data type, ready for inference with TensorRT-LLM and vLLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H100, we achieved 1.3x speedup.
58
 
59
  ## Usage
60
 
@@ -121,10 +121,10 @@ The accuracy (MMLU, 5-shot) and throughputs (tokens per second, TPS) benchmark r
121
  We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.3x** speedup with FP8.
122
 
123
  ### Deploy with vLLM
124
- To deploy the quantized checkpoint with [vLLM](https://github.com/vllm-project/vllm.git), follow the instructions below:
125
 
126
  1. Install vLLM from directions [here](https://github.com/vllm-project/vllm?tab=readme-ov-file#getting-started).
127
- 2. To use a Model Optimizer PTQ checkpoint with vLLM, `quantization=modelopt` flag must be passed into the config while initializing the `LLM` Engine.
128
 
129
  Example deployment on H100:
130
 
 
9
  This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA [(Meta-Llama-3.1-8B-Instruct) Model Card](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
10
 
11
  ### License/Terms of Use:
12
+ [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/blob/main/LICENSE)
13
 
14
 
15
  ## Model Architecture:
 
54
  **Test Hardware:** H100 <br>
55
 
56
  ## Post Training Quantization
57
+ This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-8B-Instruct to FP8 data type, ready for inference with TensorRT-LLM and vLLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H100, we achieved 1.3x speedup.
58
 
59
  ## Usage
60
 
 
121
  We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.3x** speedup with FP8.
122
 
123
  ### Deploy with vLLM
124
+ To deploy the quantized checkpoint with [vLLM](https://github.com/vllm-project/vllm.git), follow the instructions below:
125
 
126
  1. Install vLLM from directions [here](https://github.com/vllm-project/vllm?tab=readme-ov-file#getting-started).
127
+ 2. To use a Model Optimizer PTQ checkpoint with vLLM, `quantization=modelopt` flag must be passed into the config while initializing the `LLM` Engine.
128
 
129
  Example deployment on H100:
130