nvidia
/

Llama-3.1-8B-Medusa-FP8

Safetensors

llama

Model card Files Files and versions Community

zhiyucheng commited on Jan 31

Commit

1492bd8

1 Parent(s): 2ae2dab

update readme

Browse files

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -66,10 +66,13 @@ v0.23.0  <br>
 * [Human] <br>
 ## Medusa Speculative Decoding and Post Training Quantization
-Synthesized data was obtained from a FP8 quantized version of Meta-Llama-3.1-8B-Instruct, which is then used to finetune the Medusa heads. This model was then obtained by quantizing the weights and activations of Meta-Llama-3.1-8B-Instruct together with the Medusa heads to FP8 data type, ready for inference with TensorRT-LLM in Medusa speculative decoding mode. Only the weights and activations of the linear operators within transformers blocks and Medusa heads are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Medusa heads are used to predict candidate tokens beyond the next token. In the generation step, each Medusa head generates a distribution of tokens beyond the previous. Then a tree-based attention mechanism samples some candidate sequences for the original model to validate. The longest accepted candidate sequence is selected so that more than 1 token is returned in the generation step. The number of tokens generated in each step is called acceptance rate.
 ## Usage
-To generate text using TensorRT-LLM LLMAPI (supported in v0.17):
 ```python
 ### Generate Text Using Medusa Decoding
@@ -106,7 +109,7 @@ def main():
                                         [4, 0], [2, 1], [0, 0, 4], [0, 0, 5], [0, 1, 1], [0, 0, 6], [0, 3, 0], [5, 0], [1, 3], [0, 0, 7], \
                                             [0, 0, 8], [0, 0, 9], [6, 0], [0, 4, 0], [1, 4], [7, 0], [0, 1, 2], [2, 0, 0], [3, 1], [2, 2], [8, 0], [0, 5, 0], [1, 5], [1, 0, 1], [0, 2, 1], [9, 0], [0, 6, 0], [1, 6], [0, 7, 0]]
       )
-    llm = LLM(model="./hf_ckpt",
               build_config=build_config,
               speculative_config=speculative_config)
@@ -124,10 +127,8 @@ if __name__ == '__main__':
 ```
-To deploy the quantized checkpoint with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), follow the sample commands for [Medusa decoding](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.17.0/examples/medusa#usage) in the TensorRT-LLM GitHub repo to convert the checkpoint and build TensorRT-LLM engine.
-* Throughputs evaluation:
-Please refer to the [TensorRT-LLM benchmarking documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/Suite.md) for details.
 ## Evaluation
 The accuracy (MMLU, 5-shot) and Medusa acceptance rate benchmark results are presented in the table below:
@@ -143,4 +144,3 @@ The accuracy (MMLU, 5-shot) and Medusa acceptance rate benchmark results are pre
 NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.NVIDIA.com/en-us/support/submit-security-vulnerability/).

 * [Human] <br>
 ## Medusa Speculative Decoding and Post Training Quantization
+Synthesized data was obtained from a FP8 quantized version of Meta-Llama-3.1-8B-Instruct, which is then used to finetune the Medusa heads. This model was then obtained by quantizing the weights and activations of Meta-Llama-3.1-8B-Instruct together with the Medusa heads to FP8 data type, ready for inference with TensorRT-LLM in Medusa speculative decoding mode. Only the weights and activations of the linear operators within transformers blocks and Medusa heads are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
+Medusa heads are used to predict candidate tokens beyond the next token. In the generation step, each Medusa head generates a distribution of tokens beyond the previous. Then a tree-based attention mechanism samples some candidate sequences for the original model to validate. The longest accepted candidate sequence is selected so that more than 1 token is returned in the generation step. The number of tokens generated in each step is called acceptance rate.
 ## Usage
+To run inference with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) (supported from [v0.17](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.17.0)), we recommend using LLM APIs as shown in [this example](https://github.com/NVIDIA/TensorRT-LLM/blob/v0.17.0/examples/llm-api/llm_medusa_decoding.py#L34) with ` python llm_medusa_decoding.py --use_modelopt_ckpt` or below. The LLM APIs abstract away steps like checkpoint conversion, engine building, and inference.
 ```python
 ### Generate Text Using Medusa Decoding
                                         [4, 0], [2, 1], [0, 0, 4], [0, 0, 5], [0, 1, 1], [0, 0, 6], [0, 3, 0], [5, 0], [1, 3], [0, 0, 7], \
                                             [0, 0, 8], [0, 0, 9], [6, 0], [0, 4, 0], [1, 4], [7, 0], [0, 1, 2], [2, 0, 0], [3, 1], [2, 2], [8, 0], [0, 5, 0], [1, 5], [1, 0, 1], [0, 2, 1], [9, 0], [0, 6, 0], [1, 6], [0, 7, 0]]
       )
+    llm = LLM(model="nvidia/Llama-3.1-8B-Medusa-FP8",
               build_config=build_config,
               speculative_config=speculative_config)
 ```
+Alternatively, you can follow the [sample CLIs for Medusa decoding](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.17.0/examples/medusa#usage) in the TensorRT-LLM GitHub repo.
+Support in [TensorRT-LLM benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) with `trtllm-bench` is coming soon.
 ## Evaluation
 The accuracy (MMLU, 5-shot) and Medusa acceptance rate benchmark results are presented in the table below:
 NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.NVIDIA.com/en-us/support/submit-security-vulnerability/).