nvidia
/

Llama-3_3-Nemotron-Super-49B-v1

Text Generation

Model card Files Files and versions Community

jiaqiz commited on Apr 8

Commit

4605d82

·

verified ·

1 Parent(s): 1a2cb80

Update README.md

Files changed (1) hide show

README.md +22 -0

README.md CHANGED Viewed

@@ -27,6 +27,7 @@ The model underwent a multi-phase post-training process to enhance both its reas
 This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here:
 - [Llama-3.1-Nemotron-Nano-8B-v1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1)
 This model is ready for commercial use.
@@ -95,6 +96,7 @@ Llama-3.3-Nemotron-Super-49B-v1 is a general purpose reasoning and chat model in
 You can try this model out through the preview API, using this link: [Llama-3_3-Nemotron-Super-49B-v1](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1).
 See the snippet below for usage with [Hugging Face Transformers](https://huggingface.co/docs/transformers/main/en/index) library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below
 We recommend using the *transformers* package with version 4.48.3.
@@ -150,6 +152,26 @@ thinking = "off"
 print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"},{"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
 ```
 ## Inference:
 **Engine:**

 This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here:
 - [Llama-3.1-Nemotron-Nano-8B-v1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1)
+- [Llama-3.1-Nemotron-Ultra-253B-v1](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1)
 This model is ready for commercial use.
 You can try this model out through the preview API, using this link: [Llama-3_3-Nemotron-Super-49B-v1](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1).
+### Use It with Transformers
 See the snippet below for usage with [Hugging Face Transformers](https://huggingface.co/docs/transformers/main/en/index) library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below
 We recommend using the *transformers* package with version 4.48.3.
 print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"},{"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
 ```
+### Use It with vLLM
+```
+pip install vllm==0.8.3
+```
+An example on how to serve with vLLM:
+```
+python3 -m vllm.entrypoints.openai.api_server \
+  --model "nvidia/Llama-3_3-Nemotron-Super-49B-v1" \
+  --trust-remote-code \
+  --seed=1 \
+  --host="0.0.0.0" \
+  --port=5000 \
+  --served-model-name "nvidia/Llama-3_3-Nemotron-Super-49B-v1" \
+  --tensor-parallel-size=8 \
+  --max-model-len=32768 \
+  --gpu-memory-utilization 0.95 \
+  --enforce-eager
+```
 ## Inference:
 **Engine:**