--- license: llama3.1 language: - en inference: false fine-tuning: false tags: - nvidia - llama3.1 datasets: - nvidia/HelpSteer2 base_model: meta-llama/Llama-3.1-70B-Instruct pipeline_tag: text-generation library_name: transformers ---

Llama-3.1-Nemotron-70B-Instruct-HF Model Icon

Validated Badge ## Description: Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries. This model reaches [Arena Hard](https://github.com/lmarena/arena-hard-auto) of 85.0, [AlpacaEval 2 LC](https://tatsu-lab.github.io/alpaca_eval/) of 57.6 and [GPT-4-Turbo MT-Bench](https://github.com/lm-sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet. As of Oct 24th, 2024 the model has Elo Score of 1267(+-7), rank 9 and style controlled rank of 26 on [ChatBot Arena leaderboard](https://lmarena.ai/?leaderboard). This model was trained using RLHF (specifically, REINFORCE), [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) and [HelpSteer2-Preference prompts](https://huggingface.co/datasets/nvidia/HelpSteer2) on a Llama-3.1-70B-Instruct model as the initial policy. Llama-3.1-Nemotron-70B-Instruct-HF has been converted from [Llama-3.1-Nemotron-70B-Instruct](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct) to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the [Llama-3.1-Nemotron-70B-Instruct](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct) as evaluated in NeMo-Aligner, which the evaluation results below are based on. Try hosted inference for free at [build.nvidia.com](https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct) - it comes with an OpenAI-compatible API interface. See details on our paper at [https://arxiv.org/abs/2410.01257](https://arxiv.org/abs/2410.01257) - as a preview, this model can correctly the question ```How many r in strawberry?``` without specialized prompting or additional reasoning tokens: ``` A sweet question! Let’s count the “R”s in “strawberry”: 1. S 2. T 3. R 4. A 5. W 6. B 7. E 8. R 9. R 10. Y There are **3 “R”s** in the word “strawberry”. ``` Note: This model is a demonstration of our techniques for improving helpfulness in general-domain instruction following. It has not been tuned for performance in specialized domains such as math. ## License Your use of this model is governed by the [NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf). Additional Information: [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/). Built with Llama. ## Deployment This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below. Deploy on vLLM ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/Llama-3.1-Nemotron-70B-Instruct-HF" number_gpus = 4 sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) prompt = "Give me a short introduction to large language model." llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompt, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
Deploy on Red Hat AI Inference Server ```bash podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \ --ipc=host \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \ --name=vllm \ registry.access.redhat.com/rhaiis/rh-vllm-cuda \ vllm serve \ --tensor-parallel-size 8 \ --max-model-len 32768 \ --enforce-eager --model RedHatAI/Llama-3.1-Nemotron-70B-Instruct-HF ``` See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
Deploy on Red Hat Enterprise Linux AI ```bash # Download model from Red Hat Registry via docker # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified. ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3-1-nemotron-70b-instruct-hf:1.5 ``` ```bash # Serve model via ilab ilab model serve --model-path ~/.cache/instructlab/models/llama-3-1-nemotron-70b-instruct-hf # Chat with model ilab model chat --model ~/.cache/instructlab/models/llama-3-1-nemotron-70b-instruct-hf ``` See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
Deploy on Red Hat Openshift AI ```python # Setting up vllm server with ServingRuntime # Save as: vllm-servingruntime.yaml apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name annotations: openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' labels: opendatahub.io/dashboard: 'true' spec: annotations: prometheus.io/port: '8080' prometheus.io/path: '/metrics' multiModel: false supportedModelFormats: - autoSelect: true name: vLLM containers: - name: kserve-container image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm command: - python - -m - vllm.entrypoints.openai.api_server args: - "--port=8080" - "--model=/mnt/models" - "--served-model-name={{.Name}}" env: - name: HF_HOME value: /tmp/hf_home ports: - containerPort: 8080 protocol: TCP ``` ```python # Attach model to vllm server. This is an NVIDIA template # Save as: inferenceservice.yaml apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: annotations: openshift.io/display-name: Llama-3.1-Nemotron-70B-Instruct-HF # OPTIONAL CHANGE serving.kserve.io/deploymentMode: RawDeployment name: Llama-3.1-Nemotron-70B-Instruct-HF # specify model name. This value will be used to invoke the model in the payload labels: opendatahub.io/dashboard: 'true' spec: predictor: maxReplicas: 1 minReplicas: 1 model: modelFormat: name: vLLM name: '' resources: limits: cpu: '2' # this is model specific memory: 8Gi # this is model specific nvidia.com/gpu: '1' # this is accelerator specific requests: # same comment for this block cpu: '1' memory: 4Gi nvidia.com/gpu: '1' runtime: vllm-cuda-runtime # must match the ServingRuntime name above storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-nemotron-70b-instruct-hf:1.5 tolerations: - effect: NoSchedule key: nvidia.com/gpu operator: Exists ``` ```bash # make sure first to be in the project where you want to deploy the model # oc project # apply both resources to run model # Apply the ServingRuntime oc apply -f vllm-servingruntime.yaml # Apply the InferenceService oc apply -f qwen-inferenceservice.yaml ``` ```python # Replace and below: # - Run `oc get inferenceservice` to find your URL if unsure. # Call the server using curl: curl https://-predictor-default./v1/chat/completions -H "Content-Type: application/json" \ -d '{ "model": $(model-name), "stream": true, "stream_options": { "include_usage": true }, "max_tokens": 1, "messages": [ { "role": "user", "content": "How can a bee fly when its wings are so small?" } ] }' ``` See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
## Evaluation Metrics As of 1 Oct 2024, Llama-3.1-Nemotron-70B-Instruct performs best on Arena Hard, AlpacaEval 2 LC (verified tab) and MT Bench (GPT-4-Turbo) | Model | Arena Hard | AlpacaEval | MT-Bench | Mean Response Length | |:-----------------------------|:----------------|:-----|:----------|:-------| |Details | (95% CI) | 2 LC (SE) | (GPT-4-Turbo) | (# of Characters for MT-Bench)| | _**Llama-3.1-Nemotron-70B-Instruct**_ | **85.0** (-1.5, 1.5) | **57.6** (1.65) | **8.98** | 2199.8 | | Llama-3.1-70B-Instruct | 55.7 (-2.9, 2.7) | 38.1 (0.90) | 8.22 | 1728.6 | | Llama-3.1-405B-Instruct | 69.3 (-2.4, 2.2) | 39.3 (1.43) | 8.49 | 1664.7 | | Claude-3-5-Sonnet-20240620 | 79.2 (-1.9, 1.7) | 52.4 (1.47) | 8.81 | 1619.9 | | GPT-4o-2024-05-13 | 79.3 (-2.1, 2.0) | 57.5 (1.47) | 8.74 | 1752.2 | ## Usage: You can use the model using HuggingFace Transformers library with 2 or more 80GB GPUs (NVIDIA Ampere or newer) with at least 150GB of free disk space to accomodate the download. This code has been tested on Transformers v4.44.0, torch v2.4.0 and 2 A100 80GB GPUs, but any setup that supports ```meta-llama/Llama-3.1-70B-Instruct``` should support this model as well. If you run into problems, you can consider doing ```pip install -U transformers```. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "How many r in strawberry?" messages = [{"role": "user", "content": prompt}] tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True) response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(), max_new_tokens=4096, pad_token_id = tokenizer.eos_token_id) generated_tokens =response_token_ids[:, len(tokenized_message['input_ids'][0]):] generated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0] print(generated_text) # See response at top of model card ``` ## References(s): * [NeMo Aligner](https://arxiv.org/abs/2405.01481) * [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257) * [HelpSteer2](https://arxiv.org/abs/2406.08673) * [Introducing Llama 3.1: Our most capable models to date](https://ai.meta.com/blog/meta-llama-3-1/) * [Meta's Llama 3.1 Webpage](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1) * [Meta's Llama 3.1 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) ## Model Architecture: **Architecture Type:** Transformer
**Network Architecture:** Llama 3.1
## Input: **Input Type(s):** Text
**Input Format:** String
**Input Parameters:** One Dimensional (1D)
**Other Properties Related to Input:** Max of 128k tokens
## Output: **Output Type(s):** Text
**Output Format:** String
**Output Parameters:** One Dimensional (1D)
**Other Properties Related to Output:** Max of 4k tokens
## Software Integration: **Supported Hardware Microarchitecture Compatibility:**
* NVIDIA Ampere
* NVIDIA Hopper
* NVIDIA Turing
**Supported Operating System(s):** Linux
## Model Version: v1.0 # Training & Evaluation: ## Alignment methodology * REINFORCE implemented in NeMo Aligner ## Datasets: **Data Collection Method by dataset**
* [Hybrid: Human, Synthetic]
**Labeling Method by dataset**
* [Human]
**Link:** * [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2) **Properties (Quantity, Dataset Descriptions, Sensor(s)):**
* 21, 362 prompt-responses built to make more models more aligned with human preference - specifically more helpful, factually-correct, coherent, and customizable based on complexity and verbosity. * 20, 324 prompt-responses used for training and 1, 038 used for validation. # Inference: **Engine:** [Triton](https://developer.nvidia.com/triton-inference-server)
**Test Hardware:** H100, A100 80GB, A100 40GB
## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Citation If you find this model useful, please cite the following works ```bibtex @misc{wang2024helpsteer2preferencecomplementingratingspreferences, title={HelpSteer2-Preference: Complementing Ratings with Preferences}, author={Zhilin Wang and Alexander Bukharin and Olivier Delalleau and Daniel Egert and Gerald Shen and Jiaqi Zeng and Oleksii Kuchaiev and Yi Dong}, year={2024}, eprint={2410.01257}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2410.01257}, } ```