robgreenberg3 jennyyyi commited on
Commit
c263a90
·
verified ·
1 Parent(s): bfe3a2c

Update README.md (#1)

Browse files

- Update README.md (8c923e722a2fc5b6c4ece788b41751de1af295d6)
- Update README.md (c466f7b8c23b7d5a803f7ee2d4cc4d2965d703cb)


Co-authored-by: Jenny Y <[email protected]>

Files changed (1) hide show
  1. README.md +197 -0
README.md CHANGED
@@ -188,6 +188,15 @@ extra_gated_description: The information you provide will be collected, stored,
188
  and shared in accordance with the [Meta Privacy Policy](https://www.facebook.com/privacy/policy/).
189
  extra_gated_button_content: Submit
190
  ---
 
 
 
 
 
 
 
 
 
191
 
192
  ## Model Information
193
  **Built with Llama**
@@ -292,6 +301,194 @@ Where to send questions or comments about the model Instructions on how to provi
292
 
293
  This repository contains two versions of Meta-Llama-3.1-8B-Instruct, for use with transformers and with the original `llama` codebase.
294
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
295
  ### Use with transformers
296
 
297
  Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function.
 
188
  and shared in accordance with the [Meta Privacy Policy](https://www.facebook.com/privacy/policy/).
189
  extra_gated_button_content: Submit
190
  ---
191
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
192
+ Llama-3.1-8B-Instruct
193
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
194
+ </h1>
195
+
196
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
197
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
198
+ </a>
199
+
200
 
201
  ## Model Information
202
  **Built with Llama**
 
301
 
302
  This repository contains two versions of Meta-Llama-3.1-8B-Instruct, for use with transformers and with the original `llama` codebase.
303
 
304
+ ### Deployment
305
+
306
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
307
+
308
+ Deploy on <strong>vLLM</strong>
309
+
310
+ ```python
311
+ from vllm import LLM, SamplingParams
312
+
313
+ from transformers import AutoTokenizer
314
+
315
+ model_id = "RedHatAI/Llama-3.1-8B-Instruct"
316
+ number_gpus = 4
317
+
318
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
319
+
320
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
321
+
322
+ prompt = "Give me a short introduction to large language model."
323
+
324
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
325
+
326
+ outputs = llm.generate(prompt, sampling_params)
327
+
328
+ generated_text = outputs[0].outputs[0].text
329
+ print(generated_text)
330
+ ```
331
+
332
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
333
+
334
+ <details>
335
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
336
+
337
+ ```bash
338
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
339
+ --ipc=host \
340
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
341
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
342
+ --name=vllm \
343
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
344
+ vllm serve \
345
+ --tensor-parallel-size 8 \
346
+ --max-model-len 32768 \
347
+ --enforce-eager --model RedHatAI/Llama-3.1-8B-Instruct
348
+ ```
349
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
350
+ </details>
351
+
352
+ <details>
353
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
354
+
355
+ ```bash
356
+ # Download model from Red Hat Registry via docker
357
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
358
+ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3-1-8b-instruct:1.5
359
+ ```
360
+
361
+ ```bash
362
+ # Serve model via ilab
363
+ ilab model serve --model-path ~/.cache/instructlab/models/llama-3-1-8b-instruct
364
+
365
+ # Chat with model
366
+ ilab model chat --model ~/.cache/instructlab/models/llama-3-1-8b-instruct
367
+ ```
368
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
369
+ </details>
370
+
371
+ <details>
372
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
373
+
374
+ ```python
375
+ # Setting up vllm server with ServingRuntime
376
+ # Save as: vllm-servingruntime.yaml
377
+ apiVersion: serving.kserve.io/v1alpha1
378
+ kind: ServingRuntime
379
+ metadata:
380
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
381
+ annotations:
382
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
383
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
384
+ labels:
385
+ opendatahub.io/dashboard: 'true'
386
+ spec:
387
+ annotations:
388
+ prometheus.io/port: '8080'
389
+ prometheus.io/path: '/metrics'
390
+ multiModel: false
391
+ supportedModelFormats:
392
+ - autoSelect: true
393
+ name: vLLM
394
+ containers:
395
+ - name: kserve-container
396
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
397
+ command:
398
+ - python
399
+ - -m
400
+ - vllm.entrypoints.openai.api_server
401
+ args:
402
+ - "--port=8080"
403
+ - "--model=/mnt/models"
404
+ - "--served-model-name={{.Name}}"
405
+ env:
406
+ - name: HF_HOME
407
+ value: /tmp/hf_home
408
+ ports:
409
+ - containerPort: 8080
410
+ protocol: TCP
411
+ ```
412
+
413
+ ```python
414
+ # Attach model to vllm server. This is an NVIDIA template
415
+ # Save as: inferenceservice.yaml
416
+ apiVersion: serving.kserve.io/v1beta1
417
+ kind: InferenceService
418
+ metadata:
419
+ annotations:
420
+ openshift.io/display-name: llama-3-1-8b-instruct # OPTIONAL CHANGE
421
+ serving.kserve.io/deploymentMode: RawDeployment
422
+ name: llama-3-1-8b-instruct # specify model name. This value will be used to invoke the model in the payload
423
+ labels:
424
+ opendatahub.io/dashboard: 'true'
425
+ spec:
426
+ predictor:
427
+ maxReplicas: 1
428
+ minReplicas: 1
429
+ model:
430
+ modelFormat:
431
+ name: vLLM
432
+ name: ''
433
+ resources:
434
+ limits:
435
+ cpu: '2' # this is model specific
436
+ memory: 8Gi # this is model specific
437
+ nvidia.com/gpu: '1' # this is accelerator specific
438
+ requests: # same comment for this block
439
+ cpu: '1'
440
+ memory: 4Gi
441
+ nvidia.com/gpu: '1'
442
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
443
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct:1.5
444
+ tolerations:
445
+ - effect: NoSchedule
446
+ key: nvidia.com/gpu
447
+ operator: Exists
448
+ ```
449
+
450
+ ```bash
451
+ # make sure first to be in the project where you want to deploy the model
452
+ # oc project <project-name>
453
+
454
+ # apply both resources to run model
455
+
456
+ # Apply the ServingRuntime
457
+ oc apply -f vllm-servingruntime.yaml
458
+
459
+ # Apply the InferenceService
460
+ oc apply -f qwen-inferenceservice.yaml
461
+ ```
462
+
463
+ ```python
464
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
465
+ # - Run `oc get inferenceservice` to find your URL if unsure.
466
+
467
+ # Call the server using curl:
468
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
469
+ -H "Content-Type: application/json" \
470
+ -d '{
471
+ "model": "llama-3-1-8b-instruct",
472
+ "stream": true,
473
+ "stream_options": {
474
+ "include_usage": true
475
+ },
476
+ "max_tokens": 1,
477
+ "messages": [
478
+ {
479
+ "role": "user",
480
+ "content": "How can a bee fly when its wings are so small?"
481
+ }
482
+ ]
483
+ }'
484
+
485
+ ```
486
+
487
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
488
+ </details>
489
+
490
+
491
+
492
  ### Use with transformers
493
 
494
  Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function.