FriendliAI
/

ultravox-v0_5-llama-3_3-70b

@@ -49,96 +49,22 @@ metrics:
 pipeline_tag: audio-text-to-text
 ---
-# Model Card for Ultravox
-Ultravox is a multimodal Speech LLM built around a pretrained [Llama3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) and [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) backbone.
-See https://ultravox.ai for the GitHub repo and more information.
-## Model Details
-### Model Description
-Ultravox is a multimodal model that can consume both speech and text as input (e.g., a text system prompt and voice user message).
-The input to the model is given as a text prompt with a special `<|audio|>` pseudo-token, and the model processor will replace this magic token with embeddings derived from the input audio.
-Using the merged embeddings as input, the model will then generate output text as usual.
-In a future revision of Ultravox, we plan to expand the token vocabulary to support generation of semantic and acoustic audio tokens, which can then be fed to a vocoder to produce voice output.
-No preference tuning has been applied to this revision of the model.
-- **Developed by:** Fixie.ai
-- **License:** MIT
-### Model Sources
-- **Repository:** https://ultravox.ai
-- **Demo:** See repo
-## Usage
-Think of the model as an LLM that can also hear and understand speech. As such, it can be used as a voice agent, and also to do speech-to-speech translation, analysis of spoken audio, etc.
-To use the model, try the following:
-```python
-# pip install transformers peft librosa
-import transformers
-import numpy as np
-import librosa
-pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_5-llama-3_3-70b', trust_remote_code=True)
-path = "<path-to-input-audio>"  # TODO: pass the audio here
-audio, sr = librosa.load(path, sr=16000)
-turns = [
-  {
-    "role": "system",
-    "content": "You are a friendly and helpful character. You love to answer questions for people."
-  },
-]
-pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
-```
-## Training Details
-The model uses a pre-trained [Llama3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) backbone as well as the encoder part of [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo).
-The multi-modal adapter is trained, the Whisper encoder is fine-tuned, and the Llama model is kept frozen.
-We use a knowledge-distillation loss where Ultravox is trying to match the logits of the text-based Llama backbone.
-### Training Data
-The training dataset is a mix of ASR datasets, extended with continuations generated by Llama 3.1 8B, and speech translation datasets, which yield a modest improvement in translation evaluations.
-### Training Procedure
-Supervised speech instruction finetuning via knowledge-distillation. For more info, see [training code in Ultravox repo](https://github.com/fixie-ai/ultravox/blob/main/ultravox/training/train.py).
-#### Training Hyperparameters
-- **Training regime:** BF16 mixed precision training
-- **Hardward used:** 8x H100 GPUs
-#### Speeds, Sizes, Times
-The current version of Ultravox, when invoked with audio content, has a time-to-first-token (TTFT) of approximately 150ms, and a tokens-per-second rate of ~50-100 when using an A100-40GB GPU, all using a Llama 3.3 70B backbone.
-Check out the audio tab on [TheFastest.ai](https://thefastest.ai/?m=audio) for daily benchmarks and a comparison with other existing models.
-## Evaluation
-|  | Ultravox 0.4 70B | Ultravox 0.4.1 70B | **Ultravox 0.5 70B** |
-| --- |  ---: | ---: | ---: |
-| **covost2 en_ar** | 14.97 | 19.64 | 20.21 |
-| **covost2 en_ca** | 35.02 | 37.58 | 40.01 |
-| **covost2 en_de** | 30.30 | 32.47 | 34.53 |
-| **covost2 es_en** | 39.55 | 40.76 | 43.29 |
-| **covost2 ru_en** | 44.16 | 45.07 | 48.99 |
-| **covost2 zh_en** | 12.16 | 17.98 | 21.37 |
-| **big bench audio**| -- | 76.20 | 82.70 |

 pipeline_tag: audio-text-to-text
 ---
+<!-- header start -->
+<p align="center">
+  <img src="https://huggingface.co/datasets/FriendliAI/documentation-images/resolve/main/model-card-assets/friendliai.png" width="100%" alt="FriendliAI Logo">
+</p>
+<!-- header end -->
+# fixie-ai/ultravox-v0_5-llama-3_3-70b
+* Model creator: [fixie-ai](https://huggingface.co/fixie-ai)
+* Original model: [ultravox-v0_5-llama-3_3-70b](https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_3-70b)
+## Differences
+* Pre-pulled meta-llama/Llama-3.3-70B-Instruct weights
+## License
+Refer to the license of the original model card.

config.json CHANGED Viewed

@@ -70,8 +70,8 @@
   "projector_act": "swiglu",
   "projector_ln_mid": true,
   "stack_factor": 8,
-  "text_model_id": "meta-llama/Llama-3.3-70B-Instruct",
   "torch_dtype": "bfloat16",
   "transformers_version": "4.48.2",
   "vocab_size": 128256
-}

   "projector_act": "swiglu",
   "projector_ln_mid": true,
   "stack_factor": 8,
+  "text_model_id": "/model/Llama-3.3-70B-Instruct",
   "torch_dtype": "bfloat16",
   "transformers_version": "4.48.2",
   "vocab_size": 128256
+}