diff for compatibility
Browse files- README.md +12 -86
- config.json +2 -2
README.md
CHANGED
@@ -49,96 +49,22 @@ metrics:
|
|
49 |
pipeline_tag: audio-text-to-text
|
50 |
---
|
51 |
|
52 |
-
|
|
|
|
|
|
|
|
|
53 |
|
54 |
-
Ultravox is a multimodal Speech LLM built around a pretrained [Llama3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) and [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) backbone.
|
55 |
|
56 |
-
|
57 |
|
|
|
|
|
58 |
|
59 |
-
##
|
60 |
|
61 |
-
|
62 |
|
63 |
-
|
64 |
-
The input to the model is given as a text prompt with a special `<|audio|>` pseudo-token, and the model processor will replace this magic token with embeddings derived from the input audio.
|
65 |
-
Using the merged embeddings as input, the model will then generate output text as usual.
|
66 |
|
67 |
-
|
68 |
-
No preference tuning has been applied to this revision of the model.
|
69 |
-
|
70 |
-
- **Developed by:** Fixie.ai
|
71 |
-
- **License:** MIT
|
72 |
-
|
73 |
-
### Model Sources
|
74 |
-
|
75 |
-
- **Repository:** https://ultravox.ai
|
76 |
-
- **Demo:** See repo
|
77 |
-
|
78 |
-
## Usage
|
79 |
-
|
80 |
-
Think of the model as an LLM that can also hear and understand speech. As such, it can be used as a voice agent, and also to do speech-to-speech translation, analysis of spoken audio, etc.
|
81 |
-
|
82 |
-
To use the model, try the following:
|
83 |
-
```python
|
84 |
-
# pip install transformers peft librosa
|
85 |
-
|
86 |
-
import transformers
|
87 |
-
import numpy as np
|
88 |
-
import librosa
|
89 |
-
|
90 |
-
pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_5-llama-3_3-70b', trust_remote_code=True)
|
91 |
-
|
92 |
-
path = "<path-to-input-audio>" # TODO: pass the audio here
|
93 |
-
audio, sr = librosa.load(path, sr=16000)
|
94 |
-
|
95 |
-
|
96 |
-
turns = [
|
97 |
-
{
|
98 |
-
"role": "system",
|
99 |
-
"content": "You are a friendly and helpful character. You love to answer questions for people."
|
100 |
-
},
|
101 |
-
]
|
102 |
-
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
|
103 |
-
```
|
104 |
-
|
105 |
-
|
106 |
-
## Training Details
|
107 |
-
|
108 |
-
The model uses a pre-trained [Llama3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) backbone as well as the encoder part of [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo).
|
109 |
-
|
110 |
-
The multi-modal adapter is trained, the Whisper encoder is fine-tuned, and the Llama model is kept frozen.
|
111 |
-
|
112 |
-
We use a knowledge-distillation loss where Ultravox is trying to match the logits of the text-based Llama backbone.
|
113 |
-
|
114 |
-
### Training Data
|
115 |
-
|
116 |
-
The training dataset is a mix of ASR datasets, extended with continuations generated by Llama 3.1 8B, and speech translation datasets, which yield a modest improvement in translation evaluations.
|
117 |
-
|
118 |
-
### Training Procedure
|
119 |
-
|
120 |
-
Supervised speech instruction finetuning via knowledge-distillation. For more info, see [training code in Ultravox repo](https://github.com/fixie-ai/ultravox/blob/main/ultravox/training/train.py).
|
121 |
-
|
122 |
-
|
123 |
-
#### Training Hyperparameters
|
124 |
-
|
125 |
-
- **Training regime:** BF16 mixed precision training
|
126 |
-
- **Hardward used:** 8x H100 GPUs
|
127 |
-
|
128 |
-
#### Speeds, Sizes, Times
|
129 |
-
|
130 |
-
The current version of Ultravox, when invoked with audio content, has a time-to-first-token (TTFT) of approximately 150ms, and a tokens-per-second rate of ~50-100 when using an A100-40GB GPU, all using a Llama 3.3 70B backbone.
|
131 |
-
|
132 |
-
Check out the audio tab on [TheFastest.ai](https://thefastest.ai/?m=audio) for daily benchmarks and a comparison with other existing models.
|
133 |
-
|
134 |
-
## Evaluation
|
135 |
-
|
136 |
-
| | Ultravox 0.4 70B | Ultravox 0.4.1 70B | **Ultravox 0.5 70B** |
|
137 |
-
| --- | ---: | ---: | ---: |
|
138 |
-
| **covost2 en_ar** | 14.97 | 19.64 | 20.21 |
|
139 |
-
| **covost2 en_ca** | 35.02 | 37.58 | 40.01 |
|
140 |
-
| **covost2 en_de** | 30.30 | 32.47 | 34.53 |
|
141 |
-
| **covost2 es_en** | 39.55 | 40.76 | 43.29 |
|
142 |
-
| **covost2 ru_en** | 44.16 | 45.07 | 48.99 |
|
143 |
-
| **covost2 zh_en** | 12.16 | 17.98 | 21.37 |
|
144 |
-
| **big bench audio**| -- | 76.20 | 82.70 |
|
|
|
49 |
pipeline_tag: audio-text-to-text
|
50 |
---
|
51 |
|
52 |
+
<!-- header start -->
|
53 |
+
<p align="center">
|
54 |
+
<img src="https://huggingface.co/datasets/FriendliAI/documentation-images/resolve/main/model-card-assets/friendliai.png" width="100%" alt="FriendliAI Logo">
|
55 |
+
</p>
|
56 |
+
<!-- header end -->
|
57 |
|
|
|
58 |
|
59 |
+
# fixie-ai/ultravox-v0_5-llama-3_3-70b
|
60 |
|
61 |
+
* Model creator: [fixie-ai](https://huggingface.co/fixie-ai)
|
62 |
+
* Original model: [ultravox-v0_5-llama-3_3-70b](https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_3-70b)
|
63 |
|
64 |
+
## Differences
|
65 |
|
66 |
+
* Pre-pulled meta-llama/Llama-3.3-70B-Instruct weights
|
67 |
|
68 |
+
## License
|
|
|
|
|
69 |
|
70 |
+
Refer to the license of the original model card.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
config.json
CHANGED
@@ -70,8 +70,8 @@
|
|
70 |
"projector_act": "swiglu",
|
71 |
"projector_ln_mid": true,
|
72 |
"stack_factor": 8,
|
73 |
-
"text_model_id": "
|
74 |
"torch_dtype": "bfloat16",
|
75 |
"transformers_version": "4.48.2",
|
76 |
"vocab_size": 128256
|
77 |
-
}
|
|
|
70 |
"projector_act": "swiglu",
|
71 |
"projector_ln_mid": true,
|
72 |
"stack_factor": 8,
|
73 |
+
"text_model_id": "/model/Llama-3.3-70B-Instruct",
|
74 |
"torch_dtype": "bfloat16",
|
75 |
"transformers_version": "4.48.2",
|
76 |
"vocab_size": 128256
|
77 |
+
}
|