minpeter commited on
Commit
1e2145f
·
verified ·
1 Parent(s): 8230a56

diff for compatibility

Browse files
Files changed (2) hide show
  1. README.md +12 -86
  2. config.json +2 -2
README.md CHANGED
@@ -49,96 +49,22 @@ metrics:
49
  pipeline_tag: audio-text-to-text
50
  ---
51
 
52
- # Model Card for Ultravox
 
 
 
 
53
 
54
- Ultravox is a multimodal Speech LLM built around a pretrained [Llama3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) and [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) backbone.
55
 
56
- See https://ultravox.ai for the GitHub repo and more information.
57
 
 
 
58
 
59
- ## Model Details
60
 
61
- ### Model Description
62
 
63
- Ultravox is a multimodal model that can consume both speech and text as input (e.g., a text system prompt and voice user message).
64
- The input to the model is given as a text prompt with a special `<|audio|>` pseudo-token, and the model processor will replace this magic token with embeddings derived from the input audio.
65
- Using the merged embeddings as input, the model will then generate output text as usual.
66
 
67
- In a future revision of Ultravox, we plan to expand the token vocabulary to support generation of semantic and acoustic audio tokens, which can then be fed to a vocoder to produce voice output.
68
- No preference tuning has been applied to this revision of the model.
69
-
70
- - **Developed by:** Fixie.ai
71
- - **License:** MIT
72
-
73
- ### Model Sources
74
-
75
- - **Repository:** https://ultravox.ai
76
- - **Demo:** See repo
77
-
78
- ## Usage
79
-
80
- Think of the model as an LLM that can also hear and understand speech. As such, it can be used as a voice agent, and also to do speech-to-speech translation, analysis of spoken audio, etc.
81
-
82
- To use the model, try the following:
83
- ```python
84
- # pip install transformers peft librosa
85
-
86
- import transformers
87
- import numpy as np
88
- import librosa
89
-
90
- pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_5-llama-3_3-70b', trust_remote_code=True)
91
-
92
- path = "<path-to-input-audio>" # TODO: pass the audio here
93
- audio, sr = librosa.load(path, sr=16000)
94
-
95
-
96
- turns = [
97
- {
98
- "role": "system",
99
- "content": "You are a friendly and helpful character. You love to answer questions for people."
100
- },
101
- ]
102
- pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
103
- ```
104
-
105
-
106
- ## Training Details
107
-
108
- The model uses a pre-trained [Llama3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) backbone as well as the encoder part of [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo).
109
-
110
- The multi-modal adapter is trained, the Whisper encoder is fine-tuned, and the Llama model is kept frozen.
111
-
112
- We use a knowledge-distillation loss where Ultravox is trying to match the logits of the text-based Llama backbone.
113
-
114
- ### Training Data
115
-
116
- The training dataset is a mix of ASR datasets, extended with continuations generated by Llama 3.1 8B, and speech translation datasets, which yield a modest improvement in translation evaluations.
117
-
118
- ### Training Procedure
119
-
120
- Supervised speech instruction finetuning via knowledge-distillation. For more info, see [training code in Ultravox repo](https://github.com/fixie-ai/ultravox/blob/main/ultravox/training/train.py).
121
-
122
-
123
- #### Training Hyperparameters
124
-
125
- - **Training regime:** BF16 mixed precision training
126
- - **Hardward used:** 8x H100 GPUs
127
-
128
- #### Speeds, Sizes, Times
129
-
130
- The current version of Ultravox, when invoked with audio content, has a time-to-first-token (TTFT) of approximately 150ms, and a tokens-per-second rate of ~50-100 when using an A100-40GB GPU, all using a Llama 3.3 70B backbone.
131
-
132
- Check out the audio tab on [TheFastest.ai](https://thefastest.ai/?m=audio) for daily benchmarks and a comparison with other existing models.
133
-
134
- ## Evaluation
135
-
136
- | | Ultravox 0.4 70B | Ultravox 0.4.1 70B | **Ultravox 0.5 70B** |
137
- | --- | ---: | ---: | ---: |
138
- | **covost2 en_ar** | 14.97 | 19.64 | 20.21 |
139
- | **covost2 en_ca** | 35.02 | 37.58 | 40.01 |
140
- | **covost2 en_de** | 30.30 | 32.47 | 34.53 |
141
- | **covost2 es_en** | 39.55 | 40.76 | 43.29 |
142
- | **covost2 ru_en** | 44.16 | 45.07 | 48.99 |
143
- | **covost2 zh_en** | 12.16 | 17.98 | 21.37 |
144
- | **big bench audio**| -- | 76.20 | 82.70 |
 
49
  pipeline_tag: audio-text-to-text
50
  ---
51
 
52
+ <!-- header start -->
53
+ <p align="center">
54
+ <img src="https://huggingface.co/datasets/FriendliAI/documentation-images/resolve/main/model-card-assets/friendliai.png" width="100%" alt="FriendliAI Logo">
55
+ </p>
56
+ <!-- header end -->
57
 
 
58
 
59
+ # fixie-ai/ultravox-v0_5-llama-3_3-70b
60
 
61
+ * Model creator: [fixie-ai](https://huggingface.co/fixie-ai)
62
+ * Original model: [ultravox-v0_5-llama-3_3-70b](https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_3-70b)
63
 
64
+ ## Differences
65
 
66
+ * Pre-pulled meta-llama/Llama-3.3-70B-Instruct weights
67
 
68
+ ## License
 
 
69
 
70
+ Refer to the license of the original model card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -70,8 +70,8 @@
70
  "projector_act": "swiglu",
71
  "projector_ln_mid": true,
72
  "stack_factor": 8,
73
- "text_model_id": "meta-llama/Llama-3.3-70B-Instruct",
74
  "torch_dtype": "bfloat16",
75
  "transformers_version": "4.48.2",
76
  "vocab_size": 128256
77
- }
 
70
  "projector_act": "swiglu",
71
  "projector_ln_mid": true,
72
  "stack_factor": 8,
73
+ "text_model_id": "/model/Llama-3.3-70B-Instruct",
74
  "torch_dtype": "bfloat16",
75
  "transformers_version": "4.48.2",
76
  "vocab_size": 128256
77
+ }