Audio-to-Audio
Transformers
Safetensors
speech_language_model
Inference Endpoints
gallilmaimon nielsr HF staff commited on
Commit
b301b9f
·
verified ·
1 Parent(s): 8e79dfd

Fix typos (#1)

Browse files

- Correct pipeline tag in model card (ff0ab65f3d1598bb6622b2a85c217ce175723d6e)
- Update README.md (e6504e164bd68545ee3d356761e5261e2c783f0b)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +21 -13
README.md CHANGED
@@ -1,28 +1,36 @@
1
  ---
2
- library_name: transformers
3
- license: mit
4
  datasets:
5
  - openslr/librispeech_asr
6
  - slprl/SpokenSwag
7
  - slprl/sTinyStories
8
- base_model:
9
- - Qwen/Qwen2.5-0.5B
10
  pipeline_tag: audio-to-audio
11
  ---
12
 
 
 
 
 
 
 
 
 
13
  # Model Card for Model ID
14
- This is a Speech Lanaguage Model trained for generating speech contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
15
 
16
 
17
  ## Model Details
18
 
19
  ### Model Description
20
- This is a Speech Lanaguage Model, introduced in ["_Slamming_: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.
21
  It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
22
  the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
23
 
24
- The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data
25
- [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was then trained with DPO over
26
  [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
27
 
28
  - **Developed by:** [SLP-RL](https://huggingface.co/slprl)
@@ -34,10 +42,10 @@ The model was trained by next-token prediction over a subset of LibriSpeech, Lib
34
 
35
  - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
36
  - **Paper:** [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
37
- - **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
38
 
39
  ## Uses
40
- This is a base SpeechLM and as such can be used to generate contiuations for speech segments, or as base for further tuning. See the _SlamKit_
41
  [codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples
42
 
43
  ### Out-of-Scope Use
@@ -46,7 +54,7 @@ This model was trained on curated speech datasets which contain mainly audio-boo
46
 
47
 
48
  ## How to Get Started with the Model
49
- We refer users to the official repository for full usage explainations - [github](https://github.com/slp-rl/slamkit).
50
 
51
 
52
  ## Training Details
@@ -60,7 +68,7 @@ This model was trained on a subset of [LibriSpeech](https://huggingface.co/datas
60
  dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
61
 
62
  ### Training Procedure
63
- This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
64
  Please refer to the [paper](https://arxiv.org/abs/2502.15814) or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
65
 
66
  #### Preprocessing
@@ -105,7 +113,7 @@ This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
105
 
106
  #### Software
107
  The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
108
- easy and efficent training of Speech Language Models.
109
 
110
  ## Citation
111
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-0.5B
4
  datasets:
5
  - openslr/librispeech_asr
6
  - slprl/SpokenSwag
7
  - slprl/sTinyStories
8
+ library_name: transformers
9
+ license: mit
10
  pipeline_tag: audio-to-audio
11
  ---
12
 
13
+ # Slamming: Training a Speech Language Model on One GPU in a Day
14
+
15
+ The model was presented in the paper [Slamming: Training a Speech Language Model on One GPU in a Day](https://arxiv.org/abs/2502.15814).
16
+
17
+ # Paper abstract
18
+
19
+ We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
20
+
21
  # Model Card for Model ID
22
+ This is a Speech Language Model (SLM) trained for generating speech continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
23
 
24
 
25
  ## Model Details
26
 
27
  ### Model Description
28
+ This Speech Language Model, introduced in ["_Slamming_: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focuses on efficient training.
29
  It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
30
  the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
31
 
32
+ The model was pre-trained using next-token prediction on a subset of LibriSpeech, Libri-Light and a synthetic dataset
33
+ [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was subsequently fine-tuned with DPO on
34
  [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
35
 
36
  - **Developed by:** [SLP-RL](https://huggingface.co/slprl)
 
42
 
43
  - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
44
  - **Paper:** [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
45
+ - **Demo:** [https://pages.cs.huji.ac.il/adiyoss-lab/slamming/](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
46
 
47
  ## Uses
48
+ This base SpeechLM can be used to generate continuations for speech segments, or as a base for further tuning. See the _SlamKit_
49
  [codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples
50
 
51
  ### Out-of-Scope Use
 
54
 
55
 
56
  ## How to Get Started with the Model
57
+ We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).
58
 
59
 
60
  ## Training Details
 
68
  dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
69
 
70
  ### Training Procedure
71
+ This model was trained by next token prediction over several datasets, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
72
  Please refer to the [paper](https://arxiv.org/abs/2502.15814) or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
73
 
74
  #### Preprocessing
 
113
 
114
  #### Software
115
  The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
116
+ easy and efficient training of Speech Language Models.
117
 
118
  ## Citation
119