Audio-to-Audio
Transformers
Safetensors
speech_language_model
Inference Endpoints
gallilmaimon commited on
Commit
8e79dfd
·
verified ·
1 Parent(s): f978b00

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -6
README.md CHANGED
@@ -17,7 +17,7 @@ This is a Speech Lanaguage Model trained for generating speech contiuations over
17
  ## Model Details
18
 
19
  ### Model Description
20
- This is a Speech Lanaguage Model, introduced in "_Slamming_: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.
21
  It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
22
  the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
23
 
@@ -33,7 +33,7 @@ The model was trained by next-token prediction over a subset of LibriSpeech, Lib
33
  ### Model Sources
34
 
35
  - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
36
- - **Paper:** [Soon!]
37
  - **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
38
 
39
  ## Uses
@@ -50,7 +50,7 @@ We refer users to the official repository for full usage explainations - [github
50
 
51
 
52
  ## Training Details
53
- We highly encourage users to read the full [paper](), for full training details, a brief overview is provided below.
54
 
55
 
56
  ### Training Data
@@ -61,7 +61,7 @@ dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
61
 
62
  ### Training Procedure
63
  This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
64
- Please refer to the [paper]() or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
65
 
66
  #### Preprocessing
67
  Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
@@ -98,7 +98,7 @@ The paper provides full results, we do give here some results and also refer to
98
 
99
 
100
  ### Compute Infrastructure
101
- This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training.
102
 
103
  #### Hardware
104
  This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
@@ -110,4 +110,14 @@ easy and efficent training of Speech Language Models.
110
  ## Citation
111
 
112
  **BibTeX:**
113
- Soon!
 
 
 
 
 
 
 
 
 
 
 
17
  ## Model Details
18
 
19
  ### Model Description
20
+ This is a Speech Lanaguage Model, introduced in ["_Slamming_: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.
21
  It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
22
  the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
23
 
 
33
  ### Model Sources
34
 
35
  - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
36
+ - **Paper:** [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
37
  - **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
38
 
39
  ## Uses
 
50
 
51
 
52
  ## Training Details
53
+ We highly encourage users to read the full [paper](https://arxiv.org/abs/2502.15814), for full training details, a brief overview is provided below.
54
 
55
 
56
  ### Training Data
 
61
 
62
  ### Training Procedure
63
  This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
64
+ Please refer to the [paper](https://arxiv.org/abs/2502.15814) or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
65
 
66
  #### Preprocessing
67
  Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
 
98
 
99
 
100
  ### Compute Infrastructure
101
+ This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.
102
 
103
  #### Hardware
104
  This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
 
110
  ## Citation
111
 
112
  **BibTeX:**
113
+ ```
114
+ @misc{maimon2025slamming,
115
+ title={Slamming: Training a Speech Language Model on One GPU in a Day},
116
+ author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
117
+ year={2025},
118
+ eprint={2502.15814},
119
+ archivePrefix={arXiv},
120
+ primaryClass={cs.LG},
121
+ url={https://arxiv.org/abs/2502.15814},
122
+ }
123
+ ```