slprl
/

slam_scaled

Audio-to-Audio

Transformers

Safetensors

speech_language_model

Inference Endpoints

Model card Files Files and versions Community

gallilmaimon commited on 2 days ago

Commit

8e79dfd

verified ·

1 Parent(s): f978b00

Update README.md

Browse files

Files changed (1) hide show

README.md +16 -6

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ This is a Speech Lanaguage Model trained for generating speech contiuations over
 ## Model Details
 ### Model Description
-This is a Speech Lanaguage Model, introduced in "_Slamming_: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.
 It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
 the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
@@ -33,7 +33,7 @@ The model was trained by next-token prediction over a subset of LibriSpeech, Lib
 ### Model Sources
 - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
-- **Paper:** [Soon!]
 - **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
 ## Uses
@@ -50,7 +50,7 @@ We refer users to the official repository for full usage explainations - [github
 ## Training Details
-We highly encourage users to read the full [paper](), for full training details, a brief overview is provided below.
 ### Training Data
@@ -61,7 +61,7 @@ dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
 ### Training Procedure
 This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
-Please refer to the [paper]() or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
 #### Preprocessing
 Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
@@ -98,7 +98,7 @@ The paper provides full results, we do give here some results and also refer to
 ### Compute Infrastructure
-This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training.
 #### Hardware
 This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
@@ -110,4 +110,14 @@ easy and efficent training of Speech Language Models.
 ## Citation
 **BibTeX:**
-Soon!

 ## Model Details
 ### Model Description
+This is a Speech Lanaguage Model, introduced in ["_Slamming_: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.
 It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
 the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
 ### Model Sources
 - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
+- **Paper:** [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
 - **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
 ## Uses
 ## Training Details
+We highly encourage users to read the full [paper](https://arxiv.org/abs/2502.15814), for full training details, a brief overview is provided below.
 ### Training Data
 ### Training Procedure
 This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
+Please refer to the [paper](https://arxiv.org/abs/2502.15814) or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
 #### Preprocessing
 Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
 ### Compute Infrastructure
+This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.
 #### Hardware
 This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
 ## Citation
 **BibTeX:**
+```
+@misc{maimon2025slamming,
+      title={Slamming: Training a Speech Language Model on One GPU in a Day},
+      author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
+      year={2025},
+      eprint={2502.15814},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2502.15814},
+}
+```