TuKoResearch
/

AuriStream1B_librilight_ckpt500k

@@ -1,24 +1,25 @@
 ---
 language:
 - en
 library_name: transformers
-pipeline_tag: feature-extraction
 tags:
 - audio
 - speech
 - autoregressive
 - transformers
 - custom_code
-datasets:
-- LibriLight
-license: apache-2.0
 pretty_name: AuriStream1B
 ---
 # AuriStream-1B
-**AuriStream** is a biologically-inspired, GPT-style autoregressive Transformer trained to predict tokens from the speech stream (denoted as **cochlear tokens**). These cochlear tokens are discrete codes produced by a companion “WavCoch” tokenizer (a model trained to predict the time-frequency cochleagram from a waveform, with a LFQ bottleneck for token read-out). AuriStream utilizes a long context window of (\~20 s, \~4096 tokens) and is trained on **LibriLight (\~60k hours)** for **500k steps**. It learns meaningful representations about e.g. phoneme/word identity and can predict future tokens to generate **speech continuations**. Inputs are cochlear **token IDs**; use it with a WavCoch tokenizer for audio -> tokens.
 ---
@@ -126,8 +127,6 @@ with torch.no_grad():
         prompt_tokens, rollout_steps, temp=0.7, top_k=50, top_p=0.95, seed=0
     )
     full_tokens = torch.cat([prompt_tokens, pred_tokens], dim=1)  # (1, L+K)
 ```
 ## Architecture overview
@@ -152,5 +151,4 @@ If you use this model, please cite:
   doi       = {10.21437/Interspeech.2025-2044},
   issn      = {2958-1796}
 }
-```

 ---
+datasets:
+- LibriLight
 language:
 - en
 library_name: transformers
+license: apache-2.0
+pipeline_tag: audio-to-audio
 tags:
 - audio
 - speech
 - autoregressive
 - transformers
 - custom_code
 pretty_name: AuriStream1B
 ---
 # AuriStream-1B
+[📚 Paper](https://huggingface.co/papers/2508.11598) - [🌐 Project Page](https://tukoresearch.github.io/auristream-speech/)
+**AuriStream** is a biologically-inspired, GPT-style autoregressive Transformer trained to predict tokens from the speech stream (denoted as **cochlear tokens**). These cochlear tokens are discrete codes produced by a companion “WavCoch” tokenizer (a model trained to predict the time-frequency cochleagram from a waveform, with a LFQ bottleneck for token read-out). AuriStream utilizes a long context window of (~20 s, ~4096 tokens) and is trained on **LibriLight (~60k hours)** for **500k steps**. It learns meaningful representations about e.g. phoneme/word identity and can predict future tokens to generate **speech continuations**. Inputs are cochlear **token IDs**; use it with a WavCoch tokenizer for audio -> tokens.
 ---
         prompt_tokens, rollout_steps, temp=0.7, top_k=50, top_p=0.95, seed=0
     )
     full_tokens = torch.cat([prompt_tokens, pred_tokens], dim=1)  # (1, L+K)
 ```
 ## Architecture overview
   doi       = {10.21437/Interspeech.2025-2044},
   issn      = {2958-1796}
 }
+```