Vikhrmodels
/

salt-qwen2.5-0.5b-asr-tts

Model card Files Files and versions

ksych commited on 27 days ago

Commit

44fce99

·

verified ·

1 Parent(s): 7af7127

Create README.md

Files changed (1) hide show

README.md +44 -0

README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+---
+datasets:
+- openslr/librispeech_asr
+- amphion/Emilia-Dataset
+- its5Q/bigger-ru-book
+- mozilla-foundation/common_voice_12_0
+language:
+- en
+- ru
+- uk
+base_model:
+- Qwen/Qwen2.5-0.5B
+---
+#### **Model Performance Overview**
+**Metrics**:
+- **PESQ**: Perceptual Evaluation of Speech Quality (higher = better).
+- **STOI**: Short-Time Objective Intelligibility (closer to 1 = better).
+- **SI-SDR**: Scale-Invariant Signal-to-Distortion Ratio (higher = better).
+| Model                     | PESQ@200       | STOI@200      | SI-SDR@200        |
+|---------------------------|----------------|---------------|-------------------|
+| Fish-aduio-1.5        | 1.20     | 0.16    | 23.0         |
+| **SALT-tts**           | 1.11     | 0.16    | 23.58         |
+| **SALT-tts+asr**  | 1.09    | 0.18    | 23.09        |
+---
+#### **Our Solution**
+- **Method**: Extends a pre-trained LLM with audio tokens and fine-tunes on **TTS** and **ASR** tasks.
+- **Training**:
+  - BigCodec tokenizer (supports Slavic languages) for speech generation.
+  - SpeechTokenizer (semantic tokens only) for speech recognition.
+  - Training time: **168 H100 GPU hours**.
+- **Advantages**: Unified LM loss for dual tasks, minimal training overhead.
+---
+#### **Resources**
+- Code: [GitHub Repo](https://github.com/VikhrModels/Vikhr4o)
+---