ksych commited on
Commit
44fce99
·
verified ·
1 Parent(s): 7af7127

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -0
README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - openslr/librispeech_asr
4
+ - amphion/Emilia-Dataset
5
+ - its5Q/bigger-ru-book
6
+ - mozilla-foundation/common_voice_12_0
7
+ language:
8
+ - en
9
+ - ru
10
+ - uk
11
+ base_model:
12
+ - Qwen/Qwen2.5-0.5B
13
+ ---
14
+
15
+
16
+ #### **Model Performance Overview**
17
+ **Metrics**:
18
+ - **PESQ**: Perceptual Evaluation of Speech Quality (higher = better).
19
+ - **STOI**: Short-Time Objective Intelligibility (closer to 1 = better).
20
+ - **SI-SDR**: Scale-Invariant Signal-to-Distortion Ratio (higher = better).
21
+
22
+ | Model | PESQ@200 | STOI@200 | SI-SDR@200 |
23
+ |---------------------------|----------------|---------------|-------------------|
24
+ | Fish-aduio-1.5 | 1.20 | 0.16 | 23.0 |
25
+ | **SALT-tts** | 1.11 | 0.16 | 23.58 |
26
+ | **SALT-tts+asr** | 1.09 | 0.18 | 23.09 |
27
+
28
+ ---
29
+
30
+ #### **Our Solution**
31
+ - **Method**: Extends a pre-trained LLM with audio tokens and fine-tunes on **TTS** and **ASR** tasks.
32
+ - **Training**:
33
+ - BigCodec tokenizer (supports Slavic languages) for speech generation.
34
+ - SpeechTokenizer (semantic tokens only) for speech recognition.
35
+ - Training time: **168 H100 GPU hours**.
36
+ - **Advantages**: Unified LM loss for dual tasks, minimal training overhead.
37
+
38
+
39
+ ---
40
+
41
+ #### **Resources**
42
+ - Code: [GitHub Repo](https://github.com/VikhrModels/Vikhr4o)
43
+
44
+ ---