README.md · Vikhrmodels/salt-qwen2.5-0.5b-asr-tts at main

metadata

datasets:
  - openslr/librispeech_asr
  - amphion/Emilia-Dataset
  - its5Q/bigger-ru-book
  - mozilla-foundation/common_voice_12_0
language:
  - en
  - ru
  - uk
base_model:
  - Qwen/Qwen2.5-0.5B

Model Performance Overview

Metrics:

PESQ: Perceptual Evaluation of Speech Quality (higher = better).
STOI: Short-Time Objective Intelligibility (closer to 1 = better).
SI-SDR: Scale-Invariant Signal-to-Distortion Ratio (higher = better).

Model	PESQ@200	STOI@200	SI-SDR@200
Fish-aduio-1.5	1.20	0.16	23.0
SALT-tts	1.11	0.16	23.58
SALT-tts+asr	1.09	0.18	23.09

Our Solution

Method: Extends a pre-trained LLM with audio tokens and fine-tunes on TTS and ASR tasks.
Training:
- BigCodec tokenizer (supports Slavic languages) for speech generation.
- SpeechTokenizer (semantic tokens only) for speech recognition.
- Training time: 168 H100 GPU hours.
Advantages: Unified LM loss for dual tasks, minimal training overhead.

Resources

Code: GitHub Repo