ksych's picture
Update README.md
a966fb8 verified
metadata
datasets:
  - openslr/librispeech_asr
  - amphion/Emilia-Dataset
  - its5Q/bigger-ru-book
  - mozilla-foundation/common_voice_12_0
language:
  - en
  - ru
  - uk
base_model:
  - Qwen/Qwen2.5-0.5B

Model Performance Overview

Metrics:

  • PESQ: Perceptual Evaluation of Speech Quality (higher = better).
  • STOI: Short-Time Objective Intelligibility (closer to 1 = better).
  • SI-SDR: Scale-Invariant Signal-to-Distortion Ratio (higher = better).
Model PESQ@200 STOI@200 SI-SDR@200
Fish-aduio-1.5 1.20 0.16 23.0
SALT-tts 1.11 0.16 23.58
SALT-tts+asr 1.09 0.18 23.09

Our Solution

  • Method: Extends a pre-trained LLM with audio tokens and fine-tunes on TTS and ASR tasks.
  • Training:
    • BigCodec tokenizer (supports Slavic languages) for speech generation.
    • SpeechTokenizer (semantic tokens only) for speech recognition.
    • Training time: 168 H100 GPU hours.
  • Advantages: Unified LM loss for dual tasks, minimal training overhead.

Resources