Vikhrmodels
/

salt-qwen2.5-0.5b-asr-tts

Model card Files Files and versions Community

salt-qwen2.5-0.5b-asr-tts / README.md

ksych's picture

Update README.md

a966fb8 verified 25 days ago

|

history blame contribute delete

1.46 kB

	---
	datasets:
	- openslr/librispeech_asr
	- amphion/Emilia-Dataset
	- its5Q/bigger-ru-book
	- mozilla-foundation/common_voice_12_0
	language:
	- en
	- ru
	- uk
	base_model:
	- Qwen/Qwen2.5-0.5B
	---


	#### Model Performance Overview
	Metrics:
	- PESQ: Perceptual Evaluation of Speech Quality (higher = better).
	- STOI: Short-Time Objective Intelligibility (closer to 1 = better).
	- SI-SDR: Scale-Invariant Signal-to-Distortion Ratio (higher = better).

	\| Model \| PESQ@200 \| STOI@200 \| SI-SDR@200 \|
	\|---------------------------\|----------------\|---------------\|-------------------\|
	\| Fish-aduio-1.5 \| 1.20 \| 0.16 \| 23.0 \|
	\| [SALT-tts](https://huggingface.co/Vikhrmodels/salt-qwen2.5-0.5b-tts) \| 1.11 \| 0.16 \| 23.58 \|
	\| [SALT-tts+asr](https://huggingface.co/Vikhrmodels/salt-qwen2.5-0.5b-asr-tts) \| 1.09 \| 0.18 \| 23.09 \|

	---

	#### Our Solution
	- Method: Extends a pre-trained LLM with audio tokens and fine-tunes on TTS and ASR tasks.
	- Training:
	- BigCodec tokenizer (supports Slavic languages) for speech generation.
	- SpeechTokenizer (semantic tokens only) for speech recognition.
	- Training time: 168 H100 GPU hours.
	- Advantages: Unified LM loss for dual tasks, minimal training overhead.


	---

	#### Resources
	- Code: [GitHub Repo](https://github.com/VikhrModels/Vikhr4o)

	---