VoxPolska Auralis: Bringing Polish Speech to Life with Cutting-Edge TTS

image/png

πŸ“Œ Model Highlights

  • Context-Aware Voice: Generates speech that captures the nuances and tone of the Polish language.
  • Showcases advanced proficiency in speech synthesis and Polish language processing.
  • Converts written Polish text into natural, fluent, and expressive speech.
  • Advanced Deep Learning: Built using cutting-edge deep learning techniques for optimal performance.
  • State-of-the-Art Technology: Showcases advanced proficiency in speech synthesis and Polish language processing.

πŸ”§ Technical Details

  • Base Model: Llasa TTS
  • Optimized Fine-Tuning: Employs LoRA (Low-Rank Adaptation) with high-rank setting for efficient and scalable adaptation.
  • High-Fidelity Audio: Produces clear 16 kHz sample rate audio, delivering crisp and realistic voice output.
  • Dataset: Trained with 24000+ Polish transcript and audio pairs
  • Merge: Merged 16 bit quantization
  • Audio Decoding: Customized layer-wise audio generation pipeline
  • Repetition Penalty: 1.1 to avoid repetitive phrases
  • Robust Deep Learning: Integrates advanced training optimizations including gradient checkpointing and mixed precision training.

🎧 Example Usage (Pipeline)

  • Here is an exampe code cell to run the model on a notebook:
!pip install transformers ipython
from transformers import pipeline
from IPython.display import Audio
pipe = pipeline("text-to-speech", model="salihfurkaan/VoxPolska-Auralis")
output = pipe("CzeΕ›Δ‡, jestem modelem sztucznej inteligencji mΓ³wiΔ…cym po polsku")
Audio(output["audio"], rate=output["sampling_rate"])

🎧 Example Usage (Directly)

  • Here is an exampe code cell to run the model on a notebook:
!pip install --no-deps unsloth==2025.4.1 bitsandbytes unsloth_zoo trl==0.15.2
!pip install xcodec2==0.1.5 --no-deps
!pip install vector_quantize_pytorch

from unsloth import FastLanguageModel
import torch
from xcodec2.modeling_xcodec2 import XCodec2Model
import torchaudio
import soundfile as sf
from IPython.display import display, Audio
from transformers import AutoTokenizer, AutoModelForCausalLM

input_text = "CzeΕ›Δ‡, jestem modelem sztucznej inteligencji mΓ³wiΔ…cym po polsku."
XCODEC2_MODEL_NAME = "HKUST-Audio/xcodec2"
SAMPLE_RATE = 16000
device = "cuda" if torch.cuda.is_available() else "cpu"

codec_model = XCodec2Model.from_pretrained(XCODEC2_MODEL_NAME)
codec_model = codec_model.to(device).eval()
codec_model.to('cpu')


tokenizer = AutoTokenizer.from_pretrained("salihfurkaan/VoxPolska-Auralis")
model = AutoModelForCausalLM.from_pretrained("salihfurkaan/VoxPolska-Auralis")

FastLanguageModel.for_inference(model)

def ids_to_speech_tokens(speech_ids):

   speech_tokens_str = []
   for speech_id in speech_ids:
       speech_tokens_str.append(f"<|s_{speech_id}|>")
   return speech_tokens_str

def extract_speech_ids(speech_tokens_str):

   speech_ids = []
   for token_str in speech_tokens_str:
       if token_str.startswith('<|s_') and token_str.endswith('|>'):
           num_str = token_str[4:-2]

           num = int(num_str)
           speech_ids.append(num)
       else:
           print(f"Unexpected token: {token_str}")
   return speech_ids


with torch.inference_mode():
   with torch.amp.autocast(device,dtype=model.dtype):
       formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"

       chat = [
           {"role": "user", "content": "Convert the text to speech:" + formatted_text},
           {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
       ]

       input_ids = tokenizer.apply_chat_template(
           chat,
           tokenize=True,
           return_tensors='pt',
           continue_final_message=True
       )

       speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')

       # Generate the speech autoregressively
       outputs = model.generate(
           input_ids,
           max_length=2048, 
           eos_token_id= speech_end_id ,
           do_sample=True,
           top_p=1.2,           #  Adjusts the diversity of generated content
           temperature=1.2,   #  Controls randomness in output
       )

   generated_ids = outputs[0][input_ids.shape[1]:-1]
   speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
   speech_tokens = extract_speech_ids(speech_tokens)
   speech_tokens = torch.tensor(speech_tokens).cpu().unsqueeze(0).unsqueeze(0)
   gen_wav = codec_model.decode_code(speech_tokens)

sf.write("output.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)

display(Audio(gen_wav[0, 0, :].cpu().numpy(), rate=16000))

You can get your huggingface token from here

πŸ“« Contact and Support

For questions, suggestions, and feedback, please open an issue on HuggingFace. You can also reach the author via: LinkedIn

Model Misuse

Do not use this model for impersonation without consent, misinformation or deception (including fake news or fraudulent calls), or any illegal or harmful activity. By using this model, you agree to follow all applicable laws and ethical guidelines.

Citation

@misc{
  title={salihfurkaan/VoxPolska-Auralis},
  author={Salih Furkan Erik},
  year={2025},
  url={https://huggingface.co/salihfurkaan/VoxPolska-Auralis/}
}
Downloads last month
238
Safetensors
Model size
1.37B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for salihfurkaan/VoxPolska-Auralis

Finetuned
unsloth/Llasa-1B
Finetuned
(1)
this model

Dataset used to train salihfurkaan/VoxPolska-Auralis

Collection including salihfurkaan/VoxPolska-Auralis