VoxPolska: Next-Gen Polish Voice Generation

📌 Model Highlights

Context-Aware Voice: Generates speech that captures the nuances and tone of the Polish language.
Showcases advanced proficiency in speech synthesis and Polish language processing.
Converts written Polish text into natural, fluent, and expressive speech.
Advanced Deep Learning: Built using cutting-edge deep learning techniques for optimal performance.
State-of-the-Art Technology: Showcases advanced proficiency in speech synthesis and Polish language processing.

🔧 Technical Details

Base Model: Orpheus TTS
LoRA (Low-Rank Adaptation) fine-tuning applied to optimize model performance.
Sample Rate: 24 kHz audio output for high-fidelity sound.
Trained with 24000+ Polish transcript and audio pairs
Merged 16 bit quantization
Audio Decoding: Customized layer-wise processing for audio generation
Repetition Penalty: 1.1 to avoid repetitive phrases
Gradient Checkpointing: Enabled for efficient memory usage

🎧 Example Usage (Pipeline)

Here is an example code snippet to run the model on a notebook:

!pip install transformers
from transformers import pipeline
pipe = pipeline("text-to-speech", model="salihfurkaan/VoxPolska-V1-Merged-16bit")

🎧 Example Usage (Directly)

Here is an example code to run the model on a notebook:

  !pip install snac torch transformers

  import torch
  import snac
  from snac import SNAC
  from transformers import AutoTokenizer, AutoModelForCausalLM
  import os
  from IPython.display import display, Audio
  
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  
  tokenizer = AutoTokenizer.from_pretrained("salihfurkaan/VoxPolska-V1-Merged-16bit")
  model = AutoModelForCausalLM.from_pretrained("salihfurkaan/VoxPolska-V1-Merged-16bit").to(device)
  
  os.environ["HF_TOKEN"] = "your huggingface token here"
  snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to(device)
  
  prompts = [
      "Cześć, jestem dużym modelem języka sztucznej inteligencji"
  ]  #an example prompt
  chosen_voice = None
  
  prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
  all_input_ids = []
  for prompt in prompts_:
      input_ids = tokenizer(prompt, return_tensors="pt").input_ids
      all_input_ids.append(input_ids)
  
  start_token = torch.tensor([[128259]], dtype=torch.int64)  # Start of human
  end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)  # End of text, End of human
  
  all_modified_input_ids = []
  for input_ids in all_input_ids:
      modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
      all_modified_input_ids.append(modified_input_ids)
  
  all_padded_tensors = []
  all_attention_masks = []
  max_length = max([x.shape[1] for x in all_modified_input_ids])
  for modified_input_ids in all_modified_input_ids:
      padding = max_length - modified_input_ids.shape[1]
      padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
      attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
      all_padded_tensors.append(padded_tensor)
      all_attention_masks.append(attention_mask)
  
  all_padded_tensors = torch.cat(all_padded_tensors, dim=0).to(device)
  all_attention_masks = torch.cat(all_attention_masks, dim=0).to(device)
  
  generated_ids = model.generate(
      input_ids=all_padded_tensors,
      attention_mask=all_attention_masks,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=128258,
      use_cache=True
  )
  
  token_to_find = 128257
  token_to_remove = 128258
  token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
  
  if len(token_indices[1]) > 0:
      last_occurrence_idx = token_indices[1][-1].item()
      cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
  else:
      cropped_tensor = generated_ids
  
  processed_rows = []
  for row in cropped_tensor:
      masked_row = row[row != token_to_remove]
      processed_rows.append(masked_row)
  
  code_lists = []
  for row in processed_rows:
      row_length = row.size(0)
      new_length = (row_length // 7) * 7
      trimmed_row = row[:new_length]
      trimmed_row = [t - 128266 for t in trimmed_row]
      code_lists.append(trimmed_row)
  
  def redistribute_codes(code_list):
      layer_1 = []
      layer_2 = []
      layer_3 = []
      for i in range((len(code_list) + 1) // 7):
          layer_1.append(code_list[7 * i])
          layer_2.append(code_list[7 * i + 1] - 4096)
          layer_3.append(code_list[7 * i + 2] - (2 * 4096))
          layer_3.append(code_list[7 * i + 3] - (3 * 4096))
          layer_2.append(code_list[7 * i + 4] - (4 * 4096))
          layer_3.append(code_list[7 * i + 5] - (5 * 4096))
          layer_3.append(code_list[7 * i + 6] - (6 * 4096))
      
      codes = [
          torch.tensor(layer_1).unsqueeze(0).to(device),
          torch.tensor(layer_2).unsqueeze(0).to(device),
          torch.tensor(layer_3).unsqueeze(0).to(device)
      ]
      audio_hat = snac_model.decode(codes)
      return audio_hat
      
  my_samples = []
  for code_list in code_lists:
      samples = redistribute_codes(code_list)
      my_samples.append(samples)
  
  if len(prompts) != len(my_samples):
      raise Exception("Number of prompts and samples do not match")
  else:
      for i in range(len(my_samples)):
          print(prompts[i])
          samples = my_samples[i]
          display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
  
  del my_samples, samples

You can get your huggingface token from here

📫 Contact and Support

For questions, suggestions, and feedback, please open an issue on HuggingFace. You can also reach the author via: LinkedIn

Model Misuse

Do not use this model for impersonation without consent, misinformation or deception (including fake news or fraudulent calls), or any illegal or harmful activity. By using this model, you agree to follow all applicable laws and ethical guidelines.

Citation

@misc{
  title={salihfurkaan/VoxPolska-V1-Merged-16bit},
  author={Salih Furkan Erik},
  year={2025},
  url={https://huggingface.co/salihfurkaan/VoxPolska-V1-Merged-16bit/}
}

salihfurkaan
/

VoxPolska-V1-Merged-16bit

VoxPolska: Next-Gen Polish Voice Generation

📌 Model Highlights

🔧 Technical Details

🎧 Example Usage (Pipeline)

🎧 Example Usage (Directly)

📫 Contact and Support

Model Misuse

Citation

Model tree for salihfurkaan/VoxPolska-V1-Merged-16bit

Dataset used to train salihfurkaan/VoxPolska-V1-Merged-16bit

Collection including salihfurkaan/VoxPolska-V1-Merged-16bit

VoxPolska