Audio-to-Audio
Safetensors
Japanese
xcodec2
speech
Not-For-All-Audiences
Anime-XCodec2 / README.md
OmniAICreator's picture
Update README.md
443ddae verified
metadata
license: cc-by-nc-4.0
datasets:
  - OOPPEENN/56697375616C4E6F76656C5F44617461736574
  - amphion/Emilia-Dataset
  - OmniAICreator/ASMR-Archive-Processed
language:
  - ja
base_model:
  - HKUSTAudio/xcodec2
pipeline_tag: audio-to-audio
tags:
  - audio-to-audio
  - speech
  - not-for-all-audiences

Anime‑XCodec2: Japanese Fine‑Tuned Variant of XCodec2

License: CC BY‑NC 4.0

TL;DR: Anime‑XCodec2 is a fine‑tuned variant of HKUSTAudio/xcodec2, trained on ~25k hours of Japanese anime/game‑style voices.

Only the decoder was updated; the encoder and codebook remain frozen, so speech tokens are identical to the original XCodec2. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (e.g., Llasa).


🔗 Quick Links


1) Model Summary

  • What it is: A neural speech codec / speech tokenizer model based on XCodec2 with a decoder fine‑tuned for Japanese speech, particularly anime/game‑style voices.
  • Training scope: Decoder‑only fine‑tuning on ~25,000 hours of Japanese data; encoder and codebook are frozen.
  • Compatibility: Because the encoder and codebook are unchanged, speech tokens produced at encode time are identical with the original XCodec2. Any downstream model expecting XCodec2 codes can use Anime‑XCodec2 as a drop‑in decoder (e.g., Llasa).
  • Sampling rate: 16 kHz (XCodec2 operates at 16 kHz).

2) Intended Use

  • Decode XCodec2 speech tokens (e.g., from Llasa or other AR token generators trained on XCodec2 codes) into Japanese speech with improved naturalness for anime/game‑style voices.
  • Reconstruction of Japanese speech from XCodec2 tokens when analyzing or building Japanese‑focused speech pipelines.

3) Limitations & Trade‑offs

  • Language scope: Optimized for Japanese. Performance on other languages may degrade compared to the baseline XCodec2
  • Sampling rate: 16 kHz only (resample inputs to 16 kHz before encoding; decode assumes 16 kHz).
  • Content domain: Tuned toward anime/game‑style voices; out‑of‑domain speech may not benefit.

4) Data (High‑Level)

  • ~25,000 hours of Japanese speech, with a focus on anime/game‑style voices (acting, character voices, etc.).
  • Data preparation included resampling to 16 kHz and standard loudness/peak checks where appropriate.

5) Training Procedure (High‑Level)

  • Updated (fine‑tuned): generator.backbone, generator.head, fc_post_a
  • Frozen: all other non‑listed components

Goal: preserve token compatibility with HKUSTAudio/xcodec2 while improving reconstruction quality for Japanese anime/game‑style speech.


6) Samples (A/B Listening)

ID Original (reference) Baseline Reconstruct (HKUSTAudio/xcodec2) Anime‑XCodec2 Reconstruct (this model)
1
2
3

Note: original audio is 48 / 44.1 kHz, while reconstructed audios are at 16 kHz

These samples come from NandemoGHS/Japanese-Eroge-Voice and were not included in the training or validation data.


7) License


8) Acknowledgements

  • Original: HKUSTAudio/xcodec2
  • Thanks to contributors and the community around Japanese speech resources.