metadata
license: cc-by-nc-4.0
datasets:
- OOPPEENN/56697375616C4E6F76656C5F44617461736574
- amphion/Emilia-Dataset
- OmniAICreator/ASMR-Archive-Processed
language:
- ja
base_model:
- HKUSTAudio/xcodec2
pipeline_tag: audio-to-audio
tags:
- audio-to-audio
- speech
- not-for-all-audiences
Anime‑XCodec2: Japanese Fine‑Tuned Variant of XCodec2
TL;DR: Anime‑XCodec2 is a fine‑tuned variant of HKUSTAudio/xcodec2, trained on ~25k hours of Japanese anime/game‑style voices.
Only the decoder was updated; the encoder and codebook remain frozen, so speech tokens are identical to the original XCodec2. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (e.g., Llasa).
🔗 Quick Links
- Demo (Gradio / Hugging Face Spaces): https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-Demo
- Baseline model (pretrained):
HKUSTAudio/xcodec2
- This repository (fine‑tuned):
NandemoGHS/Anime‑XCodec2
- Training Logs (Weights & Biases): View Report
1) Model Summary
- What it is: A neural speech codec / speech tokenizer model based on XCodec2 with a decoder fine‑tuned for Japanese speech, particularly anime/game‑style voices.
- Training scope: Decoder‑only fine‑tuning on ~25,000 hours of Japanese data; encoder and codebook are frozen.
- Compatibility: Because the encoder and codebook are unchanged, speech tokens produced at encode time are identical with the original XCodec2. Any downstream model expecting XCodec2 codes can use Anime‑XCodec2 as a drop‑in decoder (e.g., Llasa).
- Sampling rate: 16 kHz (XCodec2 operates at 16 kHz).
2) Intended Use
- Decode XCodec2 speech tokens (e.g., from Llasa or other AR token generators trained on XCodec2 codes) into Japanese speech with improved naturalness for anime/game‑style voices.
- Reconstruction of Japanese speech from XCodec2 tokens when analyzing or building Japanese‑focused speech pipelines.
3) Limitations & Trade‑offs
- Language scope: Optimized for Japanese. Performance on other languages may degrade compared to the baseline XCodec2
- Sampling rate: 16 kHz only (resample inputs to 16 kHz before encoding; decode assumes 16 kHz).
- Content domain: Tuned toward anime/game‑style voices; out‑of‑domain speech may not benefit.
4) Data (High‑Level)
- ~25,000 hours of Japanese speech, with a focus on anime/game‑style voices (acting, character voices, etc.).
- Data preparation included resampling to 16 kHz and standard loudness/peak checks where appropriate.
5) Training Procedure (High‑Level)
- Updated (fine‑tuned):
generator.backbone
,generator.head
,fc_post_a
- Frozen: all other non‑listed components
Goal: preserve token compatibility with HKUSTAudio/xcodec2
while improving reconstruction quality for Japanese anime/game‑style speech.
6) Samples (A/B Listening)
ID | Original (reference) | Baseline Reconstruct (HKUSTAudio/xcodec2 ) |
Anime‑XCodec2 Reconstruct (this model) |
---|---|---|---|
1 | |||
2 | |||
3 |
Note: original audio is 48 / 44.1 kHz, while reconstructed audios are at 16 kHz
These samples come from NandemoGHS/Japanese-Eroge-Voice and were not included in the training or validation data.
7) License
- CC‑BY‑NC 4.0 (same as the original XCodec2 license).
- See: https://creativecommons.org/licenses/by-nc/4.0/
8) Acknowledgements
- Original: HKUSTAudio/xcodec2
- Thanks to contributors and the community around Japanese speech resources.