metadata

license: cc-by-nc-4.0
datasets:
  - OOPPEENN/56697375616C4E6F76656C5F44617461736574
  - amphion/Emilia-Dataset
  - OmniAICreator/ASMR-Archive-Processed
language:
  - ja
base_model:
  - HKUSTAudio/xcodec2
pipeline_tag: audio-to-audio
tags:
  - audio-to-audio
  - speech
  - not-for-all-audiences

Anime‑XCodec2: Japanese Fine‑Tuned Variant of XCodec2

TL;DR: Anime‑XCodec2 is a fine‑tuned variant of HKUSTAudio/xcodec2, trained on ~25k hours of Japanese anime/game‑style voices.

Only the decoder was updated; the encoder and codebook remain frozen, so speech tokens are identical to the original XCodec2. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (e.g., Llasa).

🔗 Quick Links

Demo (Gradio / Hugging Face Spaces): https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-Demo
Baseline model (pretrained): HKUSTAudio/xcodec2
This repository (fine‑tuned): NandemoGHS/Anime‑XCodec2
Training Logs (Weights & Biases): View Report

1) Model Summary

What it is: A neural speech codec / speech tokenizer model based on XCodec2 with a decoder fine‑tuned for Japanese speech, particularly anime/game‑style voices.
Training scope: Decoder‑only fine‑tuning on ~25,000 hours of Japanese data; encoder and codebook are frozen.
Compatibility: Because the encoder and codebook are unchanged, speech tokens produced at encode time are identical with the original XCodec2. Any downstream model expecting XCodec2 codes can use Anime‑XCodec2 as a drop‑in decoder (e.g., Llasa).
Sampling rate: 16 kHz (XCodec2 operates at 16 kHz).

2) Intended Use

Decode XCodec2 speech tokens (e.g., from Llasa or other AR token generators trained on XCodec2 codes) into Japanese speech with improved naturalness for anime/game‑style voices.
Reconstruction of Japanese speech from XCodec2 tokens when analyzing or building Japanese‑focused speech pipelines.

3) Limitations & Trade‑offs

Language scope: Optimized for Japanese. Performance on other languages may degrade compared to the baseline XCodec2
Sampling rate: 16 kHz only (resample inputs to 16 kHz before encoding; decode assumes 16 kHz).
Content domain: Tuned toward anime/game‑style voices; out‑of‑domain speech may not benefit.

4) Data (High‑Level)

~25,000 hours of Japanese speech, with a focus on anime/game‑style voices (acting, character voices, etc.).
Data preparation included resampling to 16 kHz and standard loudness/peak checks where appropriate.

5) Training Procedure (High‑Level)

Updated (fine‑tuned): generator.backbone, generator.head, fc_post_a
Frozen: all other non‑listed components

Goal: preserve token compatibility with HKUSTAudio/xcodec2 while improving reconstruction quality for Japanese anime/game‑style speech.

6) Samples (A/B Listening)

ID	Original (reference)	Baseline Reconstruct (`HKUSTAudio/xcodec2`)	Anime‑XCodec2 Reconstruct (this model)
1
2
3

Note: original audio is 48 / 44.1 kHz, while reconstructed audios are at 16 kHz

These samples come from NandemoGHS/Japanese-Eroge-Voice and were not included in the training or validation data.

7) License

CC‑BY‑NC 4.0 (same as the original XCodec2 license).
See: https://creativecommons.org/licenses/by-nc/4.0/

8) Acknowledgements

Original: HKUSTAudio/xcodec2
Thanks to contributors and the community around Japanese speech resources.