|
--- |
|
license: cc-by-nc-4.0 |
|
datasets: |
|
- OOPPEENN/56697375616C4E6F76656C5F44617461736574 |
|
- amphion/Emilia-Dataset |
|
- OmniAICreator/ASMR-Archive-Processed |
|
language: |
|
- ja |
|
base_model: |
|
- HKUSTAudio/xcodec2 |
|
pipeline_tag: audio-to-audio |
|
tags: |
|
- audio-to-audio |
|
- speech |
|
- not-for-all-audiences |
|
--- |
|
# Anime‑XCodec2: Japanese Fine‑Tuned Variant of XCodec2 |
|
|
|
[](https://creativecommons.org/licenses/by-nc/4.0/) |
|
|
|
**TL;DR**: Anime‑XCodec2 is a fine‑tuned variant of **HKUSTAudio/xcodec2**, trained on \~25k hours of **Japanese anime/game‑style voices**. |
|
|
|
Only the **decoder** was updated; the **encoder and codebook remain frozen**, so **speech tokens are identical to the original XCodec2**. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (*e.g., Llasa*). |
|
|
|
--- |
|
|
|
## 🔗 Quick Links |
|
|
|
* **Demo (Gradio / Hugging Face Spaces)**: [https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-Demo](https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-Demo) |
|
* **Baseline model (pretrained)**: `HKUSTAudio/xcodec2` |
|
* **This repository (fine‑tuned)**: `NandemoGHS/Anime‑XCodec2` |
|
* **Training Logs (Weights & Biases)**: [View Report](https://api.wandb.ai/links/aratako-lm/0pf7mfmj) |
|
|
|
--- |
|
|
|
## 1) Model Summary |
|
|
|
* **What it is**: A neural speech codec / speech tokenizer model based on **XCodec2** with a decoder fine‑tuned for Japanese speech, particularly **anime/game‑style voices**. |
|
* **Training scope**: **Decoder‑only** fine‑tuning on \~**25,000 hours** of Japanese data; **encoder** and **codebook** are **frozen**. |
|
* **Compatibility**: Because the encoder and codebook are unchanged, **speech tokens produced at encode time are identical** with the original XCodec2. Any downstream model expecting XCodec2 codes can use **Anime‑XCodec2** as a **drop‑in decoder** (*e.g., Llasa*). |
|
* **Sampling rate**: **16 kHz** (XCodec2 operates at 16 kHz). |
|
|
|
--- |
|
|
|
## 2) Intended Use |
|
|
|
* **Decode XCodec2 speech tokens** (e.g., from Llasa or other AR token generators trained on XCodec2 codes) into **Japanese speech** with improved naturalness for **anime/game‑style** voices. |
|
* **Reconstruction** of Japanese speech from XCodec2 tokens when analyzing or building Japanese‑focused speech pipelines. |
|
|
|
--- |
|
|
|
## 3) Limitations & Trade‑offs |
|
|
|
* **Language scope**: Optimized for **Japanese**. **Performance on other languages may degrade compared to the baseline XCodec2** |
|
* **Sampling rate**: 16 kHz only (resample inputs to 16 kHz before encoding; decode assumes 16 kHz). |
|
* **Content domain**: Tuned toward **anime/game‑style** voices; out‑of‑domain speech may not benefit. |
|
|
|
--- |
|
|
|
## 4) Data (High‑Level) |
|
|
|
* \~**25,000 hours** of Japanese speech, with a focus on **anime/game‑style voices** (acting, character voices, etc.). |
|
* Data preparation included resampling to **16 kHz** and standard loudness/peak checks where appropriate. |
|
|
|
--- |
|
|
|
## 5) Training Procedure (High‑Level) |
|
|
|
* **Updated (fine‑tuned)**: `generator.backbone`, `generator.head`, `fc_post_a` |
|
* **Frozen**: all other non‑listed components |
|
|
|
Goal: preserve **token compatibility** with `HKUSTAudio/xcodec2` while improving reconstruction quality for **Japanese anime/game‑style speech**. |
|
|
|
--- |
|
|
|
## 6) Samples (A/B Listening) |
|
|
|
| ID | Original (reference) | Baseline Reconstruct (`HKUSTAudio/xcodec2`) | Anime‑XCodec2 Reconstruct (this model) | |
|
| -: | :----------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------- | |
|
| 1 | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample1_original.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample1_baseline.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample1_animexcodec2.wav"></audio> | |
|
| 2 | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample2_original.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample2_baseline.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample2_animexcodec2.wav"></audio> | |
|
| 3 | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample3_original.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample3_baseline.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample3_animexcodec2.wav"></audio> | |
|
|
|
*Note: original audio is 48 / 44.1 kHz, while reconstructed audios are at 16 kHz* |
|
|
|
*These samples come from [NandemoGHS/Japanese-Eroge-Voice](https://huggingface.co/datasets/NandemoGHS/Japanese-Eroge-Voice) and were not included in the training or validation data.* |
|
|
|
--- |
|
|
|
## 7) License |
|
|
|
* **CC‑BY‑NC 4.0** (same as the original XCodec2 license). |
|
* See: [https://creativecommons.org/licenses/by-nc/4.0/](https://creativecommons.org/licenses/by-nc/4.0/) |
|
|
|
--- |
|
|
|
## 8) Acknowledgements |
|
|
|
* Original: **HKUSTAudio/xcodec2** |
|
* Thanks to contributors and the community around Japanese speech resources. |