Audio-to-Audio
Safetensors
Japanese
xcodec2
speech
Not-For-All-Audiences
Anime-XCodec2 / README.md
OmniAICreator's picture
Update README.md
443ddae verified
---
license: cc-by-nc-4.0
datasets:
- OOPPEENN/56697375616C4E6F76656C5F44617461736574
- amphion/Emilia-Dataset
- OmniAICreator/ASMR-Archive-Processed
language:
- ja
base_model:
- HKUSTAudio/xcodec2
pipeline_tag: audio-to-audio
tags:
- audio-to-audio
- speech
- not-for-all-audiences
---
# Anime‑XCodec2: Japanese Fine‑Tuned Variant of XCodec2
[![License: CC BY‑NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
**TL;DR**: Anime‑XCodec2 is a fine‑tuned variant of **HKUSTAudio/xcodec2**, trained on \~25k hours of **Japanese anime/game‑style voices**.
Only the **decoder** was updated; the **encoder and codebook remain frozen**, so **speech tokens are identical to the original XCodec2**. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (*e.g., Llasa*).
---
## 🔗 Quick Links
* **Demo (Gradio / Hugging Face Spaces)**: [https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-Demo](https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-Demo)
* **Baseline model (pretrained)**: `HKUSTAudio/xcodec2`
* **This repository (fine‑tuned)**: `NandemoGHS/Anime‑XCodec2`
* **Training Logs (Weights & Biases)**: [View Report](https://api.wandb.ai/links/aratako-lm/0pf7mfmj)
---
## 1) Model Summary
* **What it is**: A neural speech codec / speech tokenizer model based on **XCodec2** with a decoder fine‑tuned for Japanese speech, particularly **anime/game‑style voices**.
* **Training scope**: **Decoder‑only** fine‑tuning on \~**25,000 hours** of Japanese data; **encoder** and **codebook** are **frozen**.
* **Compatibility**: Because the encoder and codebook are unchanged, **speech tokens produced at encode time are identical** with the original XCodec2. Any downstream model expecting XCodec2 codes can use **Anime‑XCodec2** as a **drop‑in decoder** (*e.g., Llasa*).
* **Sampling rate**: **16 kHz** (XCodec2 operates at 16 kHz).
---
## 2) Intended Use
* **Decode XCodec2 speech tokens** (e.g., from Llasa or other AR token generators trained on XCodec2 codes) into **Japanese speech** with improved naturalness for **anime/game‑style** voices.
* **Reconstruction** of Japanese speech from XCodec2 tokens when analyzing or building Japanese‑focused speech pipelines.
---
## 3) Limitations & Trade‑offs
* **Language scope**: Optimized for **Japanese**. **Performance on other languages may degrade compared to the baseline XCodec2**
* **Sampling rate**: 16 kHz only (resample inputs to 16 kHz before encoding; decode assumes 16 kHz).
* **Content domain**: Tuned toward **anime/game‑style** voices; out‑of‑domain speech may not benefit.
---
## 4) Data (High‑Level)
* \~**25,000 hours** of Japanese speech, with a focus on **anime/game‑style voices** (acting, character voices, etc.).
* Data preparation included resampling to **16 kHz** and standard loudness/peak checks where appropriate.
---
## 5) Training Procedure (High‑Level)
* **Updated (fine‑tuned)**: `generator.backbone`, `generator.head`, `fc_post_a`
* **Frozen**: all other non‑listed components
Goal: preserve **token compatibility** with `HKUSTAudio/xcodec2` while improving reconstruction quality for **Japanese anime/game‑style speech**.
---
## 6) Samples (A/B Listening)
| ID | Original (reference) | Baseline Reconstruct (`HKUSTAudio/xcodec2`) | Anime‑XCodec2 Reconstruct (this model) |
| -: | :----------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------- |
| 1 | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample1_original.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample1_baseline.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample1_animexcodec2.wav"></audio> |
| 2 | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample2_original.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample2_baseline.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample2_animexcodec2.wav"></audio> |
| 3 | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample3_original.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample3_baseline.wav"></audio> | <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample3_animexcodec2.wav"></audio> |
*Note: original audio is 48 / 44.1 kHz, while reconstructed audios are at 16 kHz*
*These samples come from [NandemoGHS/Japanese-Eroge-Voice](https://huggingface.co/datasets/NandemoGHS/Japanese-Eroge-Voice) and were not included in the training or validation data.*
---
## 7) License
* **CC‑BY‑NC 4.0** (same as the original XCodec2 license).
* See: [https://creativecommons.org/licenses/by-nc/4.0/](https://creativecommons.org/licenses/by-nc/4.0/)
---
## 8) Acknowledgements
* Original: **HKUSTAudio/xcodec2**
* Thanks to contributors and the community around Japanese speech resources.