Not-For-All-Audiences

Model card Files Files and versions

xet

Community

Anime-XCodec2 / README.md

OmniAICreator

Update README.md

443ddae verified 8 days ago

preview code

raw

history blame contribute delete

5.85 kB

	---
	license: cc-by-nc-4.0
	datasets:
	- OOPPEENN/56697375616C4E6F76656C5F44617461736574
	- amphion/Emilia-Dataset
	- OmniAICreator/ASMR-Archive-Processed
	language:
	- ja
	base_model:
	- HKUSTAudio/xcodec2
	pipeline_tag: audio-to-audio
	tags:
	- audio-to-audio
	- speech
	- not-for-all-audiences
	---
	# Anime‑XCodec2: Japanese Fine‑Tuned Variant of XCodec2

	[![License: CC BY‑NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)

	TL;DR: Anime‑XCodec2 is a fine‑tuned variant of HKUSTAudio/xcodec2, trained on \~25k hours of Japanese anime/game‑style voices.

	Only the decoder was updated; the encoder and codebook remain frozen, so speech tokens are identical to the original XCodec2. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (e.g., Llasa).

	---

	## 🔗 Quick Links

	* Demo (Gradio / Hugging Face Spaces): [https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-Demo](https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-Demo)
	* Baseline model (pretrained): `HKUSTAudio/xcodec2`
	* This repository (fine‑tuned): `NandemoGHS/Anime‑XCodec2`
	* Training Logs (Weights & Biases): [View Report](https://api.wandb.ai/links/aratako-lm/0pf7mfmj)

	---

	## 1) Model Summary

	* What it is: A neural speech codec / speech tokenizer model based on XCodec2 with a decoder fine‑tuned for Japanese speech, particularly anime/game‑style voices.
	* Training scope: Decoder‑only fine‑tuning on \~25,000 hours of Japanese data; encoder and codebook are frozen.
	* Compatibility: Because the encoder and codebook are unchanged, speech tokens produced at encode time are identical with the original XCodec2. Any downstream model expecting XCodec2 codes can use Anime‑XCodec2 as a drop‑in decoder (e.g., Llasa).
	* Sampling rate: 16 kHz (XCodec2 operates at 16 kHz).

	---

	## 2) Intended Use

	* Decode XCodec2 speech tokens (e.g., from Llasa or other AR token generators trained on XCodec2 codes) into Japanese speech with improved naturalness for anime/game‑style voices.
	* Reconstruction of Japanese speech from XCodec2 tokens when analyzing or building Japanese‑focused speech pipelines.

	---

	## 3) Limitations & Trade‑offs

	* Language scope: Optimized for Japanese. Performance on other languages may degrade compared to the baseline XCodec2
	* Sampling rate: 16 kHz only (resample inputs to 16 kHz before encoding; decode assumes 16 kHz).
	* Content domain: Tuned toward anime/game‑style voices; out‑of‑domain speech may not benefit.

	---

	## 4) Data (High‑Level)

	* \~25,000 hours of Japanese speech, with a focus on anime/game‑style voices (acting, character voices, etc.).
	* Data preparation included resampling to 16 kHz and standard loudness/peak checks where appropriate.

	---

	## 5) Training Procedure (High‑Level)

	* Updated (fine‑tuned): `generator.backbone`, `generator.head`, `fc_post_a`
	* Frozen: all other non‑listed components

	Goal: preserve token compatibility with `HKUSTAudio/xcodec2` while improving reconstruction quality for Japanese anime/game‑style speech.

	---

	## 6) Samples (A/B Listening)

	\| ID \| Original (reference) \| Baseline Reconstruct (`HKUSTAudio/xcodec2`) \| Anime‑XCodec2 Reconstruct (this model) \|
	\| -: \| :----------------------------------------------------------------------------------------------------------------------- \| :----------------------------------------------------------------------------------------------------------------------- \| :--------------------------------------------------------------------------------------------------------------------------- \|
	\| 1 \| <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample1_original.wav"></audio> \| <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample1_baseline.wav"></audio> \| <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample1_animexcodec2.wav"></audio> \|
	\| 2 \| <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample2_original.wav"></audio> \| <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample2_baseline.wav"></audio> \| <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample2_animexcodec2.wav"></audio> \|
	\| 3 \| <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample3_original.wav"></audio> \| <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample3_baseline.wav"></audio> \| <audio controls src="https://huggingface.co/NandemoGHS/Anime-XCodec2/resolve/main/samples/sample3_animexcodec2.wav"></audio> \|

	Note: original audio is 48 / 44.1 kHz, while reconstructed audios are at 16 kHz

	These samples come from [NandemoGHS/Japanese-Eroge-Voice](https://huggingface.co/datasets/NandemoGHS/Japanese-Eroge-Voice) and were not included in the training or validation data.

	---

	## 7) License

	* CC‑BY‑NC 4.0 (same as the original XCodec2 license).
	* See: [https://creativecommons.org/licenses/by-nc/4.0/](https://creativecommons.org/licenses/by-nc/4.0/)

	---

	## 8) Acknowledgements

	* Original: HKUSTAudio/xcodec2
	* Thanks to contributors and the community around Japanese speech resources.