Improve model card for XY-Tokenizer

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +101 -3
README.md CHANGED
@@ -1,3 +1,101 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: audio-to-audio
4
+ library_name: pytorch
5
+ ---
6
+
7
+ # XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs
8
+
9
+ This repository contains the model weights for **XY-Tokenizer**, a novel speech codec introduced in the paper [XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs](https://huggingface.co/papers/2506.23325).
10
+
11
+ [[Paper]](https://huggingface.co/papers/2506.23325) | [[GitHub]](https://github.com/gyt1145028706/XY-Tokenizer) | [[Hugging Face Model Page]](https://huggingface.co/fdugyt/XY_Tokenizer)
12
+
13
+ ## Overview 🔍
14
+
15
+ **XY-Tokenizer** is a novel speech codec designed to bridge the gap between speech signals and large language models by simultaneously **modeling both semantic and acoustic information**. It operates at a bitrate of **1 kbps** (1000 bps), using **8-layer Residual Vector Quantization (RVQ8)** at a **12.5 Hz** frame rate.
16
+
17
+ At this ultra-low bitrate, **XY-Tokenizer** achieves performance comparable to state-of-the-art speech codecs that focus on only one aspect—either semantic or acoustic—while **XY-Tokenizer** performs strongly on both. You can also find the model on [Hugging Face](https://huggingface.co/fdugyt/XY_Tokenizer).
18
+
19
+ ## Highlights ✨
20
+
21
+ - **Low frame rate, low bitrate with high fidelity and text alignment**: Achieves strong semantic alignment and acoustic quality at 12.5Hz and 1kbps.
22
+
23
+ - **Multilingual training on the full Emilia dataset**: Trained on a large-scale multilingual dataset, supporting robust performance across diverse languages.
24
+
25
+ - **Designed for Speech LLMs**: Can be used for zero-shot TTS, dialogue TTS (e.g., [MOSS-TTSD](https://github.com/OpenMOSS/MOSS-TTSD)), and speech large language models.
26
+
27
+ <div align="center">
28
+ <p>
29
+ <img src="https://huggingface.co/fdugyt/XY_Tokenizer/resolve/main/assets/XY-Tokenizer-Architecture.png" alt="XY-Tokenizer" width="1000">
30
+ </p>
31
+ </div>
32
+
33
+ ## News 📢
34
+
35
+ - **[2025-06-28]** We released the code and checkpoints of XY-Tokenizer. Check out our [paper](https://huggingface.co/papers/2506.23325)!
36
+
37
+ ## Installation 🛠️
38
+
39
+ To use XY-Tokenizer, you need to install the required dependencies. You can use either pip or conda to set up your environment.
40
+
41
+ ### Using conda
42
+
43
+ ```bash
44
+ # Clone repository
45
+ git clone [email protected]:gyt1145028706/XY-Tokenizer.git && cd XY-Tokenizer
46
+
47
+ # Create and activate conda environment
48
+ conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer
49
+
50
+ # Install dependencies
51
+ pip install -r requirements.txt
52
+ ```
53
+
54
+ ## Available Models 🗂️
55
+
56
+ | Model Name | Hugging Face | Training Data |
57
+ |:----------:|:-------------:|:---------------:|
58
+ | XY-Tokenizer | [🤗](https://huggingface.co/fdugyt/XY_Tokenizer) | Emilia |
59
+ | XY-Tokenizer-TTSD-V0 (used in [MOSS-TTSD](https://github.com/OpenMOSS/MOSS-TTSD)) | [🤗](https://huggingface.co/fnlp/XY_Tokenizer_TTSD_V0/) | Emilia + Internal Data (containing general audio) |
60
+
61
+ ## Usage 🚀
62
+
63
+ ### Download XY Tokenizer
64
+
65
+ You need to download the XY Tokenizer model weights. You can find the weights in the [XY_Tokenizer Hugging Face repository](https://huggingface.co/fdugyt/XY_Tokenizer).
66
+
67
+ ```bash
68
+ mkdir -p ./weights && huggingface-cli download fdugyt/XY_Tokenizer xy_tokenizer.ckpt --local-dir ./weights/
69
+ ```
70
+
71
+ ### Local Inference
72
+
73
+ First, set the Python path to include this repository:
74
+ ```bash
75
+ export PYTHONPATH=$PYTHONPATH:./
76
+ ```
77
+
78
+ Then you can tokenize audio to speech tokens and generate reconstructed audio from these tokens by running:
79
+ ```python
80
+ python inference.py
81
+ ```
82
+
83
+ The reconstructed audio files will be available in the `output_wavs/` directory.
84
+
85
+ ## License 📜
86
+
87
+ XY-Tokenizer is released under the Apache 2.0 license.
88
+
89
+ ## Citation 📚
90
+
91
+ ```bibtex
92
+ @misc{gong2025xytokenizermitigatingsemanticacousticconflict,
93
+ title={XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs},
94
+ author={Yitian Gong and Luozhijie Jin and Ruifan Deng and Dong Zhang and Xin Zhang and Qinyuan Cheng and Zhaoye Fei and Shimin Li and Xipeng Qiu},
95
+ year={2025},
96
+ eprint={2506.23325},
97
+ archivePrefix={arXiv},
98
+ primaryClass={cs.SD},
99
+ url={https://arxiv.org/abs/2506.23325},
100
+ }
101
+ ```