Improve model card for XY-Tokenizer
Browse filesThis PR significantly improves the model card for the XY-Tokenizer model by:
* **Adding comprehensive details**: The full content from the project's GitHub README has been integrated, including an overview, key highlights, installation instructions, and usage examples.
* **Enhancing discoverability**: The `pipeline_tag: audio-to-audio` has been added to the metadata, ensuring the model can be easily found under relevant tasks on the Hugging Face Hub (https://huggingface.co/models?pipeline_tag=audio-to-audio).
* **Specifying the library**: The `library_name: pytorch` tag has been added, indicating the primary framework used for the model.
* **Providing essential links**: Direct links to the paper ([XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs](https://huggingface.co/papers/2506.23325)), the official GitHub repository (https://github.com/gyt1145028706/XY-Tokenizer), and the Hugging Face model page are now prominently featured.
* **Cleaning up content**: Broken or unconfirmed links (e.g., for blog/demos) have been removed or updated, and the image path has been corrected for rendering on the Hub.
Please review and merge this PR to provide a more complete and helpful resource for the community.
@@ -1,3 +1,101 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
pipeline_tag: audio-to-audio
|
4 |
+
library_name: pytorch
|
5 |
+
---
|
6 |
+
|
7 |
+
# XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs
|
8 |
+
|
9 |
+
This repository contains the model weights for **XY-Tokenizer**, a novel speech codec introduced in the paper [XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs](https://huggingface.co/papers/2506.23325).
|
10 |
+
|
11 |
+
[[Paper]](https://huggingface.co/papers/2506.23325) | [[GitHub]](https://github.com/gyt1145028706/XY-Tokenizer) | [[Hugging Face Model Page]](https://huggingface.co/fdugyt/XY_Tokenizer)
|
12 |
+
|
13 |
+
## Overview 🔍
|
14 |
+
|
15 |
+
**XY-Tokenizer** is a novel speech codec designed to bridge the gap between speech signals and large language models by simultaneously **modeling both semantic and acoustic information**. It operates at a bitrate of **1 kbps** (1000 bps), using **8-layer Residual Vector Quantization (RVQ8)** at a **12.5 Hz** frame rate.
|
16 |
+
|
17 |
+
At this ultra-low bitrate, **XY-Tokenizer** achieves performance comparable to state-of-the-art speech codecs that focus on only one aspect—either semantic or acoustic—while **XY-Tokenizer** performs strongly on both. You can also find the model on [Hugging Face](https://huggingface.co/fdugyt/XY_Tokenizer).
|
18 |
+
|
19 |
+
## Highlights ✨
|
20 |
+
|
21 |
+
- **Low frame rate, low bitrate with high fidelity and text alignment**: Achieves strong semantic alignment and acoustic quality at 12.5Hz and 1kbps.
|
22 |
+
|
23 |
+
- **Multilingual training on the full Emilia dataset**: Trained on a large-scale multilingual dataset, supporting robust performance across diverse languages.
|
24 |
+
|
25 |
+
- **Designed for Speech LLMs**: Can be used for zero-shot TTS, dialogue TTS (e.g., [MOSS-TTSD](https://github.com/OpenMOSS/MOSS-TTSD)), and speech large language models.
|
26 |
+
|
27 |
+
<div align="center">
|
28 |
+
<p>
|
29 |
+
<img src="https://huggingface.co/fdugyt/XY_Tokenizer/resolve/main/assets/XY-Tokenizer-Architecture.png" alt="XY-Tokenizer" width="1000">
|
30 |
+
</p>
|
31 |
+
</div>
|
32 |
+
|
33 |
+
## News 📢
|
34 |
+
|
35 |
+
- **[2025-06-28]** We released the code and checkpoints of XY-Tokenizer. Check out our [paper](https://huggingface.co/papers/2506.23325)!
|
36 |
+
|
37 |
+
## Installation 🛠️
|
38 |
+
|
39 |
+
To use XY-Tokenizer, you need to install the required dependencies. You can use either pip or conda to set up your environment.
|
40 |
+
|
41 |
+
### Using conda
|
42 |
+
|
43 |
+
```bash
|
44 |
+
# Clone repository
|
45 |
+
git clone [email protected]:gyt1145028706/XY-Tokenizer.git && cd XY-Tokenizer
|
46 |
+
|
47 |
+
# Create and activate conda environment
|
48 |
+
conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer
|
49 |
+
|
50 |
+
# Install dependencies
|
51 |
+
pip install -r requirements.txt
|
52 |
+
```
|
53 |
+
|
54 |
+
## Available Models 🗂️
|
55 |
+
|
56 |
+
| Model Name | Hugging Face | Training Data |
|
57 |
+
|:----------:|:-------------:|:---------------:|
|
58 |
+
| XY-Tokenizer | [🤗](https://huggingface.co/fdugyt/XY_Tokenizer) | Emilia |
|
59 |
+
| XY-Tokenizer-TTSD-V0 (used in [MOSS-TTSD](https://github.com/OpenMOSS/MOSS-TTSD)) | [🤗](https://huggingface.co/fnlp/XY_Tokenizer_TTSD_V0/) | Emilia + Internal Data (containing general audio) |
|
60 |
+
|
61 |
+
## Usage 🚀
|
62 |
+
|
63 |
+
### Download XY Tokenizer
|
64 |
+
|
65 |
+
You need to download the XY Tokenizer model weights. You can find the weights in the [XY_Tokenizer Hugging Face repository](https://huggingface.co/fdugyt/XY_Tokenizer).
|
66 |
+
|
67 |
+
```bash
|
68 |
+
mkdir -p ./weights && huggingface-cli download fdugyt/XY_Tokenizer xy_tokenizer.ckpt --local-dir ./weights/
|
69 |
+
```
|
70 |
+
|
71 |
+
### Local Inference
|
72 |
+
|
73 |
+
First, set the Python path to include this repository:
|
74 |
+
```bash
|
75 |
+
export PYTHONPATH=$PYTHONPATH:./
|
76 |
+
```
|
77 |
+
|
78 |
+
Then you can tokenize audio to speech tokens and generate reconstructed audio from these tokens by running:
|
79 |
+
```python
|
80 |
+
python inference.py
|
81 |
+
```
|
82 |
+
|
83 |
+
The reconstructed audio files will be available in the `output_wavs/` directory.
|
84 |
+
|
85 |
+
## License 📜
|
86 |
+
|
87 |
+
XY-Tokenizer is released under the Apache 2.0 license.
|
88 |
+
|
89 |
+
## Citation 📚
|
90 |
+
|
91 |
+
```bibtex
|
92 |
+
@misc{gong2025xytokenizermitigatingsemanticacousticconflict,
|
93 |
+
title={XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs},
|
94 |
+
author={Yitian Gong and Luozhijie Jin and Ruifan Deng and Dong Zhang and Xin Zhang and Qinyuan Cheng and Zhaoye Fei and Shimin Li and Xipeng Qiu},
|
95 |
+
year={2025},
|
96 |
+
eprint={2506.23325},
|
97 |
+
archivePrefix={arXiv},
|
98 |
+
primaryClass={cs.SD},
|
99 |
+
url={https://arxiv.org/abs/2506.23325},
|
100 |
+
}
|
101 |
+
```
|