nielsr HF Staff commited on
Commit
5f86d02
Β·
verified Β·
1 Parent(s): 03c207a

Add project page to model card

Browse files

This PR adds a link to the project page, improving the model card.

Files changed (1) hide show
  1. README.md +128 -139
README.md CHANGED
@@ -1,153 +1,142 @@
1
  ---
2
- license: apache-2.0
3
- pipeline_tag: text-generation
 
 
 
 
 
 
 
 
4
  library_name: transformers
 
 
5
  ---
6
- # AM‑Thinking‑v1: Advancing the Frontier of Reasoning at 32B Scale
7
- * 2025-05-10Β Β·Β a-m‑team
8
 
9
  <p align="center">
10
- πŸ€— <a href="https://huggingface.co/a-m-team">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp πŸ“‘ <a href="https://arxiv.org/abs/2505.08311"> Paper</a> &nbsp&nbsp | &nbsp&nbsp πŸ“‘ <a href="https://a-m-team.github.io/am-thinking-v1/">Blog</a> &nbsp&nbsp
 
 
11
  </p>
12
 
13
- ## πŸš€ Introduction
14
-
15
- We release **AM-Thinking‑v1**, a 32B dense language model focused on enhancing reasoning capabilities.
16
- Built on Qwenβ€―2.5‑32B‑Base, AM-Thinking‑v1 shows strong performance on reasoning benchmarks, comparable to much larger MoE models like **DeepSeek‑R1**, **Qwen3‑235B‑A22B**, **Seed1.5-Thinking**, and larger dense model like **Nemotron-Ultra-253B-v1**.
17
-
18
- <div style="text-align: center;">
19
- <img src="assets/benchmark.png" alt="benchmark" style="width: 90%;">
20
- </div>
21
-
22
-
23
- ## 🧩 Why Another 32B Reasoning Model Matters?
24
-
25
- Large Mixture‑of‑Experts (MoE) models such as **DeepSeek‑R1** or **Qwen3‑235B‑A22B** dominate leaderboardsβ€”but they also demand clusters of high‑end GPUs. Many teams just need *the best dense model that fits on a single card*.
26
- **AM‑Thinking‑v1** fills that gap **while remaining fully based on open-source components**:
27
-
28
- * **Outperforms DeepSeek‑R1** on AIME’24/’25 & LiveCodeBench and **approaches Qwen3‑235B‑A22B** despite being 1/7‑th the parameter count.
29
- * **Built on the publicly availableβ€―Qwenβ€―2.5‑32B‑Base**, as well as the RL training queries.
30
- * Shows that with a **well‑designed post‑training pipeline** ( SFT + dual‑stage RL ) you can squeeze flagship‑level reasoning out of a 32β€―B dense model.
31
- * **Deploys on one A100‑80β€―GB** with deterministic latencyβ€”no MoE routing overhead.
32
-
33
- <div style="text-align: center;">
34
- <img src="assets/param-aime2024.jpeg" alt="AIME 2024" style="width: 90%; margin-bottom: 20px;">
35
- <img src="assets/param-lcb.jpeg" alt="LiveCodeBench" style="width: 90%;">
36
- <div style="margin-top: 10px;">
37
- <em>AM-Thinking-v1 achieves strong reasoning performance with significantly fewer parameters.</em>
38
- </div>
39
- </div>
40
-
41
-
42
-
43
- ## πŸ› οΈ Use Cases
44
-
45
- ### 1) Code Generation
46
- <pre style="font-family: 'Times New Roman', serif; font-size: 12px; border: 1px solid black; padding: 10px; font-style: italic;">
47
- PROMPT :
48
- write a python script for a bouncing red ball within a triangle, make sure to handle collision detection properly. make the triangle slowly rotate. implement it in python. make sure ball stays within the triangle
49
- </pre>
50
- <div style="text-align: center;">
51
- <img src="assets/ball.gif" alt="Bouncing Red Ball" width="50%">
52
- </div>
53
-
54
-
55
- ### 2) Logic
56
-
57
-
58
- <div style="text-align: center;">
59
- <img src="assets/diamond.png" alt="diamond" width="90%">
60
- </div>
61
-
62
-
63
- ### 3) Writing
64
- <div style="text-align: center;">
65
- <img src="assets/writing.png" alt="sushi" width="90%">
66
- </div>
67
-
68
-
69
-
70
- ## ⚑ Quick start
71
-
72
- ```python
73
- from transformers import AutoModelForCausalLM, AutoTokenizer
74
-
75
- model_name = "a-m-team/AM-Thinking-v1"
76
-
77
- tokenizer = AutoTokenizer.from_pretrained(model_name)
78
- model = AutoModelForCausalLM.from_pretrained(
79
- model_name,
80
- torch_dtype="auto",
81
- device_map="auto"
82
- )
83
-
84
- prompt = "How can I find inner peace?"
85
- messages = [
86
- {"role": "user", "content": prompt}
87
- ]
88
- text = tokenizer.apply_chat_template(
89
- messages,
90
- tokenize=False,
91
- add_generation_prompt=True
92
- )
93
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
94
-
95
- generated_ids = model.generate(
96
- **model_inputs,
97
- max_new_tokens=49152
98
- )
99
- output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
100
-
101
- response = tokenizer.decode(output_ids, skip_special_tokens=True)
102
- think_content = response.split("<think>")[1].split("</think>")[0]
103
- answer_content = response.split("<answer>")[1].split("</answer>")[0]
104
-
105
- print (f"user prompt: {prompt}")
106
- print (f"model thinking: {think_content}")
107
- print (f"model answer: {answer_content}")
108
  ```
109
- > Note: We have included the system prompt in the tokenizer configuration, as it was used during both the SFT and RL stages. To ensure consistent output quality, we recommend including the same system prompt during actual usage; otherwise, the model's responses may be significantly affected.
110
-
111
- ### Quantized versions for compact devices
112
- A series of quantized versions for [AM-Thinking-v1](https://huggingface.co/a-m-team/AM-Thinking-v1-gguf) model.
113
- For use with [llama.cpp](https://github.com/ggml-org/llama.cpp) and [Ollama](https://github.com/ollama/ollama)
114
- is available at [AM-Thinking-v1-gguf](https://huggingface.co/a-m-team/AM-Thinking-v1-gguf).
115
-
116
 
117
- ## πŸ”§ Post-training pipeline
118
-
119
- To achieve its strong reasoning ability, AM‑Thinking‑v1 goes through a carefully designed post-training pipeline.
120
- Below we describe the key stages involved in turning a base model into a high-performing reasoner:
121
-
122
-
123
- **Stepβ€―1 – Cold‑start SFT.**
124
- We begin with the open-sourced **Qwenβ€―2.5‑32B‑Base** and run a broad supervised fine‑tune on a blended training dataset of math, code and open‑domain chat. This endows the model with a "think‑then‑answer" behavioural pattern and equips it with an initial capacity for reasoning.
125
-
126
- **Stepβ€―2 – Pass‑rate‑aware data curation.**
127
- Before any RL, the SFT model is evaluated on every math‑ and code‑oriented training query. For each item we log a pass rate; only those with **0β€―<β€―pass‑rateβ€―<β€―1** are kept. In effect we discard problems the model already masters and those it utterly fails, concentrating learning on genuinely informative cases.
128
-
129
- **Stepβ€―3 – Reinforcement learningΒ .**
130
- We adopt a two‑stage GRPO scheme: Stageβ€―1 trains only on math and code queries. Once it converges, stage 2 starts by removing every query the model answered 100% correctly in Stageβ€―1 and adjusting key hyper‑parameters such as maximum generation length and learning rate.
131
-
132
-
133
- ## ⚠️ Limitations
134
-
135
- While AM‑Thinking‑v1 excels at pure language reasoning and open‑domain chat, it has not yet been trained for structured function‑calling or tool‑use workflows, which restricts its usefulness in agent‑style applications that must act on external systems.
136
- Improving the model's ability to follow complex instructions is also an important direction for our future work.
137
- In addition, our safety alignment is still at an early stage, so more rigorous red‑teaming are required to reduce potential harms.
138
 
139
- ## πŸ“š Citation
140
- The a-m-team is an internal team at Beike (Ke.com), dedicated to exploring AGI technology.
141
- If you find our work helpful, feel free to give us a cite.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
 
143
  ```
144
- @misc{ji2025amthinkingv1advancingfrontierreasoning,
145
- title={AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale},
146
- author={Yunjie Ji and Xiaoyu Tian and Sitong Zhao and Haotian Wang and Shuaiting Chen and Yiping Peng and Han Zhao and Xiangang Li},
147
- year={2025},
148
- eprint={2505.08311},
149
- archivePrefix={arXiv},
150
- primaryClass={cs.CL},
151
- url={https://arxiv.org/abs/2505.08311},
152
  }
153
  ```
 
1
  ---
2
+ datasets:
3
+ - maitrix-org/Voila-Benchmark
4
+ - maitrix-org/Voila-million-voice
5
+ language:
6
+ - en
7
+ - zh
8
+ - fr
9
+ - de
10
+ - ja
11
+ - ko
12
  library_name: transformers
13
+ license: mit
14
+ pipeline_tag: audio-text-to-text
15
  ---
 
 
16
 
17
  <p align="center">
18
+ <img src="https://voila.maitrix.org/static/images/logo.png" width="400"/><br/>
19
+ <b>Voila: <span style="color:#ca00f9">Voi</span>ce-<span style="color:#ca00f9">La</span>nguage Foundation Models</b><br/><br/>
20
+ πŸ’œ <a href="https://voila.maitrix.org"><b>Project Page</b></a> &nbsp&nbsp | &nbsp&nbsp πŸ–₯️ <a href="https://github.com/maitrix-org/Voila">GitHub</a> &nbsp&nbsp | &nbsp&nbspπŸ€— <a href="https://huggingface.co/collections/maitrix-org/voila-67e0d96962c19f221fc73fa5">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp πŸ“‘ <a href="http://arxiv.org/abs/2505.02707">Paper</a> &nbsp&nbsp | &nbsp&nbsp 🌐 <a href="https://huggingface.co/spaces/maitrix-org/Voila-demo">Online Demo</a> &nbsp&nbsp| &nbsp&nbsp 🏠<a href="https://maitrix.org">Maitrix.org</a>
21
  </p>
22
 
23
+ Voila is a new family of large voice-language foundation models aiming to lift human-AI interaction experiences to the next level. Breaking away from the constraints of traditional voice AI systemsβ€”high latency, loss of vocal nuances, and mechanical responsesβ€”Voila employs an innovative end-to-end model design and a novel hierarchical Transformer architecture. This approach enables real-time, autonomous, and rich voice interactions, with latency as low as 195 ms, surpassing average human response times. Combining advanced voice and language modeling, Voila offers customizable, persona-driven engagements and excels in a range of audio tasks from ASR and TTS to speech translation across six languages. With the online [web demo](https://huggingface.co/spaces/maitrix-org/Voila-demo), Voila invites you to explore a transformative, natural dialogue experience between human and AI.
24
+
25
+ # ✨ Highlights
26
+ - ⭐ High-fidelity, low-latency, real-time streaming audio processing
27
+ - ⭐ Effective integration of voice and language modeling capabilities
28
+ - ⭐ Millions of pre-built and custom voices, fast voice switching during conversation
29
+ - ⭐ Unified model for various audio tasks
30
+
31
+ # πŸŽ₯ Video Demo
32
+ [![Voila Demo](https://img.youtube.com/vi/J27M9-g5KL0/0.jpg)](https://www.youtube.com/watch?v=J27M9-g5KL0)
33
+
34
+ # πŸ”₯ Latest News!!
35
+
36
+ * April 28, 2025: πŸ‘‹ We've released the inference code and model weights of Voila.
37
+
38
+ # βš™οΈ Foundation Models
39
+
40
+ | Model | Description | Download Link |
41
+ |--------|-----------|-----------------|
42
+ |Voila-base|Voila base model|https://huggingface.co/maitrix-org/Voila-base|
43
+ |Voila-Chat|End-to-end audio chat model|https://huggingface.co/maitrix-org/Voila-chat|
44
+ |Voila-Autonomous (preview)|Full-duplex audio chat model|https://huggingface.co/maitrix-org/Voila-autonomous-preview|
45
+ |Voila-Audio-alpha|Empowering LLM with raw audio input|https://huggingface.co/maitrix-org/Voila-audio-alpha|
46
+ |Voila-Tokenizer|Audio tokenizer|https://huggingface.co/maitrix-org/Voila-Tokenizer|
47
+
48
+ ## Usage
49
+ ### CLI demo
50
+ ```shell
51
+ for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do
52
+ # Text chat
53
+ python infer.py \
54
+ --model-name ${model_name} \
55
+ --instruction "" \
56
+ --input-text "Hello" \
57
+ --task-type chat_tito
58
+ # Voice chat
59
+ python infer.py \
60
+ --model-name ${model_name} \
61
+ --instruction "" \
62
+ --input-audio "examples/test1.mp3" \
63
+ --task-type chat_aiao
64
+ done
65
+
66
+ # Autonomous mode
67
+ python infer.py \
68
+ --model-name "maitrix-org/Voila-autonomous-preview" \
69
+ --instruction "" \
70
+ --input-audio "examples/test_autonomous1.mp3" \
71
+ --task-type chat_aiao_auto
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ```
 
 
 
 
 
 
 
73
 
74
+ ### Gradio demo
75
+ ```shell
76
+ python gradio_demo.py
77
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
+ For more information, please refer to the [code repository](https://github.com/maitrix-org/Voila).
80
+
81
+ # πŸ“ Datasets
82
+ We publish the following two datasets: Voila Benchmark and Voila Voice Library. Voila-Benchmark is a novel speech evaluation benchmark, while Voila Voice Library provides millions of pre-built and customizable voices.
83
+
84
+ | Dataset | Description | Download Link |
85
+ |--------|-----------|-----------------|
86
+ |Voila Benchmark| Evaluation of Voila Benchmark | https://huggingface.co/datasets/maitrix-org/Voila-Benchmark |
87
+ |Voila Voice Library| Millons of pre-build voices | https://huggingface.co/datasets/maitrix-org/Voila-million-voice
88
+
89
+ # πŸ“Š Benchmark
90
+ ## 1. Voila Benchmark
91
+ We introduce a novel speech evaluation benchmark called the VoilaBenchmark. The Voila Benchmark is constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ-Open, and GSM8k. We compare our results with SpeechGPT and Moshi.
92
+ | Model | Voila Benchmark |
93
+ |-------|----------------|
94
+ |SpeechGPT| 13.29|
95
+ |Moshi | 11.45 |
96
+ |**Voila** | **30.56** |
97
+
98
+ _(higher is better)_
99
+
100
+ For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark").
101
+ ## 2. Evaluation of ASR
102
+ As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text-to-Speech(TTS), and spoken question answering, we also evaluate the performance of ASR and TTS.
103
+ For ASR, we assess performance on the LibriSpeech test-clean dataset, using Word Error Rate (WER) as our metric. Voila attains a word error rate (WER) of 4.8%, outperforming the 5.7% reported by Moshi. In scenarios where both models utilize LibriSpeech training data, Voila achieves an impressive WER of 2.7%.
104
+ | Model | LibriSpeech test-clean (WER) |
105
+ |-------|-----------------------|
106
+ |Whisper large v2|2.7|
107
+ |Whisper large v3|2.2|
108
+ |FastConformer|3.6|
109
+ |VoxtLM |2.7|
110
+ |Moshi |5.7|
111
+ |**Voila (w/o LibriSpeech train split)** |**4.8**|
112
+ |**Voila (with LibriSpeech train split)**|**2.7**|
113
+
114
+ _(lower is better)_
115
+
116
+ ## 3. Evaluation of TTS
117
+ For TTS, we follow the evaluation metrics proposed in Vall-E, which involves transcribing the generated audio using HuBERT-Large.
118
+ Voila once again leads with a WER of 3.2% (and 2.8% when using LibriSpeech training data).
119
+
120
+ | Model | LibriSpeech test-clean (WER) |
121
+ |-------|-----------------------|
122
+ |YourTTS |7.7|
123
+ |Vall-E|5.9|
124
+ |Moshi|4.7|
125
+ |**Voila (w/o LibriSpeech train split)** |**3.2**|
126
+ |**Voila (with LibriSpeech train split)** |**2.8**|
127
+
128
+ _(lower is better)_
129
+
130
+ # πŸ“ Citation
131
+ If you find our work helpful, please cite us.
132
 
133
  ```
134
+ @article{voila2025,
135
+ author = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
136
+ title = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Roleplay},
137
+ eprint={2505.02707},
138
+ archivePrefix={arXiv},
139
+ primaryClass={cs.CL},
140
+ year = {2025}
 
141
  }
142
  ```