File size: 3,164 Bytes
d8a173b
 
7e19f99
ed6eb51
 
 
d8a173b
bd25172
d8a173b
bd25172
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: apache-2.0
pipeline_tag: text-to-speech
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
---
# Dia: Open-Weight Text-to-Speech Dialogue Model (1.6B)

**Dia** is a 1.6B parameter open-weight text-to-speech model developed by Nari Labs.  
It generates highly realistic *dialogue* directly from transcripts, with support for both spoken and **nonverbal** cues (e.g., `(laughs)`, `(sighs)`), and can be **conditioned on audio** for emotional tone or voice consistency.

Currently, Dia supports **English** and is optimized for GPU inference. This model is designed for research and educational purposes only.

---

## πŸ”₯ Try It Out

- πŸ–₯️ [ZeroGPU demo on Spaces](https://huggingface.co/spaces/nari-labs/Dia-1.6B)
- πŸ“Š [Comparison demos](https://yummy-fir-7a4.notion.site/dia) with ElevenLabs and Sesame CSM-1B
- 🎧 Try voice remixing and conversations with a larger version β€” [join the waitlist](https://tally.so/r/meokbo)
- πŸ’¬ [Join the community on Discord](https://discord.gg/pgdB5YRe)

---

## 🧠 Capabilities

- Multispeaker support using `[S1]`, `[S2]`, etc.
- Rich nonverbal cue synthesis: `(laughs)`, `(clears throat)`, `(gasps)`, etc.
- Voice conditioning (via transcript + audio example)
- Outputs high-fidelity `.mp3` files directly from text

Example input:
```text
[S1] Dia is an open weights text-to-dialogue model. [S2] You get full control over scripts and voices. (laughs)
```

---

## πŸš€ Quickstart

Install via pip:

```bash
pip install git+https://github.com/nari-labs/dia.git
```

Launch the Gradio UI:
```bash
git clone https://github.com/nari-labs/dia.git
cd dia && uv run app.py
```

Or manually set up:

```bash
git clone https://github.com/nari-labs/dia.git
cd dia
python -m venv .venv
source .venv/bin/activate
pip install -e .
python app.py
```

---

## 🐍 Python Example

```python
from dia.model import Dia

model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")

text = "[S1] Hello! This is Dia. [S2] Nice to meet you. (laughs)"
output = model.generate(text, use_torch_compile=True, verbose=True)
model.save_audio("output.mp3", output)
```

> Coming soon: PyPI package and CLI support

---

## πŸ’» Inference Performance (on RTX 4090)

| Precision | Realtime Factor (w/ compile) | w/o Compile | VRAM Usage |
|-----------|------------------------------|-------------|------------|
| bfloat16  | 2.1Γ—                          | 1.5Γ—        | ~10GB      |
| float16   | 2.2Γ—                          | 1.3Γ—        | ~10GB      |
| float32   | 1.0Γ—                          | 0.9Γ—        | ~13GB      |

> CPU support and quantized version coming soon.

---

## ⚠️ Ethical Use

This model is for **research and educational use only**. Prohibited uses include:

- Impersonating individuals (e.g., cloning real voices without consent)
- Generating misleading or malicious content
- Illegal or harmful activities

Please use responsibly.

---

## πŸ“„ License

Apache 2.0  
See the [LICENSE](https://github.com/nari-labs/dia/blob/main/LICENSE) for details.

---

## πŸ› οΈ Roadmap

- πŸ”§ Inference speed optimization
- πŸ’Ύ CPU & quantized model support
- πŸ“¦ PyPI + CLI tools