Update README.md
Browse files
README.md
CHANGED
@@ -2,77 +2,191 @@
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
|
|
5 |
|
6 |
-
# Magi-1: Autoregressive Video Generation Are Scalable World Models
|
7 |
|
8 |
-
|
9 |
-
|
10 |
-
<
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
</div>
|
13 |
|
14 |
-
-----
|
15 |
|
16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
|
|
18 |
|
19 |
-
|
20 |
|
21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
|
24 |
-
##
|
25 |
|
26 |
-
We provide the pre-trained weights for
|
27 |
|
28 |
| Model | Link | Recommend Machine |
|
29 |
| ----------------------------- | ------------------------------------------------------------ | ------------------------------- |
|
30 |
-
|
|
31 |
-
|
|
32 |
-
|
|
33 |
-
|
|
34 |
-
|
|
35 |
-
|
|
|
|
|
|
|
|
|
|
36 |
|
|
|
37 |
|
38 |
-
|
39 |
|
40 |
-
###
|
41 |
|
42 |
-
|
43 |
|
44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
```bash
|
47 |
-
docker pull
|
48 |
|
49 |
docker run -it --gpus all --privileged --shm-size=32g --name magi --net=host --ipc=host --ulimit memlock=-1 --ulimit stack=6710886 sandai/magi:latest /bin/bash
|
50 |
```
|
51 |
|
52 |
-
**Run with
|
53 |
|
54 |
```bash
|
55 |
# Create a new environment
|
56 |
conda create -n magi python==3.10.12
|
|
|
57 |
# Install pytorch
|
58 |
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
|
|
|
59 |
# Install other dependencies
|
60 |
pip install -r requirements.txt
|
61 |
-
|
62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
```
|
64 |
|
65 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
|
67 |
```bash
|
68 |
-
|
|
|
69 |
bash example/24B/run.sh
|
70 |
|
71 |
-
# Run 4.5B
|
72 |
bash example/4.5B/run.sh
|
73 |
```
|
74 |
|
75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
|
77 |
| Config | Help |
|
78 |
| -------------- | ------------------------------------------------------------ |
|
@@ -87,13 +201,23 @@ bash example/4.5B/run.sh
|
|
87 |
| vae_pretrained | Path to load pretrained VAE model |
|
88 |
|
89 |
|
90 |
-
##
|
91 |
|
92 |
-
|
93 |
|
94 |
-
|
95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
96 |
```
|
97 |
-
```
|
98 |
|
99 |
-
|
|
|
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
5 |
+

|
6 |
|
|
|
7 |
|
8 |
+
-----
|
9 |
+
|
10 |
+
<p align="center">
|
11 |
+
<a href="https://static.magi.world/static/files/MAGI_1.pdf"><img alt="paper" src="https://img.shields.io/badge/Paper-arXiv-B31B1B?logo=arxiv"></a>
|
12 |
+
<a href="https://sand.ai"><img alt="blog" src="https://img.shields.io/badge/Sand%20AI-Homepage-333333.svg?logo="></a>
|
13 |
+
<a href="https://magi.sand.ai"><img alt="product" src="https://img.shields.io/badge/Magi-Product-logo.svg?logo=&color=DCBE7E"></a>
|
14 |
+
<a href="https://huggingface.co/sand-ai"><img alt="Hugging Face"
|
15 |
+
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Sand AI-ffc107?color=ffc107&logoColor=white"/></a>
|
16 |
+
<a href="https://x.com/SandAI_HQ"><img alt="Twitter Follow"
|
17 |
+
src="https://img.shields.io/badge/Twitter-Sand%20AI-white?logo=x&logoColor=white"/></a>
|
18 |
+
<a href="https://discord.gg/hgaZ86D7Wv"><img alt="Discord"
|
19 |
+
src="https://img.shields.io/badge/Discord-Sand%20AI-7289da?logo=discord&logoColor=white&color=7289da"/></a>
|
20 |
+
<a href="https://github.com/SandAI-org/Magi/LICENSE"><img alt="license" src="https://img.shields.io/badge/License-Apache2.0-green?logo=Apache"></a>
|
21 |
+
</p>
|
22 |
+
|
23 |
+
# MAGI-1: Autoregressive Video Generation at Scale
|
24 |
+
|
25 |
+
This repository contains the code for the MAGI-1 model, pre-trained weights and inference code. You can find more information on our [technical report](https://static.magi.world/static/files/MAGI_1.pdf) or directly create magic with MAGI-1 [here](http://sand.ai) . 🚀✨
|
26 |
+
|
27 |
+
|
28 |
+
## 🔥🔥🔥 Latest News
|
29 |
+
|
30 |
+
- Apr 21, 2025: MAGI-1 is here 🎉. We've released the model weights and inference code — check it out!
|
31 |
+
|
32 |
+
|
33 |
+
## 1. About
|
34 |
+
|
35 |
+
We present MAGI-1, a world model that generates videos by ***autoregressively*** predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 further supports controllable generation via chunk-wise prompting, enabling smooth scene transitions, long-horizon synthesis, and fine-grained text-driven control. We believe MAGI-1 offers a promising direction for unifying high-fidelity video generation with flexible instruction control and real-time deployment.
|
36 |
+
|
37 |
+
<div align="center">
|
38 |
+
<video src="https://github.com/user-attachments/assets/5cfa90e0-f6ed-476b-a194-71f1d309903a
|
39 |
+
" width="70%" poster=""> </video>
|
40 |
</div>
|
41 |
|
|
|
42 |
|
43 |
+
## 2. Model Summary
|
44 |
+
|
45 |
+
### Transformer-based VAE
|
46 |
+
|
47 |
+
- Variational autoencoder (VAE) with transformer-based architecture, 8x spatial and 4x temporal compression.
|
48 |
+
- Fastest average decoding time and highly competitive reconstruction quality
|
49 |
+
|
50 |
+
### Auto-Regressive Denoising Algorithm
|
51 |
+
|
52 |
+
MAGI-1 is an autoregressive denoising video generation model generating videos chunk-by-chunk instead of as a whole. Each chunk (24 frames) is denoised holistically, and the generation of the next chunk begins as soon as the current one reaches a certain level of denoising. This pipeline design enables concurrent processing of up to four chunks for efficient video generation.
|
53 |
|
54 |
+

|
55 |
|
56 |
+
### Diffusion Model Architecture
|
57 |
|
58 |
+
MAGI-1 is built upon the Diffusion Transformer, incorporating several key innovations to enhance training efficiency and stability at scale. These advancements include Block-Causal Attention, Parallel Attention Block, QK-Norm and GQA, Sandwich Normalization in FFN, SwiGLU, and Softcap Modulation. For more details, please refer to the [technical report.](https://static.magi.world/static/files/MAGI_1.pdf)
|
59 |
+
<div align="center">
|
60 |
+
<img src="figures/dit_architecture.png" alt="diffusion model architecture" width="500" />
|
61 |
+
</div>
|
62 |
+
|
63 |
+
### Distillation Algorithm
|
64 |
+
|
65 |
+
We adopt a shortcut distillation approach that trains a single velocity-based model to support variable inference budgets. By enforcing a self-consistency constraint—equating one large step with two smaller steps—the model learns to approximate flow-matching trajectories across multiple step sizes. During training, step sizes are cyclically sampled from {64, 32, 16, 8}, and classifier-free guidance distillation is incorporated to preserve conditional alignment. This enables efficient inference with minimal loss in fidelity.
|
66 |
|
67 |
|
68 |
+
## 3. Model Zoo
|
69 |
|
70 |
+
We provide the pre-trained weights for MAGI-1, including the 24B and 4.5B models, as well as the corresponding distill and distill+quant models. The model weight links are shown in the table.
|
71 |
|
72 |
| Model | Link | Recommend Machine |
|
73 |
| ----------------------------- | ------------------------------------------------------------ | ------------------------------- |
|
74 |
+
| T5 | [T5](https://huggingface.co/sand-ai/MAGI-1/tree/main/ckpt/t5) | - |
|
75 |
+
| MAGI-1-VAE | [MAGI-1-VAE](https://huggingface.co/sand-ai/MAGI-1/tree/main/ckpt/vae) | - |
|
76 |
+
| MAGI-1-24B | [MAGI-1-24B](https://huggingface.co/sand-ai/MAGI-1/tree/main/ckpt/magi/24B_base) | H100/H800 \* 8 |
|
77 |
+
| MAGI-1-24B-distill | [MAGI-1-24B-distill](https://huggingface.co/sand-ai/MAGI-1/tree/main/ckpt/magi/24B_distill) | H100/H800 \* 8 |
|
78 |
+
| MAGI-1-24B-distill+fp8_quant | [MAGI-1-24B-distill+quant](https://huggingface.co/sand-ai/MAGI-1/tree/main/ckpt/magi/24B_distill_quant) | H100/H800 \* 4 or RTX 4090 \* 8 |
|
79 |
+
| MAGI-1-4.5B | MAGI-1-4.5B | RTX 4090 \* 1 |
|
80 |
+
|
81 |
+
## 4. Evaluation
|
82 |
+
|
83 |
+
### In-house Human Evaluation
|
84 |
|
85 |
+
MAGI-1 achieves state-of-the-art performance among open-source models (surpassing Wan-2.1 and significantly outperforming Hailuo and HunyuanVideo), particularly excelling in instruction following and motion quality, positioning it as a strong potential competitor to closed-source commercial models such as Kling.
|
86 |
|
87 |
+

|
88 |
|
89 |
+
### Physical Evaluation
|
90 |
|
91 |
+
Thanks to the natural advantages of autoregressive architecture, Magi achieves far superior precision in predicting physical behavior through video continuation—significantly outperforming all existing models.
|
92 |
|
93 |
+
| Model | Phys. IQ Score ↑ | Spatial IoU ↑ | Spatio Temporal ↑ | Weighted Spatial IoU ↑ | MSE ↓ |
|
94 |
+
|----------------|------------------|---------------|-------------------|-------------------------|--------|
|
95 |
+
| **V2V Models** | | | | | |
|
96 |
+
| **Magi (V2V)** | **56.02** | **0.367** | **0.270** | **0.304** | **0.005** |
|
97 |
+
| VideoPoet (V2V)| 29.50 | 0.204 | 0.164 | 0.137 | 0.010 |
|
98 |
+
| **I2V Models** | | | | | |
|
99 |
+
| **Magi (I2V)** | **30.23** | **0.203** | **0.151** | **0.154** | **0.012** |
|
100 |
+
| Kling1.6 (I2V) | 23.64 | 0.197 | 0.086 | 0.144 | 0.025 |
|
101 |
+
| VideoPoet (I2V)| 20.30 | 0.141 | 0.126 | 0.087 | 0.012 |
|
102 |
+
| Gen 3 (I2V) | 22.80 | 0.201 | 0.115 | 0.116 | 0.015 |
|
103 |
+
| Wan2.1 (I2V) | 20.89 | 0.153 | 0.100 | 0.112 | 0.023 |
|
104 |
+
| Sora (I2V) | 10.00 | 0.138 | 0.047 | 0.063 | 0.030 |
|
105 |
+
| **GroundTruth**| **100.0** | **0.678** | **0.535** | **0.577** | **0.002** |
|
106 |
+
|
107 |
+
|
108 |
+
## 5. How to run
|
109 |
+
|
110 |
+
### Environment Preparation
|
111 |
+
|
112 |
+
We provide two ways to run MAGI-1, with the Docker environment being the recommended option.
|
113 |
+
|
114 |
+
**Run with Docker Environment (Recommend)**
|
115 |
|
116 |
```bash
|
117 |
+
docker pull sandai/magi:latest
|
118 |
|
119 |
docker run -it --gpus all --privileged --shm-size=32g --name magi --net=host --ipc=host --ulimit memlock=-1 --ulimit stack=6710886 sandai/magi:latest /bin/bash
|
120 |
```
|
121 |
|
122 |
+
**Run with Source Code**
|
123 |
|
124 |
```bash
|
125 |
# Create a new environment
|
126 |
conda create -n magi python==3.10.12
|
127 |
+
|
128 |
# Install pytorch
|
129 |
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
|
130 |
+
|
131 |
# Install other dependencies
|
132 |
pip install -r requirements.txt
|
133 |
+
|
134 |
+
# Install ffmpeg
|
135 |
+
conda install -c conda-forge ffmpeg=4.4
|
136 |
+
|
137 |
+
# Install MagiAttention, for more information, please refer to https://github.com/SandAI-org/MagiAttention#
|
138 |
+
git clone [email protected]:SandAI-org/MagiAttention.git
|
139 |
+
cd MagiAttention
|
140 |
+
git submodule update --init --recursive
|
141 |
+
pip install --no-build-isolation .
|
142 |
```
|
143 |
|
144 |
+
### Inference Command
|
145 |
+
|
146 |
+
To run the `MagiPipeline`, you can control the input and output by modifying the parameters in the `example/24B/run.sh` or `example/4.5B/run.sh` script. Below is an explanation of the key parameters:
|
147 |
+
|
148 |
+
#### Parameter Descriptions
|
149 |
+
|
150 |
+
- `--config_file`: Specifies the path to the configuration file, which contains model configuration parameters, e.g., `example/24B/24B_config.json`.
|
151 |
+
- `--mode`: Specifies the mode of operation. Available options are:
|
152 |
+
- `t2v`: Text to Video
|
153 |
+
- `i2v`: Image to Video
|
154 |
+
- `v2v`: Video to Video
|
155 |
+
- `--prompt`: The text prompt used for video generation, e.g., `"Good Boy"`.
|
156 |
+
- `--image_path`: Path to the image file, used only in `i2v` mode.
|
157 |
+
- `--prefix_video_path`: Path to the prefix video file, used only in `v2v` mode.
|
158 |
+
- `--output_path`: Path where the generated video file will be saved.
|
159 |
+
|
160 |
+
#### Bash Script
|
161 |
|
162 |
```bash
|
163 |
+
#!/bin/bash
|
164 |
+
# Run 24B MAGI-1 model
|
165 |
bash example/24B/run.sh
|
166 |
|
167 |
+
# Run 4.5B MAGI-1 model
|
168 |
bash example/4.5B/run.sh
|
169 |
```
|
170 |
|
171 |
+
#### Customizing Parameters
|
172 |
+
|
173 |
+
You can modify the parameters in `run.sh` as needed. For example:
|
174 |
+
|
175 |
+
- To use the Image to Video mode (`i2v`), set `--mode` to `i2v` and provide `--image_path`:
|
176 |
+
```bash
|
177 |
+
--mode i2v \
|
178 |
+
--image_path example/assets/image.jpeg \
|
179 |
+
```
|
180 |
+
|
181 |
+
- To use the Video to Video mode (`v2v`), set `--mode` to `v2v` and provide `--prefix_video_path`:
|
182 |
+
```bash
|
183 |
+
--mode v2v \
|
184 |
+
--prefix_video_path example/assets/prefix_video.mp4 \
|
185 |
+
```
|
186 |
+
|
187 |
+
By adjusting these parameters, you can flexibly control the input and output to meet different requirements.
|
188 |
+
|
189 |
+
### Some Useful Configs (for config.json)
|
190 |
|
191 |
| Config | Help |
|
192 |
| -------------- | ------------------------------------------------------------ |
|
|
|
201 |
| vae_pretrained | Path to load pretrained VAE model |
|
202 |
|
203 |
|
204 |
+
## 6. License
|
205 |
|
206 |
+
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
|
207 |
|
208 |
+
## 7. Citation
|
209 |
|
210 |
+
If you find our code or model useful in your research, please cite:
|
211 |
+
|
212 |
+
```bibtex
|
213 |
+
@misc{magi1,
|
214 |
+
title={MAGI-1: Autoregressive Video Generation at Scale},
|
215 |
+
author={Sand-AI},
|
216 |
+
year={2025},
|
217 |
+
url={https://static.magi.world/static/files/MAGI_1.pdf},
|
218 |
+
}
|
219 |
```
|
|
|
220 |
|
221 |
+
## 8. Contact
|
222 |
+
|
223 |
+
If you have any questions, please feel free to raise an issue or contact us at [[email protected]]([email protected]) .
|