File size: 7,158 Bytes
1df71cd a567ab5 1df71cd 63b9c0c 1df71cd 5eaa4bf d5439ec 1df71cd dc9b6a7 1df71cd dc9b6a7 1df71cd ef6ea49 1df71cd dc9b6a7 1df71cd 7f539bc ef6ea49 1df71cd c6ca712 7f539bc 1df71cd dc9b6a7 7f539bc dc9b6a7 7f539bc dc9b6a7 ef6ea49 dc9b6a7 1df71cd 3ffb26d 1df71cd d16b255 1df71cd 34a3f44 1df71cd 8dd8906 1df71cd dc9b6a7 1df71cd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
---
license: apache-2.0
base_model:
- genmo/mochi-1-preview
base_model_relation: quantized
pipeline_tag: text-to-video
---
# Elastic model: Fastest self-serving models. mochi-1-preview.
Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
* __M__: Faster model, with accuracy degradation less than 1.5%.
* __S__: The fastest model, with accuracy degradation less than 2%.
__Goals of Elastic Models:__
* Provide the fastest models and service for self-hosting.
* Provide flexibility in cost vs quality selection for inference.
* Provide clear quality and latency benchmarks.
* Provide interface of HF libraries: transformers and diffusers with a single line of code.
* Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
> It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
-----
Prompt: Timelapse of urban cityscape transitioning from day to night
Number of frames = 100
| S | XL | Original |
|:-:|:-:|:-:|
| <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/7D4jSJXgO0St8M34qPpTF.mp4"></video>| <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/ir7veWK4F6-n6vdMwEea5.mp4"></video>| <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/boWKOxsIFr8GHpC9sB96V.mp4"></video>|
## Inference
> Compiled versions are currently available only for 163-frame generations, height=480 and width=848. Other versions are not yet accessible. Stay tuned for updates!
To infer our models, you just need to replace `diffusers` import with `elastic_models.diffusers`:
```python
import torch
from elastic_models.diffusers import DiffusionPipeline
from diffusers.video_processor import VideoProcessor
from diffusers.utils import export_to_video
mode_name = "genmo/mochi-1-preview"
hf_token = ""
device = torch.device("cuda")
dtype = torch.bfloat16
pipe = DiffusionPipeline.from_pretrained(
mode_name, torch_dtype=dtype, token=hf_token, mode="S"
)
pipe.enable_vae_tiling()
pipe.to(device)
prompt = "Kitten eating a banana"
with torch.no_grad():
torch.cuda.synchronize()
(
prompt_embeds,
prompt_attention_mask,
negative_prompt_embeds,
negative_prompt_attention_mask,
) = pipe.encode_prompt(prompt=prompt)
if prompt_attention_mask is not None and isinstance(
prompt_attention_mask, torch.Tensor
):
prompt_attention_mask = prompt_attention_mask.to(dtype)
if negative_prompt_attention_mask is not None and isinstance(
negative_prompt_attention_mask, torch.Tensor
):
negative_prompt_attention_mask = negative_prompt_attention_mask.to(dtype)
prompt_embeds = prompt_embeds.to(dtype)
negative_prompt_embeds = negative_prompt_embeds.to(dtype)
with torch.autocast("cuda", torch.bfloat16, enabled=True):
frames = pipe(
prompt_embeds=prompt_embeds,
prompt_attention_mask=prompt_attention_mask,
negative_prompt_embeds=negative_prompt_embeds,
negative_prompt_attention_mask=negative_prompt_attention_mask,
guidance_scale=4.5,
num_inference_steps=64,
height=480,
width=848,
num_frames=163,
generator=torch.Generator("cuda").manual_seed(0),
output_type="latent",
return_dict=False,
)[0]
video_processor = VideoProcessor(vae_scale_factor=8)
has_latents_mean = (
hasattr(pipe.vae.config, "latents_mean")
and pipe.vae.config.latents_mean is not None
)
has_latents_std = (
hasattr(pipe.vae.config, "latents_std")
and pipe.vae.config.latents_std is not None
)
if has_latents_mean and has_latents_std:
latents_mean = (
torch.tensor(pipe.vae.config.latents_mean)
.view(1, 12, 1, 1, 1)
.to(frames.device, frames.dtype)
)
latents_std = (
torch.tensor(pipe.vae.config.latents_std)
.view(1, 12, 1, 1, 1)
.to(frames.device, frames.dtype)
)
frames = frames * latents_std / pipe.vae.config.scaling_factor + latents_mean
else:
frames = frames / pipe.vae.config.scaling_factor
with torch.autocast("cuda", torch.bfloat16, enabled=False):
video = pipe.vae.decode(frames.to(pipe.vae.dtype), return_dict=False)[0]
video = video_processor.postprocess_video(video)[0]
torch.cuda.synchronize()
export_to_video(video, "mochi.mp4", fps=30)
```
### Installation
__System requirements:__
* GPUs: H100, B200
* CPU: AMD, Intel
* Python: 3.10-3.12
To work with our models just run these lines in your terminal:
```shell
pip install thestage
pip install elastic_models[nvidia]\
--index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
--extra-index-url https://pypi.nvidia.com\
--extra-index-url https://pypi.org/simple
# or for blackwell support
pip install elastic_models[blackwell]\
--index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
--extra-index-url https://pypi.nvidia.com\
--extra-index-url https://pypi.org/simple
pip install -U --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
pip install -U --pre torchvision --index-url https://download.pytorch.org/whl/nightly/cu128
pip install flash_attn==2.7.3 --no-build-isolation
pip uninstall apex
pip install tensorrt==10.11.0.33 opencv-python==4.11.0.86 imageio-ffmpeg==0.6.0
```
Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
```shell
thestage config set --api-token <YOUR_API_TOKEN>
```
Congrats, now you can use accelerated models!
----
## Benchmarks
Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms.
### Latency benchmarks
Time in seconds of generation.
### Number of frames: 100
| GPU | S | XL | Original |
|----------|-----|-----|----------|
| H100 | 144 | 163 | 311 |
| B200 | 77 | 87 | 241 |
### Number of frames: 163
| GPU | S | XL | Original |
|----------|-----|-----|----------|
| H100 | 328 | 361 | 675 |
| B200 | 173 | 189 | 545 |
## Links
* __Platform__: [app.thestage.ai](https://app.thestage.ai)
<!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
* __Contact email__: [email protected] |