--- license: apache-2.0 base_model: - genmo/mochi-1-preview base_model_relation: quantized pipeline_tag: text-to-video --- # Elastic model: Fastest self-serving models. mochi-1-preview. Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models: * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler. * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks. * __M__: Faster model, with accuracy degradation less than 1.5%. * __S__: The fastest model, with accuracy degradation less than 2%. __Goals of Elastic Models:__ * Provide the fastest models and service for self-hosting. * Provide flexibility in cost vs quality selection for inference. * Provide clear quality and latency benchmarks. * Provide interface of HF libraries: transformers and diffusers with a single line of code. * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT. > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well. ----- Prompt: Timelapse of urban cityscape transitioning from day to night Number of frames = 100 | S | XL | Original | |:-:|:-:|:-:| | | | | ## Inference > Compiled versions are currently available only for 163-frame generations, height=480 and width=848. Other versions are not yet accessible. Stay tuned for updates! To infer our models, you just need to replace `diffusers` import with `elastic_models.diffusers`: ```python import torch from elastic_models.diffusers import DiffusionPipeline from diffusers.video_processor import VideoProcessor from diffusers.utils import export_to_video mode_name = "genmo/mochi-1-preview" hf_token = "" device = torch.device("cuda") dtype = torch.bfloat16 pipe = DiffusionPipeline.from_pretrained( mode_name, torch_dtype=dtype, token=hf_token, mode="S" ) pipe.enable_vae_tiling() pipe.to(device) prompt = "Kitten eating a banana" with torch.no_grad(): torch.cuda.synchronize() ( prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_prompt_attention_mask, ) = pipe.encode_prompt(prompt=prompt) if prompt_attention_mask is not None and isinstance( prompt_attention_mask, torch.Tensor ): prompt_attention_mask = prompt_attention_mask.to(dtype) if negative_prompt_attention_mask is not None and isinstance( negative_prompt_attention_mask, torch.Tensor ): negative_prompt_attention_mask = negative_prompt_attention_mask.to(dtype) prompt_embeds = prompt_embeds.to(dtype) negative_prompt_embeds = negative_prompt_embeds.to(dtype) with torch.autocast("cuda", torch.bfloat16, enabled=True): frames = pipe( prompt_embeds=prompt_embeds, prompt_attention_mask=prompt_attention_mask, negative_prompt_embeds=negative_prompt_embeds, negative_prompt_attention_mask=negative_prompt_attention_mask, guidance_scale=4.5, num_inference_steps=64, height=480, width=848, num_frames=163, generator=torch.Generator("cuda").manual_seed(0), output_type="latent", return_dict=False, )[0] video_processor = VideoProcessor(vae_scale_factor=8) has_latents_mean = ( hasattr(pipe.vae.config, "latents_mean") and pipe.vae.config.latents_mean is not None ) has_latents_std = ( hasattr(pipe.vae.config, "latents_std") and pipe.vae.config.latents_std is not None ) if has_latents_mean and has_latents_std: latents_mean = ( torch.tensor(pipe.vae.config.latents_mean) .view(1, 12, 1, 1, 1) .to(frames.device, frames.dtype) ) latents_std = ( torch.tensor(pipe.vae.config.latents_std) .view(1, 12, 1, 1, 1) .to(frames.device, frames.dtype) ) frames = frames * latents_std / pipe.vae.config.scaling_factor + latents_mean else: frames = frames / pipe.vae.config.scaling_factor with torch.autocast("cuda", torch.bfloat16, enabled=False): video = pipe.vae.decode(frames.to(pipe.vae.dtype), return_dict=False)[0] video = video_processor.postprocess_video(video)[0] torch.cuda.synchronize() export_to_video(video, "mochi.mp4", fps=30) ``` ### Installation __System requirements:__ * GPUs: H100, B200 * CPU: AMD, Intel * Python: 3.10-3.12 To work with our models just run these lines in your terminal: ```shell pip install thestage pip install elastic_models[nvidia]\ --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\ --extra-index-url https://pypi.nvidia.com\ --extra-index-url https://pypi.org/simple # or for blackwell support pip install elastic_models[blackwell]\ --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\ --extra-index-url https://pypi.nvidia.com\ --extra-index-url https://pypi.org/simple pip install -U --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128 pip install -U --pre torchvision --index-url https://download.pytorch.org/whl/nightly/cu128 pip install flash_attn==2.7.3 --no-build-isolation pip uninstall apex pip install tensorrt==10.11.0.33 opencv-python==4.11.0.86 imageio-ffmpeg==0.6.0 ``` Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows: ```shell thestage config set --api-token ``` Congrats, now you can use accelerated models! ---- ## Benchmarks Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. ### Latency benchmarks Time in seconds of generation. ### Number of frames: 100 | GPU | S | XL | Original | |----------|-----|-----|----------| | H100 | 144 | 163 | 311 | | B200 | 77 | 87 | 241 | ### Number of frames: 163 | GPU | S | XL | Original | |----------|-----|-----|----------| | H100 | 328 | 361 | 675 | | B200 | 173 | 189 | 545 | ## Links * __Platform__: [app.thestage.ai](https://app.thestage.ai) * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI) * __Contact email__: contact@thestage.ai