Update README.md

5eaa4bf verified 3 days ago

7.16 kB

	---
	license: apache-2.0
	base_model:
	- genmo/mochi-1-preview
	base_model_relation: quantized
	pipeline_tag: text-to-video
	---


	# Elastic model: Fastest self-serving models. mochi-1-preview.

	Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

	* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.

	* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.

	* __M__: Faster model, with accuracy degradation less than 1.5%.

	* __S__: The fastest model, with accuracy degradation less than 2%.


	__Goals of Elastic Models:__

	* Provide the fastest models and service for self-hosting.
	* Provide flexibility in cost vs quality selection for inference.
	* Provide clear quality and latency benchmarks.
	* Provide interface of HF libraries: transformers and diffusers with a single line of code.
	* Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.

	> It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.

	-----
	Prompt: Timelapse of urban cityscape transitioning from day to night

	Number of frames = 100

	\| S \| XL \| Original \|
	\|:-:\|:-:\|:-:\|
	\| <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/7D4jSJXgO0St8M34qPpTF.mp4"></video>\| <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/ir7veWK4F6-n6vdMwEea5.mp4"></video>\| <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/boWKOxsIFr8GHpC9sB96V.mp4"></video>\|
	## Inference

	> Compiled versions are currently available only for 163-frame generations, height=480 and width=848. Other versions are not yet accessible. Stay tuned for updates!

	To infer our models, you just need to replace `diffusers` import with `elastic_models.diffusers`:


	```python
	import torch
	from elastic_models.diffusers import DiffusionPipeline
	from diffusers.video_processor import VideoProcessor
	from diffusers.utils import export_to_video

	mode_name = "genmo/mochi-1-preview"
	hf_token = ""
	device = torch.device("cuda")
	dtype = torch.bfloat16

	pipe = DiffusionPipeline.from_pretrained(
	mode_name, torch_dtype=dtype, token=hf_token, mode="S"
	)
	pipe.enable_vae_tiling()
	pipe.to(device)

	prompt = "Kitten eating a banana"
	with torch.no_grad():
	torch.cuda.synchronize()
	(
	prompt_embeds,
	prompt_attention_mask,
	negative_prompt_embeds,
	negative_prompt_attention_mask,
	) = pipe.encode_prompt(prompt=prompt)
	if prompt_attention_mask is not None and isinstance(
	prompt_attention_mask, torch.Tensor
	):
	prompt_attention_mask = prompt_attention_mask.to(dtype)

	if negative_prompt_attention_mask is not None and isinstance(
	negative_prompt_attention_mask, torch.Tensor
	):
	negative_prompt_attention_mask = negative_prompt_attention_mask.to(dtype)

	prompt_embeds = prompt_embeds.to(dtype)
	negative_prompt_embeds = negative_prompt_embeds.to(dtype)

	with torch.autocast("cuda", torch.bfloat16, enabled=True):
	frames = pipe(
	prompt_embeds=prompt_embeds,
	prompt_attention_mask=prompt_attention_mask,
	negative_prompt_embeds=negative_prompt_embeds,
	negative_prompt_attention_mask=negative_prompt_attention_mask,
	guidance_scale=4.5,
	num_inference_steps=64,
	height=480,
	width=848,
	num_frames=163,
	generator=torch.Generator("cuda").manual_seed(0),
	output_type="latent",
	return_dict=False,
	)[0]

	video_processor = VideoProcessor(vae_scale_factor=8)
	has_latents_mean = (
	hasattr(pipe.vae.config, "latents_mean")
	and pipe.vae.config.latents_mean is not None
	)
	has_latents_std = (
	hasattr(pipe.vae.config, "latents_std")
	and pipe.vae.config.latents_std is not None
	)

	if has_latents_mean and has_latents_std:
	latents_mean = (
	torch.tensor(pipe.vae.config.latents_mean)
	.view(1, 12, 1, 1, 1)
	.to(frames.device, frames.dtype)
	)
	latents_std = (
	torch.tensor(pipe.vae.config.latents_std)
	.view(1, 12, 1, 1, 1)
	.to(frames.device, frames.dtype)
	)
	frames = frames * latents_std / pipe.vae.config.scaling_factor + latents_mean
	else:
	frames = frames / pipe.vae.config.scaling_factor

	with torch.autocast("cuda", torch.bfloat16, enabled=False):
	video = pipe.vae.decode(frames.to(pipe.vae.dtype), return_dict=False)[0]

	video = video_processor.postprocess_video(video)[0]
	torch.cuda.synchronize()
	export_to_video(video, "mochi.mp4", fps=30)
	```

	### Installation


	__System requirements:__
	* GPUs: H100, B200
	* CPU: AMD, Intel
	* Python: 3.10-3.12


	To work with our models just run these lines in your terminal:

	```shell
	pip install thestage
	pip install elastic_models[nvidia]\
	--index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
	--extra-index-url https://pypi.nvidia.com\
	--extra-index-url https://pypi.org/simple

	# or for blackwell support
	pip install elastic_models[blackwell]\
	--index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
	--extra-index-url https://pypi.nvidia.com\
	--extra-index-url https://pypi.org/simple
	pip install -U --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
	pip install -U --pre torchvision --index-url https://download.pytorch.org/whl/nightly/cu128

	pip install flash_attn==2.7.3 --no-build-isolation
	pip uninstall apex
	pip install tensorrt==10.11.0.33 opencv-python==4.11.0.86 imageio-ffmpeg==0.6.0
	```

	Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:

	```shell
	thestage config set --api-token <YOUR_API_TOKEN>
	```

	Congrats, now you can use accelerated models!

	----

	## Benchmarks

	Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms.


	### Latency benchmarks

	Time in seconds of generation.

	### Number of frames: 100


	\| GPU \| S \| XL \| Original \|
	\|----------\|-----\|-----\|----------\|
	\| H100 \| 144 \| 163 \| 311 \|
	\| B200 \| 77 \| 87 \| 241 \|

	### Number of frames: 163

	\| GPU \| S \| XL \| Original \|
	\|----------\|-----\|-----\|----------\|
	\| H100 \| 328 \| 361 \| 675 \|
	\| B200 \| 173 \| 189 \| 545 \|

	## Links

	* __Platform__: [app.thestage.ai](https://app.thestage.ai)
	<!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
	* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
	* __Contact email__: [email protected]