linoyts HF Staff commited on
Commit
768ad92
·
verified ·
1 Parent(s): bdc0b06

Update README.md

Browse files

add diffusers inference example

Files changed (1) hide show
  1. README.md +210 -40
README.md CHANGED
@@ -10,7 +10,7 @@ pipeline_tag: text-to-video
10
  library_name: diffusers
11
  ---
12
 
13
- # LTX-Video Model Card
14
  This model card focuses on the model associated with the LTX-Video model, codebase available [here](https://github.com/Lightricks/LTX-Video).
15
 
16
  LTX-Video is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 30 FPS videos at a 1216×704 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content.
@@ -26,18 +26,15 @@ We provide a model for both text-to-video as well as image+text-to-video usecase
26
  | ![example9](./media/ltx-video_example_00009.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man with graying hair, a beard, and a gray shirt...</summary>A man with graying hair, a beard, and a gray shirt looks down and to his right, then turns his head to the left. The camera angle is a close-up, focused on the man's face. The lighting is dim, with a greenish tint. The scene appears to be real-life footage. Step</details> | ![example10](./media/ltx-video_example_00010.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A clear, turquoise river flows through a rocky canyon...</summary>A clear, turquoise river flows through a rocky canyon, cascading over a small waterfall and forming a pool of water at the bottom.The river is the main focus of the scene, with its clear water reflecting the surrounding trees and rocks. The canyon walls are steep and rocky, with some vegetation growing on them. The trees are mostly pine trees, with their green needles contrasting with the brown and gray rocks. The overall tone of the scene is one of peace and tranquility.</details> | ![example11](./media/ltx-video_example_00011.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man in a suit enters a room and speaks to two women...</summary>A man in a suit enters a room and speaks to two women sitting on a couch. The man, wearing a dark suit with a gold tie, enters the room from the left and walks towards the center of the frame. He has short gray hair, light skin, and a serious expression. He places his right hand on the back of a chair as he approaches the couch. Two women are seated on a light-colored couch in the background. The woman on the left wears a light blue sweater and has short blonde hair. The woman on the right wears a white sweater and has short blonde hair. The camera remains stationary, focusing on the man as he enters the room. The room is brightly lit, with warm tones reflecting off the walls and furniture. The scene appears to be from a film or television show.</details> | ![example12](./media/ltx-video_example_00012.gif)<br><details style="max-width: 300px; margin: auto;"><summary>The waves crash against the jagged rocks of the shoreline...</summary>The waves crash against the jagged rocks of the shoreline, sending spray high into the air.The rocks are a dark gray color, with sharp edges and deep crevices. The water is a clear blue-green, with white foam where the waves break against the rocks. The sky is a light gray, with a few white clouds dotting the horizon.</details> |
27
  | ![example13](./media/ltx-video_example_00013.gif)<br><details style="max-width: 300px; margin: auto;"><summary>The camera pans across a cityscape of tall buildings...</summary>The camera pans across a cityscape of tall buildings with a circular building in the center. The camera moves from left to right, showing the tops of the buildings and the circular building in the center. The buildings are various shades of gray and white, and the circular building has a green roof. The camera angle is high, looking down at the city. The lighting is bright, with the sun shining from the upper left, casting shadows from the buildings. The scene is computer-generated imagery.</details> | ![example14](./media/ltx-video_example_00014.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man walks towards a window, looks out, and then turns around...</summary>A man walks towards a window, looks out, and then turns around. He has short, dark hair, dark skin, and is wearing a brown coat over a red and gray scarf. He walks from left to right towards a window, his gaze fixed on something outside. The camera follows him from behind at a medium distance. The room is brightly lit, with white walls and a large window covered by a white curtain. As he approaches the window, he turns his head slightly to the left, then back to the right. He then turns his entire body to the right, facing the window. The camera remains stationary as he stands in front of the window. The scene is captured in real-life footage.</details> | ![example15](./media/ltx-video_example_00015.gif)<br><details style="max-width: 300px; margin: auto;"><summary>Two police officers in dark blue uniforms and matching hats...</summary>Two police officers in dark blue uniforms and matching hats enter a dimly lit room through a doorway on the left side of the frame. The first officer, with short brown hair and a mustache, steps inside first, followed by his partner, who has a shaved head and a goatee. Both officers have serious expressions and maintain a steady pace as they move deeper into the room. The camera remains stationary, capturing them from a slightly low angle as they enter. The room has exposed brick walls and a corrugated metal ceiling, with a barred window visible in the background. The lighting is low-key, casting shadows on the officers' faces and emphasizing the grim atmosphere. The scene appears to be from a film or television show.</details> | ![example16](./media/ltx-video_example_00016.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A woman with short brown hair, wearing a maroon sleeveless top...</summary>A woman with short brown hair, wearing a maroon sleeveless top and a silver necklace, walks through a room while talking, then a woman with pink hair and a white shirt appears in the doorway and yells. The first woman walks from left to right, her expression serious; she has light skin and her eyebrows are slightly furrowed. The second woman stands in the doorway, her mouth open in a yell; she has light skin and her eyes are wide. The room is dimly lit, with a bookshelf visible in the background. The camera follows the first woman as she walks, then cuts to a close-up of the second woman's face. The scene is captured in real-life footage.</details> |
28
 
29
- # Models & Workflows
 
 
 
 
 
 
 
30
 
31
- | Name | Notes | inference.py config | ComfyUI workflow (Recommended) |
32
- |----------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
33
- | ltxv-13b-0.9.7-dev | Highest quality, requires more VRAM | [ltxv-13b-0.9.7-dev.yaml](https://github.com/Lightricks/LTX-Video/blob/main/configs/ltxv-13b-0.9.7-dev.yaml) | [ltxv-13b-i2v-base.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/ltxv-13b-i2v-base.json) |
34
- | [ltxv-13b-0.9.7-mix](https://app.ltx.studio/motion-workspace?videoModel=ltxv-13b) | Mix ltxv-13b-dev and ltxv-13b-distilled in the same multi-scale rendering workflow for balanced speed-quality | N/A | [ltxv-13b-i2v-mix.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/ltxv13b-i2v-mixed-multiscale.json) |
35
- | [ltxv-13b-0.9.7-distilled](https://app.ltx.studio/motion-workspace?videoModel=ltxv) | Faster, less VRAM usage, slight quality reduction compared to 13b. Ideal for rapid iterations | [ltxv-13b-0.9.7-distilled.yaml](https://github.com/Lightricks/LTX-Video/blob/main/configs/ltxv-13b-0.9.7-dev.yaml) | [ltxv-13b-dist-i2v-base.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/13b-distilled/ltxv-13b-dist-i2v-base.json) |
36
- | [ltxv-13b-0.9.7-distilled-lora128](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-13b-0.9.7-distilled-lora128.safetensors) | LoRA to make ltxv-13b-dev behave like the distilled model | N/A | N/A |
37
- | ltxv-13b-0.9.7-fp8 | Quantized version of ltxv-13b | Coming soon | [ltxv-13b-i2v-base-fp8.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/ltxv-13b-i2v-base-fp8.json) |
38
- | ltxv-13b-0.9.7-distilled-fp8 | Quantized version of ltxv-13b-distilled | Coming soon | [ltxv-13b-dist-fp8-i2v-base.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/13b-distilled/ltxv-13b-dist-fp8-i2v-base.json) |
39
- | ltxv-2b-0.9.6 | Good quality, lower VRAM requirement than ltxv-13b | [ltxv-2b-0.9.6-dev.yaml](https://github.com/Lightricks/LTX-Video/blob/main/configs/ltxv-2b-0.9.6-dev.yaml) | [ltxvideo-i2v.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/low_level/ltxvideo-i2v.json) |
40
- | ltxv-2b-0.9.6-distilled | 15× faster, real-time capable, fewer steps needed, no STG/CFG required | [ltxv-2b-0.9.6-distilled.yaml](https://github.com/Lightricks/LTX-Video/blob/main/configs/ltxv-2b-0.9.6-distilled.yaml) | [ltxvideo-i2v-distilled.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/low_level/ltxvideo-i2v-distilled.json) |
41
 
42
  ## Model Details
43
  - **Developed by:** Lightricks
@@ -56,9 +53,6 @@ You can use the model for purposes under the license:
56
  - 2B version 0.9.6-distilled [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-2b-0.9.6-distilled-04-25.license.txt)
57
  - 13B version 0.9.7-dev [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-13b-0.9.7-dev.license.txt)
58
  - 13B version 0.9.7-dev-fp8 [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-13b-0.9.7-dev-fp8.license.txt)
59
- - 13B version 0.9.7-distilled [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-13b-0.9.7-distilled.license.txt)
60
- - 13B version 0.9.7-distilled-fp8 [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-13b-0.9.7-distilled-fp8.license.txt)
61
- - 13B version 0.9.7-distilled-lora128 [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-13b-0.9.7-distilled-lora128.license.txt)
62
  - Temporal upscaler version 0.9.7 [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-temporal-upscaler-0.9.7.license.txt)
63
  - Spatial upscaler version 0.9.7 [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-spatial-upscaler-0.9.7.license.txt)
64
 
@@ -70,8 +64,7 @@ You can use the model for purposes under the license:
70
 
71
  ### Online demo
72
  The model is accessible right away via the following links:
73
- - [LTX-Studio image-to-video (13B-mix)](https://app.ltx.studio/motion-workspace?videoModel=ltxv-13b)
74
- - [LTX-Studio image-to-video (13B distilled)](https://app.ltx.studio/motion-workspace?videoModel=ltxv)
75
  - [Fal.ai text-to-video](https://fal.ai/models/fal-ai/ltx-video)
76
  - [Fal.ai image-to-video](https://fal.ai/models/fal-ai/ltx-video/image-to-video)
77
  - [Replicate text-to-video and image-to-video](https://replicate.com/lightricks/ltx-video)
@@ -102,13 +95,13 @@ To use our model, please follow the inference code in [inference.py](https://git
102
  ##### For text-to-video generation:
103
 
104
  ```bash
105
- python inference.py --prompt "PROMPT" --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml
106
  ```
107
 
108
  ##### For image-to-video generation:
109
 
110
  ```bash
111
- python inference.py --prompt "PROMPT" --input_image_path IMAGE_PATH --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml
112
  ```
113
 
114
  ### Diffusers 🧨
@@ -121,58 +114,235 @@ Make sure you install `diffusers` before trying out the examples below.
121
  pip install -U git+https://github.com/huggingface/diffusers
122
  ```
123
 
124
- Now, you can run the examples below:
125
 
 
126
  ```py
127
  import torch
128
- from diffusers import LTXPipeline
 
129
  from diffusers.utils import export_to_video
130
 
131
- pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
 
132
  pipe.to("cuda")
 
 
133
 
134
- prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
135
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
136
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
  video = pipe(
138
  prompt=prompt,
139
  negative_prompt=negative_prompt,
140
- width=704,
141
- height=480,
142
- num_frames=161,
143
- num_inference_steps=50,
 
 
 
 
 
 
 
 
144
  ).frames[0]
 
 
 
 
145
  export_to_video(video, "output.mp4", fps=24)
146
  ```
147
 
148
- For image-to-video:
149
 
150
  ```py
151
  import torch
152
- from diffusers import LTXImageToVideoPipeline
 
153
  from diffusers.utils import export_to_video, load_image
154
 
155
- pipe = LTXImageToVideoPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
 
156
  pipe.to("cuda")
 
 
 
 
 
 
157
 
158
- image = load_image(
159
- "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png"
160
- )
161
- prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background. Flames engulf the structure, with smoke billowing into the air. Firefighters in protective gear rush to the scene, a fire truck labeled '38' visible behind them. The girl's neutral expression contrasts sharply with the chaos of the fire, creating a poignant and emotionally charged scene."
162
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
  video = pipe(
165
- image=image,
166
  prompt=prompt,
167
  negative_prompt=negative_prompt,
168
- width=704,
169
- height=480,
170
- num_frames=161,
171
- num_inference_steps=50,
 
 
 
 
 
 
 
 
172
  ).frames[0]
 
 
 
 
173
  export_to_video(video, "output.mp4", fps=24)
 
174
  ```
175
 
 
176
  To learn more, check out the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video).
177
 
178
  Diffusers also supports directly loading from the original LTX checkpoints using the `from_single_file()` method. Check out [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video#loading-single-files) to learn more.
 
10
  library_name: diffusers
11
  ---
12
 
13
+ # LTX-Video 0.9.7 Distilled Model Card
14
  This model card focuses on the model associated with the LTX-Video model, codebase available [here](https://github.com/Lightricks/LTX-Video).
15
 
16
  LTX-Video is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 30 FPS videos at a 1216×704 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content.
 
26
  | ![example9](./media/ltx-video_example_00009.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man with graying hair, a beard, and a gray shirt...</summary>A man with graying hair, a beard, and a gray shirt looks down and to his right, then turns his head to the left. The camera angle is a close-up, focused on the man's face. The lighting is dim, with a greenish tint. The scene appears to be real-life footage. Step</details> | ![example10](./media/ltx-video_example_00010.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A clear, turquoise river flows through a rocky canyon...</summary>A clear, turquoise river flows through a rocky canyon, cascading over a small waterfall and forming a pool of water at the bottom.The river is the main focus of the scene, with its clear water reflecting the surrounding trees and rocks. The canyon walls are steep and rocky, with some vegetation growing on them. The trees are mostly pine trees, with their green needles contrasting with the brown and gray rocks. The overall tone of the scene is one of peace and tranquility.</details> | ![example11](./media/ltx-video_example_00011.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man in a suit enters a room and speaks to two women...</summary>A man in a suit enters a room and speaks to two women sitting on a couch. The man, wearing a dark suit with a gold tie, enters the room from the left and walks towards the center of the frame. He has short gray hair, light skin, and a serious expression. He places his right hand on the back of a chair as he approaches the couch. Two women are seated on a light-colored couch in the background. The woman on the left wears a light blue sweater and has short blonde hair. The woman on the right wears a white sweater and has short blonde hair. The camera remains stationary, focusing on the man as he enters the room. The room is brightly lit, with warm tones reflecting off the walls and furniture. The scene appears to be from a film or television show.</details> | ![example12](./media/ltx-video_example_00012.gif)<br><details style="max-width: 300px; margin: auto;"><summary>The waves crash against the jagged rocks of the shoreline...</summary>The waves crash against the jagged rocks of the shoreline, sending spray high into the air.The rocks are a dark gray color, with sharp edges and deep crevices. The water is a clear blue-green, with white foam where the waves break against the rocks. The sky is a light gray, with a few white clouds dotting the horizon.</details> |
27
  | ![example13](./media/ltx-video_example_00013.gif)<br><details style="max-width: 300px; margin: auto;"><summary>The camera pans across a cityscape of tall buildings...</summary>The camera pans across a cityscape of tall buildings with a circular building in the center. The camera moves from left to right, showing the tops of the buildings and the circular building in the center. The buildings are various shades of gray and white, and the circular building has a green roof. The camera angle is high, looking down at the city. The lighting is bright, with the sun shining from the upper left, casting shadows from the buildings. The scene is computer-generated imagery.</details> | ![example14](./media/ltx-video_example_00014.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man walks towards a window, looks out, and then turns around...</summary>A man walks towards a window, looks out, and then turns around. He has short, dark hair, dark skin, and is wearing a brown coat over a red and gray scarf. He walks from left to right towards a window, his gaze fixed on something outside. The camera follows him from behind at a medium distance. The room is brightly lit, with white walls and a large window covered by a white curtain. As he approaches the window, he turns his head slightly to the left, then back to the right. He then turns his entire body to the right, facing the window. The camera remains stationary as he stands in front of the window. The scene is captured in real-life footage.</details> | ![example15](./media/ltx-video_example_00015.gif)<br><details style="max-width: 300px; margin: auto;"><summary>Two police officers in dark blue uniforms and matching hats...</summary>Two police officers in dark blue uniforms and matching hats enter a dimly lit room through a doorway on the left side of the frame. The first officer, with short brown hair and a mustache, steps inside first, followed by his partner, who has a shaved head and a goatee. Both officers have serious expressions and maintain a steady pace as they move deeper into the room. The camera remains stationary, capturing them from a slightly low angle as they enter. The room has exposed brick walls and a corrugated metal ceiling, with a barred window visible in the background. The lighting is low-key, casting shadows on the officers' faces and emphasizing the grim atmosphere. The scene appears to be from a film or television show.</details> | ![example16](./media/ltx-video_example_00016.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A woman with short brown hair, wearing a maroon sleeveless top...</summary>A woman with short brown hair, wearing a maroon sleeveless top and a silver necklace, walks through a room while talking, then a woman with pink hair and a white shirt appears in the doorway and yells. The first woman walks from left to right, her expression serious; she has light skin and her eyebrows are slightly furrowed. The second woman stands in the doorway, her mouth open in a yell; she has light skin and her eyes are wide. The room is dimly lit, with a bookshelf visible in the background. The camera follows the first woman as she walks, then cuts to a close-up of the second woman's face. The scene is captured in real-life footage.</details> |
28
 
29
+ # Models
30
+
31
+ | Model | Version | Notes | inference.py config | ComfyUI workflow (Recommended) |
32
+ |--------------------|---------|---------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|------------------|
33
+ | ltxv-13b | 0.9.7 | Highest quality, requires more VRAM | [ltxv-13b-0.9.7-dev.yaml](https://github.com/Lightricks/LTX-Video/blob/main/configs/ltxv-13b-0.9.7-dev.yaml) | [ltxv-13b-i2v-base.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/ltxv-13b-i2v-base.json) |
34
+ | ltxv-13b-fp8 | 0.9.7 | Quantized model | Coming soon | [ltxv-13b-i2v-base-fp8.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/ltxv-13b-i2v-base-fp8.json) |
35
+ | ltxv-2b | 0.9.6 | Good quality, lower VRAM requirement than ltxv-13b | [ltxv-2b-0.9.6-dev.yaml](https://github.com/Lightricks/LTX-Video/blob/main/configs/ltxv-2b-0.9.6-dev.yaml) | [ltxvideo-i2v.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/low_level/ltxvideo-i2v.json) |
36
+ | ltxv-2b-distilled | 0.9.6 | 15× faster, real-time capable, fewer steps needed, no STG/CFG required | [ltxv-2b-0.9.6-distilled.yaml](https://github.com/Lightricks/LTX-Video/blob/main/configs/ltxv-2b-0.9.6-distilled.yaml) | [ltxvideo-i2v-distilled.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/low_level/ltxvideo-i2v-distilled.json) |
37
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  ## Model Details
40
  - **Developed by:** Lightricks
 
53
  - 2B version 0.9.6-distilled [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-2b-0.9.6-distilled-04-25.license.txt)
54
  - 13B version 0.9.7-dev [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-13b-0.9.7-dev.license.txt)
55
  - 13B version 0.9.7-dev-fp8 [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-13b-0.9.7-dev-fp8.license.txt)
 
 
 
56
  - Temporal upscaler version 0.9.7 [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-temporal-upscaler-0.9.7.license.txt)
57
  - Spatial upscaler version 0.9.7 [license](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-spatial-upscaler-0.9.7.license.txt)
58
 
 
64
 
65
  ### Online demo
66
  The model is accessible right away via the following links:
67
+ - [LTX-Studio image-to-video](https://app.ltx.studio/ltx-video)
 
68
  - [Fal.ai text-to-video](https://fal.ai/models/fal-ai/ltx-video)
69
  - [Fal.ai image-to-video](https://fal.ai/models/fal-ai/ltx-video/image-to-video)
70
  - [Replicate text-to-video and image-to-video](https://replicate.com/lightricks/ltx-video)
 
95
  ##### For text-to-video generation:
96
 
97
  ```bash
98
+ python inference.py --prompt "PROMPT" --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config ltxv-13b-0.9.7-dev.yaml
99
  ```
100
 
101
  ##### For image-to-video generation:
102
 
103
  ```bash
104
+ python inference.py --prompt "PROMPT" --input_image_path IMAGE_PATH --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config ltxv-13b-0.9.7-dev.yaml
105
  ```
106
 
107
  ### Diffusers 🧨
 
114
  pip install -U git+https://github.com/huggingface/diffusers
115
  ```
116
 
117
+ Now, you can run the examples below (note that the upsampling stage is optional but reccomeneded):
118
 
119
+ ### text-to-video:
120
  ```py
121
  import torch
122
+ from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
123
+ from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
124
  from diffusers.utils import export_to_video
125
 
126
+ pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16)
127
+ pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
128
  pipe.to("cuda")
129
+ pipe_upsample.to("cuda")
130
+ pipe.vae.enable_tiling()
131
 
132
+ prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
133
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
134
+ expected_height, expected_width = 704, 512
135
+ downscale_factor = 2 / 3
136
+ num_frames = 121
137
+
138
+ # Part 1. Generate video at smaller resolution
139
+ downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
140
+ latents = pipe(
141
+ conditions=None,
142
+ prompt=prompt,
143
+ negative_prompt=negative_prompt,
144
+ width=downscaled_width,
145
+ height=downscaled_height,
146
+ num_frames=num_frames,
147
+ num_inference_steps=7,
148
+ decode_timestep = 0.05,
149
+ guidnace_scale=1.0,
150
+ decode_noise_scale = 0.025,
151
+ generator=torch.Generator().manual_seed(0),
152
+ output_type="latent",
153
+ ).frames
154
+
155
+ # Part 2. Upscale generated video using latent upsampler with fewer inference steps
156
+ # The available latent upsampler upscales the height/width by 2x
157
+ upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
158
+ upscaled_latents = pipe_upsample(
159
+ latents=latents,
160
+ output_type="latent"
161
+ ).frames
162
+
163
+ # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
164
  video = pipe(
165
  prompt=prompt,
166
  negative_prompt=negative_prompt,
167
+ width=upscaled_width,
168
+ height=upscaled_height,
169
+ num_frames=num_frames,
170
+ denoise_strength=0.3, # Effectively, 4 inference steps out of 10
171
+ num_inference_steps=10,
172
+ latents=upscaled_latents,
173
+ decode_timestep = 0.05,
174
+ guidnace_scale=1.0,
175
+ decode_noise_scale = 0.025,
176
+ image_cond_noise_scale=0.025,
177
+ generator=torch.Generator().manual_seed(0),
178
+ output_type="pil",
179
  ).frames[0]
180
+
181
+ # Part 4. Downscale the video to the expected resolution
182
+ video = [frame.resize((expected_width, expected_height)) for frame in video]
183
+
184
  export_to_video(video, "output.mp4", fps=24)
185
  ```
186
 
187
+ ### For image-to-video:
188
 
189
  ```py
190
  import torch
191
+ from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
192
+ from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
193
  from diffusers.utils import export_to_video, load_image
194
 
195
+ pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16)
196
+ pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
197
  pipe.to("cuda")
198
+ pipe_upsample.to("cuda")
199
+ pipe.vae.enable_tiling()
200
+
201
+ image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png")
202
+ video = [image]
203
+ condition1 = LTXVideoCondition(video=video, frame_index=0)
204
 
205
+ prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
 
 
 
206
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
207
+ expected_height, expected_width = 832, 480
208
+ downscale_factor = 2 / 3
209
+ num_frames = 96
210
+
211
+ # Part 1. Generate video at smaller resolution
212
+ downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
213
+ downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
214
+ latents = pipe(
215
+ conditions=[condition1],
216
+ prompt=prompt,
217
+ negative_prompt=negative_prompt,
218
+ width=downscaled_width,
219
+ height=downscaled_height,
220
+ num_frames=num_frames,
221
+ num_inference_steps=7,
222
+ guidnace_scale=1.0,
223
+ decode_timestep = 0.05,
224
+ decode_noise_scale = 0.025,
225
+ generator=torch.Generator().manual_seed(0),
226
+ output_type="latent",
227
+ ).frames
228
+
229
+ # Part 2. Upscale generated video using latent upsampler with fewer inference steps
230
+ # The available latent upsampler upscales the height/width by 2x
231
+ upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
232
+ upscaled_latents = pipe_upsample(
233
+ latents=latents,
234
+ output_type="latent"
235
+ ).frames
236
+
237
+ # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
238
+ video = pipe(
239
+ conditions=[condition1],
240
+ prompt=prompt,
241
+ negative_prompt=negative_prompt,
242
+ width=upscaled_width,
243
+ height=upscaled_height,
244
+ num_frames=num_frames,
245
+ denoise_strength=0.3, # Effectively, 4 inference steps out of 10
246
+ num_inference_steps=10,
247
+ guidnace_scale=1.0,
248
+ latents=upscaled_latents,
249
+ decode_timestep = 0.05,
250
+ decode_noise_scale = 0.025,
251
+ image_cond_noise_scale=0.025,
252
+ generator=torch.Generator().manual_seed(0),
253
+ output_type="pil",
254
+ ).frames[0]
255
 
256
+ # Part 4. Downscale the video to the expected resolution
257
+ video = [frame.resize((expected_width, expected_height)) for frame in video]
258
+
259
+ export_to_video(video, "output.mp4", fps=24)
260
+
261
+ ```
262
+
263
+ ### For video-to-video:
264
+
265
+ ```py
266
+ import torch
267
+ from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
268
+ from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
269
+ from diffusers.utils import export_to_video, load_video
270
+
271
+ pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16)
272
+ pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
273
+ pipe.to("cuda")
274
+ pipe_upsample.to("cuda")
275
+ pipe.vae.enable_tiling()
276
+
277
+ def round_to_nearest_resolution_acceptable_by_vae(height, width):
278
+ height = height - (height % pipe.vae_temporal_compression_ratio)
279
+ width = width - (width % pipe.vae_temporal_compression_ratio)
280
+ return height, width
281
+
282
+ video = load_video(
283
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
284
+ )[:21] # Use only the first 21 frames as conditioning
285
+ condition1 = LTXVideoCondition(video=video, frame_index=0)
286
+
287
+ prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
288
+ negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
289
+ expected_height, expected_width = 768, 1152
290
+ downscale_factor = 2 / 3
291
+ num_frames = 161
292
+
293
+ # Part 1. Generate video at smaller resolution
294
+ downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
295
+ downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
296
+ latents = pipe(
297
+ conditions=[condition1],
298
+ prompt=prompt,
299
+ negative_prompt=negative_prompt,
300
+ width=downscaled_width,
301
+ height=downscaled_height,
302
+ num_frames=num_frames,
303
+ num_inference_steps=7,
304
+ guidnace_scale=1.0,
305
+ decode_timestep = 0.05,
306
+ decode_noise_scale = 0.025,
307
+ generator=torch.Generator().manual_seed(0),
308
+ output_type="latent",
309
+ ).frames
310
+
311
+ # Part 2. Upscale generated video using latent upsampler with fewer inference steps
312
+ # The available latent upsampler upscales the height/width by 2x
313
+ upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
314
+ upscaled_latents = pipe_upsample(
315
+ latents=latents,
316
+ output_type="latent"
317
+ ).frames
318
+
319
+ # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
320
  video = pipe(
321
+ conditions=[condition1],
322
  prompt=prompt,
323
  negative_prompt=negative_prompt,
324
+ width=upscaled_width,
325
+ height=upscaled_height,
326
+ num_frames=num_frames,
327
+ denoise_strength=0.3, # Effectively, 4 inference steps out of 10
328
+ num_inference_steps=10,
329
+ guidnace_scale=1.0,
330
+ latents=upscaled_latents,
331
+ decode_timestep = 0.05,
332
+ decode_noise_scale = 0.025,
333
+ image_cond_noise_scale=0.025,
334
+ generator=torch.Generator().manual_seed(0),
335
+ output_type="pil",
336
  ).frames[0]
337
+
338
+ # Part 4. Downscale the video to the expected resolution
339
+ video = [frame.resize((expected_width, expected_height)) for frame in video]
340
+
341
  export_to_video(video, "output.mp4", fps=24)
342
+
343
  ```
344
 
345
+
346
  To learn more, check out the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video).
347
 
348
  Diffusers also supports directly loading from the original LTX checkpoints using the `from_single_file()` method. Check out [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video#loading-single-files) to learn more.