linoyts HF Staff commited on
Commit
3b7dce7
·
verified ·
1 Parent(s): 3a04336

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -190
README.md CHANGED
@@ -1,8 +1,10 @@
1
  ---
2
  tags:
3
  - ltx-video
4
- - image-to-video
5
- pinned: true
 
 
6
  language:
7
  - en
8
  license: other
@@ -10,11 +12,15 @@ pipeline_tag: text-to-video
10
  library_name: diffusers
11
  ---
12
 
13
- # LTX-Video 0.9.7 Distilled Model Card
14
- This model card focuses on the model associated with the LTX-Video model, codebase available [here](https://github.com/Lightricks/LTX-Video).
 
 
15
 
16
  LTX-Video is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 30 FPS videos at a 1216×704 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content.
17
- We provide a model for both text-to-video as well as image+text-to-video usecases
 
 
18
 
19
  <img src="./media/trailer.gif" alt="trailer" width="512">
20
 
@@ -26,24 +32,18 @@ We provide a model for both text-to-video as well as image+text-to-video usecase
26
  | ![example9](./media/ltx-video_example_00009.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man with graying hair, a beard, and a gray shirt...</summary>A man with graying hair, a beard, and a gray shirt looks down and to his right, then turns his head to the left. The camera angle is a close-up, focused on the man's face. The lighting is dim, with a greenish tint. The scene appears to be real-life footage. Step</details> | ![example10](./media/ltx-video_example_00010.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A clear, turquoise river flows through a rocky canyon...</summary>A clear, turquoise river flows through a rocky canyon, cascading over a small waterfall and forming a pool of water at the bottom.The river is the main focus of the scene, with its clear water reflecting the surrounding trees and rocks. The canyon walls are steep and rocky, with some vegetation growing on them. The trees are mostly pine trees, with their green needles contrasting with the brown and gray rocks. The overall tone of the scene is one of peace and tranquility.</details> | ![example11](./media/ltx-video_example_00011.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man in a suit enters a room and speaks to two women...</summary>A man in a suit enters a room and speaks to two women sitting on a couch. The man, wearing a dark suit with a gold tie, enters the room from the left and walks towards the center of the frame. He has short gray hair, light skin, and a serious expression. He places his right hand on the back of a chair as he approaches the couch. Two women are seated on a light-colored couch in the background. The woman on the left wears a light blue sweater and has short blonde hair. The woman on the right wears a white sweater and has short blonde hair. The camera remains stationary, focusing on the man as he enters the room. The room is brightly lit, with warm tones reflecting off the walls and furniture. The scene appears to be from a film or television show.</details> | ![example12](./media/ltx-video_example_00012.gif)<br><details style="max-width: 300px; margin: auto;"><summary>The waves crash against the jagged rocks of the shoreline...</summary>The waves crash against the jagged rocks of the shoreline, sending spray high into the air.The rocks are a dark gray color, with sharp edges and deep crevices. The water is a clear blue-green, with white foam where the waves break against the rocks. The sky is a light gray, with a few white clouds dotting the horizon.</details> |
27
  | ![example13](./media/ltx-video_example_00013.gif)<br><details style="max-width: 300px; margin: auto;"><summary>The camera pans across a cityscape of tall buildings...</summary>The camera pans across a cityscape of tall buildings with a circular building in the center. The camera moves from left to right, showing the tops of the buildings and the circular building in the center. The buildings are various shades of gray and white, and the circular building has a green roof. The camera angle is high, looking down at the city. The lighting is bright, with the sun shining from the upper left, casting shadows from the buildings. The scene is computer-generated imagery.</details> | ![example14](./media/ltx-video_example_00014.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man walks towards a window, looks out, and then turns around...</summary>A man walks towards a window, looks out, and then turns around. He has short, dark hair, dark skin, and is wearing a brown coat over a red and gray scarf. He walks from left to right towards a window, his gaze fixed on something outside. The camera follows him from behind at a medium distance. The room is brightly lit, with white walls and a large window covered by a white curtain. As he approaches the window, he turns his head slightly to the left, then back to the right. He then turns his entire body to the right, facing the window. The camera remains stationary as he stands in front of the window. The scene is captured in real-life footage.</details> | ![example15](./media/ltx-video_example_00015.gif)<br><details style="max-width: 300px; margin: auto;"><summary>Two police officers in dark blue uniforms and matching hats...</summary>Two police officers in dark blue uniforms and matching hats enter a dimly lit room through a doorway on the left side of the frame. The first officer, with short brown hair and a mustache, steps inside first, followed by his partner, who has a shaved head and a goatee. Both officers have serious expressions and maintain a steady pace as they move deeper into the room. The camera remains stationary, capturing them from a slightly low angle as they enter. The room has exposed brick walls and a corrugated metal ceiling, with a barred window visible in the background. The lighting is low-key, casting shadows on the officers' faces and emphasizing the grim atmosphere. The scene appears to be from a film or television show.</details> | ![example16](./media/ltx-video_example_00016.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A woman with short brown hair, wearing a maroon sleeveless top...</summary>A woman with short brown hair, wearing a maroon sleeveless top and a silver necklace, walks through a room while talking, then a woman with pink hair and a white shirt appears in the doorway and yells. The first woman walks from left to right, her expression serious; she has light skin and her eyebrows are slightly furrowed. The second woman stands in the doorway, her mouth open in a yell; she has light skin and her eyes are wide. The room is dimly lit, with a bookshelf visible in the background. The camera follows the first woman as she walks, then cuts to a close-up of the second woman's face. The scene is captured in real-life footage.</details> |
28
 
29
- # Models & Workflows
30
 
31
- | Name | Notes | inference.py config | ComfyUI workflow (Recommended) |
32
- |----------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
33
- | ltxv-13b-0.9.7-dev | Highest quality, requires more VRAM | [ltxv-13b-0.9.7-dev.yaml](https://github.com/Lightricks/LTX-Video/blob/main/configs/ltxv-13b-0.9.7-dev.yaml) | [ltxv-13b-i2v-base.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/ltxv-13b-i2v-base.json) |
34
- | [ltxv-13b-0.9.7-mix](https://app.ltx.studio/motion-workspace?videoModel=ltxv-13b) | Mix ltxv-13b-dev and ltxv-13b-distilled in the same multi-scale rendering workflow for balanced speed-quality | N/A | [ltxv-13b-i2v-mix.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/ltxv13b-i2v-mixed-multiscale.json) |
35
- | [ltxv-13b-0.9.7-distilled](https://app.ltx.studio/motion-workspace?videoModel=ltxv) | Faster, less VRAM usage, slight quality reduction compared to 13b. Ideal for rapid iterations | [ltxv-13b-0.9.7-distilled.yaml](https://github.com/Lightricks/LTX-Video/blob/main/configs/ltxv-13b-0.9.7-dev.yaml) | [ltxv-13b-dist-i2v-base.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/13b-distilled/ltxv-13b-dist-i2v-base.json) |
36
- | [ltxv-13b-0.9.7-distilled-lora128](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltxv-13b-0.9.7-distilled-lora128.safetensors) | LoRA to make ltxv-13b-dev behave like the distilled model | N/A | N/A |
37
- | ltxv-13b-0.9.7-fp8 | Quantized version of ltxv-13b | Coming soon | [ltxv-13b-i2v-base-fp8.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/ltxv-13b-i2v-base-fp8.json) |
38
- | ltxv-13b-0.9.7-distilled-fp8 | Quantized version of ltxv-13b-distilled | Coming soon | [ltxv-13b-dist-fp8-i2v-base.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/13b-distilled/ltxv-13b-dist-fp8-i2v-base.json) |
39
- | ltxv-2b-0.9.6 | Good quality, lower VRAM requirement than ltxv-13b | [ltxv-2b-0.9.6-dev.yaml](https://github.com/Lightricks/LTX-Video/blob/main/configs/ltxv-2b-0.9.6-dev.yaml) | [ltxvideo-i2v.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/low_level/ltxvideo-i2v.json) |
40
- | ltxv-2b-0.9.6-distilled | 15× faster, real-time capable, fewer steps needed, no STG/CFG required | [ltxv-2b-0.9.6-distilled.yaml](https://github.com/Lightricks/LTX-Video/blob/main/configs/ltxv-2b-0.9.6-distilled.yaml) | [ltxvideo-i2v-distilled.json](https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/low_level/ltxvideo-i2v-distilled.json) |
41
 
42
  ## Model Details
43
  - **Developed by:** Lightricks
44
- - **Model type:** Diffusion-based text-to-video and image-to-video generation model
45
- - **Language(s):** English
46
-
 
47
 
48
  ## Usage
49
 
@@ -70,8 +70,7 @@ You can use the model for purposes under the license:
70
 
71
  ### Online demo
72
  The model is accessible right away via the following links:
73
- - [LTX-Studio image-to-video (13B-mix)](https://app.ltx.studio/motion-workspace?videoModel=ltxv-13b)
74
- - [LTX-Studio image-to-video (13B distilled)](https://app.ltx.studio/motion-workspace?videoModel=ltxv)
75
  - [Fal.ai text-to-video](https://fal.ai/models/fal-ai/ltx-video)
76
  - [Fal.ai image-to-video](https://fal.ai/models/fal-ai/ltx-video/image-to-video)
77
  - [Replicate text-to-video and image-to-video](https://replicate.com/lightricks/ltx-video)
@@ -99,18 +98,6 @@ python -m pip install -e .\[inference-script\]
99
 
100
  To use our model, please follow the inference code in [inference.py](https://github.com/Lightricks/LTX-Video/blob/main/inference.py):
101
 
102
- ##### For text-to-video generation:
103
-
104
- ```bash
105
- python inference.py --prompt "PROMPT" --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml
106
- ```
107
-
108
- ##### For image-to-video generation:
109
-
110
- ```bash
111
- python inference.py --prompt "PROMPT" --input_image_path IMAGE_PATH --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml
112
- ```
113
-
114
  ### Diffusers 🧨
115
 
116
  LTX Video is compatible with the [Diffusers Python library](https://huggingface.co/docs/diffusers/main/en/index). It supports both text-to-video and image-to-video generation.
@@ -121,77 +108,9 @@ Make sure you install `diffusers` before trying out the examples below.
121
  pip install -U git+https://github.com/huggingface/diffusers
122
  ```
123
 
124
- Now, you can run the examples below (note that the upsampling stage is optional but reccomeneded):
125
-
126
- ### text-to-video:
127
- ```
128
- ```py
129
- import torch
130
- from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
131
- from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
132
- from diffusers.utils import export_to_video
133
-
134
- pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16)
135
- pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
136
- pipe.to("cuda")
137
- pipe_upsample.to("cuda")
138
- pipe.vae.enable_tiling()
139
-
140
- prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
141
- negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
142
- expected_height, expected_width = 704, 512
143
- downscale_factor = 2 / 3
144
- num_frames = 121
145
-
146
- # Part 1. Generate video at smaller resolution
147
- downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
148
- latents = pipe(
149
- conditions=None,
150
- prompt=prompt,
151
- negative_prompt=negative_prompt,
152
- width=downscaled_width,
153
- height=downscaled_height,
154
- num_frames=num_frames,
155
- num_inference_steps=7,
156
- decode_timestep = 0.05,
157
- guidnace_scale=1.0,
158
- decode_noise_scale = 0.025,
159
- generator=torch.Generator().manual_seed(0),
160
- output_type="latent",
161
- ).frames
162
-
163
- # Part 2. Upscale generated video using latent upsampler with fewer inference steps
164
- # The available latent upsampler upscales the height/width by 2x
165
- upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
166
- upscaled_latents = pipe_upsample(
167
- latents=latents,
168
- output_type="latent"
169
- ).frames
170
- # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
171
- video = pipe(
172
- prompt=prompt,
173
- negative_prompt=negative_prompt,
174
- width=upscaled_width,
175
- height=upscaled_height,
176
- num_frames=num_frames,
177
- denoise_strength=0.3, # Effectively, 4 inference steps out of 10
178
- num_inference_steps=10,
179
- latents=upscaled_latents,
180
- decode_timestep = 0.05,
181
- guidnace_scale=1.0,
182
- decode_noise_scale = 0.025,
183
- image_cond_noise_scale=0.025,
184
- generator=torch.Generator().manual_seed(0),
185
- output_type="pil",
186
- ).frames[0]
187
-
188
- # Part 4. Downscale the video to the expected resolution
189
- video = [frame.resize((expected_width, expected_height)) for frame in video]
190
-
191
- export_to_video(video, "output.mp4", fps=24)
192
- ```
193
 
194
- ### For image-to-video:
195
 
196
  ```py
197
  import torch
@@ -199,91 +118,25 @@ from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
199
  from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
200
  from diffusers.utils import export_to_video, load_image
201
 
202
- pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16)
203
- pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
 
 
 
 
 
 
 
 
 
204
  pipe.to("cuda")
205
  pipe_upsample.to("cuda")
206
- pipe.vae.enable_tiling()
207
-
208
- image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png")
209
- video = [image]
210
- condition1 = LTXVideoCondition(video=video, frame_index=0)
211
-
212
- prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
213
- negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
214
- expected_height, expected_width = 832, 480
215
- downscale_factor = 2 / 3
216
- num_frames = 96
217
-
218
- # Part 1. Generate video at smaller resolution
219
- downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
220
- downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
221
- latents = pipe(
222
- conditions=[condition1],
223
- prompt=prompt,
224
- negative_prompt=negative_prompt,
225
- width=downscaled_width,
226
- height=downscaled_height,
227
- num_frames=num_frames,
228
- num_inference_steps=7,
229
- guidnace_scale=1.0,
230
- decode_timestep = 0.05,
231
- decode_noise_scale = 0.025,
232
- generator=torch.Generator().manual_seed(0),
233
- output_type="latent",
234
- ).frames
235
-
236
- # Part 2. Upscale generated video using latent upsampler with fewer inference steps
237
- # The available latent upsampler upscales the height/width by 2x
238
- upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
239
- upscaled_latents = pipe_upsample(
240
- latents=latents,
241
- output_type="latent"
242
- ).frames
243
- # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
244
- video = pipe(
245
- conditions=[condition1],
246
- prompt=prompt,
247
- negative_prompt=negative_prompt,
248
- width=upscaled_width,
249
- height=upscaled_height,
250
- num_frames=num_frames,
251
- denoise_strength=0.3, # Effectively, 4 inference steps out of 10
252
- num_inference_steps=10,
253
- guidnace_scale=1.0,
254
- latents=upscaled_latents,
255
- decode_timestep = 0.05,
256
- decode_noise_scale = 0.025,
257
- image_cond_noise_scale=0.025,
258
- generator=torch.Generator().manual_seed(0),
259
- output_type="pil",
260
- ).frames[0]
261
-
262
- # Part 4. Downscale the video to the expected resolution
263
- video = [frame.resize((expected_width, expected_height)) for frame in video]
264
-
265
- export_to_video(video, "output.mp4", fps=24)
266
-
267
- ```
268
-
269
- ### For video-to-video:
270
-
271
- ```py
272
- import torch
273
- from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
274
- from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
275
- from diffusers.utils import export_to_video, load_video
276
-
277
- pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16)
278
- pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
279
- pipe.to("cuda")
280
- pipe_upsample.to("cuda")
281
- pipe.vae.enable_tiling()
282
 
283
  def round_to_nearest_resolution_acceptable_by_vae(height, width):
284
  height = height - (height % pipe.vae_temporal_compression_ratio)
285
  width = width - (width % pipe.vae_temporal_compression_ratio)
286
  return height, width
 
287
  video = load_video(
288
  "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
289
  )[:21] # Use only the first 21 frames as conditioning
@@ -298,6 +151,7 @@ num_frames = 161
298
  # Part 1. Generate video at smaller resolution
299
  downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
300
  downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
 
301
  latents = pipe(
302
  conditions=[condition1],
303
  prompt=prompt,
@@ -305,10 +159,7 @@ latents = pipe(
305
  width=downscaled_width,
306
  height=downscaled_height,
307
  num_frames=num_frames,
308
- num_inference_steps=7,
309
- guidnace_scale=1.0,
310
- decode_timestep = 0.05,
311
- decode_noise_scale = 0.025,
312
  generator=torch.Generator().manual_seed(0),
313
  output_type="latent",
314
  ).frames
@@ -329,12 +180,10 @@ video = pipe(
329
  width=upscaled_width,
330
  height=upscaled_height,
331
  num_frames=num_frames,
332
- denoise_strength=0.3, # Effectively, 4 inference steps out of 10
333
  num_inference_steps=10,
334
- guidnace_scale=1.0,
335
  latents=upscaled_latents,
336
- decode_timestep = 0.05,
337
- decode_noise_scale = 0.025,
338
  image_cond_noise_scale=0.025,
339
  generator=torch.Generator().manual_seed(0),
340
  output_type="pil",
@@ -346,10 +195,13 @@ video = [frame.resize((expected_width, expected_height)) for frame in video]
346
  export_to_video(video, "output.mp4", fps=24)
347
 
348
  ```
349
- To learn more, check out the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video).
350
 
 
351
  Diffusers also supports directly loading from the original LTX checkpoints using the `from_single_file()` method. Check out [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video#loading-single-files) to learn more.
352
 
 
 
 
353
  ## Limitations
354
  - This model is not intended or able to provide factual information.
355
  - As a statistical model this checkpoint might amplify existing societal biases.
 
1
  ---
2
  tags:
3
  - ltx-video
4
+ - video-upscaling
5
+ - diffusers
6
+ - video-to-video
7
+ pinned: false
8
  language:
9
  - en
10
  license: other
 
12
  library_name: diffusers
13
  ---
14
 
15
+ # LTX Video Spatial Upscaler 0.9.7 Model Card
16
+
17
+ This model card focuses on the LTX Video Spatial Upscaler 0.9.7, a component model designed to work in conjunction with the LTX-Video generation models.
18
+ The main LTX-Video codebase is available [here](https://github.com/Lightricks/LTX-Video).
19
 
20
  LTX-Video is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 30 FPS videos at a 1216×704 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content.
21
+ We provide a model for both text-to-video as well as image+text-to-video usecases.
22
+
23
+ **The LTX Video Spatial Upscaler** is a diffusion-based model that enhances the spatial resolution of videos. It is specifically trained to upscale the latent representations of videos generated by LTX Video models.
24
 
25
  <img src="./media/trailer.gif" alt="trailer" width="512">
26
 
 
32
  | ![example9](./media/ltx-video_example_00009.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man with graying hair, a beard, and a gray shirt...</summary>A man with graying hair, a beard, and a gray shirt looks down and to his right, then turns his head to the left. The camera angle is a close-up, focused on the man's face. The lighting is dim, with a greenish tint. The scene appears to be real-life footage. Step</details> | ![example10](./media/ltx-video_example_00010.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A clear, turquoise river flows through a rocky canyon...</summary>A clear, turquoise river flows through a rocky canyon, cascading over a small waterfall and forming a pool of water at the bottom.The river is the main focus of the scene, with its clear water reflecting the surrounding trees and rocks. The canyon walls are steep and rocky, with some vegetation growing on them. The trees are mostly pine trees, with their green needles contrasting with the brown and gray rocks. The overall tone of the scene is one of peace and tranquility.</details> | ![example11](./media/ltx-video_example_00011.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man in a suit enters a room and speaks to two women...</summary>A man in a suit enters a room and speaks to two women sitting on a couch. The man, wearing a dark suit with a gold tie, enters the room from the left and walks towards the center of the frame. He has short gray hair, light skin, and a serious expression. He places his right hand on the back of a chair as he approaches the couch. Two women are seated on a light-colored couch in the background. The woman on the left wears a light blue sweater and has short blonde hair. The woman on the right wears a white sweater and has short blonde hair. The camera remains stationary, focusing on the man as he enters the room. The room is brightly lit, with warm tones reflecting off the walls and furniture. The scene appears to be from a film or television show.</details> | ![example12](./media/ltx-video_example_00012.gif)<br><details style="max-width: 300px; margin: auto;"><summary>The waves crash against the jagged rocks of the shoreline...</summary>The waves crash against the jagged rocks of the shoreline, sending spray high into the air.The rocks are a dark gray color, with sharp edges and deep crevices. The water is a clear blue-green, with white foam where the waves break against the rocks. The sky is a light gray, with a few white clouds dotting the horizon.</details> |
33
  | ![example13](./media/ltx-video_example_00013.gif)<br><details style="max-width: 300px; margin: auto;"><summary>The camera pans across a cityscape of tall buildings...</summary>The camera pans across a cityscape of tall buildings with a circular building in the center. The camera moves from left to right, showing the tops of the buildings and the circular building in the center. The buildings are various shades of gray and white, and the circular building has a green roof. The camera angle is high, looking down at the city. The lighting is bright, with the sun shining from the upper left, casting shadows from the buildings. The scene is computer-generated imagery.</details> | ![example14](./media/ltx-video_example_00014.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A man walks towards a window, looks out, and then turns around...</summary>A man walks towards a window, looks out, and then turns around. He has short, dark hair, dark skin, and is wearing a brown coat over a red and gray scarf. He walks from left to right towards a window, his gaze fixed on something outside. The camera follows him from behind at a medium distance. The room is brightly lit, with white walls and a large window covered by a white curtain. As he approaches the window, he turns his head slightly to the left, then back to the right. He then turns his entire body to the right, facing the window. The camera remains stationary as he stands in front of the window. The scene is captured in real-life footage.</details> | ![example15](./media/ltx-video_example_00015.gif)<br><details style="max-width: 300px; margin: auto;"><summary>Two police officers in dark blue uniforms and matching hats...</summary>Two police officers in dark blue uniforms and matching hats enter a dimly lit room through a doorway on the left side of the frame. The first officer, with short brown hair and a mustache, steps inside first, followed by his partner, who has a shaved head and a goatee. Both officers have serious expressions and maintain a steady pace as they move deeper into the room. The camera remains stationary, capturing them from a slightly low angle as they enter. The room has exposed brick walls and a corrugated metal ceiling, with a barred window visible in the background. The lighting is low-key, casting shadows on the officers' faces and emphasizing the grim atmosphere. The scene appears to be from a film or television show.</details> | ![example16](./media/ltx-video_example_00016.gif)<br><details style="max-width: 300px; margin: auto;"><summary>A woman with short brown hair, wearing a maroon sleeveless top...</summary>A woman with short brown hair, wearing a maroon sleeveless top and a silver necklace, walks through a room while talking, then a woman with pink hair and a white shirt appears in the doorway and yells. The first woman walks from left to right, her expression serious; she has light skin and her eyebrows are slightly furrowed. The second woman stands in the doorway, her mouth open in a yell; she has light skin and her eyes are wide. The room is dimly lit, with a bookshelf visible in the background. The camera follows the first woman as she walks, then cuts to a close-up of the second woman's face. The scene is captured in real-life footage.</details> |
34
 
 
35
 
36
+ **This upscaler model is compatible with and can be used to improve the output quality of videos generated by both:**
37
+ * `Lightricks/LTX-Video-0.9.7-dev`
38
+ * `Lightricks/LTX-Video-0.9.7-distilled`
39
+
 
 
 
 
 
 
40
 
41
  ## Model Details
42
  - **Developed by:** Lightricks
43
+ - **Model type:** Latent Diffusion Video Spatial Upscaler
44
+ - **Input:** Latent video frames from an LTX Video model.
45
+ - **Output:** Higher-resolution latent video frames.
46
+ - **Compatibility:** can be used with `Lightricks/LTX-Video-0.9.7-dev` and `Lightricks/LTX-Video-0.9.7-distilled`.
47
 
48
  ## Usage
49
 
 
70
 
71
  ### Online demo
72
  The model is accessible right away via the following links:
73
+ - [LTX-Studio image-to-video](https://app.ltx.studio/ltx-video)
 
74
  - [Fal.ai text-to-video](https://fal.ai/models/fal-ai/ltx-video)
75
  - [Fal.ai image-to-video](https://fal.ai/models/fal-ai/ltx-video/image-to-video)
76
  - [Replicate text-to-video and image-to-video](https://replicate.com/lightricks/ltx-video)
 
98
 
99
  To use our model, please follow the inference code in [inference.py](https://github.com/Lightricks/LTX-Video/blob/main/inference.py):
100
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  ### Diffusers 🧨
102
 
103
  LTX Video is compatible with the [Diffusers Python library](https://huggingface.co/docs/diffusers/main/en/index). It supports both text-to-video and image-to-video generation.
 
108
  pip install -U git+https://github.com/huggingface/diffusers
109
  ```
110
 
111
+ The LTX Video Spatial Upscaler is used via the `LTXLatentUpsamplePipeline` in the `diffusers` library. It is intended to be part of a multi-stage generation process.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
+ Below is an example demonstrating how to use the spatial upsampler with a base LTX Video model (either the 'dev' or 'distilled' version).
114
 
115
  ```py
116
  import torch
 
118
  from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
119
  from diffusers.utils import export_to_video, load_image
120
 
121
+ # Choose your base LTX Video model:
122
+ # base_model_id = "Lightricks/LTX-Video-0.9.7-dev"
123
+ base_model_id = "Lightricks/LTX-Video-0.9.7-distilled" # Using distilled for this example
124
+
125
+ # 0. Load base model and upsampler
126
+ pipe = LTXConditionPipeline.from_pretrained(base_model_id, torch_dtype=torch.bfloat16)
127
+ pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained(
128
+ "Lightricks/ltxv-spatial-upscaler-0.9.7",
129
+ vae=pipe.vae,
130
+ torch_dtype=torch.bfloat16
131
+ )
132
  pipe.to("cuda")
133
  pipe_upsample.to("cuda")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
  def round_to_nearest_resolution_acceptable_by_vae(height, width):
136
  height = height - (height % pipe.vae_temporal_compression_ratio)
137
  width = width - (width % pipe.vae_temporal_compression_ratio)
138
  return height, width
139
+
140
  video = load_video(
141
  "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
142
  )[:21] # Use only the first 21 frames as conditioning
 
151
  # Part 1. Generate video at smaller resolution
152
  downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
153
  downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
154
+
155
  latents = pipe(
156
  conditions=[condition1],
157
  prompt=prompt,
 
159
  width=downscaled_width,
160
  height=downscaled_height,
161
  num_frames=num_frames,
162
+ num_inference_steps=30,
 
 
 
163
  generator=torch.Generator().manual_seed(0),
164
  output_type="latent",
165
  ).frames
 
180
  width=upscaled_width,
181
  height=upscaled_height,
182
  num_frames=num_frames,
183
+ denoise_strength=0.4, # Effectively, 4 inference steps out of 10
184
  num_inference_steps=10,
 
185
  latents=upscaled_latents,
186
+ decode_timestep=0.05,
 
187
  image_cond_noise_scale=0.025,
188
  generator=torch.Generator().manual_seed(0),
189
  output_type="pil",
 
195
  export_to_video(video, "output.mp4", fps=24)
196
 
197
  ```
 
198
 
199
+ for more details and inference examples using 🧨 diffusers, check out the [diffusers documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video)
200
  Diffusers also supports directly loading from the original LTX checkpoints using the `from_single_file()` method. Check out [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video#loading-single-files) to learn more.
201
 
202
+ To learn more, check out the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video).
203
+
204
+
205
  ## Limitations
206
  - This model is not intended or able to provide factual information.
207
  - As a statistical model this checkpoint might amplify existing societal biases.