jacobitterman linoyts HF Staff commited on
Commit
a4514b2
·
verified ·
1 Parent(s): bdc0b06

Update README.md (#2)

Browse files

- Update README.md (768ad92c3c97985a11a82b2f7aefc38c604ee035)
- fix (3ede8adbf96270ae3bb93f8c0244408c1af518a0)


Co-authored-by: Linoy Tsaban <[email protected]>

Files changed (1) hide show
  1. README.md +196 -23
README.md CHANGED
@@ -10,7 +10,7 @@ pipeline_tag: text-to-video
10
  library_name: diffusers
11
  ---
12
 
13
- # LTX-Video Model Card
14
  This model card focuses on the model associated with the LTX-Video model, codebase available [here](https://github.com/Lightricks/LTX-Video).
15
 
16
  LTX-Video is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 30 FPS videos at a 1216×704 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content.
@@ -121,58 +121,231 @@ Make sure you install `diffusers` before trying out the examples below.
121
  pip install -U git+https://github.com/huggingface/diffusers
122
  ```
123
 
124
- Now, you can run the examples below:
125
 
 
 
126
  ```py
127
  import torch
128
- from diffusers import LTXPipeline
 
129
  from diffusers.utils import export_to_video
130
 
131
- pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
 
132
  pipe.to("cuda")
 
 
133
 
134
- prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
135
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
136
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
  video = pipe(
138
  prompt=prompt,
139
  negative_prompt=negative_prompt,
140
- width=704,
141
- height=480,
142
- num_frames=161,
143
- num_inference_steps=50,
 
 
 
 
 
 
 
 
144
  ).frames[0]
 
 
 
 
145
  export_to_video(video, "output.mp4", fps=24)
146
  ```
147
 
148
- For image-to-video:
149
 
150
  ```py
151
  import torch
152
- from diffusers import LTXImageToVideoPipeline
 
153
  from diffusers.utils import export_to_video, load_image
154
 
155
- pipe = LTXImageToVideoPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
 
156
  pipe.to("cuda")
 
 
157
 
158
- image = load_image(
159
- "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png"
160
- )
161
- prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background. Flames engulf the structure, with smoke billowing into the air. Firefighters in protective gear rush to the scene, a fire truck labeled '38' visible behind them. The girl's neutral expression contrasts sharply with the chaos of the fire, creating a poignant and emotionally charged scene."
162
- negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
163
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
  video = pipe(
165
- image=image,
166
  prompt=prompt,
167
  negative_prompt=negative_prompt,
168
- width=704,
169
- height=480,
170
- num_frames=161,
171
- num_inference_steps=50,
 
 
 
 
 
 
 
 
172
  ).frames[0]
 
 
 
 
173
  export_to_video(video, "output.mp4", fps=24)
 
174
  ```
175
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176
  To learn more, check out the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video).
177
 
178
  Diffusers also supports directly loading from the original LTX checkpoints using the `from_single_file()` method. Check out [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video#loading-single-files) to learn more.
 
10
  library_name: diffusers
11
  ---
12
 
13
+ # LTX-Video 0.9.7 Distilled Model Card
14
  This model card focuses on the model associated with the LTX-Video model, codebase available [here](https://github.com/Lightricks/LTX-Video).
15
 
16
  LTX-Video is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 30 FPS videos at a 1216×704 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content.
 
121
  pip install -U git+https://github.com/huggingface/diffusers
122
  ```
123
 
124
+ Now, you can run the examples below (note that the upsampling stage is optional but reccomeneded):
125
 
126
+ ### text-to-video:
127
+ ```
128
  ```py
129
  import torch
130
+ from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
131
+ from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
132
  from diffusers.utils import export_to_video
133
 
134
+ pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16)
135
+ pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
136
  pipe.to("cuda")
137
+ pipe_upsample.to("cuda")
138
+ pipe.vae.enable_tiling()
139
 
140
+ prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
141
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
142
+ expected_height, expected_width = 704, 512
143
+ downscale_factor = 2 / 3
144
+ num_frames = 121
145
+
146
+ # Part 1. Generate video at smaller resolution
147
+ downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
148
+ latents = pipe(
149
+ conditions=None,
150
+ prompt=prompt,
151
+ negative_prompt=negative_prompt,
152
+ width=downscaled_width,
153
+ height=downscaled_height,
154
+ num_frames=num_frames,
155
+ num_inference_steps=7,
156
+ decode_timestep = 0.05,
157
+ guidnace_scale=1.0,
158
+ decode_noise_scale = 0.025,
159
+ generator=torch.Generator().manual_seed(0),
160
+ output_type="latent",
161
+ ).frames
162
+
163
+ # Part 2. Upscale generated video using latent upsampler with fewer inference steps
164
+ # The available latent upsampler upscales the height/width by 2x
165
+ upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
166
+ upscaled_latents = pipe_upsample(
167
+ latents=latents,
168
+ output_type="latent"
169
+ ).frames
170
+ # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
171
  video = pipe(
172
  prompt=prompt,
173
  negative_prompt=negative_prompt,
174
+ width=upscaled_width,
175
+ height=upscaled_height,
176
+ num_frames=num_frames,
177
+ denoise_strength=0.3, # Effectively, 4 inference steps out of 10
178
+ num_inference_steps=10,
179
+ latents=upscaled_latents,
180
+ decode_timestep = 0.05,
181
+ guidnace_scale=1.0,
182
+ decode_noise_scale = 0.025,
183
+ image_cond_noise_scale=0.025,
184
+ generator=torch.Generator().manual_seed(0),
185
+ output_type="pil",
186
  ).frames[0]
187
+
188
+ # Part 4. Downscale the video to the expected resolution
189
+ video = [frame.resize((expected_width, expected_height)) for frame in video]
190
+
191
  export_to_video(video, "output.mp4", fps=24)
192
  ```
193
 
194
+ ### For image-to-video:
195
 
196
  ```py
197
  import torch
198
+ from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
199
+ from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
200
  from diffusers.utils import export_to_video, load_image
201
 
202
+ pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16)
203
+ pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
204
  pipe.to("cuda")
205
+ pipe_upsample.to("cuda")
206
+ pipe.vae.enable_tiling()
207
 
208
+ image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png")
209
+ video = [image]
210
+ condition1 = LTXVideoCondition(video=video, frame_index=0)
 
 
211
 
212
+ prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
213
+ negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
214
+ expected_height, expected_width = 832, 480
215
+ downscale_factor = 2 / 3
216
+ num_frames = 96
217
+
218
+ # Part 1. Generate video at smaller resolution
219
+ downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
220
+ downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
221
+ latents = pipe(
222
+ conditions=[condition1],
223
+ prompt=prompt,
224
+ negative_prompt=negative_prompt,
225
+ width=downscaled_width,
226
+ height=downscaled_height,
227
+ num_frames=num_frames,
228
+ num_inference_steps=7,
229
+ guidnace_scale=1.0,
230
+ decode_timestep = 0.05,
231
+ decode_noise_scale = 0.025,
232
+ generator=torch.Generator().manual_seed(0),
233
+ output_type="latent",
234
+ ).frames
235
+
236
+ # Part 2. Upscale generated video using latent upsampler with fewer inference steps
237
+ # The available latent upsampler upscales the height/width by 2x
238
+ upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
239
+ upscaled_latents = pipe_upsample(
240
+ latents=latents,
241
+ output_type="latent"
242
+ ).frames
243
+ # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
244
  video = pipe(
245
+ conditions=[condition1],
246
  prompt=prompt,
247
  negative_prompt=negative_prompt,
248
+ width=upscaled_width,
249
+ height=upscaled_height,
250
+ num_frames=num_frames,
251
+ denoise_strength=0.3, # Effectively, 4 inference steps out of 10
252
+ num_inference_steps=10,
253
+ guidnace_scale=1.0,
254
+ latents=upscaled_latents,
255
+ decode_timestep = 0.05,
256
+ decode_noise_scale = 0.025,
257
+ image_cond_noise_scale=0.025,
258
+ generator=torch.Generator().manual_seed(0),
259
+ output_type="pil",
260
  ).frames[0]
261
+
262
+ # Part 4. Downscale the video to the expected resolution
263
+ video = [frame.resize((expected_width, expected_height)) for frame in video]
264
+
265
  export_to_video(video, "output.mp4", fps=24)
266
+
267
  ```
268
 
269
+ ### For video-to-video:
270
+
271
+ ```py
272
+ import torch
273
+ from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
274
+ from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
275
+ from diffusers.utils import export_to_video, load_video
276
+
277
+ pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16)
278
+ pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
279
+ pipe.to("cuda")
280
+ pipe_upsample.to("cuda")
281
+ pipe.vae.enable_tiling()
282
+
283
+ def round_to_nearest_resolution_acceptable_by_vae(height, width):
284
+ height = height - (height % pipe.vae_temporal_compression_ratio)
285
+ width = width - (width % pipe.vae_temporal_compression_ratio)
286
+ return height, width
287
+ video = load_video(
288
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
289
+ )[:21] # Use only the first 21 frames as conditioning
290
+ condition1 = LTXVideoCondition(video=video, frame_index=0)
291
+
292
+ prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
293
+ negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
294
+ expected_height, expected_width = 768, 1152
295
+ downscale_factor = 2 / 3
296
+ num_frames = 161
297
+
298
+ # Part 1. Generate video at smaller resolution
299
+ downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
300
+ downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
301
+ latents = pipe(
302
+ conditions=[condition1],
303
+ prompt=prompt,
304
+ negative_prompt=negative_prompt,
305
+ width=downscaled_width,
306
+ height=downscaled_height,
307
+ num_frames=num_frames,
308
+ num_inference_steps=7,
309
+ guidnace_scale=1.0,
310
+ decode_timestep = 0.05,
311
+ decode_noise_scale = 0.025,
312
+ generator=torch.Generator().manual_seed(0),
313
+ output_type="latent",
314
+ ).frames
315
+
316
+ # Part 2. Upscale generated video using latent upsampler with fewer inference steps
317
+ # The available latent upsampler upscales the height/width by 2x
318
+ upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
319
+ upscaled_latents = pipe_upsample(
320
+ latents=latents,
321
+ output_type="latent"
322
+ ).frames
323
+
324
+ # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
325
+ video = pipe(
326
+ conditions=[condition1],
327
+ prompt=prompt,
328
+ negative_prompt=negative_prompt,
329
+ width=upscaled_width,
330
+ height=upscaled_height,
331
+ num_frames=num_frames,
332
+ denoise_strength=0.3, # Effectively, 4 inference steps out of 10
333
+ num_inference_steps=10,
334
+ guidnace_scale=1.0,
335
+ latents=upscaled_latents,
336
+ decode_timestep = 0.05,
337
+ decode_noise_scale = 0.025,
338
+ image_cond_noise_scale=0.025,
339
+ generator=torch.Generator().manual_seed(0),
340
+ output_type="pil",
341
+ ).frames[0]
342
+
343
+ # Part 4. Downscale the video to the expected resolution
344
+ video = [frame.resize((expected_width, expected_height)) for frame in video]
345
+
346
+ export_to_video(video, "output.mp4", fps=24)
347
+
348
+ ```
349
  To learn more, check out the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video).
350
 
351
  Diffusers also supports directly loading from the original LTX checkpoints using the `from_single_file()` method. Check out [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video#loading-single-files) to learn more.