jacobitterman linoyts HF Staff commited on
Commit
ab59826
·
verified ·
1 Parent(s): b0f7c01

Update README.md (#3)

Browse files

- Update README.md (52f62088976a31e5c6e0d8d10c0389751a2eddd7)


Co-authored-by: Linoy Tsaban <[email protected]>

Files changed (1) hide show
  1. README.md +184 -22
README.md CHANGED
@@ -10,7 +10,7 @@ pipeline_tag: text-to-video
10
  library_name: diffusers
11
  ---
12
 
13
- # LTX-Video Model Card
14
  This model card focuses on the model associated with the LTX-Video model, codebase available [here](https://github.com/Lightricks/LTX-Video).
15
 
16
  LTX-Video is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 30 FPS videos at a 1216×704 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content.
@@ -114,58 +114,220 @@ Make sure you install `diffusers` before trying out the examples below.
114
  pip install -U git+https://github.com/huggingface/diffusers
115
  ```
116
 
117
- Now, you can run the examples below:
118
 
 
119
  ```py
120
  import torch
121
- from diffusers import LTXPipeline
 
122
  from diffusers.utils import export_to_video
123
 
124
- pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
 
125
  pipe.to("cuda")
 
 
126
 
127
- prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
128
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
129
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  video = pipe(
131
  prompt=prompt,
132
  negative_prompt=negative_prompt,
133
- width=704,
134
- height=480,
135
- num_frames=161,
136
- num_inference_steps=50,
 
 
 
 
 
 
137
  ).frames[0]
 
 
 
 
138
  export_to_video(video, "output.mp4", fps=24)
139
  ```
140
 
141
- For image-to-video:
142
 
143
  ```py
144
  import torch
145
- from diffusers import LTXImageToVideoPipeline
 
146
  from diffusers.utils import export_to_video, load_image
147
 
148
- pipe = LTXImageToVideoPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
 
149
  pipe.to("cuda")
 
 
 
 
 
 
150
 
151
- image = load_image(
152
- "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png"
153
- )
154
- prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background. Flames engulf the structure, with smoke billowing into the air. Firefighters in protective gear rush to the scene, a fire truck labeled '38' visible behind them. The girl's neutral expression contrasts sharply with the chaos of the fire, creating a poignant and emotionally charged scene."
155
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  video = pipe(
158
- image=image,
159
  prompt=prompt,
160
  negative_prompt=negative_prompt,
161
- width=704,
162
- height=480,
163
- num_frames=161,
164
- num_inference_steps=50,
 
 
 
 
 
 
165
  ).frames[0]
 
 
 
 
166
  export_to_video(video, "output.mp4", fps=24)
 
167
  ```
168
 
 
169
  To learn more, check out the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video).
170
 
171
  Diffusers also supports directly loading from the original LTX checkpoints using the `from_single_file()` method. Check out [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video#loading-single-files) to learn more.
 
10
  library_name: diffusers
11
  ---
12
 
13
+ # LTX-Video 0.9.7 Model Card
14
  This model card focuses on the model associated with the LTX-Video model, codebase available [here](https://github.com/Lightricks/LTX-Video).
15
 
16
  LTX-Video is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 30 FPS videos at a 1216×704 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content.
 
114
  pip install -U git+https://github.com/huggingface/diffusers
115
  ```
116
 
117
+ Now, you can run the examples below (note that the upsampling stage is optional but reccomeneded):
118
 
119
+ ### text-to-video:
120
  ```py
121
  import torch
122
+ from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
123
+ from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
124
  from diffusers.utils import export_to_video
125
 
126
+ pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
127
+ pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
128
  pipe.to("cuda")
129
+ pipe_upsample.to("cuda")
130
+ pipe.vae.enable_tiling()
131
 
132
+ prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
133
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
134
+ expected_height, expected_width = 704, 512
135
+ downscale_factor = 2 / 3
136
+ num_frames = 121
137
+
138
+ # Part 1. Generate video at smaller resolution
139
+ downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
140
+ latents = pipe(
141
+ conditions=None,
142
+ prompt=prompt,
143
+ negative_prompt=negative_prompt,
144
+ width=downscaled_width,
145
+ height=downscaled_height,
146
+ num_frames=num_frames,
147
+ num_inference_steps=30,
148
+ generator=torch.Generator().manual_seed(0),
149
+ output_type="latent",
150
+ ).frames
151
+
152
+ # Part 2. Upscale generated video using latent upsampler with fewer inference steps
153
+ # The available latent upsampler upscales the height/width by 2x
154
+ upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
155
+ upscaled_latents = pipe_upsample(
156
+ latents=latents,
157
+ output_type="latent"
158
+ ).frames
159
+
160
+ # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
161
  video = pipe(
162
  prompt=prompt,
163
  negative_prompt=negative_prompt,
164
+ width=upscaled_width,
165
+ height=upscaled_height,
166
+ num_frames=num_frames,
167
+ denoise_strength=0.4, # Effectively, 4 inference steps out of 10
168
+ num_inference_steps=10,
169
+ latents=upscaled_latents,
170
+ decode_timestep=0.05,
171
+ image_cond_noise_scale=0.025,
172
+ generator=torch.Generator().manual_seed(0),
173
+ output_type="pil",
174
  ).frames[0]
175
+
176
+ # Part 4. Downscale the video to the expected resolution
177
+ video = [frame.resize((expected_width, expected_height)) for frame in video]
178
+
179
  export_to_video(video, "output.mp4", fps=24)
180
  ```
181
 
182
+ ### For image-to-video:
183
 
184
  ```py
185
  import torch
186
+ from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
187
+ from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
188
  from diffusers.utils import export_to_video, load_image
189
 
190
+ pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
191
+ pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
192
  pipe.to("cuda")
193
+ pipe_upsample.to("cuda")
194
+ pipe.vae.enable_tiling()
195
+
196
+ image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png")
197
+ video = [image]
198
+ condition1 = LTXVideoCondition(video=video, frame_index=0)
199
 
200
+ prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
 
 
 
201
  negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
202
+ expected_height, expected_width = 832, 480
203
+ downscale_factor = 2 / 3
204
+ num_frames = 96
205
+
206
+ # Part 1. Generate video at smaller resolution
207
+ downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
208
+ downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
209
+ latents = pipe(
210
+ conditions=[condition1],
211
+ prompt=prompt,
212
+ negative_prompt=negative_prompt,
213
+ width=downscaled_width,
214
+ height=downscaled_height,
215
+ num_frames=num_frames,
216
+ num_inference_steps=30,
217
+ generator=torch.Generator().manual_seed(0),
218
+ output_type="latent",
219
+ ).frames
220
+
221
+ # Part 2. Upscale generated video using latent upsampler with fewer inference steps
222
+ # The available latent upsampler upscales the height/width by 2x
223
+ upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
224
+ upscaled_latents = pipe_upsample(
225
+ latents=latents,
226
+ output_type="latent"
227
+ ).frames
228
+
229
+ # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
230
+ video = pipe(
231
+ conditions=[condition1],
232
+ prompt=prompt,
233
+ negative_prompt=negative_prompt,
234
+ width=upscaled_width,
235
+ height=upscaled_height,
236
+ num_frames=num_frames,
237
+ denoise_strength=0.4, # Effectively, 4 inference steps out of 10
238
+ num_inference_steps=10,
239
+ latents=upscaled_latents,
240
+ decode_timestep=0.05,
241
+ image_cond_noise_scale=0.025,
242
+ generator=torch.Generator().manual_seed(0),
243
+ output_type="pil",
244
+ ).frames[0]
245
+
246
+ # Part 4. Downscale the video to the expected resolution
247
+ video = [frame.resize((expected_width, expected_height)) for frame in video]
248
+
249
+ export_to_video(video, "output.mp4", fps=24)
250
+
251
+ ```
252
+
253
+ ### For video-to-video:
254
+
255
+ ```py
256
+ import torch
257
+ from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
258
+ from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
259
+ from diffusers.utils import export_to_video, load_video
260
+
261
+ pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
262
+ pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
263
+ pipe.to("cuda")
264
+ pipe_upsample.to("cuda")
265
+ pipe.vae.enable_tiling()
266
+
267
+ def round_to_nearest_resolution_acceptable_by_vae(height, width):
268
+ height = height - (height % pipe.vae_temporal_compression_ratio)
269
+ width = width - (width % pipe.vae_temporal_compression_ratio)
270
+ return height, width
271
 
272
+ video = load_video(
273
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
274
+ )[:21] # Use only the first 21 frames as conditioning
275
+ condition1 = LTXVideoCondition(video=video, frame_index=0)
276
+
277
+ prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
278
+ negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
279
+ expected_height, expected_width = 768, 1152
280
+ downscale_factor = 2 / 3
281
+ num_frames = 161
282
+
283
+ # Part 1. Generate video at smaller resolution
284
+ downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
285
+ downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
286
+ latents = pipe(
287
+ conditions=[condition1],
288
+ prompt=prompt,
289
+ negative_prompt=negative_prompt,
290
+ width=downscaled_width,
291
+ height=downscaled_height,
292
+ num_frames=num_frames,
293
+ num_inference_steps=30,
294
+ generator=torch.Generator().manual_seed(0),
295
+ output_type="latent",
296
+ ).frames
297
+
298
+ # Part 2. Upscale generated video using latent upsampler with fewer inference steps
299
+ # The available latent upsampler upscales the height/width by 2x
300
+ upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
301
+ upscaled_latents = pipe_upsample(
302
+ latents=latents,
303
+ output_type="latent"
304
+ ).frames
305
+
306
+ # Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
307
  video = pipe(
308
+ conditions=[condition1],
309
  prompt=prompt,
310
  negative_prompt=negative_prompt,
311
+ width=upscaled_width,
312
+ height=upscaled_height,
313
+ num_frames=num_frames,
314
+ denoise_strength=0.4, # Effectively, 4 inference steps out of 10
315
+ num_inference_steps=10,
316
+ latents=upscaled_latents,
317
+ decode_timestep=0.05,
318
+ image_cond_noise_scale=0.025,
319
+ generator=torch.Generator().manual_seed(0),
320
+ output_type="pil",
321
  ).frames[0]
322
+
323
+ # Part 4. Downscale the video to the expected resolution
324
+ video = [frame.resize((expected_width, expected_height)) for frame in video]
325
+
326
  export_to_video(video, "output.mp4", fps=24)
327
+
328
  ```
329
 
330
+
331
  To learn more, check out the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video).
332
 
333
  Diffusers also supports directly loading from the original LTX checkpoints using the `from_single_file()` method. Check out [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video#loading-single-files) to learn more.