FramePack LoRA Experiment
Update: On further testing, I've come to realise retraining may not be necessary. Maybe for some lora's, but for others, as long as there is motion in them (as in video), they seem to do alright.
I've been experimenting with LoRA support for FramePack. Since it is based on Hunyuan Video, with a finetuned transformer (and some model changes), I decided to naively just replace the regular transformer and see what happened during training (I use finetrainers). The TL;DR is: Training occurs, the model improves.
That said, it's not perfect. Either it requires longer training, or the difference in the transformer needs a more adjusted training script. But hopefully this will inspire others to experiment more.
As test subject, I used one of my more niche lora's 1970s Martial Arts Movies, because it has a certain style and a couple of unique camera movements. (I could have just chosen one camera movement and no style and saved time :shrug)
I took the first image from one of the t2v generations for my hunyuan video LoRA. This becomes the base line:
The fast pan to the right is what I wanted to reproduce.
Prompt: "a man in a traditional chinese martial arts suit. then the camera pans right to show another man in a grey martial arts suit."
Here is unmodified FramePack:
It picks up some things from the prompt, but rotates instead of pans. It seems it doesn't know the concept I want to teach it (good). (It did pan in some of the test generations, but slowly.)
And this is the lora at 600 steps, roughly 2.5h of training on my 3090, a mix of images and video clips.
Quality wise, not at all as good as the base line, but it's on the other hand much longer, sub par trained, and possibly undertrained.
The observant will notice that it's not the default resolution in the framepack demo. I chose the one "native" to the original t2v generation, since I've noticed deviation from the trained resolutions lower quality. So this could also be the case for the "no lora" generation. That the result is affected by not choosing the resolutions preferred by FramePack.
I tried many generations of various starting images and prompts, and this was the one that best represented the lora. So yes, cherry-picked. But the point of this experiment is to prove that FramePack can be finetuned, and even now.
To inference, I use my own fork that has LoRA support hacked in:
https://github.com/neph1/FramePack
I use the default settings (except for resolution)
More examples
Here follows some more examples that are not as clearly distinguished.
Similar prompt to one above, but I think it was only "The camera pans to the right to reveal ... "
Hand movements that are prevalent in several of the training clips. Better style adherence.
Nice movements, much better than my base line hunyuan lora. A bit.. Muay Thai? (I'm not an expert)
Again, I'm not a martial arts expert, but I think it's more like my training clips. More fluid movements? Also better visual style adherence.
Replication
This is the way I trained it. You should be fine other trainers, like diffusers-pipe for example. Sorry if this is a complicated way, it's how I do it.
Training
Download hunyuan video models https://huggingface.co/hunyuanvideo-community/HunyuanVideo (I used the t2v model)
Download FramePack model: https://huggingface.co/lllyasviel/FramePackI2V_HY (this is only the transformer)
Replace the "transformer" folder in hunyuan video with the one from framepack. (I symlinked it)
Download finetrainers https://github.com/a-r-r-o-w/finetrainers (I use the v0.0.1 tag, 'git checkout v0.0.1')
Optional: Use https://github.com/neph1/finetrainers-ui if you want UI (Use v0.11.2 with finetrainers v0.0.1)
This is the config I used:
accelerate_config: uncompiled_1.yaml
allow_tf32: true
batch_size: 1
beta1: 0.9
beta2: 0.95
caption_column: caption
caption_dropout_p: 0.05
caption_dropout_technique: empty
checkpointing_limit: 10
checkpointing_steps: 250
data_root: 'path to your dataset'
dataloader_num_workers: 0
dataset_file: metadata.json
diffusion_options: ''
enable_model_cpu_offload: ''
enable_slicing: true
enable_tiling: true
epsilon: 1e-8
gpu_ids: '0'
gradient_accumulation_steps: 8
gradient_checkpointing: true
id_token: 70s_kungfu
image_resolution_buckets: 480x544 384x544 352x544 544x352 320x544 448x544 256x544
224x544 192x544
layerwise_upcasting_modules: transformer
layerwise_upcasting_skip_modules_pattern: patch_embed pos_embed x_embedder context_embedder
^proj_in$ ^proj_out$ norm
layerwise_upcasting_storage_dtype: float8_e4m3fn
lora_alpha: 64
lr: 0.0003
lr_num_cycles: 1
lr_scheduler: linear
lr_warmup_steps: 50
max_grad_norm: 1
model_name: hunyuan_video
nccl_timeout: 1800
num_validation_videos: 0
optimizer: adamw
output_dir: where to put the results
pin_memory: true
precompute_conditions: true
pretrained_model_name_or_path: 'path to your hunyuan video model'
rank: 64
report_to: none
resume_from_checkpoint: ''
seed: 425
target_modules: to_q to_k to_v to_out.0
text_encoder_2_dtype: bf16
text_encoder_3_dtype: bf16
text_encoder_dtype: bf16
tracker_name: finetrainers
train_steps: 600
training_type: lora
transformer_dtype: bf16
use_8bit_bnb: ''
vae_dtype: bf16
validation_epochs: 0
validation_prompt_separator: ':::'
validation_prompts: ''
validation_steps: 10000
video_column: file
video_resolution_buckets: 1x480x544 1x384x544 1x352x544 1x544x352 1x320x544 1x448x544
1x352x576 1x320x576 24x192x320 24x192x352 24x224x320 32x192x320 32x192x352 32x224x320
weight_decay: 0.001
Inference
I used my own fork of FramePack: https://github.com/neph1/FramePack
There's a model_config.json where you can add an optional lora path:
"lora":
{ "path": "path to lora",
"name": "pytorch_lora_weights.safetensors" <- or some other name. must be safetensors
}
I've also created a PR: https://github.com/lllyasviel/FramePack/pull/157 In it, you don't need the .json, only pass '--lora path_to_the_lora'
The LoRA will not be compatible with comfyui by default. You can run this script to convert it: https://github.com/a-r-r-o-w/finetrainers/blob/main/examples/formats/hunyuan_video/convert_to_original_format.py
But as of now, I don't think the FramePack wrapper has LoRA support.