FramePack LoRA Experiment

Community Article Published April 19, 2025

Update: On further testing, I've come to realise retraining may not be necessary. Maybe for some lora's, but for others, as long as there is motion in them (as in video), they seem to do alright.

I've been experimenting with LoRA support for FramePack. Since it is based on Hunyuan Video, with a finetuned transformer (and some model changes), I decided to naively just replace the regular transformer and see what happened during training (I use finetrainers). The TL;DR is: Training occurs, the model improves.

That said, it's not perfect. Either it requires longer training, or the difference in the transformer needs a more adjusted training script. But hopefully this will inspire others to experiment more.

As test subject, I used one of my more niche lora's 1970s Martial Arts Movies, because it has a certain style and a couple of unique camera movements. (I could have just chosen one camera movement and no style and saved time :shrug)

I took the first image from one of the t2v generations for my hunyuan video LoRA. This becomes the base line:

The fast pan to the right is what I wanted to reproduce.

Prompt: "a man in a traditional chinese martial arts suit. then the camera pans right to show another man in a grey martial arts suit."

Here is unmodified FramePack:

It picks up some things from the prompt, but rotates instead of pans. It seems it doesn't know the concept I want to teach it (good). (It did pan in some of the test generations, but slowly.)

And this is the lora at 600 steps, roughly 2.5h of training on my 3090, a mix of images and video clips.

Quality wise, not at all as good as the base line, but it's on the other hand much longer, sub par trained, and possibly undertrained.

The observant will notice that it's not the default resolution in the framepack demo. I chose the one "native" to the original t2v generation, since I've noticed deviation from the trained resolutions lower quality. So this could also be the case for the "no lora" generation. That the result is affected by not choosing the resolutions preferred by FramePack.

I tried many generations of various starting images and prompts, and this was the one that best represented the lora. So yes, cherry-picked. But the point of this experiment is to prove that FramePack can be finetuned, and even now.

To inference, I use my own fork that has LoRA support hacked in:

https://github.com/neph1/FramePack

I use the default settings (except for resolution)

More examples

Here follows some more examples that are not as clearly distinguished.

Similar prompt to one above, but I think it was only "The camera pans to the right to reveal ... "

No lora:

Lora:

Hand movements that are prevalent in several of the training clips. Better style adherence.

No lora:

Nice movements, much better than my base line hunyuan lora. A bit.. Muay Thai? (I'm not an expert)

Lora:

Again, I'm not a martial arts expert, but I think it's more like my training clips. More fluid movements? Also better visual style adherence.

Replication

This is the way I trained it. You should be fine other trainers, like diffusers-pipe for example. Sorry if this is a complicated way, it's how I do it.

Training

Download hunyuan video models https://huggingface.co/hunyuanvideo-community/HunyuanVideo (I used the t2v model)

Download FramePack model: https://huggingface.co/lllyasviel/FramePackI2V_HY (this is only the transformer)

Replace the "transformer" folder in hunyuan video with the one from framepack. (I symlinked it)

Download finetrainers https://github.com/a-r-r-o-w/finetrainers (I use the v0.0.1 tag, 'git checkout v0.0.1')

Optional: Use https://github.com/neph1/finetrainers-ui if you want UI (Use v0.11.2 with finetrainers v0.0.1)

This is the config I used:

accelerate_config: uncompiled_1.yaml
allow_tf32: true
batch_size: 1
beta1: 0.9
beta2: 0.95
caption_column: caption
caption_dropout_p: 0.05
caption_dropout_technique: empty
checkpointing_limit: 10
checkpointing_steps: 250
data_root: 'path to your dataset'
dataloader_num_workers: 0
dataset_file: metadata.json
diffusion_options: ''
enable_model_cpu_offload: ''
enable_slicing: true
enable_tiling: true
epsilon: 1e-8
gpu_ids: '0'
gradient_accumulation_steps: 8
gradient_checkpointing: true
id_token: 70s_kungfu
image_resolution_buckets: 480x544 384x544 352x544 544x352 320x544 448x544 256x544
  224x544 192x544
layerwise_upcasting_modules: transformer
layerwise_upcasting_skip_modules_pattern: patch_embed pos_embed x_embedder context_embedder
  ^proj_in$ ^proj_out$ norm
layerwise_upcasting_storage_dtype: float8_e4m3fn
lora_alpha: 64
lr: 0.0003
lr_num_cycles: 1
lr_scheduler: linear
lr_warmup_steps: 50
max_grad_norm: 1
model_name: hunyuan_video
nccl_timeout: 1800
num_validation_videos: 0
optimizer: adamw
output_dir: where to put the results
pin_memory: true
precompute_conditions: true
pretrained_model_name_or_path: 'path to your hunyuan video model'
rank: 64
report_to: none
resume_from_checkpoint: ''
seed: 425
target_modules: to_q to_k to_v to_out.0
text_encoder_2_dtype: bf16
text_encoder_3_dtype: bf16
text_encoder_dtype: bf16
tracker_name: finetrainers
train_steps: 600
training_type: lora
transformer_dtype: bf16
use_8bit_bnb: ''
vae_dtype: bf16
validation_epochs: 0
validation_prompt_separator: ':::'
validation_prompts: ''
validation_steps: 10000
video_column: file
video_resolution_buckets: 1x480x544 1x384x544 1x352x544 1x544x352 1x320x544 1x448x544
  1x352x576 1x320x576 24x192x320 24x192x352 24x224x320 32x192x320 32x192x352 32x224x320
weight_decay: 0.001

Inference

I used my own fork of FramePack: https://github.com/neph1/FramePack

There's a model_config.json where you can add an optional lora path:

"lora": 
{ "path": "path to lora", 
"name": "pytorch_lora_weights.safetensors" <- or some other name. must be safetensors
}

I've also created a PR: https://github.com/lllyasviel/FramePack/pull/157 In it, you don't need the .json, only pass '--lora path_to_the_lora'

The LoRA will not be compatible with comfyui by default. You can run this script to convert it: https://github.com/a-r-r-o-w/finetrainers/blob/main/examples/formats/hunyuan_video/convert_to_original_format.py

But as of now, I don't think the FramePack wrapper has LoRA support.

Community

Safeswimming69

Apr 20

Did you try using a hunyuan Lora on Framepack? does it just error out or something?

neph1

Article author Apr 20

No, it "works". Ref: https://github.com/lllyasviel/FramePack/issues/5#issuecomment-2813983753
In my case the (motion) quality deteriorated. But I've heard of other cases where hunyuan lora's have worked better. So it might be worth trying.

tedbiv

May 6

•

edited May 6

i used 1 lora with framepack-f1 in comfyui. it worked fine. it did add to required vram. i tried it with 2 loras. it spilled too much to shared ram so i didn't wait to see if it worked. here's a link to nsfw video i made. when model and lora first loaded it spilled to dram, but shortly settled at 94% for the remainder of gen. using rtx4090 24GB vram. https://civitai.com/images/74621941
it was a movement lora, not character lora.

base f1 fp8 model uses 19.4GB vram 544x704 10secs. adding the lora (314k) uses 23.8GB vram 400x544 10 secs. took twice as long to run.

heixll12

May 18

@tedbiv Good afternoon, how did you install it with framepack?

tedbiv

May 19

•

edited May 19

actually, with base res set to 384, 10 sec video took 15 minutes with lora. about 47 secs/pass. last run with lora set 0.6 entire run took 412 seconds start to finish.

Steve72

30 days ago

I'm still setting up my new system I've had for 2 weeks with a 5090, my old 4090, 96GB's ddr5-6600, on Ubuntu. With my old 4090 system 98% of my focus has been on SD inference performance hitting 294 image/sec at 512x512 using 1 step sdxs and pioneering real-time video generation starting in Oct 2023. I just discovered FramePack-Studio on github which supports Loras and timestamped multiprompt compositions. I just started to look at its code last night and wondering if I can change the Lora used at each timestamp boundary.

I really need to shift my efforts now to training/fine-tuning or both SD and LLM's now that I have serious compute power.
I'm known as "aifartist" on github and discord.

I'm always willing to pay cash for help walking me through the process of training. Once I get through that I feel I can do serious development in that area.

Retired software Architect of 40+ years now doing AI as a hobby.

Steve72

30 days ago

It looks like links can be posted here so...

My twitter with many examples of early versions of my real-time video tool: https://x.com/Dan50412374
Scroll down till you see something interesting examples.
My youtube channel with one poorly presented demo of my tool: https://www.youtube.com/watch?v=irUpybVgdDY
I need to update this demo and do a better job.

Hunyuan does smooth gen's taking many minutes to get 10 seconds of output. A lovely lady dancing or taking a few steps. And each time you watch it it always looks the same.
LTXV is nearing real-time on H200's but the blurriness and deformity isn't that good.
My EndlessDreams is jittery lacking enough temporal consistency, but the video has sharp sdxl quality and is ever changing. You sit back and dictate with your voice our dreams, and they play out in the video. You wander anywhere in the multiverse hidden in the latency space of a sdxl model panning and zooming. You can draw into the video as it is generating to create interesting effects. Unlike a 10 second video you can explore for hours with EndlessDreams not knowing what you are going to see. Evil clowns holding balloons, Zombies, cats and flowers, peaceful houses in the forest near a creak, and nearly an infinite amount more with no waiting.

glogtorb

18 days ago

sounds like some promo shit

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote