FramePack is hands down one of the best OS releases in video generation ππ»ββοΈπ€― β fully open sourced + amazing quality + reduced memory + improved speed but more even - its gonna facilitate *soooo* many downstream applications like this version adapted for landscape rotation πhttps://huggingface.co/spaces/tori29umai/FramePack_rotate_landscape
The first open Stable Diffusion 3-like architecture model is JUST out π£ - but it is not SD3! π€
It is Tencent-Hunyuan/HunyuanDiT by Tencent, a 1.5B parameter DiT (diffusion transformer) text-to-image model πΌοΈβ¨, trained with multi-lingual CLIP + multi-lingual T5 text-encoders for english π€ chinese understanding
π 3 text-encoders: 2 CLIPs, one T5-XXL; plug-and-play: removing the larger one maintains competitiveness
ποΈ Dataset was deduplicated with SSCD which helped with memorization (no more details about the dataset tho)
Variants π A DPO fine-tuned model showed great improvement in prompt understanding and aesthetics βοΈ An Instruct Edit 2B model was trained, and learned how to do text-replacement
Results β State of the art in automated evals for composition and prompt understanding β Best win rate in human preference evaluation for prompt understanding, aesthetics and typography (missing some details on how many participants and the design of the experiment)