SkyReels-A2: Compose Anything in Video Diffusion Transformers
π Github Β· π Playground Β· Discord
This repo contains Diffusers style model weights for Skyreels-A2 models. You can find the inference code on SkyReels-A2 repository.
πͺ Models
Models | Download Link | Video Size |
---|---|---|
A2-Wan2.1-14B-Preview | Huggingface π€ | ~ 81 x 480 x 832 |
A2-Wan2.1-14B | To be released | ~ 81 x 480 x 832 |
A2-Wan2.1-14B-Infinity | To be released | ~ Inf x 720 x 1080 |
Overview of SkyReels-A2 framework. Our approach initiates by encoding all reference images using two distinct branches. The first, termed the spatial feature branch (represented in red, top row), leverages a fine-grained VAE encoder to process per-composition images. The second, identified as the semantic feature branch (represented in red, bottom row), utilizes a CLIP vision encoder followed by an MLP projection to encode semantic references. Subsequently, the spatial features are concatenated with the noised video tokens along the channel dimension before being passed through the diffusion transformer blocks. Meanwhile, the semantic features extracted from the reference images are incorporated into the diffusion transformers via supplementary cross-attention layers, ensuring that the semantic context is effectively integrated during diffusion.
Some generated results:
Citation
If you find SkyReels-A2 useful for your research, welcome to cite our work using the following BibTeX:
@article{fei2025skyreels,
title={SkyReels-A2: Compose Anything in Video Diffusion Transformers},
author={Skyreels Team},
journal={arXiv},
year={2025}
}
- Downloads last month
- 0