Diffusers
Safetensors

SkyReels-A2: Compose Anything in Video Diffusion Transformers

🌐 Github Β· πŸ‘‹ Playground Β· Discord

This repo contains Diffusers style model weights for Skyreels-A2 models. You can find the inference code on SkyReels-A2 repository.

πŸͺ„ Models

Models Download Link Video Size
A2-Wan2.1-14B-Preview Huggingface πŸ€— ~ 81 x 480 x 832
A2-Wan2.1-14B To be released ~ 81 x 480 x 832
A2-Wan2.1-14B-Infinity To be released ~ Inf x 720 x 1080

image/png

Overview of SkyReels-A2 framework. Our approach initiates by encoding all reference images using two distinct branches. The first, termed the spatial feature branch (represented in red, top row), leverages a fine-grained VAE encoder to process per-composition images. The second, identified as the semantic feature branch (represented in red, bottom row), utilizes a CLIP vision encoder followed by an MLP projection to encode semantic references. Subsequently, the spatial features are concatenated with the noised video tokens along the channel dimension before being passed through the diffusion transformer blocks. Meanwhile, the semantic features extracted from the reference images are incorporated into the diffusion transformers via supplementary cross-attention layers, ensuring that the semantic context is effectively integrated during diffusion.


Some generated results:

Citation

If you find SkyReels-A2 useful for your research, welcome to cite our work using the following BibTeX:

@article{fei2025skyreels,
  title={SkyReels-A2: Compose Anything in Video Diffusion Transformers},
  author={Skyreels Team},
  journal={arXiv},
  year={2025}
}
Downloads last month
0
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support