BAAI
/

PhyscalX's picture
Update model type
1e66007
metadata
license: apache-2.0
tags:
  - text-to-video
  - video-generation
  - baai-nova

NOVA (d48w1024-osp480) Model Card

Model Details

  • Developed by: BAAI
  • Model type: Non-quantized Autoregressive Text-to-Video Generation Model
  • Model size: 645M
  • Model precision: torch.float16 (FP16)
  • Model resolution: 768x480
  • Model Description: This is a model that can be used to generate and modify videos based on text prompts. It is a Non-quantized Video Autoregressive (NOVA) diffusion model that uses a pretrained text encoder (Phi-2) and one VAE video tokenizer (OpenSoraPlanV1.2-VAE).
  • Model License: Apache 2.0 License
  • Resources for more information: GitHub Repository.

Examples

Using the 🤗's Diffusers library to run NOVA in a simple and efficient manner.

pip install diffusers transformers accelerate imageio[ffmpeg]
pip install git+ssh://[email protected]/baaivision/NOVA.git

Running the pipeline:

import torch
from diffnext.pipelines import NOVAPipeline
from diffnext.utils import export_to_image, export_to_video

model_id = "BAAI/nova-d48w1024-osp480"
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = NOVAPipeline.from_pretrained(model_id, **model_args)
pipe = pipe.to("cuda")

prompt = "Many spotted jellyfish pulsating under water."

image = pipe(prompt, max_latent_length=1).frames[0, 0]
export_to_image(image, "jellyfish.jpg")

video = pipe(prompt, max_latent_length=9).frames[0]
export_to_video(video, "jellyfish.mp4", fps=12)

# Increase AR and diffusion steps for better video quality.
video = pipe(
  prompt,
  max_latent_length=9,
  num_inference_steps=128,  # default: 64
  num_diffusion_steps=100,  # default: 25
).frames[0]
export_to_video(video, "jellyfish_v2.mp4", fps=12)

Uses

Direct Use

The model is intended for research purposes only. Possible research areas and tasks include

  • Research on generative models.
  • Applications in educational or creative tools.
  • Generation of artworks and use in design and other artistic processes.
  • Probing and understanding the limitations and biases of generative models.
  • Safe deployment of models which have the potential to generate harmful content.

Excluded uses are described below.

Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

Misuse and Malicious Use

Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:

  • Mis- and disinformation.
  • Representations of egregious violence and gore.
  • Impersonating individuals without their consent.
  • Sexual content without consent of the people who might see it.
  • Sharing of copyrighted or licensed material in violation of its terms of use.
  • Intentionally promoting or propagating discriminatory content or harmful stereotypes.
  • Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.
  • Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.

Limitations and Bias

Limitations

  • The autoencoding part of the model is lossy.
  • The model cannot render complex legible text.
  • The model does not achieve perfect photorealism.
  • The fingers, .etc in general may not be generated properly.
  • The model was trained on a subset of the web datasets LAION-5B and COYO-700M, which contains adult, violent and sexual content.

Bias

While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.