BAAI
/

Emu3-VisionTokenizer

Feature Extraction

Model card Files Files and versions Community

ryanzhangfan commited on Sep 28

Commit

2d4f201

•

1 Parent(s): e2f892f

Update README.md

Files changed (1) hide show

README.md +56 -0

README.md CHANGED Viewed

@@ -33,4 +33,60 @@ We introduce **Emu3**, a new suite of state-of-the-art multimodal models trained
 - **Emu3** shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.
 - **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.

 - **Emu3** shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.
 - **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.
+### Quickstart for Autoencoding
+```python
+import os
+import os.path as osp
+from PIL import Image
+import torch
+from transformers import AutoModel, AutoImageProcessor
+MODEL_HUB = "BAAI/Emu3-VisionTokenizer"
+model = AutoModel.from_pretrained(MODEL_HUB, trust_remote_code=True).eval().cuda()
+processor = AutoImageProcessor.from_pretrained(MODEL_HUB, trust_remote_code=True)
+# TODO: you need to modify the path here
+VIDEO_FRAMES_PATH = "YOUR_VIDEO_FRAMES_PATH"
+video = os.listdir(VIDEO_FRAMES_PATH)
+video.sort()
+video = [Image.open(osp.join(VIDEO_FRAMES_PATH, v)) for v in video]
+images = processor(video, return_tensors="pt")["pixel_values"]
+images = images.unsqueeze(0).cuda()
+# image autoencode
+image = images[:, 0]
+print(image.shape)
+with torch.no_grad():
+    # encode
+    codes = model.encode(image)
+    # decode
+    recon = model.decode(codes)
+recon = recon.view(-1, *recon.shape[2:])
+recon_image = processor.postprocess(recon)["pixel_values"][0]
+recon_image.save("recon_image.png")
+# video autoencode
+images = images.view(
+    -1,
+    model.config.temporal_downsample_factor,
+    *images.shape[2:],
+)
+print(images.shape)
+with torch.no_grad():
+    # encode
+    codes = model.encode(images)
+    # decode
+    recon = model.decode(codes)
+recon = recon.view(-1, *recon.shape[2:])
+recon_images = processor.postprocess(recon)["pixel_values"]
+for idx, im in enumerate(recon_images):
+    im.save(f"recon_video_{idx}.png")
+```