BAAI
/

ryanzhangfan commited on
Commit
2d4f201
1 Parent(s): e2f892f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md CHANGED
@@ -33,4 +33,60 @@ We introduce **Emu3**, a new suite of state-of-the-art multimodal models trained
33
  - **Emu3** shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.
34
  - **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.
35
 
 
 
 
 
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  - **Emu3** shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.
34
  - **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.
35
 
36
+ ### Quickstart for Autoencoding
37
+ ```python
38
+ import os
39
+ import os.path as osp
40
 
41
+ from PIL import Image
42
+ import torch
43
+ from transformers import AutoModel, AutoImageProcessor
44
+
45
+ MODEL_HUB = "BAAI/Emu3-VisionTokenizer"
46
+
47
+ model = AutoModel.from_pretrained(MODEL_HUB, trust_remote_code=True).eval().cuda()
48
+ processor = AutoImageProcessor.from_pretrained(MODEL_HUB, trust_remote_code=True)
49
+
50
+ # TODO: you need to modify the path here
51
+ VIDEO_FRAMES_PATH = "YOUR_VIDEO_FRAMES_PATH"
52
+
53
+ video = os.listdir(VIDEO_FRAMES_PATH)
54
+ video.sort()
55
+ video = [Image.open(osp.join(VIDEO_FRAMES_PATH, v)) for v in video]
56
+
57
+ images = processor(video, return_tensors="pt")["pixel_values"]
58
+ images = images.unsqueeze(0).cuda()
59
+
60
+ # image autoencode
61
+ image = images[:, 0]
62
+ print(image.shape)
63
+ with torch.no_grad():
64
+ # encode
65
+ codes = model.encode(image)
66
+ # decode
67
+ recon = model.decode(codes)
68
+
69
+ recon = recon.view(-1, *recon.shape[2:])
70
+ recon_image = processor.postprocess(recon)["pixel_values"][0]
71
+ recon_image.save("recon_image.png")
72
+
73
+ # video autoencode
74
+ images = images.view(
75
+ -1,
76
+ model.config.temporal_downsample_factor,
77
+ *images.shape[2:],
78
+ )
79
+
80
+ print(images.shape)
81
+ with torch.no_grad():
82
+ # encode
83
+ codes = model.encode(images)
84
+ # decode
85
+ recon = model.decode(codes)
86
+
87
+ recon = recon.view(-1, *recon.shape[2:])
88
+ recon_images = processor.postprocess(recon)["pixel_values"]
89
+ for idx, im in enumerate(recon_images):
90
+ im.save(f"recon_video_{idx}.png")
91
+
92
+ ```