ryanzhangfan
commited on
Commit
•
2d4f201
1
Parent(s):
e2f892f
Update README.md
Browse files
README.md
CHANGED
@@ -33,4 +33,60 @@ We introduce **Emu3**, a new suite of state-of-the-art multimodal models trained
|
|
33 |
- **Emu3** shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.
|
34 |
- **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.
|
35 |
|
|
|
|
|
|
|
|
|
36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
- **Emu3** shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.
|
34 |
- **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.
|
35 |
|
36 |
+
### Quickstart for Autoencoding
|
37 |
+
```python
|
38 |
+
import os
|
39 |
+
import os.path as osp
|
40 |
|
41 |
+
from PIL import Image
|
42 |
+
import torch
|
43 |
+
from transformers import AutoModel, AutoImageProcessor
|
44 |
+
|
45 |
+
MODEL_HUB = "BAAI/Emu3-VisionTokenizer"
|
46 |
+
|
47 |
+
model = AutoModel.from_pretrained(MODEL_HUB, trust_remote_code=True).eval().cuda()
|
48 |
+
processor = AutoImageProcessor.from_pretrained(MODEL_HUB, trust_remote_code=True)
|
49 |
+
|
50 |
+
# TODO: you need to modify the path here
|
51 |
+
VIDEO_FRAMES_PATH = "YOUR_VIDEO_FRAMES_PATH"
|
52 |
+
|
53 |
+
video = os.listdir(VIDEO_FRAMES_PATH)
|
54 |
+
video.sort()
|
55 |
+
video = [Image.open(osp.join(VIDEO_FRAMES_PATH, v)) for v in video]
|
56 |
+
|
57 |
+
images = processor(video, return_tensors="pt")["pixel_values"]
|
58 |
+
images = images.unsqueeze(0).cuda()
|
59 |
+
|
60 |
+
# image autoencode
|
61 |
+
image = images[:, 0]
|
62 |
+
print(image.shape)
|
63 |
+
with torch.no_grad():
|
64 |
+
# encode
|
65 |
+
codes = model.encode(image)
|
66 |
+
# decode
|
67 |
+
recon = model.decode(codes)
|
68 |
+
|
69 |
+
recon = recon.view(-1, *recon.shape[2:])
|
70 |
+
recon_image = processor.postprocess(recon)["pixel_values"][0]
|
71 |
+
recon_image.save("recon_image.png")
|
72 |
+
|
73 |
+
# video autoencode
|
74 |
+
images = images.view(
|
75 |
+
-1,
|
76 |
+
model.config.temporal_downsample_factor,
|
77 |
+
*images.shape[2:],
|
78 |
+
)
|
79 |
+
|
80 |
+
print(images.shape)
|
81 |
+
with torch.no_grad():
|
82 |
+
# encode
|
83 |
+
codes = model.encode(images)
|
84 |
+
# decode
|
85 |
+
recon = model.decode(codes)
|
86 |
+
|
87 |
+
recon = recon.view(-1, *recon.shape[2:])
|
88 |
+
recon_images = processor.postprocess(recon)["pixel_values"]
|
89 |
+
for idx, im in enumerate(recon_images):
|
90 |
+
im.save(f"recon_video_{idx}.png")
|
91 |
+
|
92 |
+
```
|