leiwx52 commited on
Commit
55a5d66
·
verified ·
1 Parent(s): fae965c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +126 -3
README.md CHANGED
@@ -1,3 +1,126 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ tags:
6
+ - SAIL
7
+ ---
8
+
9
+ # SAIL
10
+
11
+ [\[📂 GitHub\]](https://github.com/bytedance/SAIL)
12
+ [\[📜 paper\]](https://arxiv.org/abs/2504.10462)
13
+ [\[🚀 Quick Start\]](#quick-start)
14
+
15
+
16
+
17
+ ## Introduction
18
+
19
+ SAIL is a **S**ingle tr**A**nsformer model for v**I**sion and **L**anguage. It is a unified multimodal large language model (MLLM) that seamlessly integrates raw pixel encoding and language decoding within a single architecture. **​Without relying on pre-trained vision encoders**, SAIL achieves competitive performance across a wide range of vision-language tasks and demonstrates strong visual representation, rivaling state-of-the-art vision models in tasks like semantic segmentation.
20
+
21
+ ## Model
22
+
23
+ | Model Name | HF Link |
24
+ |:----------:|:------------------------------------------------------------------:|
25
+ | SAIL-7B | [🤗 link](https://huggingface.co/ByteDance-Seed/SAIL-7B) |
26
+
27
+
28
+
29
+ ## Quick Start
30
+
31
+ We provide an example code to run `SAIL`.
32
+
33
+ ```python
34
+ from example import *
35
+
36
+ NON_VISION_TOKEN_ID = -1
37
+ PATH_TO_MODEL = "path to model"
38
+ PATH_TO_TOKENIZER = "path to tokenizer"
39
+ IMAGE_PATH = "path to image"
40
+ PROMPT = "content of prompt"
41
+
42
+ model, tokenizer = get_transformer_and_tokenizer(
43
+ PATH_TO_MODEL,
44
+ PATH_TO_TOKENIZER
45
+ )
46
+ model = model.cuda()
47
+
48
+ image_processor = lambda x: convert_image_base64_to_patches(load_image_to_base64(x), model.config.vision_patch_size, fix_res_size=None)
49
+ prompt_inp = tokenizer.bos_token + '[INST] {} [/INST]'.format(PROMPT)
50
+ image_path = IMAGE_PATH
51
+ image_patches = image_processor(image_path)
52
+ nh, nw = image_patches.shape[:2]
53
+ image_tokens, image_tokens_len = prepare_image_textual_seq_norowsep(nh, nw, tokenizer, add_cls=False)
54
+
55
+ input_tokens = image_tokens + prompt_inp
56
+ input_ids = tokenizer(input_tokens, add_special_tokens=False, return_tensors="pt").input_ids
57
+ vision_patch_indices = torch.full_like(input_ids, fill_value=NON_VISION_TOKEN_ID)
58
+ vision_patches = image_patches.view(nh * nw, -1)
59
+ assert (input_ids == tokenizer.vis_patch_tok_id).sum() == vision_patches.size(0)
60
+ assert (input_ids >= tokenizer.vis_beg_tok_id).sum() == image_tokens_len
61
+
62
+ vision_patch_indices[input_ids==tokenizer.vis_patch_tok_id] = torch.arange(vision_patches.size(0))
63
+ attention_mask = create_single_prefix_mask(image_tokens_len, input_ids.size(-1)).unsqueeze(0).unsqueeze(0)
64
+ position_ids = generate_mm_pos_ids_singleit(input_ids.squeeze(0).numpy().tolist(), tokenizer.vis_patch_tok_id, nh, nw).unsqueeze(1)
65
+
66
+ input_ids = input_ids.long().cuda()
67
+ vision_patch_indices = vision_patch_indices.long().cuda()
68
+ vision_patches = vision_patches.to(torch.bfloat16).cuda()
69
+ position_ids = position_ids.long().cuda()
70
+ attention_mask = attention_mask.cuda()
71
+
72
+ padding_attention_mask = torch.ones_like(input_ids).cuda()
73
+
74
+ inputs = dict(
75
+ input_ids = input_ids,
76
+ position_ids = position_ids,
77
+ attention_mask = padding_attention_mask,
78
+ vision_patches = vision_patches,
79
+ vision_patch_indices = vision_patch_indices,
80
+ use_cache=True
81
+ )
82
+
83
+ cached_inputs = dict(
84
+ input_ids = input_ids[:, :image_tokens_len],
85
+ position_ids = position_ids[:, :, :image_tokens_len],
86
+ attention_mask = attention_mask[:,:, :image_tokens_len, :image_tokens_len],
87
+ vision_patches = vision_patches,
88
+ vision_patch_indices = vision_patch_indices[:, :image_tokens_len],
89
+ use_cache=True
90
+ )
91
+
92
+ prefix_cache = DynamicCache()
93
+ with torch.no_grad():
94
+ prefix_cache = model.forward(**cached_inputs, past_key_values=prefix_cache).past_key_values
95
+
96
+ past_key_values = copy.deepcopy(prefix_cache)
97
+ generate_config = GenerationConfig(
98
+ max_new_tokens=1024,
99
+ return_dict_in_generate=True,
100
+ output_attentions=False
101
+ )
102
+ generated = model.generate(
103
+ **inputs,
104
+ past_key_values=past_key_values,
105
+ generation_config=generate_config
106
+ )
107
+ generated_ids = generated['sequences'][:, input_ids.size(1):]
108
+ response = tokenizer.batch_decode(
109
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
110
+ )[0]
111
+
112
+ print(f"\nModel Response: ===\n{response}\n===")
113
+ ```
114
+
115
+ ## Citation
116
+
117
+ If you find this project useful in your research, please consider citing:
118
+
119
+ ```BibTeX
120
+ @article{lei2025sail,
121
+ title={The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer},
122
+ author={Lei, Weixian and Wang, Jiacong and Wang, Haochen and Li, Xiangtai and Liew, Jun Hao and Feng, Jiashi and Huang, Zilong},
123
+ journal={arXiv preprint arXiv:2504.10462},
124
+ year={2025}
125
+ }
126
+ ```