SAIL-7B / README.md

Upload README.md with huggingface_hub

55a5d66 verified 7 days ago

4.51 kB

	---
	license: apache-2.0
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- SAIL
	---

	# SAIL

	[\[📂 GitHub\]](https://github.com/bytedance/SAIL)
	[\[📜 paper\]](https://arxiv.org/abs/2504.10462)
	[\[🚀 Quick Start\]](#quick-start)



	## Introduction

	SAIL is a Single trAnsformer model for vIsion and Language. It is a unified multimodal large language model (MLLM) that seamlessly integrates raw pixel encoding and language decoding within a single architecture. Without relying on pre-trained vision encoders, SAIL achieves competitive performance across a wide range of vision-language tasks and demonstrates strong visual representation, rivaling state-of-the-art vision models in tasks like semantic segmentation.

	## Model

	\| Model Name \| HF Link \|
	\|:----------:\|:------------------------------------------------------------------:\|
	\| SAIL-7B \| [🤗 link](https://huggingface.co/ByteDance-Seed/SAIL-7B) \|



	## Quick Start

	We provide an example code to run `SAIL`.

	```python
	from example import *

	NON_VISION_TOKEN_ID = -1
	PATH_TO_MODEL = "path to model"
	PATH_TO_TOKENIZER = "path to tokenizer"
	IMAGE_PATH = "path to image"
	PROMPT = "content of prompt"

	model, tokenizer = get_transformer_and_tokenizer(
	PATH_TO_MODEL,
	PATH_TO_TOKENIZER
	)
	model = model.cuda()

	image_processor = lambda x: convert_image_base64_to_patches(load_image_to_base64(x), model.config.vision_patch_size, fix_res_size=None)
	prompt_inp = tokenizer.bos_token + '[INST] {} [/INST]'.format(PROMPT)
	image_path = IMAGE_PATH
	image_patches = image_processor(image_path)
	nh, nw = image_patches.shape[:2]
	image_tokens, image_tokens_len = prepare_image_textual_seq_norowsep(nh, nw, tokenizer, add_cls=False)

	input_tokens = image_tokens + prompt_inp
	input_ids = tokenizer(input_tokens, add_special_tokens=False, return_tensors="pt").input_ids
	vision_patch_indices = torch.full_like(input_ids, fill_value=NON_VISION_TOKEN_ID)
	vision_patches = image_patches.view(nh * nw, -1)
	assert (input_ids == tokenizer.vis_patch_tok_id).sum() == vision_patches.size(0)
	assert (input_ids >= tokenizer.vis_beg_tok_id).sum() == image_tokens_len

	vision_patch_indices[input_ids==tokenizer.vis_patch_tok_id] = torch.arange(vision_patches.size(0))
	attention_mask = create_single_prefix_mask(image_tokens_len, input_ids.size(-1)).unsqueeze(0).unsqueeze(0)
	position_ids = generate_mm_pos_ids_singleit(input_ids.squeeze(0).numpy().tolist(), tokenizer.vis_patch_tok_id, nh, nw).unsqueeze(1)

	input_ids = input_ids.long().cuda()
	vision_patch_indices = vision_patch_indices.long().cuda()
	vision_patches = vision_patches.to(torch.bfloat16).cuda()
	position_ids = position_ids.long().cuda()
	attention_mask = attention_mask.cuda()

	padding_attention_mask = torch.ones_like(input_ids).cuda()

	inputs = dict(
	input_ids = input_ids,
	position_ids = position_ids,
	attention_mask = padding_attention_mask,
	vision_patches = vision_patches,
	vision_patch_indices = vision_patch_indices,
	use_cache=True
	)

	cached_inputs = dict(
	input_ids = input_ids[:, :image_tokens_len],
	position_ids = position_ids[:, :, :image_tokens_len],
	attention_mask = attention_mask[:,:, :image_tokens_len, :image_tokens_len],
	vision_patches = vision_patches,
	vision_patch_indices = vision_patch_indices[:, :image_tokens_len],
	use_cache=True
	)

	prefix_cache = DynamicCache()
	with torch.no_grad():
	prefix_cache = model.forward(**cached_inputs, past_key_values=prefix_cache).past_key_values

	past_key_values = copy.deepcopy(prefix_cache)
	generate_config = GenerationConfig(
	max_new_tokens=1024,
	return_dict_in_generate=True,
	output_attentions=False
	)
	generated = model.generate(
	**inputs,
	past_key_values=past_key_values,
	generation_config=generate_config
	)
	generated_ids = generated['sequences'][:, input_ids.size(1):]
	response = tokenizer.batch_decode(
	generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]

	print(f"\nModel Response: ===\n{response}\n===")
	```

	## Citation

	If you find this project useful in your research, please consider citing:

	```BibTeX
	@article{lei2025sail,
	title={The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer},
	author={Lei, Weixian and Wang, Jiacong and Wang, Haochen and Li, Xiangtai and Liew, Jun Hao and Feng, Jiashi and Huang, Zilong},
	journal={arXiv preprint arXiv:2504.10462},
	year={2025}
	}
	```