--- library_name: transformers tags: [] --- This repository contains the text-only LLM portion of `meta-llama/Llama-3.2-11B-Vision-Instruct` **How it was done** ```python from collections import OrderedDict from transformers import MllamaForConditionalGeneration, AutoModelForCausalLM from transformers.models.mllama.modeling_mllama import MllamaCrossAttentionDecoderLayer llama32_id = "meta-llama/Llama-3.2-11B-Vision-Instruct" llama32 = MllamaForConditionalGeneration.from_pretrained( llama32_id, torch_dtype=torch.bfloat16, device_map="cuda:0", ) new_layers = [] for idx, layer in enumerate(llama32.language_model.model.layers): if isinstance(layer, MllamaCrossAttentionDecoderLayer): # CrossAttention layers are only take effect when image is provided. # Ignore here since we want text-only model pass else: new_layers.append(layer) llama32.language_model.model.cross_attention_layers = [] llama32.language_model.model.layers = torch.nn.ModuleList(new_layers) # Now llama32.language_model is identical to Llama3.1-8B-Instruct, except the embedding size(+8) # see: https://github.com/huggingface/transformers/blob/a22a4378d97d06b7a1d9abad6e0086d30fdea199/src/transformers/models/mllama/modeling_mllama.py#L1667C9-L1667C26 new_llama32_state_dict = OrderedDict() for k, v in llama32.language_model.state_dict().items(): if k == "model.embed_tokens.weight": v = v[:128256, :] new_llama32_state_dict[k] = v # Load a llama31 for the architecture llama31_id = "meta-llama/Llama-3.1-8B-Instruct" llama31 = AutoModelForCausalLM.from_pretrained( llama31_id, torch_dtype=torch.bfloat16, device_map="cuda:1", ) llama31.load_state_dict(new_llama32_state_dict) # llama31.save_pretrained("./my-cool-llama3.2") ``` **Note:** In the original tokenizer, there are `date_string` in `tokenizer.chat_template` (which append the current date when calling `tokenizer.apply_chat_template(messages)`). I removed this behavior in this repo. Please be aware when you use `AutoTokenizer.from_pretrained(this_repo)`.