toshi-456 commited on
Commit
3dba822
·
verified ·
1 Parent(s): d5b9e97
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ sample.jpg filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 SB Intuitions
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,3 +1,121 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ja
4
+ - en
5
+ base_model:
6
+ - sbintuitions/sarashina2-7b
7
+ license: mit
8
+ tags:
9
+ - multimodal
10
+ - vision-language
11
+ - llama
12
+ - qwen2_vl
13
+ pipeline_tag: image-to-text
14
+ library_name: transformers
15
+ ---
16
+
17
+ # Sarashina2-Vision-8B
18
+ **Sarashina2-Vision-8B** is a Japanese Large Vision Language Model trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
19
+
20
+ This model is based on [Sarashina2-7B](https://huggingface.co/sbintuitions/sarashina2-7b) and Image Encoder of [Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B).
21
+
22
+ It achieved the highest level of scores in 4 benchmarks (as of 2025/03/07) compared to other Japanese VLMs.
23
+
24
+ ## How to use
25
+ ### 1. Install dependencies
26
+
27
+ ```sh
28
+ pip install -U transformers==4.47.0 torch torchvision pillow protobuf sentencepiece accelerate
29
+ ```
30
+
31
+ ### 2. Inference
32
+ The following script loads the model and allows inference.
33
+ ```python
34
+ import requests
35
+ from PIL import Image
36
+ from transformers import AutoModelForCausalLM, AutoProcessor
37
+
38
+ # Define model path
39
+ model_path = "sbintuitions/sarashina2-vision-8b"
40
+
41
+ # Load model and processor
42
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
43
+ model = AutoModelForCausalLM.from_pretrained(
44
+ model_path,
45
+ device_map="cuda",
46
+ torch_dtype="auto",
47
+ trust_remote_code=True,
48
+ )
49
+
50
+ message = [{"role": "user", "content": "この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか?"}]
51
+ text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
52
+ """text_prompt: <s><|prefix|><|file|><|suffix|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
53
+
54
+ ### Human: この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか?
55
+ ### Assistant:"""
56
+
57
+ sample_image_url = "https://huggingface.co/sbintuitions/sarashina2-vision-8b/resolve/main/sample.jpg"
58
+ image = Image.open(requests.get(sample_image_url, stream=True).raw).convert("RGB")
59
+ inputs = processor(
60
+ text=[text_prompt],
61
+ images=[image],
62
+ padding=True,
63
+ return_tensors="pt",
64
+ )
65
+ inputs = inputs.to("cuda")
66
+ stopping_criteria = processor.get_stopping_criteria(["\n###"])
67
+
68
+ # Inference: Generation of the output
69
+ output_ids = model.generate(
70
+ **inputs,
71
+ max_new_tokens=128,
72
+ temperature=0.0,
73
+ do_sample=False,
74
+ stopping_criteria=stopping_criteria,
75
+ )
76
+ generated_ids = [
77
+ output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
78
+ ]
79
+ output_text = processor.batch_decode(
80
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
81
+ )
82
+ print(output_text[0])
83
+ """この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京のランドマークであり、この写真では、ビル群の向こうに写っています。"""
84
+ ```
85
+
86
+ ### Example
87
+ <img src="https://huggingface.co/sbintuitions/sarashina2-vision-8b/resolve/main/sample.jpg" width="350">
88
+
89
+ |Prompt|Output|
90
+ |-|-|
91
+ |この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか?|この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京のランドマークであり、この写真では、ビル群の向こうに写っています。|
92
+ |真ん中に映っている赤と白の物は何ですか?|真ん中に映っている赤と白のものはクレーンです。|
93
+
94
+ ## Training
95
+ **Sarashina2-Vision** is created through the following three-stage learning process:
96
+
97
+ 1. We tune the parameters in the projector by caption datasets.
98
+ 2. We tune the parameters in the Vision Encoder and projector by caption datasets.
99
+ 3. We tune the parameters in the projector and LLM by Visual Instruction datasets.
100
+
101
+ ## Evaluation Results
102
+ |Model|Model Size|JMMMU<sup>*1</sup>|Heron-Bench<sup>*2</sup>|JDocQA|
103
+ |-|-|-|-|-|
104
+ |[heron-chat-git-ja-stablelm-base-7b-v1](https://huggingface.co/turing-motors/heron-chat-git-ja-stablelm-base-7b-v1)|7B|0.294|0.461|0.069|
105
+ |[llava-calm2-siglip](https://huggingface.co/cyberagent/llava-calm2-siglip)|7B|0.07|0.521|0.084|
106
+ |[Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2)|8B|0.389|0.509|0.103|
107
+ |[Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B)|14B|0.302|0.433|0.06|
108
+ |[llm-jp-3-vila-14b](https://huggingface.co/llm-jp/llm-jp-3-vila-14b)|14B|0.23|**0.665**|0.176|
109
+ |[EZO-InternVL2-26B](https://huggingface.co/AXCXEPT/EZO-InternVL2-26B)|26B|0.389|0.609|0.196|
110
+ |[Sarashina2-Vision-8B](https://huggingface.co/sbintuitions/sarashina2-vision-8b)|8B|0.393|0.648|0.229|
111
+ |[Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b)|14B|**0.433**|0.644|**0.245**|
112
+
113
+ 1. Evaluated only single image samples (1,286 samples). If answer extraction failed, we treated it as incorrect (score 0) instead of making a random choice to eliminate stochasticity.
114
+ 2. GPT-4o (gpt-4o-2024-08-06) was used for LLM-as-a-Judge.
115
+
116
+
117
+ ## Ethical Considerations and Limitations
118
+ Sarashina2-Vision might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs. Before using Sarashina2-Vision, we would like developers to tune models based on human preferences and safety considerations.
119
+
120
+ ## LICENSE
121
+ [MIT License](./LICENSE)
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "{{ bos_token + '<|prefix|><|file|><|suffix|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human\\'s questions.\\n\\n' }}{% for message in messages %}{% if message['role'] == 'user' %}{{ '### Human: ' + message['content'] + '\\n' }}{% elif message['role'] == 'assistant' %}{{ 'Assistant: ' + message['content'] + '\\n' }}{% endif %}{% endfor %}{% if messages[-1]['role'] == 'user' %}{{ '### Assistant:' }}{% endif %}"
3
+ }
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Sarashina2VisionForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_sarashina2_vision.Sarashina2VisionConfig",
7
+ "AutoModelForCausalLM": "modeling_sarashina2_vision.Sarashina2VisionForCausalLM"
8
+ },
9
+ "end_image_token_index": 102398,
10
+ "image_token_index": 14,
11
+ "model_type": "sarashina2_vision",
12
+ "start_image_token_index": 102397,
13
+ "text_config": {
14
+ "_name_or_path": "sbintuitions/sarashina2-7b",
15
+ "architectures": [
16
+ "LlamaForCausalLM"
17
+ ],
18
+ "max_position_embeddings": 4096,
19
+ "model_type": "llama",
20
+ "rms_norm_eps": 1e-05,
21
+ "torch_dtype": "bfloat16",
22
+ "vocab_size": 102400
23
+ },
24
+ "torch_dtype": "bfloat16",
25
+ "transformers_version": "4.47.0",
26
+ "vision_config": {
27
+ "hidden_size": 4096,
28
+ "in_chans": 3,
29
+ "model_type": "qwen2_vl",
30
+ "spatial_patch_size": 14
31
+ }
32
+ }
configuration_sarashina2_vision.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 the SB Intuitions.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """Sarashina2Vision model configuration"""
16
+
17
+ from typing import Any, Optional
18
+
19
+ from transformers import LlamaConfig, PretrainedConfig
20
+ from transformers.models.qwen2_vl.configuration_qwen2_vl import Qwen2VLVisionConfig
21
+ from transformers.utils import logging
22
+
23
+ logger = logging.get_logger(__name__)
24
+
25
+
26
+ class Sarashina2VisionConfig(PretrainedConfig):
27
+ """
28
+ This is the configuration class to store the configuration of a [`Sarashina2VisionModel`]. It is used to instantiate a
29
+ Sarashina2Vision model according to the specified arguments, defining the model architecture.
30
+
31
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
32
+ documentation from [`PretrainedConfig`] for more information.
33
+
34
+ Args:
35
+ vision_config (`Dict`, *optional*):
36
+ The config for the visual encoder initialization.
37
+ text_config (`Dict`, *optional*):
38
+ The config for the text decoder initialization.
39
+ image_token_index (`int`):
40
+ image token id.
41
+ start_image_token_index (`int`):
42
+ start image token id.
43
+ end_image_token_index (`int`):
44
+ end image token id.
45
+ """
46
+
47
+ model_type = "sarashina2_vision"
48
+
49
+ def __init__(
50
+ self,
51
+ vision_config: Optional[dict[str, Any]] = None,
52
+ text_config: Optional[dict[str, Any]] = None,
53
+ image_token_index: int = 14,
54
+ start_image_token_index: int = 102397,
55
+ end_image_token_index: int = 102398,
56
+ **kwargs,
57
+ ):
58
+ if isinstance(text_config, dict):
59
+ self.text_config = LlamaConfig(**text_config)
60
+ elif isinstance(text_config, LlamaConfig):
61
+ self.text_config = text_config
62
+ elif text_config is None:
63
+ self.text_config = LlamaConfig()
64
+
65
+ if isinstance(vision_config, dict):
66
+ self.vision_config = Qwen2VLVisionConfig(**vision_config)
67
+ elif isinstance(vision_config, Qwen2VLVisionConfig):
68
+ self.vision_config = vision_config
69
+ elif vision_config is None:
70
+ self.vision_config = Qwen2VLVisionConfig()
71
+
72
+ self.image_token_index = image_token_index
73
+ self.start_image_token_index = start_image_token_index
74
+ self.end_image_token_index = end_image_token_index
75
+
76
+ super().__init__(**kwargs)
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.47.0"
6
+ }
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f848457d50b90f6e8d036d72892be4ff384af97d966dfe853ea7fefbc81d9b61
3
+ size 9986932616
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c5dc1e07621a0f4d8bcc96ec76492aa0ccc517f7b1a0cacdf6ad6c052bf6ea56
3
+ size 6000187040
model.safetensors.index.json ADDED
@@ -0,0 +1,691 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 15987043328
4
+ },
5
+ "weight_map": {
6
+ "llm.lm_head.weight": "model-00002-of-00002.safetensors",
7
+ "llm.model.embed_tokens.weight": "model-00001-of-00002.safetensors",
8
+ "llm.model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
9
+ "llm.model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
10
+ "llm.model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
11
+ "llm.model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
12
+ "llm.model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
13
+ "llm.model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
14
+ "llm.model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
15
+ "llm.model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
16
+ "llm.model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
17
+ "llm.model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
18
+ "llm.model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
19
+ "llm.model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
20
+ "llm.model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
21
+ "llm.model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
22
+ "llm.model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
23
+ "llm.model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
24
+ "llm.model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
25
+ "llm.model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
26
+ "llm.model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
27
+ "llm.model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
28
+ "llm.model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
29
+ "llm.model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
30
+ "llm.model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
31
+ "llm.model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
32
+ "llm.model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
33
+ "llm.model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
34
+ "llm.model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
35
+ "llm.model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
36
+ "llm.model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
37
+ "llm.model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
38
+ "llm.model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
39
+ "llm.model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
40
+ "llm.model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
41
+ "llm.model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
42
+ "llm.model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
43
+ "llm.model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
44
+ "llm.model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
45
+ "llm.model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
46
+ "llm.model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
47
+ "llm.model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
48
+ "llm.model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
49
+ "llm.model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
50
+ "llm.model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
51
+ "llm.model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
52
+ "llm.model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
53
+ "llm.model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
54
+ "llm.model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
55
+ "llm.model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
56
+ "llm.model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
57
+ "llm.model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
58
+ "llm.model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
59
+ "llm.model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
60
+ "llm.model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
61
+ "llm.model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
62
+ "llm.model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
63
+ "llm.model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
64
+ "llm.model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
65
+ "llm.model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
66
+ "llm.model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
67
+ "llm.model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
68
+ "llm.model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
69
+ "llm.model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
70
+ "llm.model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
71
+ "llm.model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
72
+ "llm.model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
73
+ "llm.model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
74
+ "llm.model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
75
+ "llm.model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
76
+ "llm.model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
77
+ "llm.model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
78
+ "llm.model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
79
+ "llm.model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
80
+ "llm.model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
81
+ "llm.model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
82
+ "llm.model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
83
+ "llm.model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
84
+ "llm.model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
85
+ "llm.model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
86
+ "llm.model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
87
+ "llm.model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
88
+ "llm.model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
89
+ "llm.model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
90
+ "llm.model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
91
+ "llm.model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
92
+ "llm.model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
93
+ "llm.model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
94
+ "llm.model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
95
+ "llm.model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
96
+ "llm.model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
97
+ "llm.model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
98
+ "llm.model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
99
+ "llm.model.layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
100
+ "llm.model.layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
101
+ "llm.model.layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
102
+ "llm.model.layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
103
+ "llm.model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
104
+ "llm.model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
105
+ "llm.model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
106
+ "llm.model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
107
+ "llm.model.layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
108
+ "llm.model.layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
109
+ "llm.model.layers.19.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
110
+ "llm.model.layers.19.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
111
+ "llm.model.layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
112
+ "llm.model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
113
+ "llm.model.layers.19.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
114
+ "llm.model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
115
+ "llm.model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
116
+ "llm.model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
117
+ "llm.model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
118
+ "llm.model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
119
+ "llm.model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
120
+ "llm.model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
121
+ "llm.model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
122
+ "llm.model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
123
+ "llm.model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
124
+ "llm.model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
125
+ "llm.model.layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
126
+ "llm.model.layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
127
+ "llm.model.layers.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
128
+ "llm.model.layers.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
129
+ "llm.model.layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
130
+ "llm.model.layers.20.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
131
+ "llm.model.layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
132
+ "llm.model.layers.20.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
133
+ "llm.model.layers.20.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
134
+ "llm.model.layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
135
+ "llm.model.layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
136
+ "llm.model.layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
137
+ "llm.model.layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
138
+ "llm.model.layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
139
+ "llm.model.layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
140
+ "llm.model.layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
141
+ "llm.model.layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
142
+ "llm.model.layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
143
+ "llm.model.layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
144
+ "llm.model.layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
145
+ "llm.model.layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
146
+ "llm.model.layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
147
+ "llm.model.layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
148
+ "llm.model.layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
149
+ "llm.model.layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
150
+ "llm.model.layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
151
+ "llm.model.layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
152
+ "llm.model.layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
153
+ "llm.model.layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
154
+ "llm.model.layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
155
+ "llm.model.layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
156
+ "llm.model.layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
157
+ "llm.model.layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
158
+ "llm.model.layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
159
+ "llm.model.layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
160
+ "llm.model.layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
161
+ "llm.model.layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
162
+ "llm.model.layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
163
+ "llm.model.layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
164
+ "llm.model.layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
165
+ "llm.model.layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
166
+ "llm.model.layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
167
+ "llm.model.layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
168
+ "llm.model.layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
169
+ "llm.model.layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
170
+ "llm.model.layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
171
+ "llm.model.layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
172
+ "llm.model.layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
173
+ "llm.model.layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
174
+ "llm.model.layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
175
+ "llm.model.layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
176
+ "llm.model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
177
+ "llm.model.layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
178
+ "llm.model.layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
179
+ "llm.model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
180
+ "llm.model.layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
181
+ "llm.model.layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
182
+ "llm.model.layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
183
+ "llm.model.layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
184
+ "llm.model.layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
185
+ "llm.model.layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
186
+ "llm.model.layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
187
+ "llm.model.layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
188
+ "llm.model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
189
+ "llm.model.layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
190
+ "llm.model.layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
191
+ "llm.model.layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
192
+ "llm.model.layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
193
+ "llm.model.layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
194
+ "llm.model.layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
195
+ "llm.model.layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
196
+ "llm.model.layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
197
+ "llm.model.layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
198
+ "llm.model.layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
199
+ "llm.model.layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
200
+ "llm.model.layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
201
+ "llm.model.layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
202
+ "llm.model.layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
203
+ "llm.model.layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
204
+ "llm.model.layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
205
+ "llm.model.layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
206
+ "llm.model.layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
207
+ "llm.model.layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
208
+ "llm.model.layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
209
+ "llm.model.layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
210
+ "llm.model.layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
211
+ "llm.model.layers.29.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
212
+ "llm.model.layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
213
+ "llm.model.layers.29.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
214
+ "llm.model.layers.29.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
215
+ "llm.model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
216
+ "llm.model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
217
+ "llm.model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
218
+ "llm.model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
219
+ "llm.model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
220
+ "llm.model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
221
+ "llm.model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
222
+ "llm.model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
223
+ "llm.model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
224
+ "llm.model.layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
225
+ "llm.model.layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
226
+ "llm.model.layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
227
+ "llm.model.layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
228
+ "llm.model.layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
229
+ "llm.model.layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
230
+ "llm.model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
231
+ "llm.model.layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
232
+ "llm.model.layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
233
+ "llm.model.layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
234
+ "llm.model.layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
235
+ "llm.model.layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
236
+ "llm.model.layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
237
+ "llm.model.layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
238
+ "llm.model.layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
239
+ "llm.model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
240
+ "llm.model.layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
241
+ "llm.model.layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
242
+ "llm.model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
243
+ "llm.model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
244
+ "llm.model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
245
+ "llm.model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
246
+ "llm.model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
247
+ "llm.model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
248
+ "llm.model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
249
+ "llm.model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
250
+ "llm.model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
251
+ "llm.model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
252
+ "llm.model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
253
+ "llm.model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
254
+ "llm.model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
255
+ "llm.model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
256
+ "llm.model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
257
+ "llm.model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
258
+ "llm.model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
259
+ "llm.model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
260
+ "llm.model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
261
+ "llm.model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
262
+ "llm.model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
263
+ "llm.model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
264
+ "llm.model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
265
+ "llm.model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
266
+ "llm.model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
267
+ "llm.model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
268
+ "llm.model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
269
+ "llm.model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
270
+ "llm.model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
271
+ "llm.model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
272
+ "llm.model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
273
+ "llm.model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
274
+ "llm.model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
275
+ "llm.model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
276
+ "llm.model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
277
+ "llm.model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
278
+ "llm.model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
279
+ "llm.model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
280
+ "llm.model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
281
+ "llm.model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
282
+ "llm.model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
283
+ "llm.model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
284
+ "llm.model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
285
+ "llm.model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
286
+ "llm.model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
287
+ "llm.model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
288
+ "llm.model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
289
+ "llm.model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
290
+ "llm.model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
291
+ "llm.model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
292
+ "llm.model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
293
+ "llm.model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
294
+ "llm.model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
295
+ "llm.model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
296
+ "llm.model.norm.weight": "model-00002-of-00002.safetensors",
297
+ "norm.bias": "model-00001-of-00002.safetensors",
298
+ "norm.weight": "model-00001-of-00002.safetensors",
299
+ "visual.blocks.0.attn.proj.bias": "model-00001-of-00002.safetensors",
300
+ "visual.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
301
+ "visual.blocks.0.attn.qkv.bias": "model-00001-of-00002.safetensors",
302
+ "visual.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
303
+ "visual.blocks.0.mlp.fc1.bias": "model-00001-of-00002.safetensors",
304
+ "visual.blocks.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
305
+ "visual.blocks.0.mlp.fc2.bias": "model-00001-of-00002.safetensors",
306
+ "visual.blocks.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
307
+ "visual.blocks.0.norm1.bias": "model-00001-of-00002.safetensors",
308
+ "visual.blocks.0.norm1.weight": "model-00001-of-00002.safetensors",
309
+ "visual.blocks.0.norm2.bias": "model-00001-of-00002.safetensors",
310
+ "visual.blocks.0.norm2.weight": "model-00001-of-00002.safetensors",
311
+ "visual.blocks.1.attn.proj.bias": "model-00001-of-00002.safetensors",
312
+ "visual.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
313
+ "visual.blocks.1.attn.qkv.bias": "model-00001-of-00002.safetensors",
314
+ "visual.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
315
+ "visual.blocks.1.mlp.fc1.bias": "model-00001-of-00002.safetensors",
316
+ "visual.blocks.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
317
+ "visual.blocks.1.mlp.fc2.bias": "model-00001-of-00002.safetensors",
318
+ "visual.blocks.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
319
+ "visual.blocks.1.norm1.bias": "model-00001-of-00002.safetensors",
320
+ "visual.blocks.1.norm1.weight": "model-00001-of-00002.safetensors",
321
+ "visual.blocks.1.norm2.bias": "model-00001-of-00002.safetensors",
322
+ "visual.blocks.1.norm2.weight": "model-00001-of-00002.safetensors",
323
+ "visual.blocks.10.attn.proj.bias": "model-00001-of-00002.safetensors",
324
+ "visual.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
325
+ "visual.blocks.10.attn.qkv.bias": "model-00001-of-00002.safetensors",
326
+ "visual.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
327
+ "visual.blocks.10.mlp.fc1.bias": "model-00001-of-00002.safetensors",
328
+ "visual.blocks.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
329
+ "visual.blocks.10.mlp.fc2.bias": "model-00001-of-00002.safetensors",
330
+ "visual.blocks.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
331
+ "visual.blocks.10.norm1.bias": "model-00001-of-00002.safetensors",
332
+ "visual.blocks.10.norm1.weight": "model-00001-of-00002.safetensors",
333
+ "visual.blocks.10.norm2.bias": "model-00001-of-00002.safetensors",
334
+ "visual.blocks.10.norm2.weight": "model-00001-of-00002.safetensors",
335
+ "visual.blocks.11.attn.proj.bias": "model-00001-of-00002.safetensors",
336
+ "visual.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
337
+ "visual.blocks.11.attn.qkv.bias": "model-00001-of-00002.safetensors",
338
+ "visual.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
339
+ "visual.blocks.11.mlp.fc1.bias": "model-00001-of-00002.safetensors",
340
+ "visual.blocks.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
341
+ "visual.blocks.11.mlp.fc2.bias": "model-00001-of-00002.safetensors",
342
+ "visual.blocks.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
343
+ "visual.blocks.11.norm1.bias": "model-00001-of-00002.safetensors",
344
+ "visual.blocks.11.norm1.weight": "model-00001-of-00002.safetensors",
345
+ "visual.blocks.11.norm2.bias": "model-00001-of-00002.safetensors",
346
+ "visual.blocks.11.norm2.weight": "model-00001-of-00002.safetensors",
347
+ "visual.blocks.12.attn.proj.bias": "model-00001-of-00002.safetensors",
348
+ "visual.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
349
+ "visual.blocks.12.attn.qkv.bias": "model-00001-of-00002.safetensors",
350
+ "visual.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
351
+ "visual.blocks.12.mlp.fc1.bias": "model-00001-of-00002.safetensors",
352
+ "visual.blocks.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
353
+ "visual.blocks.12.mlp.fc2.bias": "model-00001-of-00002.safetensors",
354
+ "visual.blocks.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
355
+ "visual.blocks.12.norm1.bias": "model-00001-of-00002.safetensors",
356
+ "visual.blocks.12.norm1.weight": "model-00001-of-00002.safetensors",
357
+ "visual.blocks.12.norm2.bias": "model-00001-of-00002.safetensors",
358
+ "visual.blocks.12.norm2.weight": "model-00001-of-00002.safetensors",
359
+ "visual.blocks.13.attn.proj.bias": "model-00001-of-00002.safetensors",
360
+ "visual.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
361
+ "visual.blocks.13.attn.qkv.bias": "model-00001-of-00002.safetensors",
362
+ "visual.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
363
+ "visual.blocks.13.mlp.fc1.bias": "model-00001-of-00002.safetensors",
364
+ "visual.blocks.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
365
+ "visual.blocks.13.mlp.fc2.bias": "model-00001-of-00002.safetensors",
366
+ "visual.blocks.13.mlp.fc2.weight": "model-00001-of-00002.safetensors",
367
+ "visual.blocks.13.norm1.bias": "model-00001-of-00002.safetensors",
368
+ "visual.blocks.13.norm1.weight": "model-00001-of-00002.safetensors",
369
+ "visual.blocks.13.norm2.bias": "model-00001-of-00002.safetensors",
370
+ "visual.blocks.13.norm2.weight": "model-00001-of-00002.safetensors",
371
+ "visual.blocks.14.attn.proj.bias": "model-00001-of-00002.safetensors",
372
+ "visual.blocks.14.attn.proj.weight": "model-00001-of-00002.safetensors",
373
+ "visual.blocks.14.attn.qkv.bias": "model-00001-of-00002.safetensors",
374
+ "visual.blocks.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
375
+ "visual.blocks.14.mlp.fc1.bias": "model-00001-of-00002.safetensors",
376
+ "visual.blocks.14.mlp.fc1.weight": "model-00001-of-00002.safetensors",
377
+ "visual.blocks.14.mlp.fc2.bias": "model-00001-of-00002.safetensors",
378
+ "visual.blocks.14.mlp.fc2.weight": "model-00001-of-00002.safetensors",
379
+ "visual.blocks.14.norm1.bias": "model-00001-of-00002.safetensors",
380
+ "visual.blocks.14.norm1.weight": "model-00001-of-00002.safetensors",
381
+ "visual.blocks.14.norm2.bias": "model-00001-of-00002.safetensors",
382
+ "visual.blocks.14.norm2.weight": "model-00001-of-00002.safetensors",
383
+ "visual.blocks.15.attn.proj.bias": "model-00001-of-00002.safetensors",
384
+ "visual.blocks.15.attn.proj.weight": "model-00001-of-00002.safetensors",
385
+ "visual.blocks.15.attn.qkv.bias": "model-00001-of-00002.safetensors",
386
+ "visual.blocks.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
387
+ "visual.blocks.15.mlp.fc1.bias": "model-00001-of-00002.safetensors",
388
+ "visual.blocks.15.mlp.fc1.weight": "model-00001-of-00002.safetensors",
389
+ "visual.blocks.15.mlp.fc2.bias": "model-00001-of-00002.safetensors",
390
+ "visual.blocks.15.mlp.fc2.weight": "model-00001-of-00002.safetensors",
391
+ "visual.blocks.15.norm1.bias": "model-00001-of-00002.safetensors",
392
+ "visual.blocks.15.norm1.weight": "model-00001-of-00002.safetensors",
393
+ "visual.blocks.15.norm2.bias": "model-00001-of-00002.safetensors",
394
+ "visual.blocks.15.norm2.weight": "model-00001-of-00002.safetensors",
395
+ "visual.blocks.16.attn.proj.bias": "model-00001-of-00002.safetensors",
396
+ "visual.blocks.16.attn.proj.weight": "model-00001-of-00002.safetensors",
397
+ "visual.blocks.16.attn.qkv.bias": "model-00001-of-00002.safetensors",
398
+ "visual.blocks.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
399
+ "visual.blocks.16.mlp.fc1.bias": "model-00001-of-00002.safetensors",
400
+ "visual.blocks.16.mlp.fc1.weight": "model-00001-of-00002.safetensors",
401
+ "visual.blocks.16.mlp.fc2.bias": "model-00001-of-00002.safetensors",
402
+ "visual.blocks.16.mlp.fc2.weight": "model-00001-of-00002.safetensors",
403
+ "visual.blocks.16.norm1.bias": "model-00001-of-00002.safetensors",
404
+ "visual.blocks.16.norm1.weight": "model-00001-of-00002.safetensors",
405
+ "visual.blocks.16.norm2.bias": "model-00001-of-00002.safetensors",
406
+ "visual.blocks.16.norm2.weight": "model-00001-of-00002.safetensors",
407
+ "visual.blocks.17.attn.proj.bias": "model-00001-of-00002.safetensors",
408
+ "visual.blocks.17.attn.proj.weight": "model-00001-of-00002.safetensors",
409
+ "visual.blocks.17.attn.qkv.bias": "model-00001-of-00002.safetensors",
410
+ "visual.blocks.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
411
+ "visual.blocks.17.mlp.fc1.bias": "model-00001-of-00002.safetensors",
412
+ "visual.blocks.17.mlp.fc1.weight": "model-00001-of-00002.safetensors",
413
+ "visual.blocks.17.mlp.fc2.bias": "model-00001-of-00002.safetensors",
414
+ "visual.blocks.17.mlp.fc2.weight": "model-00001-of-00002.safetensors",
415
+ "visual.blocks.17.norm1.bias": "model-00001-of-00002.safetensors",
416
+ "visual.blocks.17.norm1.weight": "model-00001-of-00002.safetensors",
417
+ "visual.blocks.17.norm2.bias": "model-00001-of-00002.safetensors",
418
+ "visual.blocks.17.norm2.weight": "model-00001-of-00002.safetensors",
419
+ "visual.blocks.18.attn.proj.bias": "model-00001-of-00002.safetensors",
420
+ "visual.blocks.18.attn.proj.weight": "model-00001-of-00002.safetensors",
421
+ "visual.blocks.18.attn.qkv.bias": "model-00001-of-00002.safetensors",
422
+ "visual.blocks.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
423
+ "visual.blocks.18.mlp.fc1.bias": "model-00001-of-00002.safetensors",
424
+ "visual.blocks.18.mlp.fc1.weight": "model-00001-of-00002.safetensors",
425
+ "visual.blocks.18.mlp.fc2.bias": "model-00001-of-00002.safetensors",
426
+ "visual.blocks.18.mlp.fc2.weight": "model-00001-of-00002.safetensors",
427
+ "visual.blocks.18.norm1.bias": "model-00001-of-00002.safetensors",
428
+ "visual.blocks.18.norm1.weight": "model-00001-of-00002.safetensors",
429
+ "visual.blocks.18.norm2.bias": "model-00001-of-00002.safetensors",
430
+ "visual.blocks.18.norm2.weight": "model-00001-of-00002.safetensors",
431
+ "visual.blocks.19.attn.proj.bias": "model-00001-of-00002.safetensors",
432
+ "visual.blocks.19.attn.proj.weight": "model-00001-of-00002.safetensors",
433
+ "visual.blocks.19.attn.qkv.bias": "model-00001-of-00002.safetensors",
434
+ "visual.blocks.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
435
+ "visual.blocks.19.mlp.fc1.bias": "model-00001-of-00002.safetensors",
436
+ "visual.blocks.19.mlp.fc1.weight": "model-00001-of-00002.safetensors",
437
+ "visual.blocks.19.mlp.fc2.bias": "model-00001-of-00002.safetensors",
438
+ "visual.blocks.19.mlp.fc2.weight": "model-00001-of-00002.safetensors",
439
+ "visual.blocks.19.norm1.bias": "model-00001-of-00002.safetensors",
440
+ "visual.blocks.19.norm1.weight": "model-00001-of-00002.safetensors",
441
+ "visual.blocks.19.norm2.bias": "model-00001-of-00002.safetensors",
442
+ "visual.blocks.19.norm2.weight": "model-00001-of-00002.safetensors",
443
+ "visual.blocks.2.attn.proj.bias": "model-00001-of-00002.safetensors",
444
+ "visual.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
445
+ "visual.blocks.2.attn.qkv.bias": "model-00001-of-00002.safetensors",
446
+ "visual.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
447
+ "visual.blocks.2.mlp.fc1.bias": "model-00001-of-00002.safetensors",
448
+ "visual.blocks.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
449
+ "visual.blocks.2.mlp.fc2.bias": "model-00001-of-00002.safetensors",
450
+ "visual.blocks.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
451
+ "visual.blocks.2.norm1.bias": "model-00001-of-00002.safetensors",
452
+ "visual.blocks.2.norm1.weight": "model-00001-of-00002.safetensors",
453
+ "visual.blocks.2.norm2.bias": "model-00001-of-00002.safetensors",
454
+ "visual.blocks.2.norm2.weight": "model-00001-of-00002.safetensors",
455
+ "visual.blocks.20.attn.proj.bias": "model-00001-of-00002.safetensors",
456
+ "visual.blocks.20.attn.proj.weight": "model-00001-of-00002.safetensors",
457
+ "visual.blocks.20.attn.qkv.bias": "model-00001-of-00002.safetensors",
458
+ "visual.blocks.20.attn.qkv.weight": "model-00001-of-00002.safetensors",
459
+ "visual.blocks.20.mlp.fc1.bias": "model-00001-of-00002.safetensors",
460
+ "visual.blocks.20.mlp.fc1.weight": "model-00001-of-00002.safetensors",
461
+ "visual.blocks.20.mlp.fc2.bias": "model-00001-of-00002.safetensors",
462
+ "visual.blocks.20.mlp.fc2.weight": "model-00001-of-00002.safetensors",
463
+ "visual.blocks.20.norm1.bias": "model-00001-of-00002.safetensors",
464
+ "visual.blocks.20.norm1.weight": "model-00001-of-00002.safetensors",
465
+ "visual.blocks.20.norm2.bias": "model-00001-of-00002.safetensors",
466
+ "visual.blocks.20.norm2.weight": "model-00001-of-00002.safetensors",
467
+ "visual.blocks.21.attn.proj.bias": "model-00001-of-00002.safetensors",
468
+ "visual.blocks.21.attn.proj.weight": "model-00001-of-00002.safetensors",
469
+ "visual.blocks.21.attn.qkv.bias": "model-00001-of-00002.safetensors",
470
+ "visual.blocks.21.attn.qkv.weight": "model-00001-of-00002.safetensors",
471
+ "visual.blocks.21.mlp.fc1.bias": "model-00001-of-00002.safetensors",
472
+ "visual.blocks.21.mlp.fc1.weight": "model-00001-of-00002.safetensors",
473
+ "visual.blocks.21.mlp.fc2.bias": "model-00001-of-00002.safetensors",
474
+ "visual.blocks.21.mlp.fc2.weight": "model-00001-of-00002.safetensors",
475
+ "visual.blocks.21.norm1.bias": "model-00001-of-00002.safetensors",
476
+ "visual.blocks.21.norm1.weight": "model-00001-of-00002.safetensors",
477
+ "visual.blocks.21.norm2.bias": "model-00001-of-00002.safetensors",
478
+ "visual.blocks.21.norm2.weight": "model-00001-of-00002.safetensors",
479
+ "visual.blocks.22.attn.proj.bias": "model-00001-of-00002.safetensors",
480
+ "visual.blocks.22.attn.proj.weight": "model-00001-of-00002.safetensors",
481
+ "visual.blocks.22.attn.qkv.bias": "model-00001-of-00002.safetensors",
482
+ "visual.blocks.22.attn.qkv.weight": "model-00001-of-00002.safetensors",
483
+ "visual.blocks.22.mlp.fc1.bias": "model-00001-of-00002.safetensors",
484
+ "visual.blocks.22.mlp.fc1.weight": "model-00001-of-00002.safetensors",
485
+ "visual.blocks.22.mlp.fc2.bias": "model-00001-of-00002.safetensors",
486
+ "visual.blocks.22.mlp.fc2.weight": "model-00001-of-00002.safetensors",
487
+ "visual.blocks.22.norm1.bias": "model-00001-of-00002.safetensors",
488
+ "visual.blocks.22.norm1.weight": "model-00001-of-00002.safetensors",
489
+ "visual.blocks.22.norm2.bias": "model-00001-of-00002.safetensors",
490
+ "visual.blocks.22.norm2.weight": "model-00001-of-00002.safetensors",
491
+ "visual.blocks.23.attn.proj.bias": "model-00001-of-00002.safetensors",
492
+ "visual.blocks.23.attn.proj.weight": "model-00001-of-00002.safetensors",
493
+ "visual.blocks.23.attn.qkv.bias": "model-00001-of-00002.safetensors",
494
+ "visual.blocks.23.attn.qkv.weight": "model-00001-of-00002.safetensors",
495
+ "visual.blocks.23.mlp.fc1.bias": "model-00001-of-00002.safetensors",
496
+ "visual.blocks.23.mlp.fc1.weight": "model-00001-of-00002.safetensors",
497
+ "visual.blocks.23.mlp.fc2.bias": "model-00001-of-00002.safetensors",
498
+ "visual.blocks.23.mlp.fc2.weight": "model-00001-of-00002.safetensors",
499
+ "visual.blocks.23.norm1.bias": "model-00001-of-00002.safetensors",
500
+ "visual.blocks.23.norm1.weight": "model-00001-of-00002.safetensors",
501
+ "visual.blocks.23.norm2.bias": "model-00001-of-00002.safetensors",
502
+ "visual.blocks.23.norm2.weight": "model-00001-of-00002.safetensors",
503
+ "visual.blocks.24.attn.proj.bias": "model-00001-of-00002.safetensors",
504
+ "visual.blocks.24.attn.proj.weight": "model-00001-of-00002.safetensors",
505
+ "visual.blocks.24.attn.qkv.bias": "model-00001-of-00002.safetensors",
506
+ "visual.blocks.24.attn.qkv.weight": "model-00001-of-00002.safetensors",
507
+ "visual.blocks.24.mlp.fc1.bias": "model-00001-of-00002.safetensors",
508
+ "visual.blocks.24.mlp.fc1.weight": "model-00001-of-00002.safetensors",
509
+ "visual.blocks.24.mlp.fc2.bias": "model-00001-of-00002.safetensors",
510
+ "visual.blocks.24.mlp.fc2.weight": "model-00001-of-00002.safetensors",
511
+ "visual.blocks.24.norm1.bias": "model-00001-of-00002.safetensors",
512
+ "visual.blocks.24.norm1.weight": "model-00001-of-00002.safetensors",
513
+ "visual.blocks.24.norm2.bias": "model-00001-of-00002.safetensors",
514
+ "visual.blocks.24.norm2.weight": "model-00001-of-00002.safetensors",
515
+ "visual.blocks.25.attn.proj.bias": "model-00001-of-00002.safetensors",
516
+ "visual.blocks.25.attn.proj.weight": "model-00001-of-00002.safetensors",
517
+ "visual.blocks.25.attn.qkv.bias": "model-00001-of-00002.safetensors",
518
+ "visual.blocks.25.attn.qkv.weight": "model-00001-of-00002.safetensors",
519
+ "visual.blocks.25.mlp.fc1.bias": "model-00001-of-00002.safetensors",
520
+ "visual.blocks.25.mlp.fc1.weight": "model-00001-of-00002.safetensors",
521
+ "visual.blocks.25.mlp.fc2.bias": "model-00001-of-00002.safetensors",
522
+ "visual.blocks.25.mlp.fc2.weight": "model-00001-of-00002.safetensors",
523
+ "visual.blocks.25.norm1.bias": "model-00001-of-00002.safetensors",
524
+ "visual.blocks.25.norm1.weight": "model-00001-of-00002.safetensors",
525
+ "visual.blocks.25.norm2.bias": "model-00001-of-00002.safetensors",
526
+ "visual.blocks.25.norm2.weight": "model-00001-of-00002.safetensors",
527
+ "visual.blocks.26.attn.proj.bias": "model-00001-of-00002.safetensors",
528
+ "visual.blocks.26.attn.proj.weight": "model-00001-of-00002.safetensors",
529
+ "visual.blocks.26.attn.qkv.bias": "model-00001-of-00002.safetensors",
530
+ "visual.blocks.26.attn.qkv.weight": "model-00001-of-00002.safetensors",
531
+ "visual.blocks.26.mlp.fc1.bias": "model-00001-of-00002.safetensors",
532
+ "visual.blocks.26.mlp.fc1.weight": "model-00001-of-00002.safetensors",
533
+ "visual.blocks.26.mlp.fc2.bias": "model-00001-of-00002.safetensors",
534
+ "visual.blocks.26.mlp.fc2.weight": "model-00001-of-00002.safetensors",
535
+ "visual.blocks.26.norm1.bias": "model-00001-of-00002.safetensors",
536
+ "visual.blocks.26.norm1.weight": "model-00001-of-00002.safetensors",
537
+ "visual.blocks.26.norm2.bias": "model-00001-of-00002.safetensors",
538
+ "visual.blocks.26.norm2.weight": "model-00001-of-00002.safetensors",
539
+ "visual.blocks.27.attn.proj.bias": "model-00001-of-00002.safetensors",
540
+ "visual.blocks.27.attn.proj.weight": "model-00001-of-00002.safetensors",
541
+ "visual.blocks.27.attn.qkv.bias": "model-00001-of-00002.safetensors",
542
+ "visual.blocks.27.attn.qkv.weight": "model-00001-of-00002.safetensors",
543
+ "visual.blocks.27.mlp.fc1.bias": "model-00001-of-00002.safetensors",
544
+ "visual.blocks.27.mlp.fc1.weight": "model-00001-of-00002.safetensors",
545
+ "visual.blocks.27.mlp.fc2.bias": "model-00001-of-00002.safetensors",
546
+ "visual.blocks.27.mlp.fc2.weight": "model-00001-of-00002.safetensors",
547
+ "visual.blocks.27.norm1.bias": "model-00001-of-00002.safetensors",
548
+ "visual.blocks.27.norm1.weight": "model-00001-of-00002.safetensors",
549
+ "visual.blocks.27.norm2.bias": "model-00001-of-00002.safetensors",
550
+ "visual.blocks.27.norm2.weight": "model-00001-of-00002.safetensors",
551
+ "visual.blocks.28.attn.proj.bias": "model-00001-of-00002.safetensors",
552
+ "visual.blocks.28.attn.proj.weight": "model-00001-of-00002.safetensors",
553
+ "visual.blocks.28.attn.qkv.bias": "model-00001-of-00002.safetensors",
554
+ "visual.blocks.28.attn.qkv.weight": "model-00001-of-00002.safetensors",
555
+ "visual.blocks.28.mlp.fc1.bias": "model-00001-of-00002.safetensors",
556
+ "visual.blocks.28.mlp.fc1.weight": "model-00001-of-00002.safetensors",
557
+ "visual.blocks.28.mlp.fc2.bias": "model-00001-of-00002.safetensors",
558
+ "visual.blocks.28.mlp.fc2.weight": "model-00001-of-00002.safetensors",
559
+ "visual.blocks.28.norm1.bias": "model-00001-of-00002.safetensors",
560
+ "visual.blocks.28.norm1.weight": "model-00001-of-00002.safetensors",
561
+ "visual.blocks.28.norm2.bias": "model-00001-of-00002.safetensors",
562
+ "visual.blocks.28.norm2.weight": "model-00001-of-00002.safetensors",
563
+ "visual.blocks.29.attn.proj.bias": "model-00001-of-00002.safetensors",
564
+ "visual.blocks.29.attn.proj.weight": "model-00001-of-00002.safetensors",
565
+ "visual.blocks.29.attn.qkv.bias": "model-00001-of-00002.safetensors",
566
+ "visual.blocks.29.attn.qkv.weight": "model-00001-of-00002.safetensors",
567
+ "visual.blocks.29.mlp.fc1.bias": "model-00001-of-00002.safetensors",
568
+ "visual.blocks.29.mlp.fc1.weight": "model-00001-of-00002.safetensors",
569
+ "visual.blocks.29.mlp.fc2.bias": "model-00001-of-00002.safetensors",
570
+ "visual.blocks.29.mlp.fc2.weight": "model-00001-of-00002.safetensors",
571
+ "visual.blocks.29.norm1.bias": "model-00001-of-00002.safetensors",
572
+ "visual.blocks.29.norm1.weight": "model-00001-of-00002.safetensors",
573
+ "visual.blocks.29.norm2.bias": "model-00001-of-00002.safetensors",
574
+ "visual.blocks.29.norm2.weight": "model-00001-of-00002.safetensors",
575
+ "visual.blocks.3.attn.proj.bias": "model-00001-of-00002.safetensors",
576
+ "visual.blocks.3.attn.proj.weight": "model-00001-of-00002.safetensors",
577
+ "visual.blocks.3.attn.qkv.bias": "model-00001-of-00002.safetensors",
578
+ "visual.blocks.3.attn.qkv.weight": "model-00001-of-00002.safetensors",
579
+ "visual.blocks.3.mlp.fc1.bias": "model-00001-of-00002.safetensors",
580
+ "visual.blocks.3.mlp.fc1.weight": "model-00001-of-00002.safetensors",
581
+ "visual.blocks.3.mlp.fc2.bias": "model-00001-of-00002.safetensors",
582
+ "visual.blocks.3.mlp.fc2.weight": "model-00001-of-00002.safetensors",
583
+ "visual.blocks.3.norm1.bias": "model-00001-of-00002.safetensors",
584
+ "visual.blocks.3.norm1.weight": "model-00001-of-00002.safetensors",
585
+ "visual.blocks.3.norm2.bias": "model-00001-of-00002.safetensors",
586
+ "visual.blocks.3.norm2.weight": "model-00001-of-00002.safetensors",
587
+ "visual.blocks.30.attn.proj.bias": "model-00001-of-00002.safetensors",
588
+ "visual.blocks.30.attn.proj.weight": "model-00001-of-00002.safetensors",
589
+ "visual.blocks.30.attn.qkv.bias": "model-00001-of-00002.safetensors",
590
+ "visual.blocks.30.attn.qkv.weight": "model-00001-of-00002.safetensors",
591
+ "visual.blocks.30.mlp.fc1.bias": "model-00001-of-00002.safetensors",
592
+ "visual.blocks.30.mlp.fc1.weight": "model-00001-of-00002.safetensors",
593
+ "visual.blocks.30.mlp.fc2.bias": "model-00001-of-00002.safetensors",
594
+ "visual.blocks.30.mlp.fc2.weight": "model-00001-of-00002.safetensors",
595
+ "visual.blocks.30.norm1.bias": "model-00001-of-00002.safetensors",
596
+ "visual.blocks.30.norm1.weight": "model-00001-of-00002.safetensors",
597
+ "visual.blocks.30.norm2.bias": "model-00001-of-00002.safetensors",
598
+ "visual.blocks.30.norm2.weight": "model-00001-of-00002.safetensors",
599
+ "visual.blocks.31.attn.proj.bias": "model-00001-of-00002.safetensors",
600
+ "visual.blocks.31.attn.proj.weight": "model-00001-of-00002.safetensors",
601
+ "visual.blocks.31.attn.qkv.bias": "model-00001-of-00002.safetensors",
602
+ "visual.blocks.31.attn.qkv.weight": "model-00001-of-00002.safetensors",
603
+ "visual.blocks.31.mlp.fc1.bias": "model-00001-of-00002.safetensors",
604
+ "visual.blocks.31.mlp.fc1.weight": "model-00001-of-00002.safetensors",
605
+ "visual.blocks.31.mlp.fc2.bias": "model-00001-of-00002.safetensors",
606
+ "visual.blocks.31.mlp.fc2.weight": "model-00001-of-00002.safetensors",
607
+ "visual.blocks.31.norm1.bias": "model-00001-of-00002.safetensors",
608
+ "visual.blocks.31.norm1.weight": "model-00001-of-00002.safetensors",
609
+ "visual.blocks.31.norm2.bias": "model-00001-of-00002.safetensors",
610
+ "visual.blocks.31.norm2.weight": "model-00001-of-00002.safetensors",
611
+ "visual.blocks.4.attn.proj.bias": "model-00001-of-00002.safetensors",
612
+ "visual.blocks.4.attn.proj.weight": "model-00001-of-00002.safetensors",
613
+ "visual.blocks.4.attn.qkv.bias": "model-00001-of-00002.safetensors",
614
+ "visual.blocks.4.attn.qkv.weight": "model-00001-of-00002.safetensors",
615
+ "visual.blocks.4.mlp.fc1.bias": "model-00001-of-00002.safetensors",
616
+ "visual.blocks.4.mlp.fc1.weight": "model-00001-of-00002.safetensors",
617
+ "visual.blocks.4.mlp.fc2.bias": "model-00001-of-00002.safetensors",
618
+ "visual.blocks.4.mlp.fc2.weight": "model-00001-of-00002.safetensors",
619
+ "visual.blocks.4.norm1.bias": "model-00001-of-00002.safetensors",
620
+ "visual.blocks.4.norm1.weight": "model-00001-of-00002.safetensors",
621
+ "visual.blocks.4.norm2.bias": "model-00001-of-00002.safetensors",
622
+ "visual.blocks.4.norm2.weight": "model-00001-of-00002.safetensors",
623
+ "visual.blocks.5.attn.proj.bias": "model-00001-of-00002.safetensors",
624
+ "visual.blocks.5.attn.proj.weight": "model-00001-of-00002.safetensors",
625
+ "visual.blocks.5.attn.qkv.bias": "model-00001-of-00002.safetensors",
626
+ "visual.blocks.5.attn.qkv.weight": "model-00001-of-00002.safetensors",
627
+ "visual.blocks.5.mlp.fc1.bias": "model-00001-of-00002.safetensors",
628
+ "visual.blocks.5.mlp.fc1.weight": "model-00001-of-00002.safetensors",
629
+ "visual.blocks.5.mlp.fc2.bias": "model-00001-of-00002.safetensors",
630
+ "visual.blocks.5.mlp.fc2.weight": "model-00001-of-00002.safetensors",
631
+ "visual.blocks.5.norm1.bias": "model-00001-of-00002.safetensors",
632
+ "visual.blocks.5.norm1.weight": "model-00001-of-00002.safetensors",
633
+ "visual.blocks.5.norm2.bias": "model-00001-of-00002.safetensors",
634
+ "visual.blocks.5.norm2.weight": "model-00001-of-00002.safetensors",
635
+ "visual.blocks.6.attn.proj.bias": "model-00001-of-00002.safetensors",
636
+ "visual.blocks.6.attn.proj.weight": "model-00001-of-00002.safetensors",
637
+ "visual.blocks.6.attn.qkv.bias": "model-00001-of-00002.safetensors",
638
+ "visual.blocks.6.attn.qkv.weight": "model-00001-of-00002.safetensors",
639
+ "visual.blocks.6.mlp.fc1.bias": "model-00001-of-00002.safetensors",
640
+ "visual.blocks.6.mlp.fc1.weight": "model-00001-of-00002.safetensors",
641
+ "visual.blocks.6.mlp.fc2.bias": "model-00001-of-00002.safetensors",
642
+ "visual.blocks.6.mlp.fc2.weight": "model-00001-of-00002.safetensors",
643
+ "visual.blocks.6.norm1.bias": "model-00001-of-00002.safetensors",
644
+ "visual.blocks.6.norm1.weight": "model-00001-of-00002.safetensors",
645
+ "visual.blocks.6.norm2.bias": "model-00001-of-00002.safetensors",
646
+ "visual.blocks.6.norm2.weight": "model-00001-of-00002.safetensors",
647
+ "visual.blocks.7.attn.proj.bias": "model-00001-of-00002.safetensors",
648
+ "visual.blocks.7.attn.proj.weight": "model-00001-of-00002.safetensors",
649
+ "visual.blocks.7.attn.qkv.bias": "model-00001-of-00002.safetensors",
650
+ "visual.blocks.7.attn.qkv.weight": "model-00001-of-00002.safetensors",
651
+ "visual.blocks.7.mlp.fc1.bias": "model-00001-of-00002.safetensors",
652
+ "visual.blocks.7.mlp.fc1.weight": "model-00001-of-00002.safetensors",
653
+ "visual.blocks.7.mlp.fc2.bias": "model-00001-of-00002.safetensors",
654
+ "visual.blocks.7.mlp.fc2.weight": "model-00001-of-00002.safetensors",
655
+ "visual.blocks.7.norm1.bias": "model-00001-of-00002.safetensors",
656
+ "visual.blocks.7.norm1.weight": "model-00001-of-00002.safetensors",
657
+ "visual.blocks.7.norm2.bias": "model-00001-of-00002.safetensors",
658
+ "visual.blocks.7.norm2.weight": "model-00001-of-00002.safetensors",
659
+ "visual.blocks.8.attn.proj.bias": "model-00001-of-00002.safetensors",
660
+ "visual.blocks.8.attn.proj.weight": "model-00001-of-00002.safetensors",
661
+ "visual.blocks.8.attn.qkv.bias": "model-00001-of-00002.safetensors",
662
+ "visual.blocks.8.attn.qkv.weight": "model-00001-of-00002.safetensors",
663
+ "visual.blocks.8.mlp.fc1.bias": "model-00001-of-00002.safetensors",
664
+ "visual.blocks.8.mlp.fc1.weight": "model-00001-of-00002.safetensors",
665
+ "visual.blocks.8.mlp.fc2.bias": "model-00001-of-00002.safetensors",
666
+ "visual.blocks.8.mlp.fc2.weight": "model-00001-of-00002.safetensors",
667
+ "visual.blocks.8.norm1.bias": "model-00001-of-00002.safetensors",
668
+ "visual.blocks.8.norm1.weight": "model-00001-of-00002.safetensors",
669
+ "visual.blocks.8.norm2.bias": "model-00001-of-00002.safetensors",
670
+ "visual.blocks.8.norm2.weight": "model-00001-of-00002.safetensors",
671
+ "visual.blocks.9.attn.proj.bias": "model-00001-of-00002.safetensors",
672
+ "visual.blocks.9.attn.proj.weight": "model-00001-of-00002.safetensors",
673
+ "visual.blocks.9.attn.qkv.bias": "model-00001-of-00002.safetensors",
674
+ "visual.blocks.9.attn.qkv.weight": "model-00001-of-00002.safetensors",
675
+ "visual.blocks.9.mlp.fc1.bias": "model-00001-of-00002.safetensors",
676
+ "visual.blocks.9.mlp.fc1.weight": "model-00001-of-00002.safetensors",
677
+ "visual.blocks.9.mlp.fc2.bias": "model-00001-of-00002.safetensors",
678
+ "visual.blocks.9.mlp.fc2.weight": "model-00001-of-00002.safetensors",
679
+ "visual.blocks.9.norm1.bias": "model-00001-of-00002.safetensors",
680
+ "visual.blocks.9.norm1.weight": "model-00001-of-00002.safetensors",
681
+ "visual.blocks.9.norm2.bias": "model-00001-of-00002.safetensors",
682
+ "visual.blocks.9.norm2.weight": "model-00001-of-00002.safetensors",
683
+ "visual.merger.ln_q.bias": "model-00001-of-00002.safetensors",
684
+ "visual.merger.ln_q.weight": "model-00001-of-00002.safetensors",
685
+ "visual.merger.mlp.0.bias": "model-00001-of-00002.safetensors",
686
+ "visual.merger.mlp.0.weight": "model-00001-of-00002.safetensors",
687
+ "visual.merger.mlp.2.bias": "model-00001-of-00002.safetensors",
688
+ "visual.merger.mlp.2.weight": "model-00001-of-00002.safetensors",
689
+ "visual.patch_embed.proj.weight": "model-00001-of-00002.safetensors"
690
+ }
691
+ }
modeling_sarashina2_vision.py ADDED
@@ -0,0 +1,242 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 the SB Intuitions.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ from typing import List, Optional, Tuple, Union
16
+
17
+ import torch
18
+ import torch.nn as nn
19
+ import torch.nn.functional as F
20
+ from torch.nn import CrossEntropyLoss
21
+ from transformers import (
22
+ AutoConfig,
23
+ AutoModelForCausalLM,
24
+ GenerationMixin,
25
+ LlamaForCausalLM,
26
+ PreTrainedModel,
27
+ )
28
+ from transformers.modeling_outputs import CausalLMOutputWithPast
29
+ from transformers.models.qwen2_vl.modeling_qwen2_vl import Qwen2VisionTransformerPretrainedModel
30
+ from transformers.utils import logging, replace_return_docstrings
31
+
32
+ from .configuration_sarashina2_vision import Sarashina2VisionConfig
33
+
34
+ logger = logging.get_logger(__name__)
35
+
36
+ _CONFIG_FOR_DOC = "Sarashina2VisionConfig"
37
+
38
+
39
+ class Sarashina2VisionPreTrainedModel(PreTrainedModel):
40
+ config_class = Sarashina2VisionConfig
41
+ base_model_prefix = "model"
42
+ _supports_flash_attn_2 = True
43
+ _supports_sdpa = True
44
+ _supports_cache_class = True
45
+ _supports_static_cache = True
46
+
47
+ def _init_weights(self, module):
48
+ std = (
49
+ self.config.initializer_range
50
+ if hasattr(self.config, "initializer_range")
51
+ else self.config.text_config.initializer_range
52
+ )
53
+
54
+ if hasattr(module, "class_embedding"):
55
+ module.class_embedding.data.normal_(mean=0.0, std=std)
56
+
57
+ if isinstance(module, (nn.Linear, nn.Conv3d)):
58
+ module.weight.data.normal_(mean=0.0, std=std)
59
+ if module.bias is not None:
60
+ module.bias.data.zero_()
61
+ elif isinstance(module, nn.Embedding):
62
+ module.weight.data.normal_(mean=0.0, std=std)
63
+ if module.padding_idx is not None:
64
+ module.weight.data[module.padding_idx].zero_()
65
+
66
+
67
+ class Sarashina2VisionForCausalLM(Sarashina2VisionPreTrainedModel, GenerationMixin):
68
+ def __init__(self, config: Sarashina2VisionConfig):
69
+ super().__init__(config)
70
+ self.visual = Qwen2VisionTransformerPretrainedModel._from_config(config.vision_config)
71
+ self.norm = nn.LayerNorm(config.text_config.hidden_size)
72
+ self.llm = LlamaForCausalLM._from_config(config.text_config)
73
+ self._attn_implementation = config._attn_implementation
74
+
75
+ # Initialize weights and apply final processing
76
+ self.post_init()
77
+
78
+ def get_input_embeddings(self):
79
+ return self.llm.get_input_embeddings()
80
+
81
+ def get_image_embeds(
82
+ self,
83
+ hidden_states: torch.Tensor,
84
+ grid_thw: torch.Tensor,
85
+ ) -> torch.Tensor:
86
+ rotary_pos_emb = self.visual.rot_pos_emb(grid_thw)
87
+ hidden_states = self.visual.patch_embed(hidden_states)
88
+
89
+ cu_seqlens = torch.repeat_interleave(
90
+ grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]
91
+ ).cumsum(dim=0, dtype=torch.int32)
92
+ cu_seqlens = F.pad(cu_seqlens, (1, 0), value=0)
93
+
94
+ for blk in self.visual.blocks:
95
+ hidden_states = blk(
96
+ hidden_states, cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb
97
+ )
98
+ return self.norm(self.visual.merger(hidden_states))
99
+
100
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
101
+ def forward(
102
+ self,
103
+ input_ids: torch.LongTensor = None,
104
+ attention_mask: Optional[torch.Tensor] = None,
105
+ position_ids: Optional[torch.LongTensor] = None,
106
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
107
+ inputs_embeds: Optional[torch.FloatTensor] = None,
108
+ labels: Optional[torch.LongTensor] = None,
109
+ use_cache: Optional[bool] = None,
110
+ output_attentions: Optional[bool] = None,
111
+ output_hidden_states: Optional[bool] = None,
112
+ return_dict: Optional[bool] = None,
113
+ pixel_values: torch.FloatTensor = None,
114
+ image_grid_thw: Optional[torch.LongTensor] = None,
115
+ cache_position: Optional[torch.LongTensor] = None,
116
+ **lm_kwargs,
117
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
118
+ """
119
+ Args:
120
+ input_ids (torch.LongTensor, optional): Indices of input sequence tokens in the vocabulary. Defaults to None.
121
+ attention_mask (Optional[torch.Tensor], optional): Mask to avoid performing attention on padding token indices. Defaults to None.
122
+ position_ids (Optional[torch.LongTensor], optional): Indices of positions of each input sequence tokens in the position embeddings. Defaults to None.
123
+ past_key_values (Optional[List[torch.FloatTensor]], optional): _description_. Defaults to None.
124
+ inputs_embeds (Optional[torch.FloatTensor], optional): Instead of passing `input_ids` you can choose to directly pass an embedded representation. Defaults to None.
125
+ labels (Optional[torch.LongTensor], optional): Labels for computing the masked language modeling loss. Defaults to None.
126
+ use_cache (Optional[bool], optional): If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding. Defaults to None.
127
+ output_attentions (Optional[bool], optional): Whether or not to return the attentions tensors of all attention layers. Defaults to None.
128
+ output_hidden_states (Optional[bool], optional): Whether or not to return the hidden states of all layers. Defaults to None.
129
+ return_dict (Optional[bool], optional): Whether or not to return a `CausalLMOutputWithPast` instead of a plain tuple. Defaults to None.
130
+ pixel_values (torch.FloatTensor, optional): The tensors corresponding to the input images. Defaults to None.
131
+ image_grid_thw (Optional[torch.LongTensor], optional): The temporal, height and width of feature shape of each image in LLM. Defaults to None.
132
+ cache_position (Optional[torch.LongTensor], optional): Indices depicting the position of the input sequence tokens in the sequence. Defaults to None.
133
+ Returns:
134
+ CausalLMOutputWithPast: The output of the model.
135
+ """
136
+ output_attentions = (
137
+ output_attentions if output_attentions is not None else self.config.output_attentions
138
+ )
139
+ output_hidden_states = (
140
+ output_hidden_states
141
+ if output_hidden_states is not None
142
+ else self.config.output_hidden_states
143
+ )
144
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
145
+
146
+ if inputs_embeds is None:
147
+ inputs_embeds = self.get_input_embeddings()(input_ids)
148
+ if pixel_values is not None:
149
+ pixel_values = pixel_values.type(self.visual.get_dtype())
150
+ image_embeds = self.get_image_embeds(pixel_values, image_grid_thw)
151
+ n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
152
+ n_image_features = image_embeds.shape[0]
153
+ if n_image_tokens != n_image_features:
154
+ raise ValueError(
155
+ f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
156
+ )
157
+ image_mask = (
158
+ (input_ids == self.config.image_token_index)
159
+ .unsqueeze(-1)
160
+ .expand_as(inputs_embeds)
161
+ .to(inputs_embeds.device)
162
+ )
163
+ image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
164
+ inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
165
+
166
+ outputs = self.llm(
167
+ attention_mask=attention_mask,
168
+ position_ids=position_ids,
169
+ past_key_values=past_key_values,
170
+ inputs_embeds=inputs_embeds,
171
+ use_cache=use_cache,
172
+ output_attentions=output_attentions,
173
+ output_hidden_states=output_hidden_states,
174
+ return_dict=return_dict,
175
+ cache_position=cache_position,
176
+ **lm_kwargs,
177
+ )
178
+
179
+ logits = outputs[0]
180
+
181
+ loss = None
182
+ if labels is not None:
183
+ # Upcast to float if we need to compute the loss to avoid potential precision issues
184
+ logits = logits.float()
185
+ # Shift so that tokens < n predict n
186
+ shift_logits = logits[..., :-1, :].contiguous()
187
+ shift_labels = labels[..., 1:].contiguous()
188
+ # Flatten the tokens
189
+ loss_fct = CrossEntropyLoss()
190
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
191
+ shift_labels = shift_labels.view(-1)
192
+ # Enable model parallelism
193
+ shift_labels = shift_labels.to(shift_logits.device)
194
+ loss = loss_fct(shift_logits, shift_labels)
195
+
196
+ if not return_dict:
197
+ output = (logits,) + outputs[1:]
198
+ return (loss,) + output if loss is not None else output
199
+
200
+ return CausalLMOutputWithPast(
201
+ loss=loss,
202
+ logits=logits,
203
+ past_key_values=outputs.past_key_values,
204
+ hidden_states=outputs.hidden_states,
205
+ attentions=outputs.attentions,
206
+ )
207
+
208
+ def prepare_inputs_for_generation(
209
+ self,
210
+ input_ids,
211
+ past_key_values=None,
212
+ inputs_embeds=None,
213
+ pixel_values=None,
214
+ attention_mask=None,
215
+ cache_position=None,
216
+ logits_to_keep=None,
217
+ image_grid_thw=None,
218
+ **kwargs,
219
+ ):
220
+ model_inputs = self.llm.prepare_inputs_for_generation(
221
+ input_ids,
222
+ past_key_values=past_key_values,
223
+ inputs_embeds=inputs_embeds,
224
+ attention_mask=attention_mask,
225
+ cache_position=cache_position,
226
+ logits_to_keep=logits_to_keep,
227
+ **kwargs,
228
+ )
229
+
230
+ if cache_position[0] == 0:
231
+ # If we're in cached decoding stage, pixel values should be None because input ids do not contain special image token anymore
232
+ # Otherwise we need pixel values to be passed to model
233
+ model_inputs["pixel_values"] = pixel_values
234
+ model_inputs["image_grid_thw"] = image_grid_thw
235
+
236
+ return model_inputs
237
+
238
+
239
+ AutoConfig.register("sarashina2_vision", Sarashina2VisionConfig)
240
+ AutoModelForCausalLM.register(Sarashina2VisionConfig, Sarashina2VisionForCausalLM)
241
+ Sarashina2VisionConfig.register_for_auto_class()
242
+ Sarashina2VisionForCausalLM.register_for_auto_class("AutoModelForCausalLM")
preprocessor_config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_sarashina2_vision.Srashina2VisionProcessor"
4
+ },
5
+ "do_convert_rgb": true,
6
+ "do_normalize": true,
7
+ "do_rescale": true,
8
+ "do_resize": true,
9
+ "image_mean": [
10
+ 0.48145466,
11
+ 0.4578275,
12
+ 0.40821073
13
+ ],
14
+ "image_processor_type": "Sarashina2VisionImageProcessor",
15
+ "image_std": [
16
+ 0.26862954,
17
+ 0.26130258,
18
+ 0.27577711
19
+ ],
20
+ "max_pixels": 1016064,
21
+ "merge_size": 2,
22
+ "min_pixels": 3136,
23
+ "patch_size": 14,
24
+ "processor_class": "Srashina2VisionProcessor",
25
+ "resample": 3,
26
+ "rescale_factor": 0.00392156862745098,
27
+ "size": {
28
+ "max_pixels": 1016064,
29
+ "min_pixels": 3136
30
+ },
31
+ "temporal_patch_size": 2
32
+ }
processing_sarashina2_vision.py ADDED
@@ -0,0 +1,383 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 the SB Intuitions.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """
16
+ Processor class for Srashina2Vision.
17
+ """
18
+
19
+ from copy import deepcopy
20
+ from typing import List, Optional, Union
21
+
22
+ import numpy as np
23
+ import torch
24
+ import torch.nn.functional as F
25
+ from PIL import Image
26
+ from transformers import (
27
+ AutoImageProcessor,
28
+ PreTrainedTokenizer,
29
+ Qwen2VLImageProcessor,
30
+ StoppingCriteria,
31
+ StoppingCriteriaList,
32
+ )
33
+ from transformers.feature_extraction_utils import BatchFeature
34
+ from transformers.image_transforms import (
35
+ convert_to_rgb,
36
+ to_channel_dimension_format,
37
+ )
38
+ from transformers.image_utils import (
39
+ ChannelDimension,
40
+ ImageInput,
41
+ VideoInput,
42
+ get_image_size,
43
+ infer_channel_dimension_format,
44
+ is_scaled_image,
45
+ make_list_of_images,
46
+ to_numpy_array,
47
+ )
48
+ from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize
49
+ from transformers.processing_utils import ProcessingKwargs, ProcessorMixin, Unpack
50
+ from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
51
+ from transformers.utils import logging
52
+
53
+ logger = logging.get_logger(__name__)
54
+
55
+
56
+ class GenerationStopper(StoppingCriteria):
57
+ def __init__(
58
+ self,
59
+ stop_str_list: list[str],
60
+ tokenizer: PreTrainedTokenizer,
61
+ decode_suffix_length: int = 5,
62
+ ):
63
+ self.stop_str_list = stop_str_list
64
+ self.tokenizer = deepcopy(tokenizer)
65
+ self.decode_suffix_length = decode_suffix_length
66
+ self.input_ids_end = None
67
+
68
+ def __repr__(self):
69
+ return f"Stopping words: {self.stop_str_list}"
70
+
71
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
72
+ if self.input_ids_end is None:
73
+ length = input_ids.shape[1]
74
+ self.input_ids_end = length - 1 if (length - 1) > 0 else 0
75
+ decode_ids = input_ids[0][self.input_ids_end :][-self.decode_suffix_length :]
76
+ if len(decode_ids) == 0:
77
+ decoded = ""
78
+ else:
79
+ decoded = self.tokenizer.decode(decode_ids)
80
+
81
+ for stop_str in self.stop_str_list:
82
+ if stop_str in decoded:
83
+ self.input_ids_end = None
84
+ return True
85
+ return False
86
+
87
+ @property
88
+ def criteria(self):
89
+ return StoppingCriteriaList([self])
90
+
91
+ def format(self, sentence: str):
92
+ for w in self.stop_str_list:
93
+ if w in sentence[-len(w) :]:
94
+ sentence = sentence[: -len(w)]
95
+ return sentence
96
+
97
+
98
+ class Sarashina2VisionImageProcessor(Qwen2VLImageProcessor):
99
+ def _preprocess(
100
+ self,
101
+ images: Union[ImageInput, VideoInput],
102
+ do_resize: bool = None,
103
+ resample: Image.Resampling = None,
104
+ do_rescale: bool = None,
105
+ rescale_factor: float = None,
106
+ do_normalize: bool = None,
107
+ image_mean: Optional[Union[float, List[float]]] = None,
108
+ image_std: Optional[Union[float, List[float]]] = None,
109
+ do_convert_rgb: bool = None,
110
+ data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
111
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
112
+ ):
113
+ """
114
+ Preprocess an image or batch of images. Copy of the `preprocess` method from `Qwen2VLImageProcessor`.
115
+
116
+ Args:
117
+ images (`ImageInput`):
118
+ Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
119
+ vision_info (`List[Dict]`, *optional*):
120
+ Optional list of dictionaries containing additional information about vision inputs.
121
+ do_resize (`bool`, *optional*, defaults to `self.do_resize`):
122
+ Whether to resize the image.
123
+ resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
124
+ Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
125
+ do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
126
+ Whether to rescale the image.
127
+ rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
128
+ Scale factor to use if rescaling the image.
129
+ do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
130
+ Whether to normalize the image.
131
+ image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
132
+ Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
133
+ image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
134
+ Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
135
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
136
+ Whether to convert the image to RGB.
137
+ data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
138
+ The channel dimension format for the output image. Can be one of:
139
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
140
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
141
+ - Unset: Use the channel dimension format of the input image.
142
+ input_data_format (`ChannelDimension` or `str`, *optional*):
143
+ The channel dimension format for the input image. Can be one of:
144
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
145
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
146
+ - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
147
+ """
148
+ images = make_list_of_images(images)
149
+
150
+ if do_convert_rgb:
151
+ images = [convert_to_rgb(image) for image in images]
152
+
153
+ # All transformations expect numpy arrays.
154
+ images = [to_numpy_array(image) for image in images]
155
+
156
+ if do_rescale and is_scaled_image(images[0]):
157
+ logger.warning_once(
158
+ "It looks like you are trying to rescale already rescaled images. If the input"
159
+ " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
160
+ )
161
+ if input_data_format is None:
162
+ # We assume that all images have the same channel dimension format.
163
+ input_data_format = infer_channel_dimension_format(images[0])
164
+
165
+ height, width = get_image_size(images[0], channel_dim=input_data_format)
166
+ resized_height, resized_width = height, width
167
+ processed_images = []
168
+ for image in images:
169
+ if do_rescale:
170
+ image = self.rescale(
171
+ image, scale=rescale_factor, input_data_format=input_data_format
172
+ )
173
+
174
+ if do_normalize:
175
+ image = self.normalize(
176
+ image=image,
177
+ mean=image_mean,
178
+ std=image_std,
179
+ input_data_format=input_data_format,
180
+ )
181
+
182
+ image = to_channel_dimension_format(
183
+ image, data_format, input_channel_dim=input_data_format
184
+ )
185
+
186
+ if do_resize:
187
+ resized_height, resized_width = smart_resize(
188
+ height,
189
+ width,
190
+ factor=self.patch_size * self.merge_size,
191
+ min_pixels=self.min_pixels,
192
+ max_pixels=self.max_pixels,
193
+ )
194
+ image = (
195
+ F.interpolate(
196
+ torch.from_numpy(image).unsqueeze(0),
197
+ size=(resized_height, resized_width),
198
+ mode="bicubic",
199
+ )
200
+ .squeeze(0)
201
+ .numpy()
202
+ )
203
+
204
+ processed_images.append(image)
205
+
206
+ patches = np.array(processed_images)
207
+ if data_format == ChannelDimension.LAST:
208
+ patches = patches.transpose(0, 3, 1, 2)
209
+ if patches.shape[0] % self.temporal_patch_size != 0:
210
+ repeats = np.repeat(patches[-1][np.newaxis], self.temporal_patch_size - 1, axis=0)
211
+ patches = np.concatenate([patches, repeats], axis=0)
212
+ channel = patches.shape[1]
213
+ grid_t = patches.shape[0] // self.temporal_patch_size
214
+ grid_h, grid_w = resized_height // self.patch_size, resized_width // self.patch_size
215
+ patches = patches.reshape(
216
+ grid_t,
217
+ self.temporal_patch_size,
218
+ channel,
219
+ grid_h // self.merge_size,
220
+ self.merge_size,
221
+ self.patch_size,
222
+ grid_w // self.merge_size,
223
+ self.merge_size,
224
+ self.patch_size,
225
+ )
226
+ patches = patches.transpose(0, 3, 6, 4, 7, 2, 1, 5, 8)
227
+ flatten_patches = patches.reshape(
228
+ grid_t * grid_h * grid_w,
229
+ channel * self.temporal_patch_size * self.patch_size * self.patch_size,
230
+ )
231
+
232
+ return flatten_patches, (grid_t, grid_h, grid_w)
233
+
234
+
235
+ class Srashina2VisionProcessorKwargs(ProcessingKwargs, total=False):
236
+ _defaults = {
237
+ "text_kwargs": {
238
+ "padding": False,
239
+ },
240
+ }
241
+
242
+
243
+ class Srashina2VisionProcessor(ProcessorMixin):
244
+ r"""
245
+ Constructs Srashina2Vision processor which wraps a Srashina2Vision image processor and a LLama tokenizer into a single processor.
246
+ [`Srashina2VisionProcessor`] offers all the functionalities of [`Sarashina2VisionImageProcessor`] and [`LlamaTokenizerFast`]. See the
247
+ [`~Srashina2VisionProcessor.__call__`] and [`~Srashina2VisionProcessor.decode`] for more information.
248
+ Args:
249
+ image_processor ([`Sarashina2VisionImageProcessor`], *optional*):
250
+ The image processor is a required input.
251
+ tokenizer ([`LlamaTokenizerFast`], *optional*):
252
+ The tokenizer is a required input.
253
+ chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
254
+ in a chat into a tokenizable string.
255
+ """
256
+
257
+ attributes = ["image_processor", "tokenizer"]
258
+ valid_kwargs = ["chat_template"]
259
+ image_processor_class = "AutoImageProcessor"
260
+ tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast")
261
+
262
+ def __init__(self, image_processor=None, tokenizer=None, chat_template=None, **kwargs):
263
+ self.image_token = (
264
+ "<|file|>" if not hasattr(tokenizer, "image_token") else tokenizer.image_token
265
+ )
266
+ self.stop_symbol = "\n###"
267
+ super().__init__(image_processor, tokenizer, chat_template=chat_template)
268
+
269
+ def __call__(
270
+ self,
271
+ images: ImageInput = None,
272
+ text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
273
+ **kwargs: Unpack[Srashina2VisionProcessorKwargs],
274
+ ) -> BatchFeature:
275
+ """
276
+ Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
277
+ and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
278
+ the text. To prepare the vision inputs, this method forwards the `vision_infos` and `kwrags` arguments to
279
+ Sarashina2VisionImageProcessor's [`~Sarashina2VisionImageProcessor.__call__`] if `vision_infos` is not `None`.
280
+
281
+ Args:
282
+ images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
283
+ The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
284
+ tensor. Both channels-first and channels-last formats are supported.
285
+ text (`str`, `List[str]`, `List[List[str]]`):
286
+ The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
287
+ (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
288
+ `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
289
+ return_tensors (`str` or [`~utils.TensorType`], *optional*):
290
+ If set, will return tensors of a particular framework. Acceptable values are:
291
+ - `'tf'`: Return TensorFlow `tf.constant` objects.
292
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
293
+ - `'np'`: Return NumPy `np.ndarray` objects.
294
+ - `'jax'`: Return JAX `jnp.ndarray` objects.
295
+
296
+ Returns:
297
+ [`BatchFeature`]: A [`BatchFeature`] with the following fields:
298
+
299
+ - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
300
+ - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
301
+ `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
302
+ `None`).
303
+ - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
304
+ - **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
305
+ """
306
+ output_kwargs = self._merge_kwargs(
307
+ Srashina2VisionProcessorKwargs,
308
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
309
+ **kwargs,
310
+ )
311
+ if images is not None:
312
+ image_inputs = self.image_processor(
313
+ images=images, videos=None, **output_kwargs["images_kwargs"]
314
+ )
315
+ image_grid_thw = image_inputs["image_grid_thw"]
316
+ else:
317
+ image_inputs = {}
318
+ image_grid_thw = None
319
+
320
+ if not isinstance(text, list):
321
+ text = [text]
322
+
323
+ if image_grid_thw is not None:
324
+ merge_length = self.image_processor.merge_size**2
325
+ index = 0
326
+ for i in range(len(text)):
327
+ while self.image_token in text[i]:
328
+ text[i] = text[i].replace(
329
+ self.image_token,
330
+ "<|placeholder|>" * (image_grid_thw[index].prod() // merge_length),
331
+ 1,
332
+ )
333
+ index += 1
334
+ text[i] = text[i].replace("<|placeholder|>", self.image_token)
335
+
336
+ text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
337
+
338
+ return BatchFeature(data={**text_inputs, **image_inputs})
339
+
340
+ def batch_decode(self, *args, **kwargs):
341
+ """
342
+ This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`].
343
+ """
344
+ return [
345
+ output.replace(self.stop_symbol, "")
346
+ for output in self.tokenizer.batch_decode(*args, **kwargs)
347
+ ]
348
+
349
+ def decode(self, *args, **kwargs):
350
+ """
351
+ This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`].
352
+ """
353
+ return self.tokenizer.decode(*args, **kwargs).replace(self.stop_symbol, "")
354
+
355
+ def post_process_image_text_to_text(self, generated_outputs):
356
+ """
357
+ Post-process the output of the model to decode the text.
358
+
359
+ Args:
360
+ generated_outputs (`torch.Tensor` or `np.ndarray`):
361
+ The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
362
+ or `(sequence_length,)`.
363
+
364
+ Returns:
365
+ `List[str]`: The decoded text.
366
+ """
367
+ return self.tokenizer.batch_decode(
368
+ generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
369
+ )
370
+
371
+ @property
372
+ def model_input_names(self):
373
+ tokenizer_input_names = self.tokenizer.model_input_names
374
+ image_processor_input_names = self.image_processor.model_input_names
375
+ return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
376
+
377
+ def get_stopping_criteria(self, stop_symbols: List[str]):
378
+ stopping_criteria = GenerationStopper(stop_str_list=stop_symbols, tokenizer=self.tokenizer)
379
+ return stopping_criteria.criteria
380
+
381
+
382
+ Srashina2VisionProcessor.register_for_auto_class("AutoProcessor")
383
+ AutoImageProcessor.register("Sarashina2VisionImageProcessor", Sarashina2VisionImageProcessor)
processor_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_sarashina2_vision.Srashina2VisionProcessor"
4
+ },
5
+ "processor_class": "Srashina2VisionProcessor"
6
+ }
sample.jpg ADDED

Git LFS Details

  • SHA256: fec4aaeb7320998e81ab2ae24e6568db9d0fd8d108a19daf2d4107c899e71d32
  • Pointer size: 132 Bytes
  • Size of remote file: 2.51 MB
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<cls>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "<sep>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:008293028e1a9d9a1038d9b63d989a2319797dfeaa03f171093a57b33a3a8277
3
+ size 1831879
tokenizer_config.json ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_dummy_prefix_space": false,
4
+ "add_eos_token": false,
5
+ "add_prefix_space": false,
6
+ "added_tokens_decoder": {
7
+ "0": {
8
+ "content": "<unk>",
9
+ "lstrip": false,
10
+ "normalized": false,
11
+ "rstrip": false,
12
+ "single_word": false,
13
+ "special": true
14
+ },
15
+ "1": {
16
+ "content": "<s>",
17
+ "lstrip": false,
18
+ "normalized": false,
19
+ "rstrip": false,
20
+ "single_word": false,
21
+ "special": true
22
+ },
23
+ "2": {
24
+ "content": "</s>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false,
29
+ "special": true
30
+ },
31
+ "3": {
32
+ "content": "<pad>",
33
+ "lstrip": false,
34
+ "normalized": false,
35
+ "rstrip": false,
36
+ "single_word": false,
37
+ "special": true
38
+ },
39
+ "4": {
40
+ "content": "<sep>",
41
+ "lstrip": false,
42
+ "normalized": false,
43
+ "rstrip": false,
44
+ "single_word": false,
45
+ "special": true
46
+ },
47
+ "5": {
48
+ "content": "<mask>",
49
+ "lstrip": false,
50
+ "normalized": false,
51
+ "rstrip": false,
52
+ "single_word": false,
53
+ "special": true
54
+ },
55
+ "6": {
56
+ "content": "<cls>",
57
+ "lstrip": false,
58
+ "normalized": false,
59
+ "rstrip": false,
60
+ "single_word": false,
61
+ "special": true
62
+ },
63
+ "7": {
64
+ "content": "<|system|>",
65
+ "lstrip": false,
66
+ "normalized": false,
67
+ "rstrip": false,
68
+ "single_word": false,
69
+ "special": false
70
+ },
71
+ "8": {
72
+ "content": "<|assistant|>",
73
+ "lstrip": false,
74
+ "normalized": false,
75
+ "rstrip": false,
76
+ "single_word": false,
77
+ "special": false
78
+ },
79
+ "9": {
80
+ "content": "<|user|>",
81
+ "lstrip": false,
82
+ "normalized": false,
83
+ "rstrip": false,
84
+ "single_word": false,
85
+ "special": false
86
+ },
87
+ "10": {
88
+ "content": "<|available_tools|>",
89
+ "lstrip": false,
90
+ "normalized": false,
91
+ "rstrip": false,
92
+ "single_word": false,
93
+ "special": false
94
+ },
95
+ "11": {
96
+ "content": "<|tool_calls|>",
97
+ "lstrip": false,
98
+ "normalized": false,
99
+ "rstrip": false,
100
+ "single_word": false,
101
+ "special": false
102
+ },
103
+ "12": {
104
+ "content": "<|tool_results|>",
105
+ "lstrip": false,
106
+ "normalized": false,
107
+ "rstrip": false,
108
+ "single_word": false,
109
+ "special": false
110
+ },
111
+ "13": {
112
+ "content": "<|code|>",
113
+ "lstrip": false,
114
+ "normalized": false,
115
+ "rstrip": false,
116
+ "single_word": false,
117
+ "special": false
118
+ },
119
+ "14": {
120
+ "content": "<|file|>",
121
+ "lstrip": false,
122
+ "normalized": false,
123
+ "rstrip": false,
124
+ "single_word": false,
125
+ "special": false
126
+ },
127
+ "102397": {
128
+ "content": "<|prefix|>",
129
+ "lstrip": false,
130
+ "normalized": false,
131
+ "rstrip": false,
132
+ "single_word": false,
133
+ "special": false
134
+ },
135
+ "102398": {
136
+ "content": "<|suffix|>",
137
+ "lstrip": false,
138
+ "normalized": false,
139
+ "rstrip": false,
140
+ "single_word": false,
141
+ "special": false
142
+ },
143
+ "102399": {
144
+ "content": "<|middle|>",
145
+ "lstrip": false,
146
+ "normalized": false,
147
+ "rstrip": false,
148
+ "single_word": false,
149
+ "special": false
150
+ }
151
+ },
152
+ "auto_map": {
153
+ "AutoProcessor": "processing_sarashina2_vision.Srashina2VisionProcessor"
154
+ },
155
+ "bos_token": "<s>",
156
+ "chat_template": "{{ bos_token + '<|prefix|><|file|><|suffix|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human\\'s questions.\\n\\n' }}{% for message in messages %}{% if message['role'] == 'user' %}{{ '### Human: ' + message['content'] + '\\n' }}{% elif message['role'] == 'assistant' %}{{ 'Assistant: ' + message['content'] + '\\n' }}{% endif %}{% endfor %}{% if messages[-1]['role'] == 'user' %}{{ '### Assistant:' }}{% endif %}",
157
+ "clean_up_tokenization_spaces": false,
158
+ "cls_token": "<cls>",
159
+ "do_lower_case": false,
160
+ "eos_token": "</s>",
161
+ "extra_ids": 0,
162
+ "extra_special_tokens": {},
163
+ "keep_accents": true,
164
+ "legacy": false,
165
+ "mask_token": "<mask>",
166
+ "model_max_length": 4096,
167
+ "pad_token": "<pad>",
168
+ "padding_side": "left",
169
+ "processor_class": "Srashina2VisionProcessor",
170
+ "sep_token": "<sep>",
171
+ "sp_model_kwargs": {},
172
+ "spaces_between_special_tokens": false,
173
+ "tokenizer_class": "LlamaTokenizer",
174
+ "unk_token": "<unk>",
175
+ "use_default_system_prompt": false
176
+ }