Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

README.md +8 -5
config.json +2 -2
configuration_intern_vit.py +1 -0
configuration_internvl_chat.py +3 -3
conversation.py +16 -19
generation_config.json +1 -1
model.safetensors +1 -1
modeling_intern_vit.py +1 -0
modeling_internvl_chat.py +13 -14

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ license: apache-2.0
 [[ Github Repo ]](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) [[ Related Paper ]](https://arxiv.org/abs/2406.11633) [[ Website ]](https://unimodal4reasoning.github.io/DocGenome_page/)
-[[ Dataset🤗 ]](https://huggingface.co/datasets/U4R/DocGenome/tree/main) [[ Models🤗 ]](https://huggingface.co/U4R/StructTable-InternVL2-1B/tree/main)
 </div>
@@ -24,7 +24,9 @@ Welcome to the official repository of StructEqTable-Deploy, a solution that conv
 Table is an effective way to represent structured data in scientific publications, financial statements, invoices, web pages, and many other scenarios. Extracting tabular data from a visual table image and performing the downstream reasoning tasks according to the extracted data is challenging, mainly due to that tables often present complicated column and row headers with spanning cell operation. To address these challenges, we present TableX, a large-scale multi-modal table benchmark extracted from [DocGenome benchmark](https://unimodal4reasoning.github.io/DocGenome_page/) for table pre-training, comprising more than 2 million high-quality Image-LaTeX pair data covering 156 disciplinary classes. Besides, benefiting from such large-scale data, we train an end-to-end model, StructEqTable, which provides the capability to precisely obtain the corresponding LaTeX description from a visual table image and perform multiple table-related reasoning tasks, including structural extraction and question answering, broadening its application scope and potential.
 ## Changelog
-- [2024/10/19] 🔥 We have released our **latest model [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B/tree/main)**!
   Thanks to IntenrVL2 powerful foundational capabilities, and through fine-tuning on the synthetic tabular data and DocGenome dataset, StructTable can convert table image into various common table formats including LaTeX, HTML, and Markdown. Moreover, inference speed has been significantly improved compared to the v0.2 version.
 - [2024/8/22] We have released our StructTable-base-v0.2, fine-tuned on the DocGenome dataset. This version features improved inference speed and robustness, achieved through data augmentation and reduced image token num.
@@ -62,9 +64,10 @@ pip install struct-eqtable==0.3.0
 | Base Model | Model Size | Training Data | Data Augmentation | LMDeploy | TensorRT | HuggingFace |
 |---------------------|------------|------------------|-------------------|----------|----------|-------------------|
-| InternVL2-1B | ~1B | DocGenome and Synthetic Data | ✔ | ✔ | | [StructTable v0.3](https://huggingface.co/U4R/StructTable-InternVL2-1B/tree/main) |
-| Pix2Struct-base | ~300M | DocGenome | ✔ | | ✔ | [StructTable v0.2](https://huggingface.co/U4R/StructTable-base/tree/v0.2) |
-| Pix2Struct-base | ~300M | DocGenome | | | ✔ | [StructTable v0.1](https://huggingface.co/U4R/StructTable-base/tree/v0.1) |

 [[ Github Repo ]](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) [[ Related Paper ]](https://arxiv.org/abs/2406.11633) [[ Website ]](https://unimodal4reasoning.github.io/DocGenome_page/)
+[[ Dataset🤗 ]](https://huggingface.co/datasets/U4R/DocGenome/tree/main) [[ Models🤗 ]](https://huggingface.co/U4R/StructTable-InternVL2-1B/tree/main) [[ Demo💬 ]](https://www.modelscope.cn/studios/HongbinZhou/StructEqTable-Demo/)
 </div>
 Table is an effective way to represent structured data in scientific publications, financial statements, invoices, web pages, and many other scenarios. Extracting tabular data from a visual table image and performing the downstream reasoning tasks according to the extracted data is challenging, mainly due to that tables often present complicated column and row headers with spanning cell operation. To address these challenges, we present TableX, a large-scale multi-modal table benchmark extracted from [DocGenome benchmark](https://unimodal4reasoning.github.io/DocGenome_page/) for table pre-training, comprising more than 2 million high-quality Image-LaTeX pair data covering 156 disciplinary classes. Besides, benefiting from such large-scale data, we train an end-to-end model, StructEqTable, which provides the capability to precisely obtain the corresponding LaTeX description from a visual table image and perform multiple table-related reasoning tasks, including structural extraction and question answering, broadening its application scope and potential.
 ## Changelog
+- [2024/12/12] 🔥 We have released latest model **[StructTable-InternVL2-1B v0.2](https://huggingface.co/U4R/StructTable-InternVL2-1B/tree/main)** with enhanced recognition stability for HTML and Markdown formats!
+- [2024/10/19] We have released our latest model StructTable-InternVL2-1B!
   Thanks to IntenrVL2 powerful foundational capabilities, and through fine-tuning on the synthetic tabular data and DocGenome dataset, StructTable can convert table image into various common table formats including LaTeX, HTML, and Markdown. Moreover, inference speed has been significantly improved compared to the v0.2 version.
 - [2024/8/22] We have released our StructTable-base-v0.2, fine-tuned on the DocGenome dataset. This version features improved inference speed and robustness, achieved through data augmentation and reduced image token num.
 | Base Model | Model Size | Training Data | Data Augmentation | LMDeploy | TensorRT | HuggingFace |
 |---------------------|------------|------------------|-------------------|----------|----------|-------------------|
+| InternVL2-1B | ~1B | DocGenome and Synthetic Data | ✔ | ✔ | | [StructTable-InternVL2-1B v0.2](https://huggingface.co/U4R/StructTable-InternVL2-1B/tree/main) |
+| InternVL2-1B | ~1B | DocGenome and Synthetic Data | ✔ | ✔ | | [StructTable-InternVL2-1B v0.1](https://huggingface.co/U4R/StructTable-InternVL2-1B/tree/v0.1) |
+| Pix2Struct-base | ~300M | DocGenome | ✔ | | ✔ | [StructTable-base v0.2](https://huggingface.co/U4R/StructTable-base/tree/v0.2) |
+| Pix2Struct-base | ~300M | DocGenome | | | ✔ | [StructTable-base v0.1](https://huggingface.co/U4R/StructTable-base/tree/v0.1) |

config.json CHANGED Viewed

@@ -87,7 +87,7 @@
     "top_p": 1.0,
     "torch_dtype": "bfloat16",
     "torchscript": false,
-    "transformers_version": "4.44.0.dev0",
     "typical_p": 1.0,
     "use_bfloat16": true,
     "use_cache": false,
@@ -185,7 +185,7 @@
     "top_p": 1.0,
     "torch_dtype": "bfloat16",
     "torchscript": false,
-    "transformers_version": "4.44.0.dev0",
     "typical_p": 1.0,
     "use_bfloat16": true,
     "use_flash_attn": true

     "top_p": 1.0,
     "torch_dtype": "bfloat16",
     "torchscript": false,
+    "transformers_version": "4.44.2",
     "typical_p": 1.0,
     "use_bfloat16": true,
     "use_cache": false,
     "top_p": 1.0,
     "torch_dtype": "bfloat16",
     "torchscript": false,
+    "transformers_version": "4.44.2",
     "typical_p": 1.0,
     "use_bfloat16": true,
     "use_flash_attn": true

configuration_intern_vit.py CHANGED Viewed

@@ -3,6 +3,7 @@
 # Copyright (c) 2024 OpenGVLab
 # Licensed under The MIT License [see LICENSE for details]
 # --------------------------------------------------------
 import os
 from typing import Union

 # Copyright (c) 2024 OpenGVLab
 # Licensed under The MIT License [see LICENSE for details]
 # --------------------------------------------------------
 import os
 from typing import Union

configuration_internvl_chat.py CHANGED Viewed

@@ -46,12 +46,12 @@ class InternVLChatConfig(PretrainedConfig):
             logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
         self.vision_config = InternVisionConfig(**vision_config)
-        if llm_config['architectures'][0] == 'LlamaForCausalLM':
             self.llm_config = LlamaConfig(**llm_config)
-        elif llm_config['architectures'][0] == 'Qwen2ForCausalLM':
             self.llm_config = Qwen2Config(**llm_config)
         else:
-            raise ValueError('Unsupported architecture: {}'.format(llm_config['architectures'][0]))
         self.use_backbone_lora = use_backbone_lora
         self.use_llm_lora = use_llm_lora
         self.select_layer = select_layer

             logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
         self.vision_config = InternVisionConfig(**vision_config)
+        if llm_config.get('architectures')[0] == 'LlamaForCausalLM':
             self.llm_config = LlamaConfig(**llm_config)
+        elif llm_config.get('architectures')[0] == 'Qwen2ForCausalLM':
             self.llm_config = Qwen2Config(**llm_config)
         else:
+            raise ValueError('Unsupported architecture: {}'.format(llm_config.get('architectures')[0]))
         self.use_backbone_lora = use_backbone_lora
         self.use_llm_lora = use_llm_lora
         self.select_layer = select_layer

conversation.py CHANGED Viewed

@@ -3,11 +3,13 @@ Conversation prompt templates.
 We kindly request that you import fastchat instead of copying this file if you wish to use it.
 If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
 """
 import dataclasses
 from enum import IntEnum, auto
-from typing import Any, Dict, List, Tuple, Union
 class SeparatorStyle(IntEnum):
@@ -340,17 +342,10 @@ register_conv_template(
         system_template='<|im_start|>system\n{system_message}',
         # note: The new system prompt was not used here to avoid changes in benchmark performance.
         # system_message='我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。',
-        # system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型，英文名叫InternVL, 是一个有用无害的人工智能助手。',
-        system_message='You are a Table Image to LaTeX/Markdown/HMTL Code converter.',
         roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
         sep_style=SeparatorStyle.MPT,
         sep='<|im_end|>',
-        stop_token_ids=[
-            2,
-            6,
-            7,
-            8,
-        ],
         stop_str='<|endoftext|>',
     )
 )
@@ -366,11 +361,6 @@ register_conv_template(
         roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
         sep_style=SeparatorStyle.MPT,
         sep='<|im_end|>',
-        stop_token_ids=[
-            2,
-            92543,
-            92542
-        ]
     )
 )
@@ -385,10 +375,17 @@ register_conv_template(
         roles=('<|user|>\n', '<|assistant|>\n'),
         sep_style=SeparatorStyle.MPT,
         sep='<|end|>',
-        stop_token_ids=[
-            2,
-            32000,
-            32007
-        ]
     )
 )

 We kindly request that you import fastchat instead of copying this file if you wish to use it.
 If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
+Modified from https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
 """
 import dataclasses
 from enum import IntEnum, auto
+from typing import Dict, List, Tuple, Union
 class SeparatorStyle(IntEnum):
         system_template='<|im_start|>system\n{system_message}',
         # note: The new system prompt was not used here to avoid changes in benchmark performance.
         # system_message='我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。',
+        system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型，英文名叫InternVL, 是一个有用无害的人工智能助手。',
         roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
         sep_style=SeparatorStyle.MPT,
         sep='<|im_end|>',
         stop_str='<|endoftext|>',
     )
 )
         roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
         sep_style=SeparatorStyle.MPT,
         sep='<|im_end|>',
     )
 )
         roles=('<|user|>\n', '<|assistant|>\n'),
         sep_style=SeparatorStyle.MPT,
         sep='<|end|>',
+    )
+)
+register_conv_template(
+    Conversation(
+        name='internvl2_5',
+        system_template='<|im_start|>system\n{system_message}',
+        system_message='你是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。',
+        roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
+        sep_style=SeparatorStyle.MPT,
+        sep='<|im_end|>\n',
     )
 )

generation_config.json CHANGED Viewed

@@ -4,5 +4,5 @@
     151644,
     151645
   ],
-  "transformers_version": "4.44.0.dev0"
 }

     151644,
     151645
   ],
+  "transformers_version": "4.44.2"
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e16caef36733e9f43421450f6d4d220017a40c4ac2e8a65f86407059a049c930
 size 1876395376

 version https://git-lfs.github.com/spec/v1
+oid sha256:59901e249fd22bef86f66f22005e091379c92a706f55340ac5fc481d930757fb
 size 1876395376

modeling_intern_vit.py CHANGED Viewed

@@ -3,6 +3,7 @@
 # Copyright (c) 2024 OpenGVLab
 # Licensed under The MIT License [see LICENSE for details]
 # --------------------------------------------------------
 from typing import Optional, Tuple, Union
 import torch

 # Copyright (c) 2024 OpenGVLab
 # Licensed under The MIT License [see LICENSE for details]
 # --------------------------------------------------------
 from typing import Optional, Tuple, Union
 import torch

modeling_internvl_chat.py CHANGED Viewed

@@ -3,8 +3,9 @@
 # Copyright (c) 2024 OpenGVLab
 # Licensed under The MIT License [see LICENSE for details]
 # --------------------------------------------------------
 import warnings
-from typing import Any, List, Optional, Tuple, Union
 import torch.utils.checkpoint
 import transformers
@@ -34,6 +35,7 @@ def version_cmp(v1, v2, op='eq'):
 class InternVLChatModel(PreTrainedModel):
     config_class = InternVLChatConfig
     main_input_name = 'pixel_values'
     _supports_flash_attn_2 = True
     _no_split_modules = ['InternVisionModel', 'LlamaDecoderLayer', 'Qwen2DecoderLayer']
@@ -99,10 +101,11 @@ class InternVLChatModel(PreTrainedModel):
     ) -> Union[Tuple, CausalLMOutputWithPast]:
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         image_flags = image_flags.squeeze(0)
         pixel_values = pixel_values.squeeze(0)
-        input_embeds = self.language_model.get_input_embeddings()(input_ids)
         vit_embeds = self.extract_feature(pixel_values)
         vit_embeds = vit_embeds[image_flags == 1]
@@ -116,7 +119,6 @@ class InternVLChatModel(PreTrainedModel):
         input_ids = input_ids.reshape(B * N)
         selected = (input_ids == self.img_context_token_id)
         try:
             input_embeds[selected] = input_embeds[selected] * 0.0 + vit_embeds.reshape(-1, C)
         except Exception as e:
@@ -236,9 +238,9 @@ class InternVLChatModel(PreTrainedModel):
         tokenizer.padding_side = 'left'
         model_inputs = tokenizer(queries, return_tensors='pt', padding=True)
-        input_ids = model_inputs['input_ids'].cuda()
-        attention_mask = model_inputs['attention_mask'].cuda()
-        eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
         generation_config['eos_token_id'] = eos_token_id
         generation_output = self.generate(
             pixel_values=pixel_values,
@@ -247,7 +249,7 @@ class InternVLChatModel(PreTrainedModel):
             **generation_config
         )
         responses = tokenizer.batch_decode(generation_output, skip_special_tokens=True)
-        responses = [response.split(template.sep)[0].strip() for response in responses]
         return responses
     def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
@@ -266,7 +268,7 @@ class InternVLChatModel(PreTrainedModel):
         template = get_conv_template(self.template)
         template.system_message = self.system_message
-        eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
         history = [] if history is None else history
         for (old_question, old_answer) in history:
@@ -285,10 +287,9 @@ class InternVLChatModel(PreTrainedModel):
             query = query.replace('<image>', image_tokens, 1)
         model_inputs = tokenizer(query, return_tensors='pt')
-        input_ids = model_inputs['input_ids'].cuda()
-        attention_mask = model_inputs['attention_mask'].cuda()
         generation_config['eos_token_id'] = eos_token_id
         generation_output = self.generate(
             pixel_values=pixel_values,
             input_ids=input_ids,
@@ -296,7 +297,7 @@ class InternVLChatModel(PreTrainedModel):
             **generation_config
         )
         response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0]
-        response = response.split(template.sep)[0].strip()
         history.append((question, response))
         if return_history:
             return response, history
@@ -316,7 +317,6 @@ class InternVLChatModel(PreTrainedModel):
             visual_features: Optional[torch.FloatTensor] = None,
             generation_config: Optional[GenerationConfig] = None,
             output_hidden_states: Optional[bool] = None,
-            return_dict: Optional[bool] = None,
             img_context_token_id: Optional[bool] = None,
             **generate_kwargs,
     ) -> torch.LongTensor:
@@ -347,7 +347,6 @@ class InternVLChatModel(PreTrainedModel):
             attention_mask=attention_mask,
             generation_config=generation_config,
             output_hidden_states=output_hidden_states,
-            return_dict=return_dict,
             use_cache=True,
             **generate_kwargs,
         )

 # Copyright (c) 2024 OpenGVLab
 # Licensed under The MIT License [see LICENSE for details]
 # --------------------------------------------------------
 import warnings
+from typing import List, Optional, Tuple, Union
 import torch.utils.checkpoint
 import transformers
 class InternVLChatModel(PreTrainedModel):
     config_class = InternVLChatConfig
     main_input_name = 'pixel_values'
+    base_model_prefix = 'language_model'
     _supports_flash_attn_2 = True
     _no_split_modules = ['InternVisionModel', 'LlamaDecoderLayer', 'Qwen2DecoderLayer']
     ) -> Union[Tuple, CausalLMOutputWithPast]:
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # image_flags = image_flags.squeeze(-1)
         image_flags = image_flags.squeeze(0)
         pixel_values = pixel_values.squeeze(0)
+        input_embeds = self.language_model.get_input_embeddings()(input_ids).clone()
         vit_embeds = self.extract_feature(pixel_values)
         vit_embeds = vit_embeds[image_flags == 1]
         input_ids = input_ids.reshape(B * N)
         selected = (input_ids == self.img_context_token_id)
         try:
             input_embeds[selected] = input_embeds[selected] * 0.0 + vit_embeds.reshape(-1, C)
         except Exception as e:
         tokenizer.padding_side = 'left'
         model_inputs = tokenizer(queries, return_tensors='pt', padding=True)
+        input_ids = model_inputs['input_ids'].to(self.device)
+        attention_mask = model_inputs['attention_mask'].to(self.device)
+        eos_token_id = tokenizer.convert_tokens_to_ids(template.sep.strip())
         generation_config['eos_token_id'] = eos_token_id
         generation_output = self.generate(
             pixel_values=pixel_values,
             **generation_config
         )
         responses = tokenizer.batch_decode(generation_output, skip_special_tokens=True)
+        responses = [response.split(template.sep.strip())[0].strip() for response in responses]
         return responses
     def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
         template = get_conv_template(self.template)
         template.system_message = self.system_message
+        eos_token_id = tokenizer.convert_tokens_to_ids(template.sep.strip())
         history = [] if history is None else history
         for (old_question, old_answer) in history:
             query = query.replace('<image>', image_tokens, 1)
         model_inputs = tokenizer(query, return_tensors='pt')
+        input_ids = model_inputs['input_ids'].to(self.device)
+        attention_mask = model_inputs['attention_mask'].to(self.device)
         generation_config['eos_token_id'] = eos_token_id
         generation_output = self.generate(
             pixel_values=pixel_values,
             input_ids=input_ids,
             **generation_config
         )
         response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0]
+        response = response.split(template.sep.strip())[0].strip()
         history.append((question, response))
         if return_history:
             return response, history
             visual_features: Optional[torch.FloatTensor] = None,
             generation_config: Optional[GenerationConfig] = None,
             output_hidden_states: Optional[bool] = None,
             img_context_token_id: Optional[bool] = None,
             **generate_kwargs,
     ) -> torch.LongTensor:
             attention_mask=attention_mask,
             generation_config=generation_config,
             output_hidden_states=output_hidden_states,
             use_cache=True,
             **generate_kwargs,
         )