Qwen/Qwen2.5-VL-7B-Instruct · Poor performance with simple table extraction task

There is a lot of hype around multimodal models, such as Qwen 2.5 VL or Omni, I would like to know if others made a similar experience in practice: While they can do impressive things, they still struggle with table extraction, in cases which are straight-forward for humans.

Attached is a simple example, all I need is a reconstruction of the table as a flat CSV, preserving empty all empty cells correctly. Which open source model is able to do that?

I don't have much experience with visual models, but if it's anything comparable to classic LLM, then you may need to scale up the model size with the task difficulty. In other words, if certain model size isn't enough for certain task, you may need to use a bigger model for such task. Usually in standard text generation tasks, 7B models are barely enough to understand the context let alone generate an adequate response to the user's input. It's like an entry level of a model for text generation tasks. It can handle simple tasks, but for some more complex tasks it's probably better to use something bigger.