VIS-Shepherd: Constructing Critic for LLM-based Data Visualization Generation

GitHub Repo

This repository is the official implementation of VIS-Shepherd: Constructing Critic for LLM-based Data Visualization Generation.

Requirements

Common Dependencies

Pyhton Environment Setup

To install requirements for python environment (we recommend python 3.10):

pip install -r requirements.txt

You can use some virtual environment to install dependencies, e.g. conda or venv.

LLaMA-Factory

We use LLaMA-Factory for training and model inference. If you reproduce our training experiments, please follow the instructions in the repository:

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
# More optional dependencies can be found at https://llamafactory.readthedocs.io/en/latest/getting_started/installation.html
pip install -e ".[torch,metrics,deepspeed]"

Training

The dataset for training is available at train/data/viscrafter_20250521.json, with the format as follows:

[
    {
        "input": "the input instruction",
        "output": "the output response",
        "images": [
            "the image path"
        ]
    },
]

To train the model(s) in the paper, directly run this command at the root of the project:

llamafactory-cli train train/configs/train-sft-full-viscrafter-20250521.yml

We trained the model on 8 A800 GPUs (80G memory) using DeepSpeed. You can find more configuration methods in the LLaMA-Factory documentation to modify training parameters to adapt to your training environment.

Setup Local Inference Server

You can set up an inference server using the following command, which will start a server compatible with the OpenAI API that you can use to test your model.

llamafactory-cli api train/configs/infer-sft-full-viscrafter-20250521.yml

Evaluation

First move to the folder for evaluation and fill your API_BASE, API_KEY, and list of the name of models to use in evaluation/config/config.yaml. Note that we use Azure's API for GPT-4o, local inference server for locally trained models and OpenRouters for other models (e.g. llama-4-maverick).

cd evaluation

## config for openai key
OPENAI_API_BASE: "put your api base here"
OPENAI_API_KEY: "put your api key here"
OPENAI_API_MODEL_LIST: ["gpt-4o", "qwen/qwen-2.5-vl-7b-instruct", "qwen/qwen2.5-vl-72b-instruct", "meta-llama/llama-4-maverick"]
OPENAI_TEMPERATURE: 0.01
OPENAI_TOP_P: 0.1

To run inference on the test dataset for certain model, execute the following command (set --model_used to the name of model used as critic) and automatically save the inference result at folder critic_outputs:

python run_parallel_autoCritic.py --input_base_path test_set --output_base_path critic_outputs --model_used "The name of the LLM used as critic"

To run auto evaluation for all the inference results under the folder critic_outputs, execute:

./run_all_autoEvaluate.sh

The Evaluation result will be saved as evaluation/result.md.

Results

Model	Mean Score	% Scores 3-5
GPT-4o	3.41	72.0%
VIS-Shepherd	2.98	67.1%
Llama-4-Maverick	2.94	52.8%
Qwen-2.5-VL-72B	2.78	49.1%
qwen-2.5-VL-7B_1.2k	2.5	52.2%
qwen-2.5-VL-7B_0.3k	2.4	44.1%
qwen-2.5-VL-7B	2.2	44.1%

ZJUVAI
/

VIS-Shepherd-Qwen2.5-VL-7B