Idefics2-8B-SFT

image/jpeg

Idefics2-8B-SFT is SFT fine-tune of HuggingFaceM4/idefics2-8b on 35k TextVQA dataset. Training was performed on RTX A5000 for 10 hrs. Wandb report:

image/png

This fine-tuned model achieves a Levenshtein score of 82.29%.

Model Summary

πŸ’» Usage

processor = AutoProcessor.from_pretrained("Syed-Hasan-8503/Idefics2-8B-SFT")
model = AutoModelForVision2Seq.from_pretrained("Syed-Hasan-8503/Idefics2-8B-SFT",).to(DEVICE)

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ]
    },       
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['User: What do we see in this image? \nAssistant: In this image, we can see the city of New York, and more specifically the Statue of Liberty. \nUser: And how about this image? \nAssistant: In this image we can see buildings, trees, lights, water and sky.']

πŸ† Evaluation

Coming Soon!

Downloads last month
10
Safetensors
Model size
8.4B params
Tensor type
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train Syed-Hasan-8503/Idefics2-8B-SFT

Collection including Syed-Hasan-8503/Idefics2-8B-SFT