Thorscribe-Model-3: A Florence-2 Model for Thoracic Medical Image Captioning
This model, thorscribe/thorscribe-model-3
, is a fine-tuned version of microsoft/Florence-2-large
specialized for generating descriptive captions for medical images of the thorax (chest). It was developed as part of the THORSCRIBE research project, which aims to create AI-powered tools to assist radiologists and medical professionals in interpreting radiological images.
This model card corresponds to Model 3 from the original research paper, which was identified as the best-performing model based on validation loss and automated evaluation metrics.
Link to Demo on Hugging Face Spaces
Model Description
Thorscribe-Model-3 is an image-to-text, multi-modal model that takes a thoracic medical image (such as an X-ray or CT scan) as input and generates a concise, descriptive caption in English. The primary goal is to provide a preliminary, automated description of the visual findings in the image, which can serve as a supportive tool for medical education, documentation, or as a "second opinion" for practitioners.
The base model, Florence-2, is a powerful vision-language foundation model. This version has been fine-tuned on a specific dataset of thoracic images to adapt its language and visual understanding to the nuances of the medical domain.
Dataset
The model was fine-tuned on thorscribe/rcv2-qa-fixed
, a manually curated and cleaned subset of the ROCOv2 (Radiology Objects in Context Version 2) dataset. The subset was created to exclusively contain images and captions related to the thoracic region.
- Dataset:
thorscribe/rcv2-qa-fixed
- Training Split: 1,796 image-caption pairs
- Validation Split: 449 image-caption pairs
The captions are descriptive sentences in English, often containing medical terminology relevant to findings in chest radiology.
Training Procedure
Hyperparameters
This model was trained with the following hyperparameters, which yielded the lowest validation loss among nine experimental configurations:
Hyperparameter | Value |
---|---|
Base Model | microsoft/Florence-2-large |
Learning Rate | 1e-6 |
Weight Decay | 0.05 |
Optimizer | AdamW |
Training Epochs | 6 |
Batch Size | 1 |
Loss Function | Cross-Entropy |
Frameworks and Hardware
- Frameworks: PyTorch, Hugging Face
transformers
,datasets
- Hardware: The model was trained on a single NVIDIA RTX 4090 GPU.
How to Use
You can easily use this model with the transformers
library. Make sure you have transformers
, torch
, and Pillow
installed.
import requests
from PIL import Image
import torch
from transformers import AutoConfig, AutoProcessor, AutoModelForCausalLM
import os
import sys
def main():
"""
Main function to run the image captioning inference.
"""
# --- 1. Setup and Configuration ---
# Authenticate with Hugging Face Hub if a token is available.
# This is good practice but not always required for public models.
hf_token = os.environ.get("HF_TOKEN")
if hf_token:
from huggingface_hub import login
login(token=hf_token)
print("Hugging Face Hub login successful.")
else:
print("Warning: HF_TOKEN environment variable not set. Proceeding without login.")
# Set the device to CUDA if available, otherwise use CPU.
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Set the model identifier.
model_id = "thorscribe/thorscribe-model-3"
print(f"Using model: {model_id}")
# Determine the appropriate data type for the model based on hardware support.
if device == "cuda":
# Use bfloat16 if supported for better performance, otherwise float16.
torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
else:
# Use float32 for CPU.
torch_dtype = torch.float32
print(f"Using torch dtype: {torch_dtype}")
# --- 2. Image Processing Functions ---
def pad_to_square(image, background_color=(0, 0, 0)):
"""
Pads a PIL image to make it square.
Args:
image (PIL.Image.Image): The input image.
background_color (tuple): The RGB color for the padding.
Returns:
PIL.Image.Image: The padded, square image.
"""
if image is None:
return None
width, height = image.size
if width == height:
return image
# Create a new square image with the specified background color.
new_size = max(width, height)
new_image = Image.new('RGB', (new_size, new_size), background_color)
# Paste the original image into the center of the new square image.
paste_x = (new_size - width) // 2
paste_y = (new_size - height) // 2
new_image.paste(image, (paste_x, paste_y))
return new_image
def process_image(image, size=1024):
"""
Processes an image by padding it to a square and resizing it.
Args:
image (PIL.Image.Image): The input image.
size (int): The target size for the final image (width and height).
Returns:
PIL.Image.Image: The processed image.
"""
if image is None:
return None
print("Processing image: Padding to square...")
image = pad_to_square(image)
print(f"Processing image: Resizing to {size}x{size}...")
image = image.resize((size, size), Image.LANCZOS)
print(f"Image processed successfully. Final size: {image.size}")
return image
# --- 3. Load Model and Processor ---
try:
print(f"Loading processor from {model_id}...")
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
print("Processor loaded successfully.")
print(f"Loading model from {model_id}...")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
trust_remote_code=True
)
model.to(device)
print("Model loaded and moved to device successfully.")
except Exception as e:
print(f"Fatal error loading model or processor: {e}")
sys.exit(1)
# --- 4. Inference ---
# Define the prompt and the URL for the example image.
prompt = "<CAPTION>"
image_url = "https://prod-images-static.radiopaedia.org/images/34894868/e0194426997a654457c739504b73e5_big_gallery.jpeg"
print(f"Loading image from URL: {image_url}")
try:
# Load the image from the URL.
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
# Process the image to be suitable for the model.
processed_image = process_image(raw_image, size=1024)
# Prepare the inputs for the model.
inputs = processor(text=prompt, images=processed_image, return_tensors="pt").to(device, dtype=torch_dtype)
print("Generating caption...")
# Generate the caption using the model.
with torch.no_grad():
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3,
do_sample=False # Use deterministic generation
)
# Decode the generated IDs to text.
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
# Use the processor's post-processing to clean up the output.
parsed_caption = processor.post_process_generation(
generated_text,
task=prompt,
image_size=(raw_image.width, raw_image.height)
)
# --- 5. Display Result ---
print("\n--- CAPTION ---")
print(parsed_caption[prompt])
print("---------------")
except requests.exceptions.RequestException as e:
print(f"Error fetching image from URL: {e}")
except Exception as e:
print(f"An error occurred during inference: {e}")
if __name__ == "__main__":
main()
Limitations and Bias
- Domain Specificity: This model is specialized for thoracic images only. It will not perform reliably on medical images from other body parts (e.g., brain, abdomen).
- Not a Diagnostic Tool: The model's outputs are descriptive, not diagnostic. It should NOT be used for clinical decision-making. It is intended as a supplementary tool that requires verification by a qualified medical professional.
- Potential for Errors: The model may occasionally:
- Omit findings: It might miss secondary or subtle abnormalities.
- Simplify terminology: It may use general terms instead of precise clinical ones (e.g., "heart enlargement" instead of "cardiomegaly").
- Hallucinate: In some cases, it may generate findings that are not present in the image.
- Dataset Bias: The model is trained on captions from the ROCOv2 dataset, which are derived from scientific publications. These captions may differ in style and content from actual clinical radiology reports.
Citation
If you use this model or the THORSCRIBE research in your work, please cite the original paper:
@thesis{haq2025thorscribe,
title={THORSCRIBE: MODEL IMAGE CAPTIONING PADA CITRA MEDIS TORAKS UNTUK RADIOLOGI},
author={Haq, Dzakwan Amirul and Djobo, Carlo Nathanael and Haniputra, Marco Sajid Aristo},
school={Universitas Bina Nusantara},
year={2025}
}
Model Authors
Dzakwan Amirul Haq, Carlo Nathanael Djobo, Marco Sajid Aristo Haniputra.
- Downloads last month
- 28
Model tree for thorscribe/thorscribe-model-3
Base model
microsoft/Florence-2-large