Thorscribe-Model-3: A Florence-2 Model for Thoracic Medical Image Captioning

This model, thorscribe/thorscribe-model-3, is a fine-tuned version of microsoft/Florence-2-large specialized for generating descriptive captions for medical images of the thorax (chest). It was developed as part of the THORSCRIBE research project, which aims to create AI-powered tools to assist radiologists and medical professionals in interpreting radiological images.

This model card corresponds to Model 3 from the original research paper, which was identified as the best-performing model based on validation loss and automated evaluation metrics.

Link to Demo on Hugging Face Spaces

Model Description

Thorscribe-Model-3 is an image-to-text, multi-modal model that takes a thoracic medical image (such as an X-ray or CT scan) as input and generates a concise, descriptive caption in English. The primary goal is to provide a preliminary, automated description of the visual findings in the image, which can serve as a supportive tool for medical education, documentation, or as a "second opinion" for practitioners.

The base model, Florence-2, is a powerful vision-language foundation model. This version has been fine-tuned on a specific dataset of thoracic images to adapt its language and visual understanding to the nuances of the medical domain.

Dataset

The model was fine-tuned on thorscribe/rcv2-qa-fixed, a manually curated and cleaned subset of the ROCOv2 (Radiology Objects in Context Version 2) dataset. The subset was created to exclusively contain images and captions related to the thoracic region.

Dataset: thorscribe/rcv2-qa-fixed
Training Split: 1,796 image-caption pairs
Validation Split: 449 image-caption pairs

The captions are descriptive sentences in English, often containing medical terminology relevant to findings in chest radiology.

Training Procedure

Hyperparameters

This model was trained with the following hyperparameters, which yielded the lowest validation loss among nine experimental configurations:

Hyperparameter	Value
Base Model	`microsoft/Florence-2-large`
Learning Rate	`1e-6`
Weight Decay	`0.05`
Optimizer	AdamW
Training Epochs	6
Batch Size	1
Loss Function	Cross-Entropy

Frameworks and Hardware

Frameworks: PyTorch, Hugging Face transformers, datasets
Hardware: The model was trained on a single NVIDIA RTX 4090 GPU.

How to Use

You can easily use this model with the transformers library. Make sure you have transformers, torch, and Pillow installed.

import requests
from PIL import Image
import torch
from transformers import AutoConfig, AutoProcessor, AutoModelForCausalLM
import os
import sys

def main():
    """
    Main function to run the image captioning inference.
    """
    # --- 1. Setup and Configuration ---

    # Authenticate with Hugging Face Hub if a token is available.
    # This is good practice but not always required for public models.
    hf_token = os.environ.get("HF_TOKEN")
    if hf_token:
        from huggingface_hub import login
        login(token=hf_token)
        print("Hugging Face Hub login successful.")
    else:
        print("Warning: HF_TOKEN environment variable not set. Proceeding without login.")

    # Set the device to CUDA if available, otherwise use CPU.
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    # Set the model identifier.
    model_id = "thorscribe/thorscribe-model-3"
    print(f"Using model: {model_id}")

    # Determine the appropriate data type for the model based on hardware support.
    if device == "cuda":
        # Use bfloat16 if supported for better performance, otherwise float16.
        torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
    else:
        # Use float32 for CPU.
        torch_dtype = torch.float32
    print(f"Using torch dtype: {torch_dtype}")

    # --- 2. Image Processing Functions ---

    def pad_to_square(image, background_color=(0, 0, 0)):
        """
        Pads a PIL image to make it square.

        Args:
            image (PIL.Image.Image): The input image.
            background_color (tuple): The RGB color for the padding.

        Returns:
            PIL.Image.Image: The padded, square image.
        """
        if image is None:
            return None
            
        width, height = image.size
        if width == height:
            return image
        
        # Create a new square image with the specified background color.
        new_size = max(width, height)
        new_image = Image.new('RGB', (new_size, new_size), background_color)
        
        # Paste the original image into the center of the new square image.
        paste_x = (new_size - width) // 2
        paste_y = (new_size - height) // 2
        new_image.paste(image, (paste_x, paste_y))
        
        return new_image

    def process_image(image, size=1024):
        """
        Processes an image by padding it to a square and resizing it.

        Args:
            image (PIL.Image.Image): The input image.
            size (int): The target size for the final image (width and height).

        Returns:
            PIL.Image.Image: The processed image.
        """
        if image is None:
            return None
        
        print("Processing image: Padding to square...")
        image = pad_to_square(image)
        
        print(f"Processing image: Resizing to {size}x{size}...")
        image = image.resize((size, size), Image.LANCZOS)
        
        print(f"Image processed successfully. Final size: {image.size}")
        return image

    # --- 3. Load Model and Processor ---

    try:
        print(f"Loading processor from {model_id}...")
        processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
        print("Processor loaded successfully.")

        print(f"Loading model from {model_id}...")
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch_dtype,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
        model.to(device)
        print("Model loaded and moved to device successfully.")

    except Exception as e:
        print(f"Fatal error loading model or processor: {e}")
        sys.exit(1)

    # --- 4. Inference ---

    # Define the prompt and the URL for the example image.
    prompt = "<CAPTION>"
    image_url = "https://prod-images-static.radiopaedia.org/images/34894868/e0194426997a654457c739504b73e5_big_gallery.jpeg"

    print(f"Loading image from URL: {image_url}")
    try:
        # Load the image from the URL.
        raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
        
        # Process the image to be suitable for the model.
        processed_image = process_image(raw_image, size=1024)

        # Prepare the inputs for the model.
        inputs = processor(text=prompt, images=processed_image, return_tensors="pt").to(device, dtype=torch_dtype)

        print("Generating caption...")
        # Generate the caption using the model.
        with torch.no_grad():
            generated_ids = model.generate(
                input_ids=inputs["input_ids"],
                pixel_values=inputs["pixel_values"],
                max_new_tokens=1024,
                num_beams=3,
                do_sample=False  # Use deterministic generation
            )
        
        # Decode the generated IDs to text.
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
        
        # Use the processor's post-processing to clean up the output.
        parsed_caption = processor.post_process_generation(
            generated_text, 
            task=prompt, 
            image_size=(raw_image.width, raw_image.height)
        )
        
        # --- 5. Display Result ---
        print("\n--- CAPTION ---")
        print(parsed_caption[prompt])
        print("---------------")

    except requests.exceptions.RequestException as e:
        print(f"Error fetching image from URL: {e}")
    except Exception as e:
        print(f"An error occurred during inference: {e}")

if __name__ == "__main__":
    main()

Limitations and Bias

Domain Specificity: This model is specialized for thoracic images only. It will not perform reliably on medical images from other body parts (e.g., brain, abdomen).
Not a Diagnostic Tool: The model's outputs are descriptive, not diagnostic. It should NOT be used for clinical decision-making. It is intended as a supplementary tool that requires verification by a qualified medical professional.
Potential for Errors: The model may occasionally:
- Omit findings: It might miss secondary or subtle abnormalities.
- Simplify terminology: It may use general terms instead of precise clinical ones (e.g., "heart enlargement" instead of "cardiomegaly").
- Hallucinate: In some cases, it may generate findings that are not present in the image.
Dataset Bias: The model is trained on captions from the ROCOv2 dataset, which are derived from scientific publications. These captions may differ in style and content from actual clinical radiology reports.

Citation

If you use this model or the THORSCRIBE research in your work, please cite the original paper:

@thesis{haq2025thorscribe,
  title={THORSCRIBE: MODEL IMAGE CAPTIONING PADA CITRA MEDIS TORAKS UNTUK RADIOLOGI},
  author={Haq, Dzakwan Amirul and Djobo, Carlo Nathanael and Haniputra, Marco Sajid Aristo},
  school={Universitas Bina Nusantara},
  year={2025}
}

Model Authors

Dzakwan Amirul Haq, Carlo Nathanael Djobo, Marco Sajid Aristo Haniputra.

thorscribe
/

thorscribe-model-3