Vit-Axavision-2-ChestX 🩺

This model is a fine-tuned version of nlpconnect/vit-gpt2-image-captioning on a chest X-ray dataset. It is developed as part of the Axamine AI research efforts to explore medical vision-language applications. The model takes chest X-ray images as input and generates descriptive captions that may help in automated reporting, healthcare research, or AI-assisted diagnostics.


Model Details

  • Base model: nlpconnect/vit-gpt2-image-captioning
  • Architecture: VisionEncoderDecoderModel (ViT encoder + GPT2 decoder)
  • Fine-tuned on dataset: Shrey-1329/cxiu_hf_dataset
  • Model size: ~250M parameters
  • Developed by: Henilsinh Raj (Axamine AI)

Use Cases

Intended Use

  • Chest X-ray image captioning
  • Healthcare research
  • Medical AI experiments
  • Educational purposes

Limitations

  • This model does not provide medical diagnosis.
  • Captions are purely descriptive and may not fully reflect clinical accuracy.

Usage

Here’s how you can use the model for inference:

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from PIL import Image
import torch
import requests

# Load model
model_id = "Henil1/vit-axavision-2-ChestX"
model = VisionEncoderDecoderModel.from_pretrained(model_id)
feature_extractor = ViTImageProcessor.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Preprocess image
image = Image.open("your_image_path.jpg").convert("RGB")
pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values.to(device)

# Generate caption
output_ids = model.generate(pixel_values, max_length=64, num_beams=4)
caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("Generated caption:", caption)

Citation

If you use this model, please cite:

@misc{henil2025axavision,
  author = {Henilsinh Raj},
  title = {Vit-Axavision-2-ChestX: Vision-Language Model for Chest X-Ray Captioning},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Henil1/vit-axavision-2-ChestX}
}

Downloads last month
36
Safetensors
Model size
239M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support