Instructions to use microsoft/maira-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/maira-2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/maira-2", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("microsoft/maira-2", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/maira-2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/maira-2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/maira-2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/maira-2

SGLang

How to use microsoft/maira-2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/maira-2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/maira-2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/maira-2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/maira-2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/maira-2 with Docker Model Runner:
```
docker model run hf.co/microsoft/maira-2
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Please confirm that you have read and agree to the following disclaimer.
The model(s) and/or software described in this repository are provided for research and development use only. The model(s) and/or software are not intended for use in clinical decision-making or for any other clinical use, and performance for clinical use has not been established. You bear sole responsibility for any use of these model(s) and/or software, including incorporation into any product intended for clinical use.

Model Card for MAIRA-2

MAIRA-2 is a multimodal transformer designed for the generation of grounded or non-grounded radiology reports from chest X-rays. It is described in more detail in MAIRA-2: Grounded Radiology Report Generation (S. Bannur, K. Bouzid et al., 2024). MAIRA-2 has been built for research purposes only and is being shared to facilitate comparison and further research.

Model Details

Model Description

MAIRA-2 is composed of the image encoder RAD-DINO-MAIRA-2 (used frozen), a projection layer (trained from scratch), and the language model vicuna-7b-v1.5 (fully fine-tuned).

Developed by: Microsoft Research Health Futures
Model type: Multimodal transformer
Language(s) (NLP): English
License: MSRLA
Finetuned from model [optional]: vicuna-7b-1.5, RAD-DINO-MAIRA-2

Uses

MAIRA-2 is shared for research purposes only. It is not meant to be used for clinical practice. MAIRA-2 was not extensively tested for its capabilities and properties, including its accuracy and reliability in application settings, fairness across different demographics and uses, and security and privacy.

Direct Use

As inputs, MAIRA-2 takes a frontal chest X-ray, and any of the following:

A lateral view from the current study
A frontal view from the prior study, with accompanying prior report
The indication for the current study
The technique and comparison sections for the current study

MAIRA-2 can generate the findings section of the current study, in one of two forms:

Narrative text, without any image annotations (this is the typical report generation scenario).
As a grounded report, wherein all described findings are accompanied by zero or more bounding boxes indicating their location on the current frontal image.

MAIRA-2 can also perform phrase grounding. In this case, it must also be provided with an input phrase. It will then repeat the phrase and generate a bounding box localising the finding described in the phrase.

These use-cases are illustrated with sample code below.

Out-of-Scope Use

MAIRA-2 was trained on chest X-rays from adults with English language reports only, and is not expected to work on any other imaging modality or anatomy. Variations in the input prompt (e.g. changing the instruction) are likely to degrade performance, as this model was not optimised for arbitrary user inputs.

As above, this is a research model which should not be used in any real clinical or production scenario.

Bias, Risks, and Limitations

Data biases

MAIRA-2 was trained on chest X-ray report datasets from Spain (translated from the original Spanish to English) and the USA, listed below. Reporting styles, patient demographics and disease prevalence, and image acquisition protocols can vary across health systems and regions. These factors will impact the generalisability of the model.

Model errors (fabrication, omission)

This model does not perform perfectly on its tasks, as outlined in more detail in the MAIRA-2 report. Hence, errors can be present in the generated (grounded) reports.

How to Get Started with the Model

We demonstrate below how to run inference with MAIRA-2 for its three capabilities: findings generation with and without grounding, and phrase grounding.

Setup

To run this sample code, you will need the following packages:

pillow
protobuf
sentencepiece
torch
transformers>=4.48.0,<4.52

Note: MAIRA-2 has last been tested with transformers v4.51.3.

First, initialise the model and put it in eval mode.

from transformers import AutoModelForCausalLM, AutoProcessor
from pathlib import Path
import torch

model = AutoModelForCausalLM.from_pretrained("microsoft/maira-2", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("microsoft/maira-2", trust_remote_code=True)

device = torch.device("cuda")
model = model.eval()
model = model.to(device)

We need to get some data to demonstrate the forward pass. For this example, we'll collect an example from the IU X-ray dataset, which has a permissive license.

import requests
from PIL import Image

def get_sample_data() -> dict[str, Image.Image | str]:
    """
    Download chest X-rays from IU-Xray, which we didn't train MAIRA-2 on. License is CC.
    We modified this function from the Rad-DINO repository on Huggingface.
    """
    frontal_image_url = "https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-1001.png"
    lateral_image_url = "https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-2001.png"

    def download_and_open(url: str) -> Image.Image:
        response = requests.get(url, headers={"User-Agent": "MAIRA-2"}, stream=True)
        return Image.open(response.raw)

    frontal_image = download_and_open(frontal_image_url)
    lateral_image = download_and_open(lateral_image_url)

    sample_data = {
        "frontal": frontal_image,
        "lateral": lateral_image,
        "indication": "Dyspnea.",
        "comparison": "None.",
        "technique": "PA and lateral views of the chest.",
        "phrase": "Pleural effusion."  # For the phrase grounding example. This patient has pleural effusion.
    }
    return sample_data

sample_data = get_sample_data()

Use-case 1 and 2: Findings generation with or without grounding

We can toggle whether MAIRA-2 generates a grounded report based on how we preprocess the inputs, as it uses a different prompt. Let's start without grounding (get_grounding=False). While generating, for non-grounded reporting use max_new_tokens=300, and for grounded reporting use max_new_tokens=450 to accommodate additional box and object tokens.

processed_inputs = processor.format_and_preprocess_reporting_input(
    current_frontal=sample_data["frontal"],
    current_lateral=sample_data["lateral"],
    prior_frontal=None,  # Our example has no prior
    indication=sample_data["indication"],
    technique=sample_data["technique"],
    comparison=sample_data["comparison"],
    prior_report=None,  # Our example has no prior
    return_tensors="pt",
    get_grounding=False,  # For this example we generate a non-grounded report
)

processed_inputs = processed_inputs.to(device)
with torch.no_grad():
    output_decoding = model.generate(
        **processed_inputs,
        max_new_tokens=300,  # Set to 450 for grounded reporting
        use_cache=True,
    )
prompt_length = processed_inputs["input_ids"].shape[-1]
decoded_text = processor.decode(output_decoding[0][prompt_length:], skip_special_tokens=True)
decoded_text = decoded_text.lstrip()  # Findings generation completions have a single leading space
prediction = processor.convert_output_to_plaintext_or_grounded_sequence(decoded_text)
print("Parsed prediction:", prediction)

We get something that looks like this:

There is a large right pleural effusion with associated right basilar atelectasis. The left lung is clear. No pneumothorax is identified. The cardiomediastinal silhouette and hilar contours are normal. There is no free air under the diaphragm. Surgical clips are noted in the right upper quadrant of the abdomen.

If we had set get_grounding=True, MAIRA-2 would generate a grounded report. For this example, that looks like this:

('There is a large right pleural effusion.', [(0.055, 0.275, 0.445, 0.665)]),
('The left lung is clear.', None),
('No pneumothorax is identified.', None),
('The cardiomediastinal silhouette is within normal limits.', None),
('The visualized osseous structures are unremarkable.', None)

The generated bounding box coordinates are the (x, y) coordinates of the top left and bottom right corners of the box, e.g. (x_topleft, y_topleft, x_bottomright, y_bottomright). These are relative to the cropped image (that is, the image that MAIRA-2 ultimately got as input), so be careful while visualising. The processor provides a method adjust_box_for_original_image_size to get boxes relative to the original image shape.

Note that MAIRA-2 generates slightly different reports for grounded and non-grounded reporting scenarios, a side-effect of its grounded reporting training data coming from a different data distribution.

Use-case 3: Phrase Grounding

Here the input is different as we provide the model with a phrase to ground in the image. Recall (get_sample_data) that our phrase here is just "Pleural effusion", which we already know is present in this image.

processed_inputs = processor.format_and_preprocess_phrase_grounding_input(
    frontal_image=sample_data["frontal"],
    phrase=sample_data["phrase"],
    return_tensors="pt",
)

processed_inputs = processed_inputs.to(device)
with torch.no_grad():
    output_decoding = model.generate(
        **processed_inputs,
        max_new_tokens=150,
        use_cache=True,
    )
prompt_length = processed_inputs["input_ids"].shape[-1]
decoded_text = processor.decode(output_decoding[0][prompt_length:], skip_special_tokens=True)
prediction = processor.convert_output_to_plaintext_or_grounded_sequence(decoded_text)

print("Parsed prediction:", prediction)

This gives us something like this:

('Pleural effusion.', [(0.025, 0.345, 0.425, 0.575)])

Again, as for grounded reporting we must remember the bbox coordinates are relative to the cropped image seen by MAIRA-2, use processor.adjust_box_for_original_image_size to get boxes adjusted for the original image shape.

Training details

We did not originally train MAIRA-2 using the exact model class provided here, however we have checked that its behaviour is the same. We provide this class to facilitate research re-use and inference.

Training data

MAIRA-2 was trained on a mix of public and private chest X-ray datasets. Each example comprises one or more CXR images and associated report text, with or without grounding (spatial annotations). The model is trained to generate the findings section of the report, with or without grounding.

Dataset	Country	# examples (ungrounded)	# examples (grounded)
MIMIC-CXR	USA	55 218	595*
PadChest	Spain	52 828	3 122
USMix (Private)	USA	118 031	53 613

*We use the MS-CXR phrase grounding dataset to provide `grounding' examples from MIMIC-CXR.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA A100 GPUs
Hours used: 1432
Cloud Provider: Azure
Compute Region: West US 2
Carbon Emitted: 107.4 CO₂ eq (ostensibly offset by this provider)

Citation

BibTeX:

@article{Bannur2024MAIRA2GR,
  title={MAIRA-2: Grounded Radiology Report Generation},
  author={Shruthi Bannur and Kenza Bouzid and Daniel C. Castro and Anton Schwaighofer and Anja Thieme and Sam Bond-Taylor and Maximilian Ilse and Fernando P\'{e}rez-Garc\'{i}a and Valentina Salvatelli and Harshita Sharma and Felix Meissen and Mercy Prasanna Ranjit and Shaury Srivastav and Julia Gong and Noel C. F. Codella and Fabian Falck and Ozan Oktay and Matthew P. Lungren and Maria T. A. Wetscherek and Javier Alvarez-Valle and Stephanie L. Hyland},
  journal={arXiv},
  year={2024},
  volume={abs/2406.04449},
  url={https://arxiv.org/abs/2406.04449}
}

APA:

Bannur*, S., Bouzid*, K., Castro, D. C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., Meissen, F., Ranjit, M.P., Srivastav, S., Gong, J., Codella, N.C.F., Falck, F., Oktay, O., Lungren, M.P., Wetscherek, M.T., Alvarez-Valle, J., & Hyland, S. L. (2024). MAIRA-2: Grounded Radiology Report Generation. arXiv preprint abs/2406.04449.