LLaVA-Next-Med-OLAB

Leveraging LLaVA-Next backbone with LLaVA-Med's training curriculum done by OLAB at NYU Langone Health

We combined the backbone and pretraining of LLaVA-Next (alternative link) with the staged medical curriculum of the original LLaVA-Med as a part of our work on Repurposing the scientific literature with vision-language models. This model served as an intermediate step of training CNS-Obsidian.

Model Details

Base Model: llava-hf/llava-v1.6-34b-hf

Model date: Trained in September 2024, arXiv'd in February 2025, model weights made public July 2025.

Paper https://arxiv.org/abs/2502.19546

License

This model is/may be subject to multiple licenses. The strictest license terms apply in all relevant cases:

NousResearch/Nous-Hermes-2-Yi-34B: Apache License 2.0
LLaVA-Next: Apache License 2.0
LLaVA-Med Data: CC BY NC 4.0
LLaVA-Med (if relevant): Microsoft Research License Terms

Intended use

Primary Intended Use

The data, code, and model checkpoints are intended to be used solely for (I) future research on visual-language processing and (II) reproducibility of the experimental results. The primary intended use is to support AI researchers reproducing and building on top of this work as we built on LLaVA-Next and LLaVA-Med. LLaVA-Next-Med-OLAB and its associated models should be helpful for exploring various biomedical vision-language processing (VLP ) and vision question answering (VQA) research questions.

Out-of-Scope Use

Any deployed use case of the model --- commercial or otherwise --- is out of scope. The data, code, and model checkpoints are intended for research use only and not intended for deployed use in clinical care or for any clinical decision making purposes.

Data

This model builds upon LLaVA-Med, which in turn builds upon the PMC-15M dataset. PMC-15M is a large-scale parallel image-text dataset for biomedical vision-language processing, containing 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. It covers a diverse range of biomedical image types, such as microscopy, radiography, histology, and more.

For LLaVA-Next-Med-OLAB, we obtained the training data using the data downloading script from the LLaVA-Med GitHub repository. Through this process, we were able to recover 467K biomedical image-text pairs for Stage 1 alignment and 56K instruction-following samples for Stage 2 fine-tuning (from the originally reported 500K and 60K respectively).

How to Use

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests

model_id = "NYU-OLAB/LLaVA-Next-Med-OLAB"

processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

# Load epidural hematoma CT from Radiopaedia
url = "https://prod-images-static.radiopaedia.org/images/64765614/eb6541731e66f04fc1e3a544fe55a7935646d39f886bee9aae3da8320c29b165.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What abnormality is shown in this CT scan?"},
            {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))

Limitations

This model was developed using English corpora, and thus may be considered English-only. It is not suitable for use in any clinical setting. Under some conditions, the model may make inaccurate predictions and display limitations, which may require additional mitigation strategies. In particular, this model is likely to carry many of the limitations of the model from which it is derived - LLaVA-Next (as well as LLaVA and LLaVA-Med).

Disclosure

Our work was performed and arXiv'ed in parallel with LLaVA-NeXT-Med: Medical Multimodal Large Language Model by Yunfei Guo, Wu Huang. It is NOT the same model but trained in a very similar fashion. We chose to add clarifier -OLAB to ours to avoid confusion.

BibTeX entry and citation info

@misc{alyakin2025cnsobsidian,
      title={Repurposing the scientific literature with vision-language models}, 
      author={Anton Alyakin and Jaden Stryker and Daniel Alexander Alber and Karl L. Sangwon and Jin Vivian Lee and Brandon Duderstadt and Akshay Save and David Kurland and Spencer Frome and Shrutika Singh and Jeff Zhang and Eunice Yang and Ki Yun Park and Cordelia Orillac and Aly A. Valliani and Sean Neifert and Albert Liu and Aneek Patel and Christopher Livia and Darryl Lau and Ilya Laufer and Peter A. Rozman and Eveline Teresa Hidalgo and Howard Riina and Rui Feng and Todd Hollon and Yindalon Aphinyanaphongs and John G. Golfinos and Laura Snyder and Eric Leuthardt and Douglas Kondziolka and Eric Karl Oermann},
      year={2025},
      eprint={2502.19546},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.19546}, 
}

NYU-OLAB
/

LLaVA-Next-Med-OLAB