Aurea: Adaptive Multimodal Fusion for Vision-Language Models

Aurea Logo

Aurea is an open-source research framework centered on an adaptive spatial-range attention module that fuses spatial and semantic cues from encoder features, yielding richer, context-aware representations for downstream tasks.

Explore the full source code and technical documentation on GitHub

Key Features

  • Multiple Vision Encoders: Input images are encoded separately by DINOv2 and SigLIP2.

  • Multi-stage Fusion: The SpatialRangeBlock fuses these inputs through multiple layers of SpatialRangeAttention, which selectively aggregates features by jointly considering spatial proximity and semantic similarity. This is performed with a highly optimized fused CUDA kernel.

  • Flexible Language Model Integration: While Phi-4 is the default language model, Aurea is designed for easy adaptation to other pretrained language models with minimal engineering effort.

  • Model Weights: Two model checkpoints are provided: (1) base pretrained weights (trained on a ~558k image subset of LAION) and (2) instruction-tuned weights (further fine-tuned on ~625k samples from LLaVA 1.5 datasets). All checkpoints can be downloaded directly from this repository.

  • Extensible and Modular: The code supports straightforward extension, experimentation, and integration with novel encoders or downstream tasks.

Installation

  1. Clone the source repository
git clone https://github.com/Dcas89/Aurea.git
cd Aurea
  1. Install Python dependencies
pip install -r requirements.txt

Usage

First, initialize the Aurea model:

from entry import Aurea

aurea = Aurea(root_dir='/path/to/Aurea')

Note: When initializing the model, all required model checkpoints will be downloaded automatically.

Image + Text Generation (Basic)

Generate text based on an image and prompt:

# Basic image + text generation
response = aurea.generate(
    prompt="How many remote control devices are in this image?", 
    image_path='./assets/cats.png'  # Example image included in the repo
)
print(response)

Generation with Custom Parameters

Tune generation parameters for more control:

# Advanced generation with custom parameters
response = aurea.generate(
    prompt="Only one cat is wearing a collar in the image. Which cat is it? Answer Briefly: Left, Right, or Both", 
    image_path='./assets/cats.png',  # Example image included in the repo
    max_new_tokens=50,          # Maximum number of tokens to generate
    temperature=0.1,            # Lower values make output more deterministic
    repetition_penalty=1.1,     # Penalizes token repetition (>1.0)
    filter_kwargs={'thres': 0.90, 'top_k': 50},  # Parameters for filtering function
    use_dynamic_top_k=False,    # Whether to use dynamic top-k sampling
    min_top_k=50,               # Minimum top-k value if using dynamic top-k
    max_top_k=90,               # Maximum top-k value if using dynamic top-k
    filter_fn=None,             # Custom filtering function
    exclude_prompt=True         # Whether to exclude prompt from returned text
)
print(response)

Logit Filtering

Using a specific filtering function (e.g., top_p):

from generate import top_p

response = aurea.generate(
    prompt="Only one cat is wearing a collar in the image. What is the color of the collar? Answer Briefly: Blue, Light Green, Yellow", 
    image_path='./assets/cats.png',  # Example image included in the repo
    max_new_tokens=50,
    temperature=0.1,
    repetition_penalty=1.1,
    filter_kwargs={'thres': 0.99, 'top_k': 50},
    filter_fn=top_p,            # Using top-p sampling
    exclude_prompt=True
)
print(response)

Dynamic Top-K Sampling

Example using dynamic top-k sampling (interpolating from max_top_k to min_top_k over generation):

response = aurea.generate(
    prompt="What does the logo say and what does it represent?", 
    image_path='./assets/mazure.png',
    max_new_tokens=100,
    temperature=0.1,
    repetition_penalty=1.1,
    filter_kwargs={'thres': 0.99, 'top_k': 50},
    use_dynamic_top_k=True,     # Enable dynamic top-k sampling
    min_top_k=50,               # Lower bound for top-k
    max_top_k=90,               # Upper bound for top-k
    filter_fn=None,
    exclude_prompt=True
)

print(response)

Text-Only Generation

Aurea can also be used for text-only tasks:

# Text-only generation (no image)
response = aurea.generate(
    prompt="What is CUDA programming?",
    max_new_tokens=200,
    temperature=0.1,
    repetition_penalty=1.1,
    filter_kwargs={'thres': 0.9, 'top_k': 50},
    exclude_prompt=True
)
print(response)

References

License

This project is released under the Apache 2.0 License.

Acknowledgements

  • The CUDA spatial-range attention is inspired by and adapted from LLaVA-UHD.
  • Some components were adapted from lucidrains repositories, which provide excellent implementations of various transformer and attention mechanisms.
  • Thanks to the open-source community for DINOv2, SigLIP2, LLaVA, LlaVA-UHD, and Phi-4.
  • Thanks to Hugging Face for their Transformers and Accelerate libraries.

This project incorporates code and models from:

  • Phi-4 Mini: Copyright (c) 2025 Microsoft Corporation
  • DINOv2: Copyright (c) 2024 Meta Platforms, Inc.
  • SigLIP2: Copyright (c) 2025 Google LLC
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
3.84B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Dcas89/Aurea

Adapter
(1)
this model