Borealis-5B-IT

Open In Colab

Borealis is an audio-language model that combines Whisper encoder with Qwen3-4B LLM for speech understanding and instruction-following tasks.

Model Description

  • Audio Encoder: Whisper Large V3 (frozen)
  • Language Model: Qwen3-4B (fine-tuned)
  • Adapter: 2-layer MLP projecting audio embeddings to LLM space
  • Total Parameters: ~5B
  • Languages: Russian, English

Installation

pip install transformers torch torchaudio safetensors

Quick Start

import torch
import torchaudio
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "Vikhrmodels/Borealis-5b-it",
    trust_remote_code=True,
    device="cuda"
)
model.eval()

# Load audio
audio, sr = torchaudio.load("your_audio.wav")
if sr != 16000:
    audio = torchaudio.functional.resample(audio, sr, 16000)
audio = audio.squeeze()

# Generate response
with torch.inference_mode():
    output_ids = model.generate(
        audio=audio,
        user_prompt="What is being said in this audio? <|start_of_audio|><|end_of_audio|>",
        system_prompt="You are a helpful voice assistant.",
        max_new_tokens=256,
        temperature=0.7,
    )

response = model.decode(output_ids[0])
print(response)

Prompt Examples

Audio Transcription

output = model.generate(
    audio=audio,
    user_prompt="Transcribe this audio: <|start_of_audio|><|end_of_audio|>",
    system_prompt="You are a speech recognition assistant. Accurately transcribe audio to text."
)

Audio Summarization

output = model.generate(
    audio=audio,
    user_prompt="Summarize what is said in this recording: <|start_of_audio|><|end_of_audio|>",
    system_prompt="You are a helpful voice assistant."
)

Audio Q&A (Russian)

output = model.generate(
    audio=audio,
    user_prompt="О Ρ‡Ρ‘ΠΌ говорится Π² этой аудиозаписи? <|start_of_audio|><|end_of_audio|>",
    system_prompt="Π’Ρ‹ ΠΏΠΎΠ»Π΅Π·Π½Ρ‹ΠΉ голосовой ассистСнт."
)

Content Description

output = model.generate(
    audio=audio,
    user_prompt="Describe in detail what you hear: <|start_of_audio|><|end_of_audio|>",
    system_prompt="You are an attentive listener."
)

Emotion Analysis

output = model.generate(
    audio=audio,
    user_prompt="What emotions does the speaker express? <|start_of_audio|><|end_of_audio|>",
    system_prompt="You are an expert in audio analysis."
)

Training Data

The model was fine-tuned on a diverse mix of audio-instruction datasets:

Dataset Description Size
Vikhrmodels/Speech-Instructions General speech instruction-following 70k
Vikhrmodels/Speech-Describe Audio description tasks (speech & non-speech) ~2M
Vikhrmodels/ToneBooks Russian audiobook excerpts -
Vikhrmodels/AudioBooksInstructGemini2.5 Instruction data generated with Gemini 2.5 -

Model Architecture

Audio Input (16kHz)
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Whisper Large V3β”‚  (Frozen)
β”‚    Encoder      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ (1280-dim embeddings)
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Downsampler   β”‚  (4x temporal reduction)
β”‚   + Adapter     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ (2560-dim embeddings)
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Qwen3-4B      β”‚  (Fine-tuned)
β”‚      LLM        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
    Text Output

vLLM Support

Borealis has native vLLM support through a plugin system. This enables high-performance inference with full audio processing.

Install vLLM Plugin

pip install vllm>=0.12.0

# Clone plugin only (skip large model weights)
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Vikhrmodels/Borealis-5b-it
cd Borealis-5b-it/vllm_borealis
pip install -e .

Basic Usage

import librosa
from vllm import LLM, SamplingParams

# Load model with vLLM
llm = LLM(
    model="Vikhrmodels/Borealis-5b-it",
    trust_remote_code=True,
    dtype="bfloat16",
    limit_mm_per_prompt={"audio": 1},
)

# Load audio (16kHz)
audio, sr = librosa.load("audio.wav", sr=16000)

# Simple prompt with audio placeholder
prompt = "<|AUDIO|>Transcribe this audio."

sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(
    {
        "prompt": prompt,
        "multi_modal_data": {"audio": audio},
    },
    sampling_params=sampling_params,
)

print(outputs[0].outputs[0].text)

With Chat Format

import librosa
from vllm import LLM, SamplingParams

llm = LLM(
    model="Vikhrmodels/Borealis-5b-it",
    trust_remote_code=True,
    dtype="bfloat16",
    limit_mm_per_prompt={"audio": 1},
)

audio, sr = librosa.load("audio.wav", sr=16000)

# Build prompt with Qwen3 chat format
prompt = """<|im_start|>system
You are a helpful voice assistant.<|im_end|>
<|im_start|>user
<|AUDIO|>What is being said in this audio?<|im_end|>
<|im_start|>assistant
"""

sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(
    {
        "prompt": prompt,
        "multi_modal_data": {"audio": audio},
    },
    sampling_params=sampling_params,
)

print(outputs[0].outputs[0].text)

OpenAI-Compatible Server

Note: Install the vLLM plugin first (see above).

# Start vLLM server
vllm serve Vikhrmodels/Borealis-5b-it \
    --trust-remote-code \
    --dtype bfloat16 \
    --limit-mm-per-prompt audio=1

How It Works

The vLLM plugin processes audio through the full Borealis pipeline:

Audio (numpy array, 16kHz)
    ↓ WhisperFeatureExtractor
Mel spectrogram [128, 3000]
    ↓ WhisperEncoder (frozen)
Encoder output [1500, 1280]
    ↓ Downsample 4x (concat adjacent frames)
[375, 5120]
    ↓ AudioLanguageAdapter (2-layer MLP)
Audio embeddings [375, 2560]
    ↓ Replace <|AUDIO|> tokens
    ↓ Qwen3-4B LLM (vLLM optimized)
Generated text

Each 30-second audio clip produces 375 audio tokens in the sequence.

Benchmark Results

Tested on NVIDIA A100 with 30s audio input, 128 max tokens:

Method Throughput Speedup
HuggingFace (native) 44.9 tok/s 1.0x
vLLM (plugin) 95.9 tok/s 2.1x

vLLM provides ~2x speedup over HuggingFace with full audio processing support.

ASR Benchmarks (WER / CER)

Split Borealis baseline Borealis step-2898 Whisper-v3
Russian_LibriSpeech 6.63% 5.64% 11.68%
Common_Voice 8.88% 12.67% 12.23%
Tone_Webinars 56.87% 60.55% 7.77%
Tone_Books 6.03% 5.25% 11.95%
Tone_Speak 4.63% 6.49% 2.68%
Sova_RuDevices 17.28% 21.57% 19.87%

Baseline: Whisper Large V3. Lower is better.

Limitations

  • Optimized for audio up to 30 seconds
  • Best performance on Russian and English
  • May not handle heavily noisy audio well

Citation

@misc{borealis2025,
  title={Borealis: Audio-Language Model for Speech Understanding},
  author={VikhrModels},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/Vikhrmodels/Borealis-5b-it}
}

License

Apache 2.0

Downloads last month
729
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Vikhrmodels/Borealis-5b-it

Space using Vikhrmodels/Borealis-5b-it 1