Borealis — open data, code, weights recipe for training Audio LLM

Community Article Published May 25, 2026

Why audio-LLMs
Architecture
Datasets
Experimental setup
Results
01 · Russian vs English
02 · Adding plain-text instructions
03 · The webinar problem
All runs
Under the hood — serving and integrating with transformers
A · Adapter — simple vs deep
B · Audio augmentations
C · Patching vLLM
Practical recommendations
Limitations
How to try it
Links
Citation
Open 5B audio-language model for Russian and English. Open source, open data, full recipe to reproduce.

By Ilya · Ksenia · Nikolay · Konstantin · Alexander — VikhrModels

Hey. Borealis has been quietly cooking for about a year — our open take on Voxtral / Flamingo-audio. Today I want to share how we trained it from scratch, what worked, and what didn't.

Nothing especially new on the recipe: Whisper3-large, Qwen 4B as the LLM backbone, and an adapter glued in between.

Why audio-LLMs

Classical ASR (Whisper, Wav2Vec2) transcribes well but doesn't understand. Ask Whisper "what is this audio about?" and you get a transcript. Audio-LLMs close that gap — they hear and reason.

What we trained Borealis for:

Summarize long recordings
Answer questions about content
Reason about tone and emotion

Architecture

The recipe is well-trodden: a strong audio encoder, a strong LLM, and an adapter between them.

Audio @ 16 kHz                          [input]
     │  waveform → log-mel
     ▼
Whisper Large V3 encoder                [frozen]
     │  1280-dim · ~1500 tokens / 30 s · 635M params
     ▼
4× downsampler + MLP adapter            [trained]
     │  concat 4 frames → 5120 → 2560 · ~375 tokens / 30 s
     ▼
Qwen3-4B  ·  causal LLM                 [LoRA fine-tuned]
     │
     ▼
Text response                           [output]

Why this stack

Whisper Large V3 — best open speech encoder, especially for multilingual.
Frozen encoder — preserves ASR quality. Basically every VLM does it this way — and we're really just training a VLM, only for audio.
4× downsampling — 1500 → 375 tokens. Audio isn't a dense channel, so compressing pays off.
Qwen3-4B — went with what we had on hand.

~5B parameters total, of which ~500M are trained — LoRA on the LLM + the adapter.

Datasets

We assembled several data pools for the ablations.

All eight datasets in one place → Borealis training datasets collection

Note. AudioBooksInstructGemini2.5 was built by chunking audiobooks and generating instructions via Gemini 2.5 Pro — summarization, QA, analysis, structured output. The generation script is open.

Experimental setup

Our questions:

Do we need to unfreeze anything?
How much does training-data language matter — RU vs EN?
Does adding plain-text instructions help?
What ratios of languages and text data are optimal?

Config: base ckpt AlexWortega/Borealis5b_90k · 8× GPU · batch 1 / GPU · grad-accum 16 (effective 128) · LR 1e-5 · WER on 6 Russian benchmarks.

Results

01 · Russian vs English

How much does training-data language matter?

EN-only hits 20.88% WER on Russian benchmarks — only 1.5 pp behind native-RU.

That points to strong cross-lingual transfer:

Whisper already knows Russian (multilingual pretrain).
Qwen3 also knows Russian.
The adapter only needs alignment in one language; the rest transfers.

Still — native data wins, and surprisingly mixing EN into RU makes things worse. Don't dilute target-language data "for diversity."

02 · Adding plain-text instructions

Does mixing in plain text help?

Non-linear:

10% text → small improvement (19.32 → 19.17).
25% text → degrades (→ 24.02).

At 25%, the model starts forgetting the audio task — the LLM drifts into text-to-text mode and stops fitting the audio embeddings properly.

💡 Takeaway. 10–15% text helps; 25% hurts. Clear sweet spot.

03 · The webinar problem

One benchmark stays stubbornly bad.

All our runs sit around 60% WER on webinars while plain Whisper is at 7.77%.

Webinars = noise, echo, bad mics, niche jargon, multiple speakers and interruptions. The Whisper encoder handles all that, but the LLM "over-corrects" the transcript toward something more grammatical — and the result is just bad.

All runs

Under the hood — serving and integrating with transformers

Serving multimodal models isn't a new topic. Fast part (encoder) + slow part (LLM): run the encoder asynchronously, accumulate the logits, hand them off to the LLM. We chunk audio for Whisper and serve as-is. The rest of this section is the boring story of patching vLLM.

A · Adapter — simple vs deep

borealis/modeling.py ships two adapters. Production uses the simple one — a 2-layer MLP, no biases:

class AudioLanguageAdapter(nn.Module):
    # ~31M params for Whisper-large × Qwen3-4B
    def __init__(self, hidden_size: int, dim: int):
        super().__init__()
        self.w_in  = nn.Linear(hidden_size, dim, bias=False)
        self.gelu  = nn.GELU()
        self.w_out = nn.Linear(dim, dim, bias=False)

    def forward(self, x):
        return self.w_out(self.gelu(self.w_in(x)))

Dimensions:

encoder.d_model = 1280 · downsample_factor = 4 → hidden_size = 5120
llm.config.hidden_size = 2560 = dim
Net: Linear(5120, 2560) → GELU → Linear(2560, 2560)
Params: 5120·2560 + 2560·2560 ≈ 19.7M for the matrices; with buffers ~31M

A heavier AudioLanguageAdapterDeep (~80M params) also lives in the repo — three transformer-like blocks with LayerNorm + GELU + residual + dropout. Didn't ship; the simple MLP was enough.

The 4× downsample is a plain view — four neighboring frames concatenated into one with 4× the channel dim:

def _downsample(self, seq):
    k = self.downsample_factor          # 4
    T, d = seq.shape                    # 1500 × 1280
    target = k * math.ceil(T / k)
    if target != T:
        seq = F.pad(seq, (0, 0, 0, target - T))
    return seq.contiguous().view(target // k, d * k)   # 375 × 5120

Token flow: 1500 × 1280 (Whisper output) → 375 × 5120 (downsample) → 375 × 2560 (adapter). Those 375 embeddings fill the slots of <|AUDIO|> placeholder tokens.

The encoder is frozen hard: encoder.eval(), then for p in encoder.parameters(): p.requires_grad = False.

B · Audio augmentations

borealis/augmentations.py is a curriculum machine: an AugmentationPipeline with a dozen random effects, plus an AugmentationScheduler callback that activates different stages at different epochs. Start clean, get harsher over time.

What's in the pipeline (each gated by its own p):

Background noise mix — SNR 18–28 dB · cafe, street, A/C hum
IR convolution — room & hall reverb
EQ — ±6 dB · different mic curves
Random gain — ±3 dB
Band-pass — 150–350 / 3200–5200 Hz · cheap mics
Resample — 14–20 kHz · low-bandwidth channels
Telephony — 8–12 kHz · 180–4200 Hz · phone, call-centers
Codec — 96–160 kbps · MP3 / Opus compression
Clipping — 0.82–0.95 · overdriven signal
Pitch / Speed — ±4 st · 0.8–1.2×
SpecAugment — ≤2 freq masks (27 bins), ≤2 time masks (100 frames)

AugmentationScheduler is a HF TrainerCallback; on on_epoch_begin it picks the current AugmentationStage by start_epoch. Curriculum: clean audio first, progressively harsher distortions later.

Listen: before & after

The clean clip comes from ToneBooks; the noise is sampled from the Musan split of Vikhrmodels/Audio_Noise_Dataset. The mix is what the model sees at training time with noise augmentation enabled.

"Yet they were not pleasing at all — quite the opposite, they shocked and horrified."

① Clean sample _ToneBooks	② Noise only _Musan	③ Speech + noise _{SNR ~10 dB}	④ Telephony _{300–3400 Hz · 8 kHz}

The third sample is what the model trains on during the hard curriculum epochs: the text is almost at the edge of intelligibility, but that's exactly what stops the adapter from "sticking" to the clean Whisper signal. The fourth is a classic telephone band (300–3400 Hz via 8 kHz resample round-trip).

C · Patching vLLM

The interesting part. vLLM ships a closed set of multimodal architectures out of the box (Qwen2-Audio, LLaVA, Phi-4-MM, a couple more). Borealis isn't there — Whisper-encoder + custom adapter + Qwen3 + two extra vocab tokens. To get the speedup we wrote a vLLM plugin.

The plugin (vllm_borealis) sits next to the weights in the HF model repo. Two files:

__init__.py — entry point. Registers the model with vllm.ModelRegistry.
borealis.py — ~400 lines, four classes for the vLLM API.

def register():
    from vllm import ModelRegistry
    if "BorealisForConditionalGeneration" not in ModelRegistry.get_supported_archs():
        ModelRegistry.register_model(
            "BorealisForConditionalGeneration",
            "vllm_borealis.borealis:BorealisForConditionalGeneration",
        )

vLLM picks up register() via entry_points in pyproject.toml (group vllm.general_plugins). From that moment "BorealisForConditionalGeneration" in config.json is a first-class architecture name — as if it was native.

The four classes vLLM expects:

BorealisProcessingInfo — declares the modality. Key line: get_supported_mm_limits() == {"audio": 1} enforces one audio per prompt. Also exposes the WhisperFeatureExtractor for waveform → mel.
BorealisDummyInputsBuilder — synthesizes empty 30-second audio for warm-up and profiling so vLLM can size the KV-cache.
BorealisMultiModalProcessor — the magic class. When the user writes a prompt with one <|AUDIO|>, the processor expands it into <|start_of_audio|> + 375×<|AUDIO|> + <|start_of_audio|>, and marks each of those 375 tokens as "embedding will be supplied externally" via PromptUpdateDetails.select_token_id(..., embed_token_id=audio_token_id).
BorealisForConditionalGeneration — the model itself. Holds WhisperEncoder, our AudioLanguageAdapter, and — best part — init_vllm_registered_model(architectures=["Qwen3ForCausalLM"]) instead of a re-implemented LLM.

The key trick. We never re-implement Qwen3 for vLLM. We tell vLLM "give us your own optimized Qwen3 block" via init_vllm_registered_model and get paged-attention, continuous batching and fused kernels for free. The only thing we own is the audio input and adapter: Whisper → downsample → adapter.

from vllm.model_executor.models.utils import (
    init_vllm_registered_model, maybe_prefix,
)

llm_config = AutoConfig.from_pretrained("Qwen/Qwen3-4B")
llm_config.vocab_size = 151671      # base 151669 + 2 audio tokens

self.llm = init_vllm_registered_model(
    vllm_config=vllm_config,
    hf_config=llm_config,
    prefix=maybe_prefix(prefix, "llm"),
    architectures=["Qwen3ForCausalLM"],   # vLLM's own optimized impl
)

Token-level magic. The hard part of multi-modal inference in vLLM is splicing externally-computed embeddings (the adapter output) into specific token positions without losing any of the other optimizations. vLLM handles it via PromptReplacement + PromptUpdateDetails.select_token_id:

def get_replacement_borealis(item_idx):
    # 30s audio → 1500 mel frames / 4 = 375 audio tokens
    num_features = audio_embeds[item_idx].shape[0]   # or 375 default
    audio_tokens = [audio_token_id] * num_features
    return PromptUpdateDetails.select_token_id(
        [audio_marker_id] + audio_tokens + [audio_marker_id],
        embed_token_id=audio_token_id,   # ← "these tokens carry external embeddings"
    )

return [PromptReplacement(
    modality="audio",
    target="<|AUDIO|>",                  # single placeholder in user prompt
    replacement=get_replacement_borealis,
)]

One <|AUDIO|> in the prompt inflates to 377 tokens (marker + 375 + marker). The 375 "real" audio tokens get adapter embeddings via embed_token_id; everything else flows through the normal LLM embedding table.

A few small under-the-hood details:

Vocab resize. Qwen3 base = 151669. We add <|AUDIO|> (id 151669) and <|start_of_audio|> (id 151670) → vocab_size = 151671. The plugin falls back to those exact ids if config.json doesn't have them.
Stray batch dim. vLLM sometimes ships mel as [N, 1, 128, 3000] because it packs multimodal fields into its own table. Plugin guards with if input_features.dim() == 4 and shape[1] == 1: squeeze(1). Classic footgun.
merge_by_field_config = True — tells vLLM to auto-batch multimodal fields when merging requests. Without it you'd write a collator by hand.
Audio is computed once. Encoder + adapter run once per generate call; the resulting 375 embeddings live in the KV-cache like normal tokens. Every subsequent next-token-step only touches the LLM — so the audio-frontend cost amortizes across long generations.

Where the 2.1× comes from. The comparison is unfair in one direction: native transformers does eager attention with dynamic allocation, no continuous batching. vLLM adds:

PagedAttention — KV-cache lives in a page table; no GPU minutes wasted on padding/fragmentation.
Continuous batching — variable-length requests don't wait for the slowest in the batch.
Fused Qwen3 kernels — optimized CUDA kernels for attention and MLP, especially in bf16.

Measured on NVIDIA A100, 30 s audio, max_tokens=128, bf16. Native transformers: 44.9 tok/s. vLLM plugin: 95.9 tok/s. With batch ≥4 the gap widens further.

Full plugin source (~400 lines): Vikhrmodels/Borealis-5b-it/tree/main/vllm_borealis

Practical recommendations

Always start from a pretrain. Without one, the model won't converge in reasonable time. No checkpoint? Pretrain on plain ASR first.
Start native. Cross-lingual transfer works, but native data wins. For Russian — collect Russian audio.
Add text — but only a little. 10–15% of plain-text instructions helps. 25% regresses.
Don't mix audio languages. RU + EN audio didn't beat pure RU. Languages compete for capacity.
Plan a separate path for noisy audio. For meetings or call-centers — fine-tune separately or fall back to Whisper. One general checkpoint won't cover both.

Limitations

Audio longer than ~30 s — caller must chunk it.
Heavy noise — WER degrades.
Streaming — offline only for now.
Multi-audio prompts — limit = 1.

How to try it

Minimal inference via transformers:

from transformers import AutoModel
import torchaudio

model = AutoModel.from_pretrained(
    "Vikhrmodels/Borealis-5b-it",
    trust_remote_code=True,
    device="cuda",
)

audio, sr = torchaudio.load("audio.wav")
if sr != 16000:
    audio = torchaudio.functional.resample(audio, sr, 16000)

output = model.generate(
    audio=audio.squeeze(),
    user_prompt="What is this audio about? <|start_of_audio|><|end_of_audio|>",
    system_prompt="You are a helpful voice assistant.",
    max_new_tokens=256,
)
print(model.decode(output[0]))

For production we recommend vLLM (2× faster):

pip install vllm>=0.12.0

vllm serve Vikhrmodels/Borealis-5b-it \
  --trust-remote-code \
  --dtype bfloat16

Training an audio-LLM for Russian is doable but nuanced. In one line: pretrain critical, native data wins, a little text helps, and noise is its own problem. Borealis isn't perfect — but it's a solid open baseline for Russian audio-LLM work, and we hope this post saves someone a few hundred GPU-hours.

Citation

@misc{borealis2025,
  title  = {Borealis: Audio-Language Model for Speech Understanding},
  author = {VikhrModels},
  year   = {2025},
  url    = {https://huggingface.co/Vikhrmodels/Borealis-5b-it}
}

Models mentioned in this article 1

Datasets mentioned in this article 2

Collections mentioned in this article 1

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote