Borealis — open data, code, weights recipe for training Audio LLM
Open 5B audio-language model for Russian and English. Open source, open data, full recipe to reproduce.
By Ilya · Ksenia · Nikolay · Konstantin · Alexander — VikhrModels
Hey. Borealis has been quietly cooking for about a year — our open take on Voxtral / Flamingo-audio. Today I want to share how we trained it from scratch, what worked, and what didn't.
Nothing especially new on the recipe: Whisper3-large, Qwen 4B as the LLM backbone, and an adapter glued in between.
Why audio-LLMs
Classical ASR (Whisper, Wav2Vec2) transcribes well but doesn't understand. Ask Whisper "what is this audio about?" and you get a transcript. Audio-LLMs close that gap — they hear and reason.
What we trained Borealis for:
- Summarize long recordings
- Answer questions about content
- Reason about tone and emotion
Architecture
The recipe is well-trodden: a strong audio encoder, a strong LLM, and an adapter between them.
Audio @ 16 kHz [input]
│ waveform → log-mel
▼
Whisper Large V3 encoder [frozen]
│ 1280-dim · ~1500 tokens / 30 s · 635M params
▼
4× downsampler + MLP adapter [trained]
│ concat 4 frames → 5120 → 2560 · ~375 tokens / 30 s
▼
Qwen3-4B · causal LLM [LoRA fine-tuned]
│
▼
Text response [output]
Why this stack
- Whisper Large V3 — best open speech encoder, especially for multilingual.
- Frozen encoder — preserves ASR quality. Basically every VLM does it this way — and we're really just training a VLM, only for audio.
- 4× downsampling — 1500 → 375 tokens. Audio isn't a dense channel, so compressing pays off.
- Qwen3-4B — went with what we had on hand.
~5B parameters total, of which ~500M are trained — LoRA on the LLM + the adapter.
Datasets
We assembled several data pools for the ablations.
All eight datasets in one place → Borealis training datasets collection
Note.
AudioBooksInstructGemini2.5was built by chunking audiobooks and generating instructions via Gemini 2.5 Pro — summarization, QA, analysis, structured output. The generation script is open.
Experimental setup
Our questions:
- Do we need to unfreeze anything?
- How much does training-data language matter — RU vs EN?
- Does adding plain-text instructions help?
- What ratios of languages and text data are optimal?
Config: base ckpt AlexWortega/Borealis5b_90k · 8× GPU · batch 1 / GPU · grad-accum 16 (effective 128) · LR 1e-5 · WER on 6 Russian benchmarks.
Results
01 · Russian vs English
How much does training-data language matter?
EN-only hits 20.88% WER on Russian benchmarks — only 1.5 pp behind native-RU.
That points to strong cross-lingual transfer:
- Whisper already knows Russian (multilingual pretrain).
- Qwen3 also knows Russian.
- The adapter only needs alignment in one language; the rest transfers.
Still — native data wins, and surprisingly mixing EN into RU makes things worse. Don't dilute target-language data "for diversity."
02 · Adding plain-text instructions
Does mixing in plain text help?
Non-linear:
- 10% text → small improvement (19.32 → 19.17).
- 25% text → degrades (→ 24.02).
At 25%, the model starts forgetting the audio task — the LLM drifts into text-to-text mode and stops fitting the audio embeddings properly.
💡 Takeaway. 10–15% text helps; 25% hurts. Clear sweet spot.
03 · The webinar problem
One benchmark stays stubbornly bad.
All our runs sit around 60% WER on webinars while plain Whisper is at 7.77%.
Webinars = noise, echo, bad mics, niche jargon, multiple speakers and interruptions. The Whisper encoder handles all that, but the LLM "over-corrects" the transcript toward something more grammatical — and the result is just bad.
All runs
Under the hood — serving and integrating with transformers
Serving multimodal models isn't a new topic. Fast part (encoder) + slow part (LLM): run the encoder asynchronously, accumulate the logits, hand them off to the LLM. We chunk audio for Whisper and serve as-is. The rest of this section is the boring story of patching vLLM.
A · Adapter — simple vs deep
borealis/modeling.py ships two adapters. Production uses the simple one — a 2-layer MLP, no biases:
class AudioLanguageAdapter(nn.Module):
# ~31M params for Whisper-large × Qwen3-4B
def __init__(self, hidden_size: int, dim: int):
super().__init__()
self.w_in = nn.Linear(hidden_size, dim, bias=False)
self.gelu = nn.GELU()
self.w_out = nn.Linear(dim, dim, bias=False)
def forward(self, x):
return self.w_out(self.gelu(self.w_in(x)))
Dimensions:
encoder.d_model = 1280 · downsample_factor = 4 → hidden_size = 5120llm.config.hidden_size = 2560 = dim- Net:
Linear(5120, 2560) → GELU → Linear(2560, 2560) - Params:
5120·2560 + 2560·2560 ≈ 19.7Mfor the matrices; with buffers ~31M
A heavier AudioLanguageAdapterDeep (~80M params) also lives in the repo — three transformer-like blocks with LayerNorm + GELU + residual + dropout. Didn't ship; the simple MLP was enough.
The 4× downsample is a plain view — four neighboring frames concatenated into one with 4× the channel dim:
def _downsample(self, seq):
k = self.downsample_factor # 4
T, d = seq.shape # 1500 × 1280
target = k * math.ceil(T / k)
if target != T:
seq = F.pad(seq, (0, 0, 0, target - T))
return seq.contiguous().view(target // k, d * k) # 375 × 5120
Token flow: 1500 × 1280 (Whisper output) → 375 × 5120 (downsample) → 375 × 2560 (adapter). Those 375 embeddings fill the slots of <|AUDIO|> placeholder tokens.
The encoder is frozen hard: encoder.eval(), then for p in encoder.parameters(): p.requires_grad = False.
B · Audio augmentations
borealis/augmentations.py is a curriculum machine: an AugmentationPipeline with a dozen random effects, plus an AugmentationScheduler callback that activates different stages at different epochs. Start clean, get harsher over time.
What's in the pipeline (each gated by its own p):
- Background noise mix — SNR 18–28 dB · cafe, street, A/C hum
- IR convolution — room & hall reverb
- EQ — ±6 dB · different mic curves
- Random gain — ±3 dB
- Band-pass — 150–350 / 3200–5200 Hz · cheap mics
- Resample — 14–20 kHz · low-bandwidth channels
- Telephony — 8–12 kHz · 180–4200 Hz · phone, call-centers
- Codec — 96–160 kbps · MP3 / Opus compression
- Clipping — 0.82–0.95 · overdriven signal
- Pitch / Speed — ±4 st · 0.8–1.2×
- SpecAugment — ≤2 freq masks (27 bins), ≤2 time masks (100 frames)
AugmentationScheduler is a HF TrainerCallback; on on_epoch_begin it picks the current AugmentationStage by start_epoch. Curriculum: clean audio first, progressively harsher distortions later.
Listen: before & after
The clean clip comes from ToneBooks; the noise is sampled from the Musan split of Vikhrmodels/Audio_Noise_Dataset. The mix is what the model sees at training time with noise augmentation enabled.
"Yet they were not pleasing at all — quite the opposite, they shocked and horrified."
| ① Clean sample ToneBooks |
② Noise only Musan |
③ Speech + noise SNR ~10 dB |
④ Telephony 300–3400 Hz · 8 kHz |
|---|---|---|---|
The third sample is what the model trains on during the hard curriculum epochs: the text is almost at the edge of intelligibility, but that's exactly what stops the adapter from "sticking" to the clean Whisper signal. The fourth is a classic telephone band (300–3400 Hz via 8 kHz resample round-trip).
C · Patching vLLM
The interesting part. vLLM ships a closed set of multimodal architectures out of the box (Qwen2-Audio, LLaVA, Phi-4-MM, a couple more). Borealis isn't there — Whisper-encoder + custom adapter + Qwen3 + two extra vocab tokens. To get the speedup we wrote a vLLM plugin.
The plugin (vllm_borealis) sits next to the weights in the HF model repo. Two files:
__init__.py— entry point. Registers the model withvllm.ModelRegistry.borealis.py— ~400 lines, four classes for the vLLM API.
def register():
from vllm import ModelRegistry
if "BorealisForConditionalGeneration" not in ModelRegistry.get_supported_archs():
ModelRegistry.register_model(
"BorealisForConditionalGeneration",
"vllm_borealis.borealis:BorealisForConditionalGeneration",
)
vLLM picks up register() via entry_points in pyproject.toml (group vllm.general_plugins). From that moment "BorealisForConditionalGeneration" in config.json is a first-class architecture name — as if it was native.
The four classes vLLM expects:
BorealisProcessingInfo— declares the modality. Key line:get_supported_mm_limits() == {"audio": 1}enforces one audio per prompt. Also exposes theWhisperFeatureExtractorfor waveform → mel.BorealisDummyInputsBuilder— synthesizes empty 30-second audio for warm-up and profiling so vLLM can size the KV-cache.BorealisMultiModalProcessor— the magic class. When the user writes a prompt with one<|AUDIO|>, the processor expands it into<|start_of_audio|>+ 375×<|AUDIO|>+<|start_of_audio|>, and marks each of those 375 tokens as "embedding will be supplied externally" viaPromptUpdateDetails.select_token_id(..., embed_token_id=audio_token_id).BorealisForConditionalGeneration— the model itself. HoldsWhisperEncoder, ourAudioLanguageAdapter, and — best part —init_vllm_registered_model(architectures=["Qwen3ForCausalLM"])instead of a re-implemented LLM.
The key trick. We never re-implement Qwen3 for vLLM. We tell vLLM "give us your own optimized Qwen3 block" via init_vllm_registered_model and get paged-attention, continuous batching and fused kernels for free. The only thing we own is the audio input and adapter: Whisper → downsample → adapter.
from vllm.model_executor.models.utils import (
init_vllm_registered_model, maybe_prefix,
)
llm_config = AutoConfig.from_pretrained("Qwen/Qwen3-4B")
llm_config.vocab_size = 151671 # base 151669 + 2 audio tokens
self.llm = init_vllm_registered_model(
vllm_config=vllm_config,
hf_config=llm_config,
prefix=maybe_prefix(prefix, "llm"),
architectures=["Qwen3ForCausalLM"], # vLLM's own optimized impl
)
Token-level magic. The hard part of multi-modal inference in vLLM is splicing externally-computed embeddings (the adapter output) into specific token positions without losing any of the other optimizations. vLLM handles it via PromptReplacement + PromptUpdateDetails.select_token_id:
def get_replacement_borealis(item_idx):
# 30s audio → 1500 mel frames / 4 = 375 audio tokens
num_features = audio_embeds[item_idx].shape[0] # or 375 default
audio_tokens = [audio_token_id] * num_features
return PromptUpdateDetails.select_token_id(
[audio_marker_id] + audio_tokens + [audio_marker_id],
embed_token_id=audio_token_id, # ← "these tokens carry external embeddings"
)
return [PromptReplacement(
modality="audio",
target="<|AUDIO|>", # single placeholder in user prompt
replacement=get_replacement_borealis,
)]
One <|AUDIO|> in the prompt inflates to 377 tokens (marker + 375 + marker). The 375 "real" audio tokens get adapter embeddings via embed_token_id; everything else flows through the normal LLM embedding table.
A few small under-the-hood details:
- Vocab resize. Qwen3 base = 151669. We add
<|AUDIO|>(id 151669) and<|start_of_audio|>(id 151670) →vocab_size = 151671. The plugin falls back to those exact ids ifconfig.jsondoesn't have them. - Stray batch dim. vLLM sometimes ships mel as
[N, 1, 128, 3000]because it packs multimodal fields into its own table. Plugin guards withif input_features.dim() == 4 and shape[1] == 1: squeeze(1). Classic footgun. merge_by_field_config = True— tells vLLM to auto-batch multimodal fields when merging requests. Without it you'd write a collator by hand.- Audio is computed once. Encoder + adapter run once per generate call; the resulting 375 embeddings live in the KV-cache like normal tokens. Every subsequent
next-token-steponly touches the LLM — so the audio-frontend cost amortizes across long generations.
Where the 2.1× comes from. The comparison is unfair in one direction: native transformers does eager attention with dynamic allocation, no continuous batching. vLLM adds:
- PagedAttention — KV-cache lives in a page table; no GPU minutes wasted on padding/fragmentation.
- Continuous batching — variable-length requests don't wait for the slowest in the batch.
- Fused Qwen3 kernels — optimized CUDA kernels for attention and MLP, especially in bf16.
Measured on NVIDIA A100, 30 s audio, max_tokens=128, bf16. Native transformers: 44.9 tok/s. vLLM plugin: 95.9 tok/s. With batch ≥4 the gap widens further.
Full plugin source (~400 lines): Vikhrmodels/Borealis-5b-it/tree/main/vllm_borealis
Practical recommendations
- Always start from a pretrain. Without one, the model won't converge in reasonable time. No checkpoint? Pretrain on plain ASR first.
- Start native. Cross-lingual transfer works, but native data wins. For Russian — collect Russian audio.
- Add text — but only a little. 10–15% of plain-text instructions helps. 25% regresses.
- Don't mix audio languages. RU + EN audio didn't beat pure RU. Languages compete for capacity.
- Plan a separate path for noisy audio. For meetings or call-centers — fine-tune separately or fall back to Whisper. One general checkpoint won't cover both.
Limitations
- Audio longer than ~30 s — caller must chunk it.
- Heavy noise — WER degrades.
- Streaming — offline only for now.
- Multi-audio prompts — limit = 1.
How to try it
Minimal inference via transformers:
from transformers import AutoModel
import torchaudio
model = AutoModel.from_pretrained(
"Vikhrmodels/Borealis-5b-it",
trust_remote_code=True,
device="cuda",
)
audio, sr = torchaudio.load("audio.wav")
if sr != 16000:
audio = torchaudio.functional.resample(audio, sr, 16000)
output = model.generate(
audio=audio.squeeze(),
user_prompt="What is this audio about? <|start_of_audio|><|end_of_audio|>",
system_prompt="You are a helpful voice assistant.",
max_new_tokens=256,
)
print(model.decode(output[0]))
For production we recommend vLLM (2× faster):
pip install vllm>=0.12.0
vllm serve Vikhrmodels/Borealis-5b-it \
--trust-remote-code \
--dtype bfloat16
Training an audio-LLM for Russian is doable but nuanced. In one line: pretrain critical, native data wins, a little text helps, and noise is its own problem. Borealis isn't perfect — but it's a solid open baseline for Russian audio-LLM work, and we hope this post saves someone a few hundred GPU-hours.
Links
- 🤗 Model — Vikhrmodels/Borealis-5b-it
- 💻 Code — github.com/VikhrModels/Borealis
- 🎙 Demo — Vikhrmodels/Borealis-inference
- 📊 Datasets collection — Borealis training datasets
- 📰 Full interactive post — AlexWortega/borealis-blog
- 𝕏 Author — @justALEXWORTEGA
Citation
@misc{borealis2025,
title = {Borealis: Audio-Language Model for Speech Understanding},
author = {VikhrModels},
year = {2025},
url = {https://huggingface.co/Vikhrmodels/Borealis-5b-it}
}
© 2026 VikhrModels · Apache 2.0








