LoRA Fine-tuned Whisper Large v3 for Icelandic ASR

This repository contains a LoRA (Low-Rank Adaptation) adapter for the openai/whisper-large-v3 model, fine-tuned for Automatic Speech Recognition (ASR) in Icelandic.

The fine-tuning was performed on the "Raddrómur Icelandic Speech 22.09" corpus, and the adapter was evaluated on a subset of the "Samrómur Milljón" dataset.

Model Description

Base Model: openai/whisper-large-v3
Fine-tuning Method: LoRA (Parameter-Efficient Fine-Tuning) using the peft library.
Language: Icelandic (is)
Task: Automatic Speech Recognition (transcription)

Fine-tuning Data

Dataset Name: Raddrómur Icelandic Speech 22.09
Source: Language and Voice Laboratory (LVL) at Reykjavík University (RU)
Description: Approximately 49 hours of Icelandic speech sourced from radio podcasts (primarily RÚV). The audio is 16kHz mono FLAC, with transcriptions automatically aligned.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Evaluation

The fine-tuned adapter was evaluated against the base openai/whisper-large-v3 model on a 1000-sample subset of the female_18to49_yrs split from the language-and-voice-lab/samromur_milljon dataset.

Evaluation Metrics (Lower is Better):

Model	WER (%)	CER (%)
Base Model	34.15	11.05
Fine-tuned Adapter	33.07	10.59

(Note: No stereo files were detected in the evaluation subset. Evaluation error flags were False for both, indicating successful completion.)

Comparison Plot:

possibly

Interpretation: The fine-tuned LoRA adapter demonstrates a modest improvement over the base whisper-large-v3 model on this specific Icelandic evaluation subset. The Word Error Rate (WER) was reduced by approximately 1.08 points (absolute), and the Character Error Rate (CER) was reduced by approximately 0.46 points (absolute). Further evaluation on larger or different test sets could provide more comprehensive insights.

How to Use

This LoRA adapter is intended to be used with the base openai/whisper-large-v3 model.

First, ensure you have the necessary libraries installed:

# Using pip
pip install transformers peft torch accelerate soundfile librosa

# Or using uv
uv pip install transformers peft torch accelerate soundfile librosa

Then, you can load the base model and apply the LoRA adapter from the Hugging Face Hub like this:

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import librosa # Or your preferred audio loading library
import numpy as np

# --- Configuration ---
BASE_MODEL_ID = "openai/whisper-large-v3"
# Replace with your actual Hugging Face Hub ID for the adapter
# For example, if you pushed it to "jonasaise/whisper-large-v3-lora-is"
ADAPTER_HUB_ID = "jonasaise/your-repo-name" # <--- CHANGE THIS
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Use the precision your model was trained/evaluated with
MODEL_PRECISION = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16

TARGET_LANGUAGE = "is"
TASK = "transcribe"

# --- 1. Load Processor ---
try:
    processor = WhisperProcessor.from_pretrained(BASE_MODEL_ID, language=TARGET_LANGUAGE, task=TASK)
except Exception as e:
    print(f"Error loading processor: {e}")
    # Fallback if processor isn't found with base model ID (less common for Whisper)
    # processor = WhisperProcessor.from_pretrained(ADAPTER_HUB_ID, language=TARGET_LANGUAGE, task=TASK)


# --- 2. Load Base Model ---
print(f"Loading base model: {BASE_MODEL_ID}...")
base_model = WhisperForConditionalGeneration.from_pretrained(
    BASE_MODEL_ID,
    torch_dtype=MODEL_PRECISION,
    low_cpu_mem_usage=True,
    attn_implementation="sdpa" # Recommended for speed if supported, or remove/use "eager"
)
print("Base model loaded.")

# --- 3. Load LoRA Adapter ---
print(f"Loading LoRA adapter from: {ADAPTER_HUB_ID}...")
# This loads the adapter weights and applies them to the base model
model = PeftModel.from_pretrained(base_model, ADAPTER_HUB_ID)
model = model.to(DEVICE)
model.eval() # Set to evaluation mode
print("LoRA adapter loaded and applied. Model is on device:", model.device)

# --- 4. Prepare Your Audio ---
# Replace "path/to/your/icelandic_audio.wav" with the actual path to your audio file
AUDIO_FILE_PATH = "path/to/your/icelandic_audio.wav" # <--- CHANGE THIS
try:
    # Load audio and resample to 16kHz mono
    speech_array, sampling_rate = librosa.load(AUDIO_FILE_PATH, sr=16000, mono=True)
    print(f"Audio loaded and resampled to 16kHz mono. Duration: {len(speech_array)/sampling_rate:.2f}s")
except Exception as e:
    print(f"Error loading audio file {AUDIO_FILE_PATH}: {e}")
    exit()

# Process audio to get input features
input_features = processor(speech_array, sampling_rate=16000, return_tensors="pt").input_features

# Ensure input_features are on the correct device and precision
# Note: Autocast during generation will handle precision, but explicit cast can also be done
input_features = input_features.to(DEVICE) # Move to device
if MODEL_PRECISION == torch.bfloat16:
    input_features = input_features.to(torch.bfloat16)
elif MODEL_PRECISION == torch.float16:
    input_features = input_features.to(torch.float16)

print("Input features prepared.")

# --- 5. Generate Transcription ---
# Configure generation parameters
# Use the model's existing generation_config as a base
generation_config = model.generation_config
generation_config.language = TARGET_LANGUAGE
generation_config.task = TASK
generation_config.forced_decoder_ids = None # Let processor handle this based on task/language
generation_config.suppress_tokens = []   # Clear any suppressed tokens

print("Generating transcription...")
with torch.inference_mode(): # Disables gradient calculations for inference
    with torch.autocast(device_type=DEVICE, dtype=MODEL_PRECISION, enabled=torch.cuda.is_available()): # Enable autocast for mixed precision
        predicted_ids = model.generate(input_features, generation_config=generation_config)

# --- 6. Decode Transcription ---
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print("-" * 30)
print(f"Transcription: {transcription}")
print("-" * 30)

Training Procedure

This section details the setup and hyperparameters used for fine-tuning the LoRA adapter.

Data Preprocessing

The fine-tuning script (finetune_whisper_ice_lora.py) performs the following preprocessing steps on the Raddrómur dataset:

Loads audio file paths and transcriptions from the metadata.tsv file.
Constructs full paths to audio files, accounting for the nested directory structure (e.g., <DATA_DIR>/speech/<podcast_name_dir>/<podcast_id_dir>/<filename.flac>).
Casts audio to 16kHz mono (though Raddrómur is already in this format).
Splits the dataset into training and test/validation sets (e.g., 90/10 split).
Uses the WhisperProcessor to:
- Convert audio arrays into log-Mel input features.
- Tokenize the Icelandic transcriptions into label IDs.
A DataCollatorSpeechSeq2SeqWithPadding is used to dynamically pad sequences within each batch.

Fine-tuning Hyperparameters & Setup

The model was fine-tuned using the following configuration:

Base Model: openai/whisper-large-v3
Fine-tuning Method: LoRA (Low-Rank Adaptation) using peft.
- r (Rank of LoRA matrices): 32 (example, adjust if different)
- lora_alpha: 64 (example, adjust if different)
- target_modules: ["q_proj", "v_proj"] (example, adjust if different)
- lora_dropout: 0.05 (example, adjust if different)
Precision: BFloat16 (bf16=True in Seq2SeqTrainingArguments).
Optimizer: AdamW 8-bit (optim="adamw_8bit" in Seq2SeqTrainingArguments, requires bitsandbytes).
Learning Rate: e.g., 1e-5 (adjust to your actual value).
Batch Size (Per Device): e.g., 4 (adjust to your final successful value).
Gradient Accumulation Steps: e.g., 8 (adjust to your final successful value).
- Effective Batch Size: (Per-Device Batch Size) * (Gradient Accumulation Steps) * (Number of GPUs)
Number of Epochs: 3 (or max_steps if that was used).
Warmup Steps: e.g., 10% of total steps (adjust to your actual value).
Attention Implementation: Scaled Dot Product Attention (attn_implementation="sdpa" during model loading).
Gradient Checkpointing: Enabled (model.gradient_checkpointing_enable()).
Logging: Weights & Biases (report_to=["wandb"]).
Evaluation Strategy during Training: Evaluated every eval_steps (e.g., 36 steps, adjust to your final value).
Language & Task: Icelandic (is), Transcribe (transcribe).

Compute Infrastructure

Hardware: NVIDIA DGX A100 (initially targeting 5 GPUs, final successful training run used 2 GPUs - 6,7).
Software:
- Python 3.10
- PyTorch
- transformers
- datasets
- peft
- accelerate (via torchrun)
- uv (for environment management)

Intended Use

This fine-tuned LoRA adapter is intended to improve the performance of openai/whisper-large-v3 for transcribing general Icelandic speech. It is particularly suited for:

Transcribing Icelandic audio content similar in nature to radio podcasts (the primary source of the Raddrómur fine-tuning data).
Use cases where improved accuracy on Icelandic specific vocabulary, names, and nuances is desired over the base multilingual model.
Applications requiring efficient fine-tuning and deployment, leveraging the small footprint of LoRA adapters.

Limitations and Bias

Domain Specificity: The fine-tuning dataset (Raddrómur) primarily consists of relatively clean radio podcast speech. Performance on other domains of Icelandic speech (e.g., highly noisy environments, strong accents not represented in Raddrómur, spontaneous conversational speech, children's speech beyond what might be in Samrómur Children, if that was used for training the original ASR systems that verified Samrómur Milljón) may vary.
Base Model Biases: The base openai/whisper-large-v3 model has its own inherent limitations and potential biases (e.g., demographic performance differences, sensitivity to certain audio characteristics). These may still be present or be amplified/mitigated to some extent by this fine-tuning.
Evaluation Subset: The reported evaluation metrics are based on a 1000-sample subset of a specific demographic split (female_18to49_yrs) from the Samrómur Milljón dataset. Performance might differ on the full dataset, other splits, or other Icelandic evaluation benchmarks.
LoRA Limitations: While parameter-efficient, LoRA fine-tunes only a small subset of the model's parameters. It might not capture all the nuances that full fine-tuning could, but offers a significant reduction in computational cost.

Recommendations

Users should be aware of the above limitations. It is recommended to:

Test the model on a diverse set of Icelandic audio relevant to the specific application before deployment.
Consider further fine-tuning or domain adaptation if performance on a specific out-of-domain task is critical.
Be mindful of potential biases when using the model in sensitive applications.

License

This Adapter: [Your Chosen License for the Adapter - e.g., MIT, Apache 2.0]
Base Model (openai/whisper-large-v3): The license of the original Whisper model applies to the base weights.
Datasets Used:
- Raddrómur Icelandic Speech 22.09: CC BY 4.0
- Samrómur Milljón: CC BY 4.0

Acknowledgements

The Language and Voice Laboratory (LVL) at Reykjavík University for creating the Raddrómur and Samrómur Milljón datasets.
The Language Technology Programme for Icelandic 2019-2023, managed by Almannarómur and funded by the Icelandic Ministry of Education, Science and Culture, for funding the dataset creation.
OpenAI for the Whisper model.
Hugging Face for the transformers, datasets, evaluate, peft, and accelerate libraries.
The Weights & Biases platform for experiment tracking.
Astral for the uv tool.

Citations

If you use this adapter or build upon this work, please consider citing the original datasets and the base model:

Raddrómur Dataset: Mena, Carlos et al. "Raddrómur Icelandic Speech 22.09". Web Download. Reykjavik University: Language and Voice Lab, 2022.

Samrómur Milljón Dataset:

@inproceedings{mena2024samromur,
    title={Samr{\'o}mur Millj{\'o}n: An ASR Corpus of One Million Verified Read Prompts in Icelandic},
    author={Mena, Carlos Daniel Hernandez and Gunnarsson, {\TH}orsteinn Da{\dh}i and Gu{\dh}nason, J{\'o}n},
    booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    pages={14305--14312},
    year={2024}
}

Whisper Model:

@inproceedings{radford2023robust,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
  booktitle={International Conference on Machine Learning},
  pages={28492--28518},
  year={2023},
  organization={PMLR}
}

jonasaise
/

whisper-large-v3-lora-is