Model Card for `alecccdd/transcription-fixer-gemma-3-12b`

This model is a fine-tuned version of unsloth/gemma-3-12b-it-unsloth-bnb-4bit designed to correct errors in audio transcriptions. It processes input transcriptions in English, German, or Spanish, identifies contextual errors, and outputs a "repaired" version.

Model Details

Model Description

This model takes potentially erroneous audio transcriptions as input, identifies the context, and "repairs" words or phrases that are likely transcription errors. It aims to improve the accuracy and readability of automated speech recognition (ASR) outputs. The model was fine-tuned on a dataset comprising 56.9k examples of transcription errors and their corrected versions across English, German, and Spanish, using a specific instructional prompt.

Developed by: alecccdd
Model type: Text-to-text generation, fine-tuned for error correction
Language(s) (NLP): English (en), German (de), Spanish (es)
License: apache-2.0
Finetuned from model: unsloth/gemma-3-12b-it-unsloth-bnb-4bit

Uses

Direct Use

This model is intended for direct use in correcting transcription errors in English, German, and Spanish. Users provide an erroneous transcription, and the model outputs a corrected version.

Recommended Prompt Format:

You receive an audio transcription as input. These transcripts can contain errors. Identify the context of the translation and "repair" words that clearly don't make sense.
---
{INPUT_TRANSCRIPTION}

Where {INPUT_TRANSCRIPTION} is the text needing correction. The model will generate the corrected version of this input.

Example Use Cases:

Post-processing outputs from Automated Speech Recognition (ASR) systems.
Improving the quality of transcribed data for training other NLP models.
Enhancing readability of raw transcripts for human review.

Downstream Use [optional]

The corrected transcriptions generated by this model can serve as improved input for various downstream NLP tasks, such as:

Machine Translation
Text Summarization
Information Extraction
Sentiment Analysis

Using cleaner, more accurate transcriptions can potentially lead to better performance in these subsequent tasks.

Out-of-Scope Use

The model is specifically fine-tuned for transcription error correction and is not intended for:

General-purpose text generation or conversational AI.
Tasks other than correcting errors in existing transcriptions.
Reliable performance on languages not included in its training data (English, German, Spanish).
Use on transcripts with extremely high error rates or very specialized domain jargon not encountered during training.
Generating entirely new content or "hallucinating" information beyond plausible corrections of input text.
Critical decision-making without human oversight, due to the possibility of incorrect Ccrrections or missed errors.

Bias, Risks, and Limitations

Bias: The model may inherit biases from its base model (gemma-3-12b-it) or the fine-tuning dataset. Performance might vary across different dialects, accents, or demographic groups if their specific speech patterns or common transcription error types were underrepresented in the training data. The training data distribution (55% English, 27.8% German, 17.2% Spanish) will also influence its proficiency across these languages.
Risks:
- Over-correction: The model might alter correctly transcribed words, misinterpreting them as errors.
- Mis-correction: It could introduce new errors or change the intended meaning of the text.
- Failure to correct: Some errors, particularly subtle ones or those requiring deep, nuanced contextual understanding beyond its training, may be missed.
- Bias Amplification: If certain error types are more prevalent in transcriptions from specific groups, the model might disproportionately alter or "correct" text from those groups.
Limitations:
- Performance is contingent on the quality of the input transcription and the similarity of its errors to those encountered during fine-tuning.
- The model was trained on data with a specific Word Error Rate (WER) distribution (see Training Data); performance on inputs with significantly different WER profiles may vary.
- Its understanding of "context" is constrained by its training data and the inherent capabilities of the underlying language model.
- It may struggle with highly ambiguous cases, complex sentence structures, or errors requiring real-world knowledge not encoded in its parameters.

Recommendations

Users (both direct and downstream) should be thoroughly aware of the potential biases, risks, and limitations outlined above.
Conduct comprehensive testing on a representative sample of your own data before deploying the model in any production or critical system.
Implement a human review process for outputs, especially in applications where accuracy is paramount.
Be mindful of the training data's language distribution when applying the model, as performance may differ between English, German, and Spanish.
Consider the input WER; very noisy transcripts might lead to suboptimal results.

Training Details

Training Data

The model was fine-tuned for 1 epoch on a dataset of 56.9k rows. Each row consisted of an erroneous audio transcription paired with its correct transcription.

Dataset Size: 56.9k training samples. An additional 14.2k samples were used for evaluation.
Language Distribution (Training Data):
- English: ~55%
- German: ~27.8%
- Spanish: ~17.2%
Word Error Rate (WER) Distribution in Training Data:
- 0.0 WER (no errors): 1.3%
- WER <= 0.09: 66.9%
- WER <= 0.18: 27.8%
- WER > 0.18: 4%
Dataset Sample Snapshot:
Overall Data Distribution (Language, WER, Word Count for combined train+eval datasets):

Training Procedure

The model was fine-tuned using the Unsloth library, leveraging LoRA for parameter-efficient fine-tuning. The training objective was to predict the corrected transcription given the erroneous one, guided by the instructional prompt.

Training Hyperparameters

Base Model: unsloth/gemma-3-12b-it-unsloth-bnb-4bit
Epochs: 1
Training regime: Fine-tuned using Unsloth, 4-bit quantization (via bitsandbytes as per base model) and mixed-precision training (e.g., bf16).
Prompt Format Used (as described in Uses section): ```text You receive an audio transcription as input. These transcripts can contain errors. Identify the context of the translation and "repair" words that clearly don't make sense.
{INPUT}

Speeds, Sizes, Times [optional]

Training Time: 130 minutes
Hardware: 1x H100 GPU

Evaluation

Testing Data, Factors & Metrics

Testing Data

An evaluation dataset of 14.2k rows was used. This dataset has a similar structure (erroneous vs. correct transcriptions) and language distribution as the training set.

Metrics

Loss: The primary metric reported is eval_loss. This measures the cross-entropy loss on the evaluation dataset, indicating how well the model's predictions matched the target corrected transcriptions.
Word Error Rate (WER): While eval_loss was reported, WER is the standard metric for this task. It would measure the percentage of words that are incorrectly predicted after correction (substitutions, deletions, insertions) compared to the ground truth. WER improvement (original WER vs. corrected WER) would be a key indicator.
BLEU/ROUGE scores: Could also be used as general text generation metrics, but WER is more specific and interpretable for transcription tasks.

Results

Final Evaluation Loss: 0.038753 on the 14.2k-row evaluation dataset.

Summary

The model achieved a low evaluation loss of 0.038753, indicating effective learning on the task of correcting transcription errors based on the provided training and evaluation data. Qualitative examples (see Model Examination below) demonstrate its practical ability to identify and fix errors in text. Further evaluation using WER and analysis of performance across different factors would provide a more complete picture of its capabilities.

Model Examination

A qualitative examination of the model's output can be seen in the provided text-diff image. This image compares an original text, its erroneous transcription, and the transcription as corrected by this model, illustrating its error correction capabilities in a practical example.

Further examination could involve:

Analysis of common error types the model successfully corrects.
Identification of error types it struggles with.
Comparison of performance on short vs. long utterances or simple vs. complex contexts.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: 1x H100 GPU
Hours used: 130 minutes (approximately 2.17 hours)
Carbon Emitted: To estimate: (Power Consumption of H100 in kW * Hours used) * Carbon Intensity of Grid (gCO2eq/kWh) * PUE of data center.
- H100 TDP is up to 700W (0.7 kW).
- Energy consumed: ~0.7 kW * 2.17 h = ~1.519 kWh.
- Final carbon emissions estimate requires knowledge of the Cloud Provider, Compute Region (for PUE and grid carbon intensity).

alecccdd
/

transcription-fixer-gemma-3-12b

Model Card for `alecccdd/transcription-fixer-gemma-3-12b`

Model Details

Model Description

Uses

Direct Use

Downstream Use [optional]

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

Training Details

Training Data

Training Procedure

Training Hyperparameters

Prompt Format Used (as described in Uses section): ```text You receive an audio transcription as input. These transcripts can contain errors. Identify the context of the translation and "repair" words that clearly don't make sense.

Speeds, Sizes, Times [optional]

Evaluation

Testing Data, Factors & Metrics

Testing Data

Metrics

Results

Summary

Model Examination

Environmental Impact

Model tree for alecccdd/transcription-fixer-gemma-3-12b

Dataset used to train alecccdd/transcription-fixer-gemma-3-12b

Model Card for alecccdd/transcription-fixer-gemma-3-12b

Model Details

Model Description

Uses

Direct Use

Downstream Use [optional]

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

Training Details

Training Data

Training Procedure

Training Hyperparameters

Prompt Format Used (as described in Uses section): ```text You receive an audio transcription as input. These transcripts can contain errors. Identify the context of the translation and "repair" words that clearly don't make sense.

Speeds, Sizes, Times [optional]

Evaluation

Testing Data, Factors & Metrics

Testing Data

Metrics

Results

Summary

Model Examination

Environmental Impact

Model tree for alecccdd/transcription-fixer-gemma-3-12b

Dataset used to train alecccdd/transcription-fixer-gemma-3-12b

Model Card for `alecccdd/transcription-fixer-gemma-3-12b`