Model Card for alecccdd/transcription-fixer-gemma-3-12b
This model is a fine-tuned version of unsloth/gemma-3-12b-it-unsloth-bnb-4bit
designed to correct errors in audio transcriptions. It processes input transcriptions in English, German, or Spanish, identifies contextual errors, and outputs a "repaired" version.
Model Details
Model Description
This model takes potentially erroneous audio transcriptions as input, identifies the context, and "repairs" words or phrases that are likely transcription errors. It aims to improve the accuracy and readability of automated speech recognition (ASR) outputs. The model was fine-tuned on a dataset comprising 56.9k examples of transcription errors and their corrected versions across English, German, and Spanish, using a specific instructional prompt.
- Developed by:
alecccdd
- Model type: Text-to-text generation, fine-tuned for error correction
- Language(s) (NLP): English (en), German (de), Spanish (es)
- License:
apache-2.0
- Finetuned from model:
unsloth/gemma-3-12b-it-unsloth-bnb-4bit
Uses
Direct Use
This model is intended for direct use in correcting transcription errors in English, German, and Spanish. Users provide an erroneous transcription, and the model outputs a corrected version.
Recommended Prompt Format:
You receive an audio transcription as input. These transcripts can contain errors. Identify the context of the translation and "repair" words that clearly don't make sense.
---
{INPUT_TRANSCRIPTION}
Where {INPUT_TRANSCRIPTION} is the text needing correction. The model will generate the corrected version of this input.
Example Use Cases:
- Post-processing outputs from Automated Speech Recognition (ASR) systems.
- Improving the quality of transcribed data for training other NLP models.
- Enhancing readability of raw transcripts for human review.
Downstream Use [optional]
The corrected transcriptions generated by this model can serve as improved input for various downstream NLP tasks, such as:
- Machine Translation
- Text Summarization
- Information Extraction
- Sentiment Analysis
Using cleaner, more accurate transcriptions can potentially lead to better performance in these subsequent tasks.
Out-of-Scope Use
The model is specifically fine-tuned for transcription error correction and is not intended for:
- General-purpose text generation or conversational AI.
- Tasks other than correcting errors in existing transcriptions.
- Reliable performance on languages not included in its training data (English, German, Spanish).
- Use on transcripts with extremely high error rates or very specialized domain jargon not encountered during training.
- Generating entirely new content or "hallucinating" information beyond plausible corrections of input text.
- Critical decision-making without human oversight, due to the possibility of incorrect Ccrrections or missed errors.
Bias, Risks, and Limitations
- Bias: The model may inherit biases from its base model (
gemma-3-12b-it
) or the fine-tuning dataset. Performance might vary across different dialects, accents, or demographic groups if their specific speech patterns or common transcription error types were underrepresented in the training data. The training data distribution (55% English, 27.8% German, 17.2% Spanish) will also influence its proficiency across these languages. - Risks:
- Over-correction: The model might alter correctly transcribed words, misinterpreting them as errors.
- Mis-correction: It could introduce new errors or change the intended meaning of the text.
- Failure to correct: Some errors, particularly subtle ones or those requiring deep, nuanced contextual understanding beyond its training, may be missed.
- Bias Amplification: If certain error types are more prevalent in transcriptions from specific groups, the model might disproportionately alter or "correct" text from those groups.
- Limitations:
- Performance is contingent on the quality of the input transcription and the similarity of its errors to those encountered during fine-tuning.
- The model was trained on data with a specific Word Error Rate (WER) distribution (see Training Data); performance on inputs with significantly different WER profiles may vary.
- Its understanding of "context" is constrained by its training data and the inherent capabilities of the underlying language model.
- It may struggle with highly ambiguous cases, complex sentence structures, or errors requiring real-world knowledge not encoded in its parameters.
Recommendations
- Users (both direct and downstream) should be thoroughly aware of the potential biases, risks, and limitations outlined above.
- Conduct comprehensive testing on a representative sample of your own data before deploying the model in any production or critical system.
- Implement a human review process for outputs, especially in applications where accuracy is paramount.
- Be mindful of the training data's language distribution when applying the model, as performance may differ between English, German, and Spanish.
- Consider the input WER; very noisy transcripts might lead to suboptimal results.
Training Details
Training Data
The model was fine-tuned for 1 epoch on a dataset of 56.9k rows. Each row consisted of an erroneous audio transcription paired with its correct transcription.
- Dataset Size: 56.9k training samples. An additional 14.2k samples were used for evaluation.
- Language Distribution (Training Data):
- English: ~55%
- German: ~27.8%
- Spanish: ~17.2%
- Word Error Rate (WER) Distribution in Training Data:
- 0.0 WER (no errors): 1.3%
- WER <= 0.09: 66.9%
- WER <= 0.18: 27.8%
- WER > 0.18: 4%
- Dataset Sample Snapshot:
- Overall Data Distribution (Language, WER, Word Count for combined train+eval datasets):
Training Procedure
The model was fine-tuned using the Unsloth library, leveraging LoRA for parameter-efficient fine-tuning. The training objective was to predict the corrected transcription given the erroneous one, guided by the instructional prompt.
Training Hyperparameters
- Base Model:
unsloth/gemma-3-12b-it-unsloth-bnb-4bit
- Epochs: 1
- Training regime: Fine-tuned using Unsloth, 4-bit quantization (via
bitsandbytes
as per base model) and mixed-precision training (e.g., bf16). Prompt Format Used (as described in Uses section): ```text You receive an audio transcription as input. These transcripts can contain errors. Identify the context of the translation and "repair" words that clearly don't make sense.
{INPUT}
Speeds, Sizes, Times [optional]
- Training Time: 130 minutes
- Hardware: 1x H100 GPU
Evaluation
Testing Data, Factors & Metrics
Testing Data
- An evaluation dataset of 14.2k rows was used. This dataset has a similar structure (erroneous vs. correct transcriptions) and language distribution as the training set.
Metrics
- Loss: The primary metric reported is
eval_loss
. This measures the cross-entropy loss on the evaluation dataset, indicating how well the model's predictions matched the target corrected transcriptions. - Word Error Rate (WER): While
eval_loss
was reported, WER is the standard metric for this task. It would measure the percentage of words that are incorrectly predicted after correction (substitutions, deletions, insertions) compared to the ground truth. WER improvement (original WER vs. corrected WER) would be a key indicator. - BLEU/ROUGE scores: Could also be used as general text generation metrics, but WER is more specific and interpretable for transcription tasks.
Results
- Final Evaluation Loss:
0.038753
on the 14.2k-row evaluation dataset.
Summary
The model achieved a low evaluation loss of 0.038753
, indicating effective learning on the task of correcting transcription errors based on the provided training and evaluation data. Qualitative examples (see Model Examination below) demonstrate its practical ability to identify and fix errors in text. Further evaluation using WER and analysis of performance across different factors would provide a more complete picture of its capabilities.
Model Examination
A qualitative examination of the model's output can be seen in the provided text-diff image. This image compares an original text, its erroneous transcription, and the transcription as corrected by this model, illustrating its error correction capabilities in a practical example.
Further examination could involve:
- Analysis of common error types the model successfully corrects.
- Identification of error types it struggles with.
- Comparison of performance on short vs. long utterances or simple vs. complex contexts.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: 1x H100 GPU
- Hours used: 130 minutes (approximately 2.17 hours)
- Carbon Emitted: To estimate: (Power Consumption of H100 in kW * Hours used) * Carbon Intensity of Grid (gCO2eq/kWh) * PUE of data center.
- H100 TDP is up to 700W (0.7 kW).
- Energy consumed: ~0.7 kW * 2.17 h = ~1.519 kWh.
- Final carbon emissions estimate requires knowledge of the Cloud Provider, Compute Region (for PUE and grid carbon intensity).
- Downloads last month
- 50
Model tree for alecccdd/transcription-fixer-gemma-3-12b
Base model
google/gemma-3-12b-pt