Further Finetuning
Dear Team,
Thanks for sharing the model it looks good. Can you help us with a finetuning script of this model or some pointer. We need to finetune this for our Industry Domain of real estate.
Pointer will also be good.
We have created the following scrip, kindly advice on it will it train the model good.
import os
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import load_dataset
Load the model and processor
model_name = "Oriserve/Whisper-Hindi2Hinglish-Prime"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)
def preprocess_function(batch):
audio = batch["audio"]
inputs = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt")
with processor.as_target_processor():
labels = processor(batch["text"], return_tensors="pt").input_ids
inputs["labels"] = labels
return inputs
Load dataset
data_files = {"train": "path_to_train.jsonl", "validation": "path_to_validation.jsonl"}
dataset = load_dataset("json", data_files=data_files)
dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
dataset = dataset.map(preprocess_function, remove_columns=dataset["train"].column_names, num_proc=4)
Training arguments
output_dir = "./whisper_finetuned"
training_args = Seq2SeqTrainingArguments(
output_dir=output_dir,
evaluation_strategy="epoch",
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=1e-4,
num_train_epochs=5,
weight_decay=0.01,
save_total_limit=2,
save_strategy="epoch",
predict_with_generate=True,
fp16=True,
logging_dir=f"{output_dir}/logs",
logging_strategy="steps",
logging_steps=100,
report_to="tensorboard",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
ddp_find_unused_parameters=False
)
Define trainer
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
tokenizer=processor.feature_extractor,
data_collator=processor.data_collator
)
Fine-tune the model
trainer.train()
Save final model
trainer.save_model(output_dir)
Save processor
processor.save_pretrained(output_dir)
Optional: log training stats
import pandas as pd
log_file = os.path.join(training_args.logging_dir, "events.out.tfevents.*")
logs = pd.read_csv(log_file, sep="\t")
print("Training and evaluation losses per epoch:")
print(logs.groupby("epoch")["loss", "eval_loss"].mean())
Hi @Rajeev1908 , Thanks for reaching out to us. Your code looks good and should work for finetuning the model. You can also follow the below steps for better results:
- Ensure that the data being used represents your use case, i.e. should represent the audios that your model will come across when running inference
- On each evaluation run, calculate metrics such as wer (word error rate) to get a better understanding of the model performance
Addtionally, We also provide custom curated robust ASR model APIs, which are much cheaper than other players in the market like Deepgram and Azure. To know more, you can reach out to us at [email protected]
Hi Team,
Thanks for the response. Will reachout to the team over email for more details around the API.
Thanks
Kunal
Hi Team,
What should be the length of data recording chunks for this as we got some error on this script. We created chunks for 10s-15s. This is giving some error. Do we need to append these chunks to create bigger chunks? will that work ?
Kindly advice on this.
Thanks
Rajeev
@Rajeev1908 The whisper model works with 30s audios, if your audios are shorter/longer than 30s try padding or trimming them accordingly