LoRA Fine-Tuning Parameters and VQA-RAD Evaluation Results

#1
by emilyseong - opened

Hi!

I'm currently fine-tuning a model using LoRA (Low-Rank Adaptation), and I'm experimenting with different training parameters. I’m curious about what values have been used for: lora_rank, learning rate (lora and mm_projector), train epochs, etc.

Also, when evaluating the model on the VQA-RAD test dataset, I obtained the following results:
{
"exact_match_score": 25.37,
"f1_score": 24.96,
"precision": 25.37,
"recall": 28.31,
"bleu_score": 1.69e-77,
"bleu_score_1": 21.93,
"bleu_score_2": 4.91,
"bleu_score_3": 1.26,
"open_accuracy": 18.99,
"yes_no_accuracy": 72.79,
"recall_closed": 72.79
}

Is this result reasonable for the LoRA finetuned-model on the downstream dataset? or would i have evaluated wrong?

emilyseong changed discussion title from What training parameters (e.g., LoRA rank, learning rate) do you use for fine-tuning? to LoRA Fine-Tuning Parameters and VQA-RAD Evaluation Results

Hey!
For this model, I used LoRA Rank = 8, LoRA Alpha = 9, and a learning rate of 2e-5. The batch size was 1 per device, with gradient accumulation steps = 1 and a warmup ratio of 0.03. I trained for 5 epochs using bf16 precision and paged_adamw_8bit as the optimizer.

Looking at the VQA-RAD evaluation results, the yes/no accuracy looks good, but other metrics, like exact match and F1 score, are quite low. The open-ended accuracy (~18.99%) suggests the model struggles with open-ended questions, likely because it was mostly trained on yes/no questions. The low BLEU scores also point to issues with text generation.

To improve performance, you could try increasing the LoRA rank (e.g., 16/32/64) and training for more epochs. Adjusting the learning rate might also help, but be careful not to overfit. Adding more open-ended questions to your dataset could also reduce bias toward yes/no answers.
Hope this helps!

Thanks for your response! Actually, the result pasted above is the inference evaluation result of the model that you've uploaded in the repo.
If you ever had a chance to test the result on VQA-RAD, was the result of mine similar to yours?
Whenever I fine-tuned on VQA-RAD, I get highly overfitted model getting 100 percent on train set but struggling on the test set.

Yes, I got similar results. Due to limited resources and time, I wasn’t able to experiment much with hyperparameter tuning to improve performance. Are you fine-tuning LlaVA-Med or this specific model?

I tried to finetune LLaVA-Med-v1.5 but didn't get expected result. So was just wondering if anyone have figured it out the way!

Sign up or log in to comment