LoRA Fine-Tuning Parameters and VQA-RAD Evaluation Results

by emilyseong - opened Apr 1

Apr 1

•

Hi!

I'm currently fine-tuning a model using LoRA (Low-Rank Adaptation), and I'm experimenting with different training parameters. I’m curious about what values have been used for: lora_rank, learning rate (lora and mm_projector), train epochs, etc.

Also, when evaluating the model on the VQA-RAD test dataset, I obtained the following results:
{
"exact_match_score": 25.37,
"f1_score": 24.96,
"precision": 25.37,
"recall": 28.31,
"bleu_score": 1.69e-77,
"bleu_score_1": 21.93,
"bleu_score_2": 4.91,
"bleu_score_3": 1.26,
"open_accuracy": 18.99,
"yes_no_accuracy": 72.79,
"recall_closed": 72.79
}

Is this result reasonable for the LoRA finetuned-model on the downstream dataset? or would i have evaluated wrong?

emilyseong changed discussion title from What training parameters (e.g., LoRA rank, learning rate) do you use for fine-tuning? to LoRA Fine-Tuning Parameters and VQA-RAD Evaluation Results Apr 1

Veda0718

Owner Apr 1

•

edited Apr 1

Hey!
For this model, I used LoRA Rank = 8, LoRA Alpha = 9, and a learning rate of 2e-5. The batch size was 1 per device, with gradient accumulation steps = 1 and a warmup ratio of 0.03. I trained for 5 epochs using bf16 precision and paged_adamw_8bit as the optimizer.

Looking at the VQA-RAD evaluation results, the yes/no accuracy looks good, but other metrics, like exact match and F1 score, are quite low. The open-ended accuracy (~18.99%) suggests the model struggles with open-ended questions, likely because it was mostly trained on yes/no questions. The low BLEU scores also point to issues with text generation.

To improve performance, you could try increasing the LoRA rank (e.g., 16/32/64) and training for more epochs. Adjusting the learning rate might also help, but be careful not to overfit. Adding more open-ended questions to your dataset could also reduce bias toward yes/no answers.
Hope this helps!

emilyseong

Apr 1

Thanks for your response! Actually, the result pasted above is the inference evaluation result of the model that you've uploaded in the repo.
If you ever had a chance to test the result on VQA-RAD, was the result of mine similar to yours?
Whenever I fine-tuned on VQA-RAD, I get highly overfitted model getting 100 percent on train set but struggling on the test set.

Veda0718

Owner Apr 1

•

edited Apr 1

Yes, I got similar results. Due to limited resources and time, I wasn’t able to experiment much with hyperparameter tuning to improve performance. Are you fine-tuning LlaVA-Med or this specific model?

emilyseong

Apr 6

I tried to finetune LLaVA-Med-v1.5 but didn't get expected result. So was just wondering if anyone have figured it out the way!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment