lm_eval results is weird
I try to test the result of some benchmark. But the score is too low:
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy| 1|none | 0|acc |0.2647|± |0.0091|
| | |none | 0|acc_norm|0.2597|± |0.0090|
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande| 1|none | 0|acc |0.5107|± | 0.014|
I try to test the result of some benchmark. But the score is too low:
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |--------|------:|------|-----:|--------|-----:|---|-----:| |arc_easy| 1|none | 0|acc |0.2647|± |0.0091| | | |none | 0|acc_norm|0.2597|± |0.0090| | Tasks |Version|Filter|n-shot|Metric|Value | |Stderr| |----------|------:|------|-----:|------|-----:|---|-----:| |winogrande| 1|none | 0|acc |0.5107|± | 0.014|
you should use few-shot.
I try to test the result of some benchmark. But the score is too low:
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |--------|------:|------|-----:|--------|-----:|---|-----:| |arc_easy| 1|none | 0|acc |0.2647|± |0.0091| | | |none | 0|acc_norm|0.2597|± |0.0090| | Tasks |Version|Filter|n-shot|Metric|Value | |Stderr| |----------|------:|------|-----:|------|-----:|---|-----:| |winogrande| 1|none | 0|acc |0.5107|± | 0.014|
you should use few-shot.
If few-shot is must for arc_easy, I think the model is not trained well.
I try to test the result of some benchmark. But the score is too low:
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |--------|------:|------|-----:|--------|-----:|---|-----:| |arc_easy| 1|none | 0|acc |0.2647|± |0.0091| | | |none | 0|acc_norm|0.2597|± |0.0090| | Tasks |Version|Filter|n-shot|Metric|Value | |Stderr| |----------|------:|------|-----:|------|-----:|---|-----:| |winogrande| 1|none | 0|acc |0.5107|± | 0.014|
you should use few-shot.
A simple script for testing the model:
#coding:utf-8
import sys
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn import CrossEntropyLoss
import torch
model_path = "."
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
text = "Question: Darryl learns that freezing temperatures may help cause weathering. Which statement explains how freezing temperatures most likely cause weathering?\nAnswer: by freezing the leaves on"
loss_func = CrossEntropyLoss(reduction="none")
input_ids = tokenizer(text, return_tensors='pt')
labels = input_ids['input_ids'][:, 1:]
output = model(**input_ids)
logits = output.logits[:,:-1]
print(logits.size())
loss = loss_func(logits.transpose(1, 2), labels)
num_tokens = input_ids['input_ids'].size(1)
avg_loss = torch.sum(loss).item() / num_tokens
print(avg_loss)
The avg_loss value is 4.48 which is too high for a language model.
I try to test the result of some benchmark. But the score is too low:
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |--------|------:|------|-----:|--------|-----:|---|-----:| |arc_easy| 1|none | 0|acc |0.2647|± |0.0091| | | |none | 0|acc_norm|0.2597|± |0.0090| | Tasks |Version|Filter|n-shot|Metric|Value | |Stderr| |----------|------:|------|-----:|------|-----:|---|-----:| |winogrande| 1|none | 0|acc |0.5107|± | 0.014|
you should use few-shot.
A simple script for testing the model:
#coding:utf-8 import sys from transformers import AutoModelForCausalLM, AutoTokenizer from torch.nn import CrossEntropyLoss import torch model_path = "." model = AutoModelForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) text = "Question: Darryl learns that freezing temperatures may help cause weathering. Which statement explains how freezing temperatures most likely cause weathering?\nAnswer: by freezing the leaves on" loss_func = CrossEntropyLoss(reduction="none") input_ids = tokenizer(text, return_tensors='pt') labels = input_ids['input_ids'][:, 1:] output = model(**input_ids) logits = output.logits[:,:-1] print(logits.size()) loss = loss_func(logits.transpose(1, 2), labels) num_tokens = input_ids['input_ids'].size(1) avg_loss = torch.sum(loss).item() / num_tokens print(avg_loss)
The avg_loss value is 4.48 which is too high for a language model.
Fine, I also tested the model on MMLU, and its zero-shot and few-shot capabilities were almost non-existent, with the options outputting only 1, 2, 3, and 4. It's unclear how the posterior probabilities for options A, B, C, and D were trained to be so high.