lm_eval results is weird

by xianf - opened Jun 7, 2024

Jun 7, 2024

I try to test the result of some benchmark. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|

LiteAI-Team

Jun 7, 2024

This comment has been hidden

lingyun1

Jun 7, 2024

I try to test the result of some benchmark. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|

you should use few-shot.

xianf

Jun 11, 2024

•

edited Jun 11, 2024

I try to test the result of some benchmark. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|

you should use few-shot.

If few-shot is must for arc_easy, I think the model is not trained well.

xianf

Jun 11, 2024

I try to test the result of some benchmark. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|

you should use few-shot.

A simple script for testing the model:

#coding:utf-8
import sys 
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn import CrossEntropyLoss
import torch
 
model_path = "." 
 
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
 
text = "Question: Darryl learns that freezing temperatures may help cause weathering. Which statement explains how freezing temperatures most likely cause weathering?\nAnswer: by freezing the leaves on"
loss_func = CrossEntropyLoss(reduction="none")
 
input_ids = tokenizer(text, return_tensors='pt')                                                                                                                                                                                                                                                   

labels = input_ids['input_ids'][:, 1:] 
 
output = model(**input_ids)
 
logits = output.logits[:,:-1]
print(logits.size())

loss = loss_func(logits.transpose(1, 2), labels)
num_tokens = input_ids['input_ids'].size(1)
avg_loss = torch.sum(loss).item() / num_tokens

print(avg_loss)

The avg_loss value is 4.48 which is too high for a language model.

lingyun1

Jun 11, 2024

I try to test the result of some benchmark. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|

you should use few-shot.

A simple script for testing the model:

#coding:utf-8
import sys 
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn import CrossEntropyLoss
import torch
 
model_path = "." 
 
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
 
text = "Question: Darryl learns that freezing temperatures may help cause weathering. Which statement explains how freezing temperatures most likely cause weathering?\nAnswer: by freezing the leaves on"
loss_func = CrossEntropyLoss(reduction="none")
 
input_ids = tokenizer(text, return_tensors='pt')                                                                                                                                                                                                                                                   

labels = input_ids['input_ids'][:, 1:] 
 
output = model(**input_ids)
 
logits = output.logits[:,:-1]
print(logits.size())

loss = loss_func(logits.transpose(1, 2), labels)
num_tokens = input_ids['input_ids'].size(1)
avg_loss = torch.sum(loss).item() / num_tokens

print(avg_loss)

The avg_loss value is 4.48 which is too high for a language model.

Fine, I also tested the model on MMLU, and its zero-shot and few-shot capabilities were almost non-existent, with the options outputting only 1, 2, 3, and 4. It's unclear how the posterior probabilities for options A, B, C, and D were trained to be so high.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment