Spaces:
Runtime error
Worries that I have with this implementation.
Some worries:1. Am I testing things correctly in eval.py, following the template format?
I'm only needing confirmation for hellaswag and a refactor of winograde as of now, I'm fairly certain everythin else is correct.
Am I choosing the correct splits in detect-pretrain-code-contamination/src/run.py? The hierarchy I use is: test > val > train
(As in: if test exists, I go with that, then validation, then default)I decided to go with winogrande_debiased instead of winogrande_l arbitrarily. (this is in detect-pretrain-code-contamination/src/eval.py)
(Not sure which one open llm leaderboard uses, or what is the standard)I'm unsure why in detect-pretrain-code-contamination/src/eval.py we append the output at the end of the input.
5. Currently I'm using huggyllama/llama-7b as ref_model, should I switch to llama2-7B? Maybe Mistral-7B? (same model as implementation in https://github.com/swj0419/detect-pretrain-code-contamination/tree/master)
Upon receiving feedback, I opted to taste finetuned models using their base models as ref and base models are tested with llama2-7b
I'd love it if I could receive some help on these issues!
Another one to add to the list:6. I had to undersample MMLU and GSM8K due to memory constraints. The script generates 100 samples (num_z in run.py in sample generation function), for these 2 tests it generates 50, max_new_tokens was also reduced by half.
max_new_tokens stays the same for every benchmark. num_z stays at 50 for every benchmark. Internally, I ran some tests and found out the difference between num_z=50 and num_z=100 is ±0.04 at worst. I can run some models at higher accuracy if need be upon request.
This evaluation is thus more inaccurate in those tests, scores here elicit a double-check on the user's end.
If any issues with accuracy arise, I'll tweak num_z again.
Thank you! It bothered me that we could not tell if the models occupying the top spots in the leaderboard were contaminated or not, when I became aware that we could test for that I jumped on the opportunity.
Hi @Yeyito thanks for putting this together! Is it still under active development? Would like to know more about any issues that need help.
@czhu2 Nope, this is no longer under active development due to the compute costs that are necessary for this to run.
My goal with this was to test several models for contamination according to the paper cited, I wouldn't mind letting this run indefinitely but the compute costs are prohibitively expensive for me.
Thank you got your concern though!
If you need help getting something similar set up I wouldn't mind giving a hand
Thanks @Yeyito - I was mainly curious about some things in the code and am wondering if you had any insight:
- "If #the result < 0.1# with a percentage greater than 0.85, it is highly likely that the dataset has been trained." I don't see any mention of these threshold/numbers in the paper, do you know how it's determined?
- It seems the quantity computed here is some other quantity and not quite the
min-k%-prob
according to the formula given in the paper
@czhu2
That threshold comes straight out of this repo which is the implementation provided by Weijia Shi, one of the authors of the paper.
I didn't thoroughly check the code, only tested it empirically on known to be contaminated models and other clean models which was accurate for TruthfulQA, MMLU, ARC and Hellaswag.
Regarding your second question, I refactored Weija Shi's code to separate the ref and test model's computations, that way I can cache the ref model's values that don't depend on the test model and reuse them for all subsequent tests, halving the compute cost. Everything should be functionally equivalent to the code in the github repo so you should check that out instead.
Hope this helps!