mattshumer/ref_70_e3 · 🚩 Report

gileneo

Sep 9, 2024

Reflection 70b benchmarks are not real

The whole drama is described here:
https://x.com/shinboson/status/1832933753837982024

calmclear

Sep 9, 2024

Matt Schumer is a fraud.

nisten

Sep 9, 2024

I literally could not reproduce a single thing that twitter thread was posting, also chat template doesn't work for lm_eval_harness to run it in non-api mode so i'm not sure what you guys are making the assumptions on.

MikeRoz

Sep 9, 2024

The claim is that the requests sent to "Reflection 70B" through his hosted API were being routed to a model other than the one whose weights are hosted in this HF repo. The fact that you're unable to reproduce any of the responses other users saw from their API when you run it locally is further evidence that what was benchmarked is not what is in this repo.

peasant

Sep 9, 2024

@nisten You are missing the point here. The model uploaded here is not the same as the API Access that others can access. (e.g. Artificial Analysis). You will not be able to reproduce the issues with an open weight model. Until Matt himself uploads and provides proof of the model, evals, and the same exact prompt to test with his open weight model. (Not a private one)

Enigrand

Sep 9, 2024

This comment has been hidden

calmclear

Sep 9, 2024

Right, so TLDR; Matt Schumer is a fraud using the media as pawns to promote his company. Tik-Tok Matt

nisten

Sep 10, 2024

Is there a single redditor here from r/localllama that has actually ran the model locally?

Hang on , maybe these ARE actual fraud accounts commenting

unclemusclez

Sep 10, 2024

let them upload it to Github Models if it's so good. Let's keep this space clear of the scams please.

Enigrand

Sep 10, 2024

•

edited Sep 10, 2024

@nisten

Why don't you just reply to my post first before calling out others FRAUD ACCOUNTS here?

https://huggingface.co/mattshumer/ref_70_e3/discussions/10#66df65d3fba5a55441a421ba

https://huggingface.co/mattshumer/ref_70_e3/discussions/5#66defbe383b31d8cf891724b

https://huggingface.co/mattshumer/ref_70_e3/discussions/7#66dee423cccbad2a02574834

https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B/discussions/49#66de9369becd5c1c0c43a0cc

https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B/discussions/42#66dd83fe5af9428df3dd7dfd

Enigrand

Sep 10, 2024

•

edited Sep 10, 2024

@nisten , reply to this before you claim anything.

Can you read titles of these posts? They're talking about thees official "reflection 70b APIs". Do you test on these APIs?

I see you're posting the results related to this twitter.

Do you really understand what he is trying to prove? He's trying to prove that the LLM behind "reflection 70b API" is using the same tokenizer as claude 3, chatgpt4o or whatever. Images he posted stands by his point.

What are you trying to prove here by posting this image? I think you're proving that what they uploaded here and what they host after API are totally different. You should explain what you want to prove in detail.

Also, I see you're using local models, so you're testing different models from all these posts claims. A natural question is that can you reproduce the evaluation results @mattshumer provided? Why not post your independent evaluation results here so you can help everyone decide whether they're genuine or overclaiming?

nisten

Sep 10, 2024

This is a local model.
You are coming from r/local_llama, to complain about about a model which you're NOT running locally.
Please, RUN IT LOCALLY, then post screenshots of WHAT YOU LOCALLY RAN!

COMPRENDE, CAPISCI, KUPTON?

Can you fix the chat_template, HERE, not on reddit, not on uncle Elons twitter, but HERE, and then run it BEFORE yapping?

Enigrand

Sep 10, 2024

•

edited Sep 10, 2024

@nisten

Before you report your independent evaluation results, please disclose whether you and @mattshumer have a conflict of interest.
In particular, relationships like friends, partners, knowing each other, etc.

nisten

Sep 10, 2024

•

edited Sep 10, 2024

No I don't actually have for real but I think we all need a new opensource license that's apache for everyone except reddit users.
So go back to r/locallama and tell them that Enigrand's yapping has inspired Nisten to make an opensource license that bans reddit users.

Enigrand

Sep 10, 2024

•

edited Sep 10, 2024

@nisten

"No I don't actually have for real but I think we all need a new opensource license that's apache for everyone except reddit users."

Quoting from your twitter post:

i dont know what he's on about with torrents, he hasnt slept in 4 days,

checkpoint 3 is working fine as far as I tested, albeit not great (it goes in loops), but IT WAS PASSING MOST OF THE TESTS ya'll claimed it didnt

Ok please just tell me what prompts to try?

EXPLAIN WHO IS HE or CHANGE YOUR DISCLOSURE. Also, please don't delete your twitter posts.

nisten

Sep 10, 2024

Can you lay off the amphetamines for one day and actually try and run the model locally please ?
It seems to perform a lot better with my chat template applied. I tried it with mistral large and it didn't do the counting properly.

Enigrand

Sep 10, 2024

@nisten

CHANGE YOUR DISCLOSURE before you claim anything else.

Enigrand

Sep 12, 2024

Here's the evaluation results from Kristoph on twitter.

These are the final notes from my work on the Reflection model. I tested the latest version of the model hosted by @hyperbolic_labs. I attempted a variety of different strategies including variation in temperature and system prompt. Ultimately these had only modest impact on the results. The final numbers I am presenting here use the prompt the Reflection team recommended. I did have to modify the question format somewhat to ensure Reflection properly generated the response ( the instruction to output a letter choice was at the end of the prompt )

The TLDR is that on virtually every benchmark the Reflection model was on par with Llama 3.1 70B it is based on.

I ultimately ran through the entire corpus of MMLU Pro for biology, chemistry, physics, engineering, health, law, philosophy, and math. All 0 shot. In all but one case Reflection was within a 1-2% of Llama 3.1 70B 0 shot and 1-3% below 5 shot. In all cases Llama 70B was called with no system prompt.

The one area where Reflection performed better was in Math where it scored 3% higher.