How was r7b?
What was the performance of r7b compared with 35b and + in your practical use ?
Assuming the available spaces are working correctly the general knowledge of this 7b LLM isn't too good, and much worse than Cohere 35b, so as pointed out by the model makers, RAG is a must when using this LLM.
It holds the top ~0.1% of popular world knowledge accurately, but after that (without RAG) it's a flood of hallucinations. For example, it accurately retrieves the primary cast, including properly linking character and actor names, of the wildly popular TV show The Big Bang Theory, but returned the following for Corner Gas.
Dale "Hank" Hill (played by Mike Myers)
Peggy Hill (played by Janeane Garofalo)
Bobby Hill (played by Kyle Gass)
Bill Dauterive (played by Craig Robinson)
Dale's Mom (played by Jennifer Coolidge)
Dale's Dad (played by Rob Lowe)
There was a main character named Hank in Corner Gas, but Hank Hill is from the show King of the Hill (as is Peggy & Bobby Hill), and Mike Myers is the guy from Austin Powers, Jennifer Coolidge is the ham actress from American Pie, Rob Lowe is the famous pretty boy... and none of them make any sense. So again, RAG is essential when using this model.
Even when asked about the wildly popular movie Pulp Fiction it said Jamie Fox (rather than Samuel L. Jackson) played Jules Winnfield. Pulp Fiction and Samuel L. Jackson are far too popular. In my opinion there's no excuse for any AI model, regardless of size, to make such a mistake.
There's nothing wrong with this model because what it knows it retrieves accurately, coherently, and apparently without grammatical or spelling errors, plus it's good at following instructions. It just knows far less world knowledge than other models like Llama 3.1 8b and Gemma 2 9b, which are also good at the previous mentioned things.
Personally I don't like RAG since there are much larger and more powerful proprietary models available online, and with free tiers, so why deal with the complexity of setting up RAG, added latency, dependence on internet availability (e.g. when camping), loss of organic flow, especially when writing stories or having a casual back and forth on a topic... that comes from implementing RAG? In many common AI use cases having core & extremely popular human knowledge in the weights is vastly superior to retrieving said data from external sources.
I wouldn't expect a good level of knowledge from a 7B model, I don't think it's even possible on the current architecture for a 7B model to generalize as much as a 35B model. What I personally found suitable for this model are embedded-ish repetitive tasks, it really does listen to the instruction very well. It works great for generating titles and tags in Open WebUI, and it seems that generally pulling Rick & Morty's "What is my purpose" on this model would yield you stable 99.9% effectiveness.
@Araki There's a reason why ~7b is so popular. It's the sweet spot. Knowledge & performance grows rapidly until ~7b, then suddenly starts leveling off. The best ~1b LLMs are completely incoherent, and the best ~3b LLMs barely follow directions or stay on topic, can't even reliably define words found in a desktop dictionary, and so on.
So although you wouldn't expect a good level of knowledge from a 7B model, Llama 3.1 8b and Gemma 2 9b prove that a functional amount can be retained if you primarily focus on a single language and train them long enough (Llama 3.1 8b), or distill them properly from larger models (Gemma 2 9b).
It's only when you start removing popular knowledge from the corpus, under training it, adding too many languages, making a model primarily for another language (e.g. Qwen2.5), focusing too much on a specific skill (e.g. coding or math)... that ~7b models can no longer retain broad knowledge and abilities.
Qwen2.5 7b is a perfect example of this. It's a good Chinese LLM and coder, but despite achieving higher English tests scores than the best English LLMs (which isn't possible), its general English knowledge and abilities are FAR LOWER than Llama 3.1's and Gemma 2b's. They simply prioritized information like coding and virology to boost standardized test scores like the MMLU, and the open source community is so distractedly obsessed with coding that they're rewarding this behavior, which is encouraging other models to do the same. That is, prioritizing training for coding, test scores, and other select tasks, at the cost of essential general knowledge and abilities.
And no, RAG has far too many limitations and drawbacks, especially when it comes to tasks like story writing, to make up for profoundly ignorant weights. Key information simply needs to be in the weights. So if you want to, for example, make an LLM proficient in all major languages, then you can't make it ~7b.
@CHNtentes That's understandable, and it's very clear most people here primarily want a coding LLM. But why scramble the weights of general purpose instruct LLMs by training on tons of synthetic coding or math data when there's already more powerful coding and math agents, such as Qwen2.5 Coder and Math? This honestly baffles me.
Also, the internet isn't always available (e.g. when traveling), plus when using the internet (e.g. RAG) there's an increase in latency, a drastic loss of coherency, and so on. Humanity's core and very popular knowledge simply needs to be in the weights for an AI instruct/chat model to remain functional across all use cases, such as story writing and casually chatting about various things.
Plus it annoys me when AI models de-prioritize the most popular pop culture knowledge most normies care about (it's called "pop" culture because it's popular), yet said models (e.g. Qwen2.5 Instruct) still include non-functional information like virology in order to boost test scores like the MMLU. If they're going to rely on RAG then specialty domains like virology, which are rarely used in story writing, casual chatting..., hence don't need to be organically retrieved, are perfect for that.