--- base_model: - cognitivecomputations/dolphin-2.9.1-llama-3-70b - migtissera/Tess-3-Llama-3.1-70B library_name: transformers tags: - mergekit - merge --- # about A merge between a L3 70b model (Dolphin 2.9.1) and a L3.1 70b base model (Tess 3) inspired by https://huggingface.co/sophosympatheia/New-Dawn-Llama-3.1-70B-v1.1 and https://huggingface.co/jukofyork/Dusk-Miqu-70B . - I made 10 different versions. I retained 3 RC (Formely in my NexesMess repo, this one was Flipper 0.38. Flipper 0.36 will be Tess Dolphin 1.1. Flipper 0.32 will be Tess Dolphin 1.0) - I (re?)added q.proj, k.proj (the rope base frequency is similar between L3 and L3.1), imput.layernorm and post.attention.layernorm (why not?) to Sophos' mix, so all tensors are merged the same way (I guess I should just have gone layer-wide then, but whatever lol). - This v1.2 respects a quasi-triangular shape for its merge gradient, from layer 1 to level 78 (or 0 to 79, I'm not even sure). It has a relatively high perplexity. (3.95) The most "Dolphin inprinted", for kicks & giggles. - the v1.1 leaves untouched the 4 first and 4 last layers of Tess. It has an intermediary perplexity. (3.75) A solid and balanced standalone version. - the V1.0 leaves untouched the 12 first and 12 last layers of Tess. It has the lowest perplexity. (3.55). Probably the most "mergeable" version with L3.1 and 3.3 models. - -> I will actually need to make a new version with 16/16 untouched layers (it will be v1.3), because that's the recommended recipe. - Beyond the maths I don't master (merge stuff, rope stuff, etc), one reason for that bump might be that the closer you get to Dolphin (thus to L3 70b), the higher perplexity you get because Llama 3 70b had itself a perplexity superior to 4.5 (unless my quants were not fresh lol), even if Dolphin went way lower (Less than 3.65). - And why Tess 3.1 as a base? Simple, I wanted a low perplexity base to compensate for L3 models high ppl, and Tess 3 70b is the best L3.1 instruct capable model at this game. (2.90 ppl 512 Wikitext Eng) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6451b24dc5d273f95482bfa4/L6wJK3wDcAJBj5MvZPWO1.png) - Blue is v1.2 (1+1 untouched layers, if any), Purple is v1.1 (4+4 untouched layers), Orange is v1.0 (12+12 untouched layers). - I initially thought that v1.0 would be green and 1.1 would be red, but I messed up I guess. The results are still usable though. - You can multiply the value on the graphic by 2, because that's the epsilon I used, in the way demonstrated by the aforementioned guys. Caution, this model is VASTLY uncensored. And good news, the grammar problems mentioned by mergers (notably between L2/Miqu and L3 models) when some layers among the 16 first (to not speak about the 8 first) seem to be extremely limited on the 3 versions. - The hiccups I could observe are in the margin of error of what I could observe on many L3.1/L3.3 finetunes. As for the intelligence of the models, well, I might have seen better with pure L3.1/L3.3 models & merges, but it's not debilitated. As for the creativity, well, those L3/L3.1 merges might not the smartest ducks of the pond, but they are colorful alright! --- # benchs - PPL 512 Wikitext Eng : 3.95 (passable), v1.1 does 3.75, v1.0 does 3.55. - ARC-C : 58.85 (average ++), v1.1 does 61.85, v1.0 does 61.20. - ARC-E : 76.15 (average), v1.1 does 80, v1.0 does 78.75. --- # tests Let's see if it can hold long context (on testing): - At 10k, it holds coherence. - At 20k, it holds coherence. - At 28k, it holds coherence. - That's good enough to validate this release, because the merged material is a 8k context L3, not even a Miqu (32k context). --- # merge This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit). ## Merge Details ### Merge Method This model was merged using the [Linear DELLA](https://arxiv.org/abs/2406.11617) merge method using [migtissera/Tess-3-Llama-3.1-70B](https://huggingface.co/migtissera/Tess-3-Llama-3.1-70B) as a base. ### Models Merged The following models were included in the merge: * [cognitivecomputations/dolphin-2.9.1-llama-3-70b](https://huggingface.co/cognitivecomputations/dolphin-2.9.1-llama-3-70b) ### Configuration The following YAML configuration was used to produce this model: ```yaml merge_method: della_linear base_model: migtissera/Tess-3-Llama-3.1-70B models: - model: cognitivecomputations/dolphin-2.9.1-llama-3-70b parameters: weight: - filter: q_proj value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] - filter: k_proj value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] - filter: v_proj value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] - filter: o_proj value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] - filter: input_layernorm value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] - filter: up_proj value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] - filter: gate_proj value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] - filter: down_proj value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] - filter: post_attention_layernorm value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] - value: 0 density: 0.5 epsilon: 0.1 lambda: 1.0 - model: migtissera/Tess-3-Llama-3.1-70B parameters: weight: 1.0 density: - filter: q_proj value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1] - filter: k_proj value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1] - filter: v_proj value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1] - filter: o_proj value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1] - filter: input_layernorm value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1] - filter: up_proj value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1] - filter: gate_proj value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1] - filter: down_proj value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1] - filter: post_attention_layernorm value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1] - value: 0.5 epsilon: - filter: q_proj value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0] - filter: k_proj value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0] - filter: v_proj value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0] - filter: o_proj value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0] - filter: input_layernorm value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0] - filter: up_proj value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0] - filter: gate_proj value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0] - filter: down_proj value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0] - filter: post_attention_layernorm value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0] - value: 0.1 lambda: 1.0 dtype: bfloat16 out_dtype: bfloat16 parameters: int8_mask: true normalize: true rescale: true filter_wise: false chat_template: auto tokenizer: source: union ```