about

A merge between a L3 70b model (Dolphin 2.9.1) and a L3.1 70b base model (Tess 3) inspired by https://huggingface.co/sophosympatheia/New-Dawn-Llama-3.1-70B-v1.1 and https://huggingface.co/jukofyork/Dusk-Miqu-70B .

I called it straight "0.2" to leave room for an eventual "0.1" respecting Sophos' method to the letter if my edits didn't pan out.
I (re?)added q.proj, k.proj (the rope base frequency is similar between L3 and L3.1), imput.layernorm and post.attention.layernorm (why not?) to Sophos' mix, so all tensors are merged the same way (I guess I should just have gone layer-wide then, but whatever lol).
First test with with rescale false, next will be will rescale true.

benchs

PPL 512 Wikitext Eng : 2.93 (excellent)
ARC-C : 62.55 (very good)
ARC-E : 82.80 (very good)
Changes are minimal compared to Tess, to the point of noise (due to no rescale I guess)

tests

Let's see if it can hold long context (on testing):

At 10k, it holds coherence.
At 20k, it holds coherence.
At 28k, it holds coherence.
That's good enough to validate this step for now, because the merged material is a 8k context L3, not even a Miqu (32k context).

credits

Credits go to Jukofyork and Sophosympatheia, as well as to the Arcee/Mergekit folks and models authors of course.

merge

This is a merge of pre-trained language models created using mergekit.

Merge Details

Merge Method

This model was merged using the Linear DELLA merge method using migtissera/Tess-3-Llama-3.1-70B as a base.

Models Merged

The following models were included in the merge:

cognitivecomputations/dolphin-2.9.1-llama-3-70b

Configuration

The following YAML configuration was used to produce this model:

merge_method: della_linear
base_model: migtissera/Tess-3-Llama-3.1-70B
models:
  - model: cognitivecomputations/dolphin-2.9.1-llama-3-70b
    parameters:
      weight:
        - filter: q_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: k_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: v_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: o_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: input_layernorm
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: up_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: gate_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: down_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: post_attention_layernorm
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - value: 0
      density: 0.25
      epsilon: 0.05
      lambda: 1.0
  - model: migtissera/Tess-3-Llama-3.1-70B
    parameters:
        weight: 1.0
        density:
          - filter: q_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: k_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: v_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: o_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: input_layernorm
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: up_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: gate_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: down_proj
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - filter: post_attention_layernorm
            value: [1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1]
          - value: 0.5
        epsilon:
          - filter: q_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: k_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: v_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: o_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: input_layernorm
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: up_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: gate_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: down_proj
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - filter: post_attention_layernorm
            value: [0, 0, 0.05, 0.05, 0.07, 0.1, 0.07, 0.05, 0.05, 0, 0]
          - value: 0.1
        lambda: 1.0
dtype: bfloat16
out_dtype: bfloat16
parameters:
  int8_mask: true
  normalize: true
  rescale: false
chat_template: auto
tokenizer:
  source: union

NexesMess
/

Llama_3.x_70b_Flipper_0.2