This recent paper points to an explanation for the unreasonable effectiveness of Frankenmerges: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (2502.05171)

Specifically, the duplication of layers in Frankenmerges serves a purpose similar to what occurs in their recurrent-depth architecture. Successful frankenmerges that operate without additional fine-tuning are able to recover or "heal" from any damage due to abrupt transitions between layer blocks. Operational replicated layer blocks can provide functional benefits grounded in latent reasoning. Frankenmerges can also result in hybrid reasoning, by splicing together the latent reasoning of different models.

Back in April 2024, I was able to duplicate a few layers in the Llama 3 8B model, turning it into a 9B model, without harming benchmarks significantly, despite any transition damage.
grimjim/llama-3-experiment-v1-9B
My informal experimentation suggested that latent reasoning circuits could occupy continguous stacks of 2-4 layers, though the result was highly sensitive to the choice of transition location between layers.

1 reply

upvoted a paper 4 months ago

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

Paper • 2502.06703 • Published Feb 10 • 154

updated a Space 4 months ago

Mwsamanaga

🏆

Some other CTF thingy, dw abt it

liked a model 5 months ago

jxm/sentence-transformers_all-MiniLM-L6-v2msmarco128

Updated Oct 31, 2023 • 11 • 2

spuun is trying

AI & ML interests

Recent Activity

Organizations

spuun's activity

Dense Multimodal Llama

Dense Multimodal Llama

Wlewlewle

Cocokin Kpppppl

Cocokin Kpppppl

Llada 8b Kcv

Llada 8b Kcv

Nagaluv

Owonaga

Owonaga

Nagaluv

Wlewlewle

Wlewlewle

Mwsamanaga

Mwsamanaga