Wur doomed!

#14
by jukofyork - opened

Continuation of THE THREAD OF DOOM.

jukofyork pinned discussion

What do you and the others think of the distilled R1 models for writing?

The llama3 / qwen models SFT'd on R1 outputs? I only tried 2 of them.

R1 Qwen (32b) - Lacks knowledge of fiction (same as the official Qwen release), so it's writing is no better.

R1 Llama3 - This is generally the worst of them (not just for writing). It'll generate the CoT and then write something completely different.

CoT traces won't let the model do anything out of distribution, so not very useful if the base model doesn't have a lot in it's training data.

Yeah, I have tried the same two and felt the same way.

I also felt that any attempt to add an R1 distill to the merge recipe of an existing merge project made it worse...so far...

@gghfez @BigHuggyD that has been my experience as well, which is a shame as I had a go of R1 on Openrouter and I was blown away.

What model is anywhere close that is usable on a 24gb vram machine with 32gb of ram in your experience?

There's nothing like it for now. I'm running R1 slowly on my ThreadRipper:

prompt eval time =   14026.61 ms /   918 tokens (   15.28 ms per token,    65.45 tokens per second)
       eval time =  398806.12 ms /  1807 tokens (  220.70 ms per token,     4.53 tokens per second)
      total time =  412832.73 ms /  2725 tokens

I tried training Wizard2 8x22b MoE on R1 data, but it doesn't really work well. It will plan ahead in think tags eg:

I need to ensure the story maintains its gritty, realistic tone without becoming overly melodramatic. The characters' growth should be subtle but significant. Also, the ending should leave a sense of hope but not be too neat—their redemption is fragile, and the future is uncertain.

Let me outline the next few chapters:

Chapter 5: Nightmares and Trust
...

But it doesn't backtrack like R1 does. Just kind of agrees with it's self and ends up writing how it usually would:

“I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead.

lol

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

@BigHuggyD This guy's trying to do the opposite of the Elara challenge!

https://github.com/envy-ai/elarablate

https://huggingface.co/e-n-v-y/L3.3-Electra-R1-70b-Elarablated-v0.1

This is a clearly misguided individual. 🤣

I figured out what was causing the looping for my qwq fine-tune:

  • If you multiply out the (multiplicative / down_proj only) LoRA C = A^T B, take the SVD C = U^T S V, and examine the distribution of singular values (in S) for each layer, it was obvious that the first 4 and last 4 layers had just created a couple of huge vector pairs that we doing something a bit like the opposite of abliteration (ie: magnifying certain directions massively).
  • I was already excluding the last 2 layers for this reason, but it looks like it wasn't enough :/

I also tried just not merging the offending layers, but this shows another problem:

  • It seems to be learning very tightly-coupled sets of vectors, and removing just some leaves the partner vectors in a bad state and the model started adding spaces to the end of every line :/
  • The solution to this is to use a really large lora_dropout = 0.5, as found in LoRA Dropout as a Sparsity Regularizer for Overfitting Control (thanks to kallewoof for linking me this paper [I won't suck him into the Thread of Doom by pinging him!]).

So now I've had to restart the fine-tuning with these two changes in mind, but I have also found that I can push the Focal Loss Star's gamma parameter even more (to 1.25 instead of 1.1 for the last run):

image.png

and so long as this isn't broken; should be much more like a base model!

(this is also probably the very limit of what can be done; as indicated by the negative log likelihood initially rising, but then starting to [and hopefully continuing to] drop rather than just diverging off...)

It's interesting to see how it quickly "cheats" the loss by down-scaling the hidden state over the first ~400M tokens:

image.png

but then slowly starts to rotate the vector directions to allow for a hidden state norm closer to the original model over the remaining ~1.2B tokens.

Fingers crossed this will work, and I will test at step 400 in a couple of hours.

@BigHuggyD This guy's trying to do the opposite of the Elara challenge!

https://github.com/envy-ai/elarablate

https://huggingface.co/e-n-v-y/L3.3-Electra-R1-70b-Elarablated-v0.1

I wonder if some other name would come up now?

I still want to try fine-tuning on a bunch of Occult book torrents I downloaded a while back, but the biggest hurdle is they are mostly in really badly formatted PDF files and it would take a huge amount of effort to parse them all properly :( I think this would go a long way to helping increase the name variability, but unless they are cleaned properly; the strongest signal will end up being stupid stuff like page headers, page numbers, mis-broken paragraphs, and so on...

I wonder if Qwen3-30B-A3Ba might be smart enough to fix this mess?

These are with the new model, but with the extra bit from this post added:

Extra Directions to Avoid Common AI Writing Issues:

  • Avoid generic phrasing or filler sentences.
  • Use fresh, specific language instead of clichés or idioms.
  • Keep internal monologue voice-consistent and emotionally grounded.
  • Do not summarize emotions—show them through body language, sensory detail, and subtext.
  • Let characters interrupt, pause, or misread each other. Real dialogue over exposition.
  • Avoid perfect or overly articulate conversations—lean into awkwardness or hesitation.
  • Limit adjectives and adverbs—prioritize strong nouns and verbs.
  • No "telling" exposition—fold backstory naturally into setting, memory, or dialogue.
  • Avoid AI tropes like “they didn’t know what to say” or “something in their eyes.” Be precise.
  • Ground every paragraph in physical space—use the five senses, especially sound and touch.
  • Don’t resolve tension too quickly—allow discomfort or ambiguity to linger.
  • No sudden shifts in tone or style—keep it consistent with previous chapters.
  • Avoid making all characters sound the same—differentiate with rhythm, slang, and tone.
  • Minimize redundant restating of emotions already shown.
  • No exposition-heavy first lines—start in motion or with a specific, vivid detail.

Also, gonna start using pastebin.com, as it highlights the syntax better and hopefully stops this thread lagging the browser so much:

https://pastebin.com/pcTxSSJ6
https://pastebin.com/LpQ5c7J0

I've scaled lm_head now so temperature = 1 cancels the gamma parameter, and these two stories were generated with just a tiny bit of min-p :

        --temp 1.0 \
        --min-p 0.01 \
        --top-k 0 \
        --top-p 1.0

I wonder if some other name would come up now?

Elara still comes up sometimes ❤️

I didn't run any proper tests other than using it, but so far I haven't noticed a new name-slop.

https://pastebin.com/pcTxSSJ6

This is really good, kind of what I was hoping for when I first prompted an LLM with that!

Edit: IMO it's worth keeping that 400-step checkpoint

I wonder if some other name would come up now?

Elara still comes up sometimes ❤️

I didn't run any proper tests other than using it, but so far I haven't noticed a new name-slop.

https://pastebin.com/pcTxSSJ6

This is really good, kind of what I was hoping for when I first prompted an LLM with that!

Edit: IMO it's worth keeping that 400-step checkpoint

It's pretty easy to use "Focal Loss Star" in any training setup:

image.png

https://arxiv.org/abs/1708.02002

by either up-scaling lm_head by gamma (say 1.1 or 1.25 rather than the suggested 2.0 in the paper), or by up-scaling logit_scale for the cohere models:

https://huggingface.co/CohereLabs/c4ai-command-a-03-2025/blob/main/config.json

with their tied embeddings.

You just have to be careful not to allow it to train lm_head and from experimentation; the last few layers, or it will try to "cheat".

All this is doing is basically:

  1. Changing the "training time temperature" so that the models' outputs appear (counter-intuitively) MORE peaked during training.
  2. Then when trained with cross-entropy loss, the model will try to again create well-calibrated outputs based on these new excessively peaked outputs.
  3. BUT: because it can't directly change lm_head or the last couple of down_proj matrices, nor get round the final layer_norm stopping blatant cheating by down-scaling the norm of the hidden state; the model is forced to "de-peak" the outputs by rotating the hidden-state vectors:

image.png

so that when these vectors are dot-producted with the lm_head vectors; more of output tokens get significant probability mass, in a similar way to what happens when you increase the inference-time temperature.

(as you can see here, there is a little bit of "cheating" going on, as it has managed to down-scale the hidden state somewhat... It seems that between gamma = 1.1 and gamma = 1.25 is the most you can get away with currently before this happens, but there is no reason a regularisation term can't be added to the loss, or a reparametrisation of the LoRA to a more orthogonal structure, to stop this in the future...).


The only difference between just scaling lm_head (or logit_scale) and what I did by adding "Focal Loss Star" to qlorapipe, is I can track the actual negative log-loss, top-1, entropy, etc of the unscaled version:

image.png

but since I'm now rescaling the lm_head tensor to reset back the baseline temperature = 1.0 (to avoid having to explain why temperature = 0.9090909091 is the correct setting), there is effectively no difference!


Hopefully, this explanation makes sense, as I really think this method likely has a good chance of undoing a lot of the recent super-peaked Elara type shit we are seeing from current models, and the idea is actually super-simple, and it's just a pity the appendix of the Focal Loss paper explained it so badly and used a '*' character in the name (making it almost impossible to search for).

I also think that qwq has the most potential to get fixed here: there clearly is a good writer living inside of qwq and it has the best long-context ability of the current open models:

https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87

it just needs a little help to undo the crazy-peakedness causing Elaraslop!

Sign up or log in to comment