Wur doomed!
What do you and the others think of the distilled R1 models for writing?
The llama3 / qwen models SFT'd on R1 outputs? I only tried 2 of them.
R1 Qwen (32b) - Lacks knowledge of fiction (same as the official Qwen release), so it's writing is no better.
R1 Llama3 - This is generally the worst of them (not just for writing). It'll generate the CoT and then write something completely different.
CoT traces won't let the model do anything out of distribution, so not very useful if the base model doesn't have a lot in it's training data.
Yeah, I have tried the same two and felt the same way.
I also felt that any attempt to add an R1 distill to the merge recipe of an existing merge project made it worse...so far...
@gghfez @BigHuggyD that has been my experience as well, which is a shame as I had a go of R1 on Openrouter and I was blown away.
What model is anywhere close that is usable on a 24gb vram machine with 32gb of ram in your experience?
There's nothing like it for now. I'm running R1 slowly on my ThreadRipper:
prompt eval time = 14026.61 ms / 918 tokens ( 15.28 ms per token, 65.45 tokens per second)
eval time = 398806.12 ms / 1807 tokens ( 220.70 ms per token, 4.53 tokens per second)
total time = 412832.73 ms / 2725 tokens
I tried training Wizard2 8x22b MoE on R1 data, but it doesn't really work well. It will plan ahead in think tags eg:
I need to ensure the story maintains its gritty, realistic tone without becoming overly melodramatic. The characters' growth should be subtle but significant. Also, the ending should leave a sense of hope but not be too neat—their redemption is fragile, and the future is uncertain.
Let me outline the next few chapters:
Chapter 5: Nightmares and Trust
...
But it doesn't backtrack like R1 does. Just kind of agrees with it's self and ends up writing how it usually would:
“I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead.
lol
Ahhh thats a shame :-(
"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."
Oh god!
I'll have to keep an eye on this thread.
I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0
But my tastes are probably not as refined as others on this thread ;-)
@BigHuggyD This guy's trying to do the opposite of the Elara challenge!
https://github.com/envy-ai/elarablate
https://huggingface.co/e-n-v-y/L3.3-Electra-R1-70b-Elarablated-v0.1
@BigHuggyD This guy's trying to do the opposite of the Elara challenge!
https://github.com/envy-ai/elarablate
https://huggingface.co/e-n-v-y/L3.3-Electra-R1-70b-Elarablated-v0.1
This is a clearly misguided individual. 🤣
I figured out what was causing the looping for my qwq
fine-tune:
- If you multiply out the (multiplicative /
down_proj
only) LoRAC = A^T B
, take the SVDC = U^T S V
, and examine the distribution of singular values (inS
) for each layer, it was obvious that the first 4 and last 4 layers had just created a couple of huge vector pairs that we doing something a bit like the opposite ofabliteration
(ie: magnifying certain directions massively). - I was already excluding the last 2 layers for this reason, but it looks like it wasn't enough :/
I also tried just not merging the offending layers, but this shows another problem:
- It seems to be learning very tightly-coupled sets of vectors, and removing just some leaves the partner vectors in a bad state and the model started adding spaces to the end of every line :/
- The solution to this is to use a really large
lora_dropout = 0.5
, as found in LoRA Dropout as a Sparsity Regularizer for Overfitting Control (thanks to kallewoof for linking me this paper [I won't suck him into the Thread of Doom by pinging him!]).
So now I've had to restart the fine-tuning with these two changes in mind, but I have also found that I can push the Focal Loss Star's gamma
parameter even more (to 1.25
instead of 1.1
for the last run):
and so long as this isn't broken; should be much more like a base model!
(this is also probably the very limit of what can be done; as indicated by the negative log likelihood initially rising, but then starting to [and hopefully continuing to] drop rather than just diverging off...)
It's interesting to see how it quickly "cheats" the loss by down-scaling the hidden state over the first ~400M tokens:
but then slowly starts to rotate the vector directions to allow for a hidden state norm closer to the original model over the remaining ~1.2B tokens.
Fingers crossed this will work, and I will test at step 400 in a couple of hours.
@BigHuggyD This guy's trying to do the opposite of the Elara challenge!
https://github.com/envy-ai/elarablate
https://huggingface.co/e-n-v-y/L3.3-Electra-R1-70b-Elarablated-v0.1
I wonder if some other name would come up now?
I still want to try fine-tuning on a bunch of Occult book torrents I downloaded a while back, but the biggest hurdle is they are mostly in really badly formatted PDF files and it would take a huge amount of effort to parse them all properly :( I think this would go a long way to helping increase the name variability, but unless they are cleaned properly; the strongest signal will end up being stupid stuff like page headers, page numbers, mis-broken paragraphs, and so on...
I wonder if Qwen3-30B-A3Ba
might be smart enough to fix this mess?
These are with the new model, but with the extra bit from this post added:
Extra Directions to Avoid Common AI Writing Issues:
- Avoid generic phrasing or filler sentences.
- Use fresh, specific language instead of clichés or idioms.
- Keep internal monologue voice-consistent and emotionally grounded.
- Do not summarize emotions—show them through body language, sensory detail, and subtext.
- Let characters interrupt, pause, or misread each other. Real dialogue over exposition.
- Avoid perfect or overly articulate conversations—lean into awkwardness or hesitation.
- Limit adjectives and adverbs—prioritize strong nouns and verbs.
- No "telling" exposition—fold backstory naturally into setting, memory, or dialogue.
- Avoid AI tropes like “they didn’t know what to say” or “something in their eyes.” Be precise.
- Ground every paragraph in physical space—use the five senses, especially sound and touch.
- Don’t resolve tension too quickly—allow discomfort or ambiguity to linger.
- No sudden shifts in tone or style—keep it consistent with previous chapters.
- Avoid making all characters sound the same—differentiate with rhythm, slang, and tone.
- Minimize redundant restating of emotions already shown.
- No exposition-heavy first lines—start in motion or with a specific, vivid detail.
Also, gonna start using pastebin.com, as it highlights the syntax better and hopefully stops this thread lagging the browser so much:
https://pastebin.com/pcTxSSJ6
https://pastebin.com/LpQ5c7J0
I've scaled lm_head
now so temperature = 1
cancels the gamma
parameter, and these two stories were generated with just a tiny bit of min-p
:
--temp 1.0 \
--min-p 0.01 \
--top-k 0 \
--top-p 1.0
I wonder if some other name would come up now?
Elara still comes up sometimes ❤️
I didn't run any proper tests other than using it, but so far I haven't noticed a new name-slop.
This is really good, kind of what I was hoping for when I first prompted an LLM with that!
Edit: IMO it's worth keeping that 400-step checkpoint
I wonder if some other name would come up now?
Elara still comes up sometimes ❤️
I didn't run any proper tests other than using it, but so far I haven't noticed a new name-slop.
This is really good, kind of what I was hoping for when I first prompted an LLM with that!
Edit: IMO it's worth keeping that 400-step checkpoint
It's pretty easy to use "Focal Loss Star" in any training setup:
https://arxiv.org/abs/1708.02002
by either up-scaling lm_head
by gamma
(say 1.1
or 1.25
rather than the suggested 2.0
in the paper), or by up-scaling logit_scale
for the cohere
models:
https://huggingface.co/CohereLabs/c4ai-command-a-03-2025/blob/main/config.json
with their tied embeddings.
You just have to be careful not to allow it to train lm_head
and from experimentation; the last few layers, or it will try to "cheat".
All this is doing is basically:
- Changing the "training time temperature" so that the models' outputs appear (counter-intuitively) MORE peaked during training.
- Then when trained with cross-entropy loss, the model will try to again create well-calibrated outputs based on these new excessively peaked outputs.
- BUT: because it can't directly change
lm_head
or the last couple ofdown_proj
matrices, nor get round the finallayer_norm
stopping blatant cheating by down-scaling the norm of the hidden state; the model is forced to "de-peak" the outputs by rotating the hidden-state vectors:
so that when these vectors are dot-producted with the lm_head
vectors; more of output tokens get significant probability mass, in a similar way to what happens when you increase the inference-time temperature.
(as you can see here, there is a little bit of "cheating" going on, as it has managed to down-scale the hidden state somewhat... It seems that between gamma = 1.1
and gamma = 1.25
is the most you can get away with currently before this happens, but there is no reason a regularisation term can't be added to the loss, or a reparametrisation of the LoRA to a more orthogonal structure, to stop this in the future...).
The only difference between just scaling lm_head
(or logit_scale
) and what I did by adding "Focal Loss Star" to qlorapipe
, is I can track the actual negative log-loss, top-1, entropy, etc of the unscaled version:
but since I'm now rescaling the lm_head
tensor to reset back the baseline temperature = 1.0
(to avoid having to explain why temperature = 0.9090909091
is the correct setting), there is effectively no difference!
Hopefully, this explanation makes sense, as I really think this method likely has a good chance of undoing a lot of the recent super-peaked Elara type shit we are seeing from current models, and the idea is actually super-simple, and it's just a pity the appendix of the Focal Loss paper explained it so badly and used a '*' character in the name (making it almost impossible to search for).
I also think that qwq
has the most potential to get fixed here: there clearly is a good writer living inside of qwq
and it has the best long-context ability of the current open models:
https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87
it just needs a little help to undo the crazy-peakedness causing Elaraslop!