Any plans for future Qwen3 8B models?
If you plan on doing more finetunes for Qwen3 8B, I would recommend the Qwen3 Deepseek R1 distill, it performed much better in my writing tests, however, I did discover the Deepseek tokenizer isn't as good as the Qwen tokenizer. I ran a series of experiments and testing and found the Qwen tokenizer to be more efficient and produce more accurate results (where benchmarks are concerned), and a slerp merge between the two to perform even better (this worked well since they are essentially the same architecture). The two tokenizers were also easy to test against each other since they have almost a 100% vocab overlap. Here's the slerp merge if you're interested: https://huggingface.co/lemon07r/Qwen3-R1-SLERP-Q3T-8B
Im also curious which datasets have been working well for you. I think a mix between having:
- your gutenberg2 dataset
- jondurbin's original gutenberg dataset with your fixes
- and sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo (which was used in darkest muse, which is possibly the best finetune I've tested in it's time)
plus a few datasets for mild uncensoring without damaging the model's quality or performance, like:
- TheDrummer/AmoralQA-v2 (this seems to be a fan favorite among popular uncensored models)
- your firewall dataset for obvious reasons, qwen/deepseek being chinese models
..could be good. I haven't messed around with finetuned qwen models too much since the few I tried had severely degarded performance compared to the original qwen3 models.
Sure, I can try out a tune on the 8B R1 distill. I'm also interested in trying out Sam's dataset.
Thanks for the ideas!
Sure, I can try out a tune on the 8B R1 distill. I'm also interested in trying out Sam's dataset.
Thanks for the ideas!
Consider the SLERP merge using the Qwen tokenizer that I linked as well, since the Deepseek tokenizer in the distill doesn't seem to be as good! I look forward to trying whatever you end up cooking
Here's a crack at it: https://huggingface.co/nbeerbower/Wenyan-Qwen3-8B
I'll take a look soon when im at a PC. Any reason for the helpsteer dataset? I took a look at a couple of the first entries and it looks awful tbh. Most of the chosen responses vs rejected ones are just the same response but with chatgptisms introduced. Also a lot of the chosen responses are introducing censorship, a lot of sorry I'm an ai I can't help in the chosen responses.. whereas the rejected responses were more okay to oblige. A note on the overcooking (havent seen if it is yet), if we're including more Gutenberg datasets then we've used in the past we probably don't need as many epochs anymore.
Yeah good call out. I'll try again with 1 epoch and more of a focus on writing datasets.
Yeah good call out. I'll try again with 1 epoch and more of a focus on writing datasets.
Would be a good experiment. Might end up obselete soon if we get another 2507 qwen3 model in the form of 8b though
Here ya go: https://huggingface.co/nbeerbower/Eloisa-Qwen3-8B
Just got done testing it. Feels pretty solid, couldn't find any real quirks or weaknesses. Seems like a good all around model without any strange bias or issues. The writing and output reads pretty naturally too.
https://www.reddit.com/r/LocalLLaMA/comments/1mibd4n/flown_under_the_radar_eloisaqwen38b/
Shared it on reddit. Hopefully others try it too :)
Awesome, I'm working on a tool to make my finetuning process easier and more reproducible so maybe I'll try this set of data on a few other models.