This looks great

by DazzlingXeno - opened Oct 27, 2024

Oct 27, 2024

•

edited Oct 27, 2024

It really does but Command R is a bit beefy for me. Are you thinking of doing this with anything smaller or is that not possible?

jukofyork

Owner Oct 27, 2024

•

edited Oct 27, 2024

Are you thinking of doing this with anything smaller or is that not possible?

Sadly, the "Instruct-Storywriter" method of just concatenating chunks of text doesn't seem to work on smaller models:

Why no 8B?

I tried multiple times to train this on Llama 3 8B Instruct, using a variety of hyperparameters. It never worked well. The model took a huge hit to intelligence every time, to the point of being unusable. 70B fared much better. I don't know why, maybe 8B is just too small for this type of technique, and loses too much of the instruction-tuned smarts.

I've run this on mistral:7b, wizard-lm-2:7b, mistral-nemo:12b and mistral-small:22b when tuning the hyper-parameters, and can confirm that all had a (very) large drop in their cross-validation loss (compared to larger models), and ultimately ended up losing most of their instruction following ability.

I will likely rerun on the newer command-r:32b which has GQA if that is any help at all, but anything lower is likely to be not much use I'm afraid :/

It's also pretty important to use instruct models that haven't been trained on "filtered" data or else they are completely clueless about writing and/or authors' styles... Sadly this excludes some of the recent "smart" long-context models like llama-3:70b and qwen-2.5:72b, and really only leaves:

command-r:35b.
command-r:32b (has more GPTisms, but at least uses GQA to not as much VRAM needed).
command-r-plus:104b (but the newer version seems to have more GPTisms and might not be great).
mistral-large-2:123b.

There is some possibility of using miqu-1:70b and wizard-lm-2:8x22b too, but I am going to concentrate on the command-r, command-r-plus and mistral-large-2 models first...

I hope to perfect this method before moving on from command-r:35b though, and have creative-writer-v0.2-35b training now using a 60% larger dataset and hopefully better / more-Entropy-inducing hyper-parameters (it will be ready in around 6 days from now).

DazzlingXeno

Oct 27, 2024

Oh, the new Command-r would be awesome for me (selfishly) it's in the sweet spot for us with 24gb vram. Thank you for all the work you do mate!

jukofyork

Owner Oct 27, 2024

•

edited Oct 27, 2024

Oh, the new Command-r would be awesome for me (selfishly) it's in the sweet spot for us with 24gb vram. Thank you for all the work you do mate!

Thanks! It will likely be sometime in min-November, depending on how well the v0.2 experiment(s) go: I'm trying to push the method used in creative-writer-v0.1-bravo-35b higher and higher until it breaks the model before moving on to new models.

Downtown-Case

Oct 29, 2024

•

edited Oct 29, 2024

It's also pretty important to use instruct models that haven't been trained on "filtered" data or else they are completely clueless about writing and/or authors' styles... Sadly this excludes some of the recent "smart" long-context models like llama-3:70b and qwen-2.5:72b, and really only leaves:

command-r:35b.

command-r:32b (has more GPTisms, but at least uses GQA to not as much VRAM needed).

command-r-plus:104b (but the newer version seems to have more GPTisms and might not be great).

mistral-large-2:123b.

Why not Qwen 32B base, or a finetune from the base model like EVA?

https://huggingface.co/EVA-UNIT-01/EVA-Qwen2.5-32B-v0.0

The base model feels much better at raw continuation than the instruct one to me, and is also bettet above 32K.

Downtown-Case

Oct 29, 2024

•

edited Oct 29, 2024

And +1 for trying this on Command-R-2024.

The aggressive GQA helps a ton a long context. And it should make training faster or longer context too, right?

jukofyork

Owner Oct 30, 2024

It's also pretty important to use instruct models that haven't been trained on "filtered" data or else they are completely clueless about writing and/or authors' styles... Sadly this excludes some of the recent "smart" long-context models like llama-3:70b and qwen-2.5:72b, and really only leaves:

command-r:35b.

command-r:32b (has more GPTisms, but at least uses GQA to not as much VRAM needed).

command-r-plus:104b (but the newer version seems to have more GPTisms and might not be great).

mistral-large-2:123b.

Why not Qwen 32B base, or a finetune from the base model like EVA?

https://huggingface.co/EVA-UNIT-01/EVA-Qwen2.5-32B-v0.0

The base model feels much better at raw continuation than the instruct one to me, and is also bettet above 32K.

Yeah, I will look at trying the same method on base models later too, but for now I'm really trying to perfect the "Instruct-Storywriter" method on the down_proj matrices only as a compliment to the control-vector stuff.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment