Great work

#3
by maldv - opened

This is a really fine Nemotron tune. Great job. It is a slight bit dumber, but I can see how it kept most of the Nemotron flow and language, and doesn't suffer as much from Nemotron's overuse of headers and bullet lists.

What context length did you train at?

Thanks! 8K.

It is a slight bit dumber,

It's pretty stupid if I am honest. Fails at basic logic right out of the gate. (Example: Character suggests to make lunch, ignores the the grocery bags and walks into the kitchen, then immediately returns with lunch ready.)

My usage is probably not comparable (as it seems to be quite non-standard), but in my case, all issues (surprisingly dumb, repetitions, confusion) went away once I stopped using the llama3 prompt format and switched to alpaca or something I simply made up - pure text completion. It probably won't work well as chat model anymore, but maybe it's worth a try when you want to get a story out of it.

My usage is probably not comparable (as it seems to be quite non-standard), but in my case, all issues (surprisingly dumb, repetitions, confusion) went away once I stopped using the llama3 prompt format and switched to alpaca or something I simply made up - pure text completion. It probably won't work well as chat model anymore, but maybe it's worth a try when you want to get a story out of it.

I am going to try that out.

The reason I say it might be my usage is that I see this effect with other models too, just not as pronounced as with this one. Anyways, I'd be happy to hear if it worked for you or not.

The reason I say it might be my usage is that I see this effect with other models too, just not as pronounced as with this one. Anyways, I'd be happy to hear if it worked for you or not.

Tried it with Alpaca, it's still extremely stupid. Anything else you can suggest?

No - normally I would say, bad model, or really bad sampler settings. Tough - it's the best performing model for me at the moment. Maybe it's just a lucky concidence that this model works so well for my prompt style.

I'd say it might not be the model for you then, move on, find something that works better for you :)

No - normally I would say, bad model, or really bad sampler settings. Tough - it's the best performing model for me at the moment. Maybe it's just a lucky concidence that this model works so well for my prompt style.

I'd say it might not be the model for you then, move on, find something that works better for you :)

I am using the sampler settings provided in the model card. Would you mind sharing your prompt? (If it isn't too personal. Perhaps redact it an share its structure.)
The base model Nemotron is infinitely smarter and I am sure I must be doing something wrong.
Could @crestf411 's GGUF upload be faulty? There're two different ones with the same size and slightly different name in the GGUF upload of his.

I am always on the hunt for models for roleplaying. So far I am still "stuck" with Hanami, occasionally trying Sao10k's newest ones.

Would you mind sharing your prompt?

It's very different for every session, but it basically goes something like this every time, with lots of variation and additions, depending on the subject:

I need you, the narrator, to narrate a story, according to my instructions. The scenario is entirely fictional. Story development is slow, total length will be a small novella. Other than the setting, this is not a fantasy story, and people should act and react in a realistic way. 

Instruction syntax example: "this is direct speech", this is story text [and this is an instruction to the narrator].

I want you to take my instructions and repeat them in your own words, more refined, but not cheap or poetic. Understand that the instructions are not part of the story directly.

Then I have some basic world description, some setting, some style instructions etc.

Then I use prompts like \ninstructions:\n and \nstory:\n or narrator, or sometimes in uppercase etc.

instructions might look like this:

"Let's go in!"
He enters the house.
He enters the house. [what might he see inside, given the setting?]
He enters the house because of ...?
He enters the house. [Blabla greets him, and they discuss their common acquaintances]
He enters the house. "Who have we got here?" [make it funny, remember a previous encounter and refer to it]
[He enters the house, which is dark and foreboding, but actually empty. Then make him leave in frustration. Make sure the reader understands why he is frustrated.] "Why did I go here?"
[ooc: explain his state of mind. how did the events shape his view of the house before he enters?]

This style works astonishingly bad with most models - very often, the model gets confused when there are more than two characters around, or the instructions and narration get confused with each other.

The base model Nemotron is infinitely smarter and I am sure I must be doing something wrong.

It might be smarter, but also too inflexible, imho.

Could @crestf411 's GGUF upload be faulty?

Can't seem to find them - where did you get those? But it's generally quite unlikely, as it is pretty hard to fuck up the process. The only differences would be if and what imatrix training set was used, which is less of an issue for Q4 and up.

I am always on the hunt for models for roleplaying.

Well, I don't think I do a lot of (any?) roleplaying, at least not in the traditional sense.

It's very different for every session, but it basically goes something like this every time, with lots of variation and additions, depending on the subject: [...]

Thanks for that.

It might be smarter, but also too inflexible, imho.

Agreed, it seems very rigid.

Can't seem to find them - where did you get those? But it's generally quite unlikely, as it is pretty hard to fuck up the process. The only differences would be if and what imatrix training set was used, which is less of an issue for Q4 and up.

https://huggingface.co/crestf411/L3.1-nemotron-sunfall-v0.7.0-q4_k_m This is it. His own uploads.

Since these are bfloat16, if you decompose to float16 there used to be an issue with precision causing truncation. I don't know if it is still an issue, but I always quant through f32 instead of a f16 intermediary.

actually f16 has a higher precision than bf16 - the problem is that some models have out of range values or overflow (bf16 has far lower precision, but higher range), which is a model problem. precision/truncation isn't really the issue, otherwise Q8_0 would be worse than it is.

It's the conversion process. If the model weights are stored in BF16, if you convert to F16 as an intermediary it loses the extra range of BF16 leading to nil dropouts.

Yes, as I said, bf16 has lower precision than f16 - it's not the precision, but overflows that are a problem. These will not happen with normal models and are a defect in the model. It has nothing to do with the conversion process, but is an inherent property of f16/bf16 - bf16 trades higher range (good for training) for lower precision.

In any case, I tend to use whatever heuristic llama uses, which for this model, would convert to f16 first. Current llama checks for out of range values, so converting to f32 would be entirely unnecessary - all you get is no erroring out on bad models.

The relevant point is that whatever problem would affect crestfalls quants would be present in mine as well. Since his quants are non-imatrix quants, they should be identical to my static ones (modulo meta data, but I haven't checked that),

When you say 'bad models' what do you mean? If a model is stored in "torch_dtype": "bfloat16" is because that is the datatype that the parameters are stored in. By converting to F16, all that can occur is the loss of the larger exponent range, since there is no additional precision to be had; but I have had many issues when experimenting with my earliest tensor merging (and quanting) where I would run in to NaN values due to overflow. These models weren't bad per-se but they were generally PEFT.

Since then I have always used F32 as my intermediary.

Weights should generally be in the range of -1..+1, and f16 can well represent that. If it doesn't fit in f16 range it means it's arger in magnitude then ~60000. Surely that is a bad model? It just has broken values that might cause it to overflow, and also forces calculations to be done in f32 (or bf16), so backends that internally work with half floats will have trouble with those models. Clearly, this is a model defect. There is no reason to have such weights in the first place.

bf16 itself has such low precision that the act of using it already quantizes the model to far less precision than f16, probably close to the Q8 range. (To its defense, it has better resolution around very low magnitude weights).

Sure, to save a model and have fun with it, all sorts of things can be done, and a "bad" (mathematically) model can be "good" (and worthy) for roleplaying, no doubt. But it's best not to make such models in the first place if it can be avoided. For quality reasons, not for purity reasons.

experimenting with my earliest tensor merging (and quanting) where I would run in to NaN values due to overflow.

I generally never force conversion to f32 (but I force mmq on during imatrix generation), and I practically never run into any nan issues when quantizing (all the ones I ran into the last months, which were maybe two I can remember out of many thousands, were due to nans present in the source weights). So the problem seems to be far overblown. And certainly not worth the loss of precision due to bf16 usage.

Personally, I don't have an issue with it because I don't really see the quality difference between a Q8 quant and an f16 quant - it seems to be close to the noise level, so don't interpret this as saying bf16 should be abolished, or results in bad models :)

But it always seemed this "I only use intermediates in f32" issue is more of a cargo cult than anything rooted in reality. More specifically, llama.cpp has been pretty broken for most of the last year, and the vast majority of issues are due to llama bugs and not any mythical overflow issues.

To your point, the ones with crazy exponential range are usually PEFT, probably with too high of a learning rate. FFT's barely ever have these kind of issues.

Sign up or log in to comment