Please Provide Chat Format and Temp Settings
This is a fantastic model, thanks for doing this!
Unfortunately, after tinkering with it for hours and hours literally, I'm unable to make it stop its generation. For the first couple of messages it might work, but then it just keeps on generating until it hits the 2048 token limit.
Please, if you can, let me know the proper chat format, and temp settings. I'm using LLama 3 Instruct in Instruct Template and Context Template in SIllyTavern but it's not working. The model I've downloaded is the Q4_K_M quant through KoboldCPP with Q8 KV cache. Temp is 0.7, Top K 40. I'm splitting 25 layers to my VRAM (16GB) and the rest to my system RAM (64GB).
Hi thanks!
Unfortunately this was done before I realized I could add the chat template into my config file. It is setup as a text completion model which might explain the issues you are getting. To get it to work for chat would mean playing with the configs which you wouldn't be able to manipulate on the GGUF.
A work around may be to add 'assistant' to your stop sequence. But that is all I can think of. Unfortunatley I only use text completion not chat at all so I am not 100% sure.
As for the settings. I use (with KoboldCPP and silly tavern, not sure if chat uses different samplers):
Thanks a lot for your detailed reply! I will use your suggestions, the screenshot helps a ton! I am using it as a text completion model haha. I do have Llam@ception 1.5, so I will download the version you're using to see if it's better. I will report back soon. :D
Here's my settings, at temp 1 it's talking kind of flowery and medieval-like, and at temp 0.7 it's really getting defensive while calling me defensive and yapping a lot hahah. Please let me know if I'm doing this right. Llam@ception 1.5 seems to be working, it's not generating nonstop until the token limit.
Awesome! Personally the only samplers I ever mess around with are Temp (between 0.7 - 1.1) and Min P (between 0.01 - 0.05) depending on the model of course. The rest are usual safe to leave as is. As for the meme Samplers like smoothing and XTC, I only use DRY as you have it there.
As for what they do, lowering temp makes the models replies more deterministic, while raising it makes the model more creative when its choosing its next token. Min P on the other hand gets more creative the lower you go.
Bonus: If you like this model you should try some of my mainline models, I'll link some below:
Tarek07/Legion-V2.1-LLaMa-70B
Tarek07/Dungeonmaster-V2.2-Expanded-LLaMa-70B
Tarek07/Dungeonmaster-V2.4-Expanded-LLaMa-70B
Thanks for reaching out and sharing! Have fun!
Thanks for getting back again! Makes sense, I'll play with Min P more now! I'll check your other models out as well, thanks for listing them. I actually came to your model because it's currently ranking at the UGI leaderboard (https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard) as the highest "Willingness to answer" (9.5) model with over 50% of total score. Anyways, feel free to close this thread if you want, as this is practically resolved right. The model's able to stop when it wants to. Thanks again for making this model!
Yeah, the GGUF quants of this model appear to have an issue with printing the stop sequence flag. It also weighs the character card very highly and will often ignore the system prompt information if there is quite a bit of detail in the character card. It even opts to actually try and complete the character card in one sitting for some odd reason unless told to give the user a chance to react to each part of the character card's stages of detail within the character card itself.
I know this model is acting wack when it keeps telling me it wants to end the damn post and can't bring itself to do it, so it throws up its hands and just continues along before eventually going in circles. The model is amazing if not otherwise though.
Yeah, the GGUF quants of this model appear to have an issue with printing the stop sequence flag. It also weighs the character card very highly and will often ignore the system prompt information if there is quite a bit of detail in the character card. It even opts to actually try and complete the character card in one sitting for some odd reason unless told to give the user a chance to react to each part of the character card's stages of detail within the character card itself.
I know this model is acting wack when it keeps telling me it wants to end the damn post and can't bring itself to do it, so it throws up its hands and just continues along before eventually going in circles. The model is amazing if not otherwise though.
The model does have issues stopping without some workarounds (especially in chat format). As for the system prompt, it is a side effect of the R1 model I suspect, the R1 models want the prompt to come from {user}, not {system}. That's why it seems more skewed towards the character card over the system prompt. I have heard others have success changing their system prompt to fall under user, but I have not tried it myself.