Triangle104/GLM4-9B-Neon-v2-Q8_0-GGUF
This model was converted to GGUF format from allura-org/GLM4-9B-Neon-v2
using llama.cpp via the ggml.ai's GGUF-my-repo space.
Refer to the original model card for more details on the model.
RP finetune of GLM-4-9B-0414. Feels nice, lots of personality, if bit quirky sometimes. Nice prose, not too Claude-ish or Gemini-ish. Doesn't seem to like too long system prompts or charcards though. Seems to like JSON formatted system prompts.
Model was trained by Auri.
Training notes
Model was trained on a dataset consisting of 77M tokens of synthetic RP and short story gen data for one epoch. Training took around 11 hours on 2xRTX 3090 workstation, generously provided by OwenArli. Went with some sane defaults for training config, QLoRA plus CCE for a nice chunk of memory usage optimization, 16k fit on 48GB nicely with some room to spare. I seem to have a problem with Eval/Loss being broken, not sure why, otherwise it trained smoothly.
Huge thanks to ArliAI for providing compute and collaborating on this run!
Format
Model responds to GLM4 instruct formatting, exactly like it's base model. Backends struggle to add BOS token automatically, so you'll need to do it yourself. Jinja template should work for chat completions.
[gMASK]<|system|>
{system_prompt}<|user|>
{prompt}<|assistant|>
Recommended Samplers
Nothing special, just classics.
Temperature - 1
Min-P - 0.1
Repetition Penalty - 1.03
Example master import for SillyTavern (using Shingane-v1 system prompt by Steelskull)
Running on KoboldCPP and other backends
To run GGUFs correctly, you need the most recent version of KoboldCPP, and to pass --overridekv glm4.rope.dimension_count=int:64 to the CLI command or put glm4.rope.dimension_count=int:64 into overridekv box in the GUI (under the Tokens tab at the very bottom).
Thanks to DaringDuck and tofumagnate for info how to apply this fix.
To run this model on vLLM, you'll need to build it from source from the git repo, full GLM4 support hasn't reached release yet.
ExLLaMAv2 and v3 based backends, such as TabbyAPI should support the model out of the box.
Latest versions of llama.cpp server should also allow running GGUFs out-of-the-box.
Use with llama.cpp
Install llama.cpp through brew (works on Mac and Linux)
brew install llama.cpp
Invoke the llama.cpp server or the CLI.
CLI:
llama-cli --hf-repo Triangle104/GLM4-9B-Neon-v2-Q8_0-GGUF --hf-file glm4-9b-neon-v2-q8_0.gguf -p "The meaning to life and the universe is"
Server:
llama-server --hf-repo Triangle104/GLM4-9B-Neon-v2-Q8_0-GGUF --hf-file glm4-9b-neon-v2-q8_0.gguf -c 2048
Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.
Step 1: Clone llama.cpp from GitHub.
git clone https://github.com/ggerganov/llama.cpp
Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1
flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
cd llama.cpp && LLAMA_CURL=1 make
Step 3: Run inference through the main binary.
./llama-cli --hf-repo Triangle104/GLM4-9B-Neon-v2-Q8_0-GGUF --hf-file glm4-9b-neon-v2-q8_0.gguf -p "The meaning to life and the universe is"
or
./llama-server --hf-repo Triangle104/GLM4-9B-Neon-v2-Q8_0-GGUF --hf-file glm4-9b-neon-v2-q8_0.gguf -c 2048
- Downloads last month
- 9
8-bit