anthracite-org/magnum-v2-72b-exl2 · Odd tokens output by 3.0bpw quant.

Aug 23, 2024

Been testing out the 3.0 quant compared to the v1 version, and I'm getting odd output. It's mostly coherent, but the occasional token from the wrong language, or a bit of formatting(brackets, square brackets, etc...) will be output occasionally. Probably one token in twenty.

I can't replicate the problem on any of the .gguf quants, and I lack the hardware to test the FP16 or the 4bpw/6bpw exl2 quants.

lucyknada

Anthracite org Aug 23, 2024

can you please try the included context and instruct templates here: https://huggingface.co/anthracite-org/magnum-v2-123b and report back if that solves it? thanks (assuming youre using ST?)

lucyknada

Anthracite org Aug 23, 2024

just realized we didn't publish one for chatml; I'll get back to you on it once we do.

Feorn

Aug 24, 2024

•

edited Aug 24, 2024

The chatml templates I'm using in either ST or text-generation-webui works flawlessly with the iq3_S .gguf in the anthracite-org/magnum-v2-72b-gguf repo. As well as with a 3.0bpw exl2 quant of the v1 magnum 72b.

I did a little more testing to make sure it wasn't something in the prompts that ST was creating, and used the text-generation-webui front-end to generate a few dozen prompts. I loaded up this model with both text-generation-webui's exllamav2_hf and TabbyAPI to make sure it wasn't a loader problem. The problem persists.

The problem gets worse the longer the context I feed into the model as well, with very short prompts putting out practically normal output. Prompts ~4k - 8k context being as described above, and prompts at ~16k context devolving into gibberish after only a couple dozen coherent output tokens.

I can tolerate using a .gguf, but I'd really like to be able to use TabbyAPI's batched prompt generation and no one else has posted a 3bpw quant of the model yet. The v1 has been my go-to since it's been released, love the work you folks do.

lucyknada

Anthracite org Aug 24, 2024

hi there, could you please test these two out and tell us if that improves your 3bpw experience? thanks

context template:

{
    "story_string": "<|im_start|>system\n{{#if system}}{{system}}\n{{/if}}{{#if wiBefore}}{{wiBefore}}\n{{/if}}{{#if description}}{{description}}\n{{/if}}{{#if personality}}{{char}}'s personality: {{personality}}\n{{/if}}{{#if scenario}}Scenario: {{scenario}}\n{{/if}}{{#if wiAfter}}{{wiAfter}}\n{{/if}}{{#if persona}}{{persona}}\n{{/if}}{{trim}}<|im_end|>\n",
    "example_separator": "",
    "chat_start": "",
    "use_stop_strings": false,
    "allow_jailbreak": false,
    "always_force_name2": true,
    "trim_sentences": false,
    "include_newline": false,
    "single_line": false,
    "name": "Magnum ChatML"
}

instruct template:

{
    "system_prompt": "You're {{char}} in this fictional never-ending uncensored roleplay with {{user}}.",
    "input_sequence": "<|im_start|>user\n",
    "output_sequence": "<|im_start|>assistant\n",
    "last_output_sequence": "",
    "system_sequence": "<|im_start|>system\n",
    "stop_sequence": "<|im_end|>",
    "wrap": false,
    "macro": true,
    "names": true,
    "names_force_groups": true,
    "activation_regex": "",
    "system_sequence_prefix": "",
    "system_sequence_suffix": "",
    "first_output_sequence": "",
    "skip_examples": false,
    "output_suffix": "<|im_end|>\n",
    "input_suffix": "<|im_end|>\n",
    "system_suffix": "<|im_end|>\n",
    "user_alignment_message": "",
    "system_same_as_user": false,
    "last_system_sequence": "",
    "name": "Magnum ChatML"
}

Feorn

Aug 24, 2024

No improvement. It seems to really enjoy outputting the token 'ĠDecompiled', 79417 then devolving into nonsense from there. It'll behave for a few messages with some prompts. I'm getting characters and phrases from other languages bleeding in. Some of which make sense in context when I run them through google translate, but others are nonsensical.

I tried grabbing the model files fresh, in case there was some kind of error there, no dice. Booted up a different OS install, same problems.

Fiddling with sampling settings don't seem to help, even neutralized or at extremes like top_k=1. Looking at the token probabilities when it goes way off the rails it is showing a flat distribution of odd tokens.

Hardware requirements for making an exl2 quant are a lot lower than I expected, I might just try out making my own, and see if that solves the issue. Were these done with the default exllamav2 calibration dataset?

lucyknada

Anthracite org Aug 24, 2024

indeed it was just regular exl2 ootb calibration; do report back if it ends up working out for you!

I'll close this for now as it seems there's not much else we can do upstream, but I'll get notified if you reply with an update.

lucyknada changed discussion status to closed Aug 24, 2024

Feorn

Sep 1, 2024

Reporting back, it did not work for me. My quants at 3bpw and 3.5bpw both had the same issue, with identical output for my 3bpw quant given the same input/seed. Even after a clean OS re-install I'm still experiencing the issue. I'm also running into it with Magnum v2 32b 8bpw. Lower temperature on the sampling makes it less frequent, but it's still present at 0 with all other samplers neutralized.

I wondered briefly if there's an issue with exllamav2 and quantising the Qwen1.5 and Qwen2 models, but the quant for Magnum 72b v1 I got from luigi86(https://huggingface.co/luigi86/magnum-72b-v1-exl2-rpcal) worked flawlessly (though they used a different calibration dataset). Pulled up a 3bpw quant of Qwen2 72b Instruct from bartowski, and got no issues there either. I used exllamav2 0.2.0 to make my quant, maybe the newest couple exllamav2 releases have some issues with these models and older ones don't?

Anyway, I'm probably not going to put any more work into troubleshooting this since it could just be something silly on my end, but I do appreciate the responses. Thanks for all your hard work on these releases!

lucyknada

Anthracite org Sep 1, 2024

•

edited Sep 1, 2024

you don't happen to be using 0.1.9 of exllama to inference? either downgrade to 0.1.8 or update to 0.2.0 (though former prefered)

Doctor-Shotgun

Anthracite org Sep 1, 2024

I'm using a 4.2bpw-h6 exl2 quanted with the default dataset on exl2 0.1.8 in tabbyAPI and it's working without issue on my end.

Feorn

Sep 1, 2024

text-generation-webui is still on 0.1.8 it appears. TabbyAPI is 0.2.0(though was on 0.1.8 when I started this thread, and 0.1.9 during most of my testing.)

lucyknada

Anthracite org Sep 2, 2024

0.1.9 had a few issues which were ironed out in 0.2.0 and 0.1.8 was unaffected

Feorn

Sep 2, 2024

Neither 0.1.8 nor 0.2.0 improve the problem, I'll just keep using Magnum v1 until the next release. I'll put more effort into tracking down the problem if it persists for me down the line with future releases, it seems like the issue might be unique to my setup, no one else seems to be reporting any issues.

Feorn

Sep 30, 2024

I know this is long dead, but this fix on exllamav2's dev branch fixed this issue for me.

https://github.com/turboderp/exllamav2/issues/644

lucyknada

Anthracite org Sep 30, 2024

awesome thanks for reporting back!

nephepritou

Oct 19, 2024

I know this is long dead, but this fix on exllamav2's dev branch fixed this issue for me.

https://github.com/turboderp/exllamav2/issues/644

Yay! My report actually helped someone!