πŸ™‹πŸ»β€β™‚οΈ gemlite hqq quant version "unlearns" tool calling (?)

#2
by Tonic - opened

https://huggingface.co/spaces/Tonic/gemlite-llama3.1

my demo of this model, am i doing something wrong , seems like it doesnt do "tool calling"

"coding" works though

Mobius Labs GmbH org

There's no training in hqq, it's a one-shot quantization, so this shouldn't happen. Do you have an example to reproduce your issue?

Mobius Labs GmbH org
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig 

model_id = "mobiuslabsgmbh/Llama-3.1-8B-Instruct_gemlite-ao_a16w4_gs_128_pack_32bit"

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    device_map='cuda', 
    attn_implementation="sdpa",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model.config.use_cache = True
model.generation_config.cache_implementation = "static"

def ask_me(user_prompt, max_new_tokens=128):

    def get_current_time():
        from datetime import datetime
        return datetime.now().isoformat()

    tool_json = '{"tool_name": "get_current_time", "parameters": {}}'
    chat = [
        {
            "role": "system",
            "content": (
                "You are a helpful assistant that can call tools."
                "If you're asked about what time it is right now, respond with a JSON object like:\n"
                '{"tool_name": "get_current_time", "parameters": {}}'+"\n"
                "Otherwise, process the prompt as if there is no tool calling."
            ),
        },
        {
            "role": "user",
            "content": user_prompt,
        },
    ]

    prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

    out = model.generate(
        **tokenizer([prompt], return_tensors="pt").to(model.device),
        max_new_tokens=max_new_tokens,
        cache_implementation="static",
        pad_token_id=tokenizer.pad_token_id,
    )

    #print(tokenizer.decode(out[0], skip_special_tokens=True))

    out = tokenizer.decode(out[0], skip_special_tokens=True).split('assistant')[-1].split(tool_json)
    for chunk in out:
        if(len(chunk) == 0):
            print(get_current_time())
        else:
            print(chunk)


ask_me("What time it is right now?") #2025-07-25T15:19:34.061615
ask_me("Write an essaye about artificial intelligence.") #Artificial intelligence (AI) has become a ubiquitous...

Llama 3.1 8B is not good with tool calling anyway, you might need to prompt it a bit, Better use another model if you wanna do tool calling, thread here

with all my thanks for the information , i took note and will do more research on the "context" next time ,

loving reading your work btw , super cool stuff !

Tonic changed discussion status to closed

Sign up or log in to comment