mobiuslabsgmbh/Llama-3.1-8B-Instruct_gemlite-ao_a16w4_gs_128_pack_32bit · 🙋🏻‍♂️ gemlite hqq quant version "unlearns" tool calling (?)

Tonic

about 17 hours ago

https://huggingface.co/spaces/Tonic/gemlite-llama3.1

my demo of this model, am i doing something wrong , seems like it doesnt do "tool calling"

"coding" works though

mobicham

Mobius Labs GmbH org about 17 hours ago

There's no training in hqq, it's a one-shot quantization, so this shouldn't happen. Do you have an example to reproduce your issue?

mobicham

Mobius Labs GmbH org about 16 hours ago

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig 

model_id = "mobiuslabsgmbh/Llama-3.1-8B-Instruct_gemlite-ao_a16w4_gs_128_pack_32bit"

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    device_map='cuda', 
    attn_implementation="sdpa",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model.config.use_cache = True
model.generation_config.cache_implementation = "static"

def ask_me(user_prompt, max_new_tokens=128):

    def get_current_time():
        from datetime import datetime
        return datetime.now().isoformat()

    tool_json = '{"tool_name": "get_current_time", "parameters": {}}'
    chat = [
        {
            "role": "system",
            "content": (
                "You are a helpful assistant that can call tools."
                "If you're asked about what time it is right now, respond with a JSON object like:\n"
                '{"tool_name": "get_current_time", "parameters": {}}'+"\n"
                "Otherwise, process the prompt as if there is no tool calling."
            ),
        },
        {
            "role": "user",
            "content": user_prompt,
        },
    ]

    prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

    out = model.generate(
        **tokenizer([prompt], return_tensors="pt").to(model.device),
        max_new_tokens=max_new_tokens,
        cache_implementation="static",
        pad_token_id=tokenizer.pad_token_id,
    )

    #print(tokenizer.decode(out[0], skip_special_tokens=True))

    out = tokenizer.decode(out[0], skip_special_tokens=True).split('assistant')[-1].split(tool_json)
    for chunk in out:
        if(len(chunk) == 0):
            print(get_current_time())
        else:
            print(chunk)


ask_me("What time it is right now?") #2025-07-25T15:19:34.061615
ask_me("Write an essaye about artificial intelligence.") #Artificial intelligence (AI) has become a ubiquitous...

Llama 3.1 8B is not good with tool calling anyway, you might need to prompt it a bit, Better use another model if you wanna do tool calling, thread here

Tonic

about 15 hours ago

with all my thanks for the information , i took note and will do more research on the "context" next time ,

loving reading your work btw , super cool stuff !

Tonic changed discussion status to closed about 15 hours ago