ππ»ββοΈ gemlite hqq quant version "unlearns" tool calling (?)
#2
by
Tonic
- opened
https://huggingface.co/spaces/Tonic/gemlite-llama3.1
my demo of this model, am i doing something wrong , seems like it doesnt do "tool calling"
"coding" works though
There's no training in hqq, it's a one-shot quantization, so this shouldn't happen. Do you have an example to reproduce your issue?
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
model_id = "mobiuslabsgmbh/Llama-3.1-8B-Instruct_gemlite-ao_a16w4_gs_128_pack_32bit"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map='cuda',
attn_implementation="sdpa",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model.config.use_cache = True
model.generation_config.cache_implementation = "static"
def ask_me(user_prompt, max_new_tokens=128):
def get_current_time():
from datetime import datetime
return datetime.now().isoformat()
tool_json = '{"tool_name": "get_current_time", "parameters": {}}'
chat = [
{
"role": "system",
"content": (
"You are a helpful assistant that can call tools."
"If you're asked about what time it is right now, respond with a JSON object like:\n"
'{"tool_name": "get_current_time", "parameters": {}}'+"\n"
"Otherwise, process the prompt as if there is no tool calling."
),
},
{
"role": "user",
"content": user_prompt,
},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
out = model.generate(
**tokenizer([prompt], return_tensors="pt").to(model.device),
max_new_tokens=max_new_tokens,
cache_implementation="static",
pad_token_id=tokenizer.pad_token_id,
)
#print(tokenizer.decode(out[0], skip_special_tokens=True))
out = tokenizer.decode(out[0], skip_special_tokens=True).split('assistant')[-1].split(tool_json)
for chunk in out:
if(len(chunk) == 0):
print(get_current_time())
else:
print(chunk)
ask_me("What time it is right now?") #2025-07-25T15:19:34.061615
ask_me("Write an essaye about artificial intelligence.") #Artificial intelligence (AI) has become a ubiquitous...
Llama 3.1 8B is not good with tool calling anyway, you might need to prompt it a bit, Better use another model if you wanna do tool calling, thread here
with all my thanks for the information , i took note and will do more research on the "context" next time ,
loving reading your work btw , super cool stuff !
Tonic
changed discussion status to
closed