--- license: cc-by-4.0 datasets: - Salesforce/xlam-function-calling-60k base_model: Qwen/Qwen2-7B-Instruct --- # Hammer-7b Function Calling Model ## Introduction Hammer-7b is a cutting-edge Large Language Model (LLM) crafted to boost the critical capability of AI agents: function calling. Differing from existing models focusing on traning data refinement, Hammer-7b optimizes performance primarily through advanced training techniques. ## Model Details Hammer-7b is a finetuned model built upon [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct). It's trained using the [APIGen Function Calling Datasets](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) containing 60,000 samples, supplemented by 7,500 irrelevance detection data we generated. Employing innovative training techniques like function masking, function shuffling, and prompt optimization, Hammer-7b has achieved exceptional performances across numerous benchmarks including [Berkley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html), [API-Bank](https://arxiv.org/abs/2304.08244), [Tool-Alpaca](https://arxiv.org/abs/2306.05301), [Nexus Raven](https://github.com/nexusflowai/NexusRaven-V2) and [Seal-Tools](https://arxiv.org/abs/2405.08355). ## Evaluation 1. First, we evaluate our model on the Berkeley Function-Calling Leaderboard (BFCL), and the performance is as follows：

Rank	Overall Acc	Model	AST Summary	Exec Summary	Irrelevance	Relevance	Organization	License
1	85.79	GPT-4-0125-Preview (Prompt)	85.5	89.25	61.35	97.56	OpenAI	Proprietary
2	85	GPT-4-1106-Preview (Prompt)	86.31	87.38	64.98	90.24	OpenAI	Proprietary
3	84.74	GPT-4-0613 (Prompt)	84.66	87.57	75.57	82.93	OpenAI	Proprietary
4	83.92	Hammer-7b	78.7	89.71	72.87	92.68	MadeAgents	cc-by-nc-4.0
5	83.89	GPT-4-turbo-2024-04-09 (Prompt)	85.41	88.12	61.82	82.93	OpenAI	Proprietary
6	83.35	GPT-4o-mini-2024-07-18 (Prompt)	80.51	87.95	79.2	80.49	OpenAI	Proprietary
7	83.13	GPT-4o-2024-05-13 (Prompt)	83.83	85.12	77.44	78.05	OpenAI	Proprietary
8	82.55	Functionary-Medium-v3.1 (FC)	81.06	89.32	73.23	70.73	MeetKai	MIT
9	81.78	GPT-4-1106-Preview (FC)	77.95	87.61	72.7	82.93	OpenAI	Proprietary
10	81.59	Meta-Llama-3-70B-Instruct (Prompt)	80.15	88.04	50.47	92.68	Meta	Meta Llama 3 Community
11	80.88	Claude-3-Opus-20240229 (Prompt)	79.42	87.39	56.15	85.37	Anthropic	Proprietary
12	80.87	GPT-4-0125-Preview (FC)	77.02	85.3	74.03	85.37	OpenAI	Proprietary
13	80.23	Nemotron-4-340b-instruct (Prompt)	76.67	83.38	84.1	78.05	NVIDIA	nvidia-open-model-license
14	80.21	Functionary-Small-v3.1 (FC)	78.64	83.45	68.36	85.37	MeetKai	MIT
15	79.66	mistral-large-2407 (FC Any)	85.61	88.45	0.34	100	Mistral AI	Proprietary

*Note: The rankings are based on the performance metrics provided.* 2.In our evaluation, we assessed the function calling capabilities of various models, including our own fine-tuned models using both masked and non-masked approaches. Below are the results across several benchmarks, derived from evaluations performed in a zero-shot manner. Our model, **hammer-7b**, demonstrated superior performance compared to other models. The table below replicates and extends the format found in ["Granite-Function Calling Model"](https://arxiv.org/abs/2407.00121), particularly Table 6: Function Calling Academic Benchmarks.

Model	Size	API-Bank L-1		API-Bank L-2		Tool-Alpaca		Nexus Raven		F1 Average
Model	Size	F1 Func-Name	F1 Args	F1 Func-Name	F1 Args	F1 Func-Name	F1 Args	F1 Func-Name	F1 Args	F1 Func-Name	F1 Args
Functionary-small-v2.4	7B	78.00%	70.00%	54.00%	45.00%	88.00%	47.00%	82.00%	64.00%	75.50%	56.50%
Gorilla-openfunctions-v2	7B	43.00%	41.00%	12.00%	12.00%	69.00%	39.00%	81.00%	65.00%	51.20%	39.30%
Hermes-2-Pro-Mistral	7B	93.00%	77.00%	54.00%	25.00%	80.00%	26.00%	90.00%	63.00%	79.30%	47.80%
Mistral-Instruct-v0.3	7B	79.00%	69.00%	69.00%	46.00%	33.00%	33.00%	71.00%	54.00%	63.00%	50.50%
CodeGemma-Instruct	7B	77.00%	57.00%	59.00%	38.00%	59.00%	31.00%	84.00%	68.00%	69.80%	48.50%
Nexusflow-Raven-v2	13B	51.00%	42.00%	28.00%	22.00%	85.00%	37.00%	92.00%	75.00%	64.00%	44.00%
C4AI-Command-R-v01	35B	93.00%	76.00%	77.00%	54.00%	90.00%	42.00%	93.00%	71.00%	88.30%	60.80%
Meta-Llama-3-70B-Instruct	70B	85.00%	67.00%	69.00%	52.00%	78.00%	43.00%	70.00%	52.00%	75.50%	53.50%
GRANITE-20B-FUNCTIONCALLING	20B	91.00%	71.00%	83.00%	60.00%	89.00%	44.00%	92.00%	72.00%	88.80%	61.80%
xlam-7b-fc-r	7B	90.00%	80.70%	68.90%	60.70%	67.30%	59.00%	54.10%	57.50%	70.10%	64.50%
Hammer-7b	7B	93.80%	85.90%	79.20%	64.40%	82.30%	59.90%	92.50%	77.40%	86.90%	71.90%

3.Finally, we evaluate the performance of our model on the [Seal-Tools](https://arxiv.org/abs/2405.08355) dataset, which also achieves better performance.

Model	Size	SealTool(Single-Tool)
Model	Size	F1 Func-Name	F1 Args
Gorilla-openfunctions-v2	7B	93.20%	91.10%
GRANITE-20B-FUNCTIONCALLING	20B	94.90%	92.70%
xlam-7b-fc-r	7B	79.00%	76.90%
Hammer-7b	7B	97.40%	91.70%

## Requiements The code of Hammer-7b has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`. ## How to Use This is a simple example of how to use our model. ~~~python import json import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "MadeAgents/Hammer-7b" model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_name) # Please use our provided instruction prompt for best performance TASK_INSTRUCTION = """You are a tool calling assistant. In order to complete the user's request, you need to select one or more appropriate tools from the following tools and fill in the correct values for the tool parameters. Your specific tasks are: 1. Make one or more function/tool calls to meet the request based on the question. 2. If none of the function can be used, point it out and refuse to answer. 3. If the given question lacks the parameters required by the function, also point it out. """ FORMAT_INSTRUCTION = """ The output MUST strictly adhere to the following JSON format, and NO other text MUST be included. The example format is as follows. Please make sure the parameter type is correct. If no function call is needed, please directly output an empty list '[]' ``` [ {"name": "func_name1", "arguments": {"argument1": "value1", "argument2": "value2"}}, ... (more tool calls as required) ] ``` """ # Define the input query and available tools query = "Where can I find live giveaways for beta access and games? And what's the weather like in New York, US?" live_giveaways_by_type = { "name": "live_giveaways_by_type", "description": "Retrieve live giveaways from the GamerPower API based on the specified type.", "parameters": { "type": "object", "properties": { "type": { "type": "string", "description": "The type of giveaways to retrieve (e.g., game, loot, beta).", "default": "game" } }, "required": ["type"] } } get_current_weather={ "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA" } }, "required": ["location"] } } get_stock_price={ "name": "get_stock_price", "description": "Retrieves the current stock price for a given ticker symbol. The ticker symbol must be a valid symbol for a publicly traded company on a major US stock exchange like NYSE or NASDAQ. The tool will return the latest trade price in USD. It should be used when the user asks about the current or most recent price of a specific stock. It will not provide any other information about the stock or company.", "parameters": { "type": "object", "properties": { "ticker": { "type": "string", "description": "The stock ticker symbol, e.g. AAPL for Apple Inc." } }, "required": ["ticker"] } } def convert_to_format_tool(tools): '''''' if isinstance(tools, dict): format_tools = { "name": tools["name"], "description": tools["description"], "parameters": tools["parameters"].get("properties", {}), } required = tools["parameters"].get("required", []) for param in required: format_tools["parameters"][param]["required"] = True for param in format_tools["parameters"].keys(): if "default" in format_tools["parameters"][param]: default = format_tools["parameters"][param]["default"] format_tools["parameters"][param]["description"]+=f"default is \'{default}\'" return format_tools elif isinstance(tools, list): return [convert_to_format_tool(tool) for tool in tools] else: return tools # Helper function to build the input prompt for our model def build_prompt(task_instruction: str, format_instruction: str, tools: list, query: str): prompt = f"[BEGIN OF TASK INSTRUCTION]\n{task_instruction}\n[END OF TASK INSTRUCTION]\n\n" prompt += f"[BEGIN OF AVAILABLE TOOLS]\n{json.dumps(tools)}\n[END OF AVAILABLE TOOLS]\n\n" prompt += f"[BEGIN OF FORMAT INSTRUCTION]\n{format_instruction}\n[END OF FORMAT INSTRUCTION]\n\n" prompt += f"[BEGIN OF QUERY]\n{query}\n[END OF QUERY]\n\n" return prompt # Build the input and start the inference openai_format_tools = [live_giveaways_by_type, get_current_weather,get_stock_price] format_tools = convert_to_format_tool(openai_format_tools) content = build_prompt(TASK_INSTRUCTION, FORMAT_INSTRUCTION, format_tools, query) messages=[ { 'role': 'user', 'content': content} ] inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device) # tokenizer.eos_token_id is the id of <|EOT|> token outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id) print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)) ~~~ ## References - 1.Yan F, Mao H, Ji C C-J, et al. Berkeley Function Calling Leaderboard. - 2. Abdelaziz I, Basu K, Agarwal M, et al. Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks[J]. arXiv preprint arXiv:2407.00121, 2024. - 3. Wu M, Zhu T, Han H, et al. Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark[J]. arXiv preprint arXiv:2405.08355, 2024. Feel free to reach out for further clarifications or contributions!