LLaDA-8B-Tools

This repository contains a variant of GSAI-ML/LLaDA-8B-Instruct, fine-tuned by Proximile LLC to enhance model tool calling capabilities. Proximile specializes in secure, on-premise AI solutions for small and medium-sized businesses.

Update Timeline

May 14 2025 – Initial public release. Training examples were missing the pad tokens filling out the rest of the generation window.
May 17 2025 – Patched training script to include correct padding; updated model weights pushed to this repository.
May 20 2025 – Google announces Gemini Diffusion.

About LLaDA

LLaDA (Large Language Diffusion with mAsking) is a novel language model architecture that uses discrete diffusion for text generation. Unlike traditional autoregressive models, LLaDA generates text through an iterative denoising process, progressively replacing mask tokens with predicted tokens based on confidence scores.

Model Description

This merged LoRA model was trained to improve LLaDA's ability to handle tool calling tasks, including:

Generating proper JSON for tool invocation
Processing tool response data
Providing helpful answers based on tool outputs

Training Details

Base Model: GSAI-ML/LLaDA-8B-Instruct
Training Method: Supervised Fine-Tuning (SFT) with LoRA
LoRA Configuration:
- Rank (r): 128
- Alpha: 256
- Target Modules: q_proj, k_proj, v_proj, gate_proj
Training Data: A modified subset of the ToolACE dataset.

Installation

pip install transformers peft torch bitsandbytes

Usage

To use this model:

from transformers import AutoTokenizer, AutoModel
from peft import PeftModel

# Load the base model and tokenizer
model_name = "Proximile/LLaDA-8B-Tools"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, device_map="auto")

Example Chat Completion Script

Here's a complete example of using the model for chat completion with tool calling:

import torch
import json
from transformers import AutoTokenizer, AutoModel

# Constants
MASK_TOKEN_ID = 126336

def add_gumbel_noise(logits, temperature):
    '''
    The Gumbel max is a method for sampling categorical distributions.
    For diffusion models, low-precision Gumbel Max affects generation quality.
    '''
    if temperature <= 0:
        return logits
        
    logits = logits.to(torch.float64)
    noise = torch.rand_like(logits, dtype=torch.float64)
    gumbel_noise = (- torch.log(noise)) ** temperature
    return logits.exp() / gumbel_noise

def get_num_transfer_tokens(mask_index, steps):
    '''
    In the reverse process, we precompute the number of tokens to transition at each step.
    '''
    mask_num = mask_index.sum(dim=1, keepdim=True)
    
    # Ensure we have at least one step
    if steps == 0:
        steps = 1
        
    base = mask_num // steps
    remainder = mask_num % steps
    
    num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base
    
    for i in range(mask_num.size(0)):
        if remainder[i] > 0:
            num_transfer_tokens[i, :remainder[i]] += 1
            
    return num_transfer_tokens

def generate(model, prompt, steps=128, gen_length=128, block_length=32, temperature=0.,
             remasking='low_confidence', mask_id=MASK_TOKEN_ID):
    '''
    Generate text using LLaDA's diffusion-based generation process.
    '''
    device = next(model.parameters()).device
    prompt = prompt.to(device)
    
    x = torch.full((1, prompt.shape[1] + gen_length), mask_id, dtype=torch.long).to(device)
    x[:, :prompt.shape[1]] = prompt.clone()
    
    prompt_index = (x != mask_id)
    
    assert gen_length % block_length == 0
    num_blocks = gen_length // block_length
    
    assert steps % num_blocks == 0
    steps_per_block = steps // num_blocks
    
    for num_block in range(num_blocks):
        block_mask_index = (x[:, prompt.shape[1] + num_block * block_length: prompt.shape[1] + (num_block + 1) * block_length:] == mask_id)
        num_transfer_tokens = get_num_transfer_tokens(block_mask_index, steps_per_block)
        
        for i in range(steps_per_block):
            mask_index = (x == mask_id)
            if not mask_index.any():
                break
                
            outputs = model(x)
            logits = outputs.logits
            
            logits_with_noise = add_gumbel_noise(logits, temperature=temperature)
            x0 = torch.argmax(logits_with_noise, dim=-1)  # b, l
            
            if remasking == 'low_confidence':
                p = torch.nn.functional.softmax(logits.to(torch.float64), dim=-1)
                x0_p = torch.squeeze(
                    torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1)  # b, l
            elif remasking == 'random':
                x0_p = torch.rand((x0.shape[0], x0.shape[1]), device=x0.device)
            else:
                raise NotImplementedError(remasking)
            
            x0_p[:, prompt.shape[1] + (num_block + 1) * block_length:] = -float('inf')
            
            x0 = torch.where(mask_index, x0, x)
            confidence = torch.where(mask_index, x0_p, -float('inf'))
            
            transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device)
            for j in range(confidence.shape[0]):
                _, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j, i])
                transfer_index[j, select_index] = True
            x[transfer_index] = x0[transfer_index]
    
    return x

def chat_completion(model, tokenizer, messages, temperature=0.1, gen_length=128, steps=128):
    """
    Generate a chat completion.
    
    Args:
        model: The LLaDA tool calling model
        tokenizer: The tokenizer
        messages: List of message dictionaries with 'role' and 'content' keys
        temperature: Temperature for generation (0 for greedy)
        gen_length: Maximum length of generated text
        steps: Number of denoising steps
        
    Returns:
        The generated response text
    """
    # Format input for the model
    formatted_input = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    
    # Tokenize input
    input_ids = tokenizer(formatted_input, return_tensors="pt")["input_ids"]
    
    # Generate response
    with torch.no_grad():
        output_ids = generate(
            model, 
            input_ids, 
            steps=steps,
            gen_length=gen_length,
            block_length=32,
            temperature=temperature,
            remasking='low_confidence'
        )
    
    # Decode the generated output
    generated_text = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=False).split("<|")[0]
    return generated_text

# Example usage
if __name__ == "__main__":
    # Load the base model and tokenizer
    model_name = "Proximile/LLaDA-8B-Tools"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    model = AutoModel.from_pretrained(model_name, trust_remote_code=True, device_map="auto")
    
    # Define tool calling function schema
    tool_schema = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The unit of temperature"
                        }
                    },
                    "required": ["location", "unit"]
                }
            }
        }
    ]
    
    # Create conversation with system prompt including tool description
    system_prompt = """You are a helpful assistant with tool calling capabilities. When you receive a tool call response, use the output to format an answer to the orginal user question.

If you choose to use one or more of the following tool functions, respond with a list of JSON function calls, each with the proper arguments that best answers the given prompt.

Each tool request within the list should be in the exact format {"name": function name, "parameters": {dictionary of argument names and values}}. Do not use variables. Just a list of two-key dictionaries, each starting with the function name, followed by a dictionary of parameters.

Here are the tool functions available to you:

""" + json.dumps(tool_schema, indent=4) + """

After receiving the results back from a function call, you have to formulate your response to the user. If the information needed is not found in the returned data, either attempt a new function call, or inform the user that you cannot answer based on your available knowledge. The user cannot see the function results. You have to interpret the data and provide a response based on it.

If the user request does not necessitate a function call, simply respond to the user's query directly."""
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What's the weather like in New York?"}
    ]
    
    # Generate assistant response (expecting tool call)
    assistant_response = chat_completion(model, tokenizer, messages)
    print(f"Assistant: {assistant_response}")
    
    # Mock tool response
    tool_response = json.dumps({
        "location": "New York, NY",
        "temperature": 72,
        "unit": "fahrenheit",
        "condition": "Partly Cloudy",
        "humidity": 65,
        "wind_speed": 8,
        "wind_direction": "NE"
    })
    
    # Add assistant and tool responses to the conversation
    messages.append({"role": "assistant", "content": assistant_response})
    messages.append({"role": "ipython", "content": tool_response})
    
    # Generate final assistant response
    final_response = chat_completion(model, tokenizer, messages)
    print(f"Assistant (with tool data): {final_response}")

# Assistant: [{"name": "get_weather", "parameters": {"location": "New York", "unit": "fahrenheit"}}]
# Assistant (with tool data): The current weather in New York is as follows:
# - Temperature: 72°F
# - Weather Condition: Partly Cloudy
# - Humidity: 65%
# - Wind Speed: 8 miles per hour
# - Wind Direction: Northeast

Limitations

LLaDA's diffusion-based generation is different from standard LLMs and may behave differently in certain contexts
The model may still hallucinate or generate incorrect tool call formats
The format of the tool call must precisely match what is shown in the example (which is a modified version of the official llama 3.1 format)

Citation

If you use this model in your research, please cite the original LLaDA paper as well as this adapter:

@misc{llada-8b-tools,
  author = {Proximile LLC},
  title = {LLaDA-8B-Tools},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Proximile/LLaDA-8B-Tools}}
}

About Proximile LLC

Proximile LLC provides secure, cost-effective, and private AI solutions tailored to small and medium-sized businesses. We specialize in:

On-premise AI inference solutions that ensure unparalleled privacy
Cost-effective hardware configurations including the Jetson Orin Nano Super
Secure Local AI applications including chatbots, RAG systems, and custom AI tools
Specialized services for compliance & governance, knowledge management, and IT automation

Visit proximile.llc to learn more about our secure, local AI solutions for your business.

License

This adapter is released under the same license as the base LLaDA model.

Proximile
/

LLaDA-8B-Tools