jtlicardo's picture
Update README.md
7f80df3 verified
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- tinyllama
- json
- intent-detection
- qlora
- gptq
---
# TinyLlama-JSON-Intent (GPTQ 4-bit)
This is a fine-tuned version of [`TinyLlama/TinyLlama-1.1B-Chat-v1.0`](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) that has been specifically trained to act as an e-commerce intent detection model. Given a catalog of products and a user's request, it outputs a structured JSON object representing the user's intent (`add` or `remove`), the `product` name, and the `quantity`.
This version of the model is **quantized to 4-bit using GPTQ**, making it highly efficient for inference in terms of memory usage and speed.
The QLoRA adapter was merged into the final GPTQ model—no separate adapter loading is required.
- **Adapter Version:** [jtlicardo/tinyllama-ecommerce-intent-adapter](https://huggingface.co/jtlicardo/tinyllama-ecommerce-intent-adapter)
## Model Description
The base model, TinyLlama-Chat, was fine-tuned using the QLoRA method on a synthetic dataset of 100 examples. The training objective was to teach the model to ignore conversational pleasantries and strictly output a JSON object that can be directly parsed by a backend system for managing a shopping cart.
## Intended Use & Limitations
This model is designed for a specific task: parsing user requests in an e-commerce context. It should not be used as a general-purpose chatbot.
- **Primary Use:** Backend service for intent detection from user text.
- **Out-of-Scope:** General conversation, answering questions, or any task not related to adding/removing items from a list.
## How to Use
The model expects a prompt formatted in a specific way, following the TinyLlama-Chat template. You must provide the `Catalog` and the `User` request.
**Important:** You need to install `optimum` and `auto-gptq` to run this 4-bit GPTQ model.
```bash
pip install -q optimum auto-gptq transformers
```
Here's how to run inference in Python:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Model repository on the Hugging Face Hub
model_id = "jtlicardo/tinyllama-ecommerce-intent-gptq"
# Load the tokenizer and the 4-bit quantized model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16 # Recommended for inference
)
# --- Define the prompt ---
catalog = """Catalog:
Shampoo (400ml bottle)
Hand Soap (250ml dispenser)
Peanut Butter (340g jar)
Headphones
Green Tea (25 tea bags)"""
user_query = "Could you please take off 4 pairs of headphons from my cart?"
# --- Format the prompt using the model's chat template ---
# The model was trained to see this structure.
prompt = f"<|user|>\n{catalog}\n\nUser:\n{user_query}\n<|assistant|>\n"
# --- Generate the output ---
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
outputs = pipe(
prompt,
max_new_tokens=50, # Max length of the JSON output
do_sample=False, # Use deterministic output
temperature=None, # Not needed for do_sample=False
top_p=None, # Not needed for do_sample=False
return_full_text=False # Only return the generated part
)
# The output will be a clean JSON string
generated_json = outputs[0]['generated_text'].strip()
print(generated_json)
# Expected output:
# {"action": "remove", "product": "Headphones", "quantity": 4}
```
## Training Procedure
This model was fine-tuned using the `trl` library's `SFTTrainer`.
- **Method:** QLoRA (4-bit quantization with LoRA adapters)
- **Dataset:** A custom JSONL file with 100 `prompt`/`completion` pairs.
- **Configuration:** `completion_only_loss=True` was used to ensure the model only learned to generate the assistant's JSON response.