UAE Rulebook Q&A Assistant - Finetuned LFM2 Model
Model ID: rajeshthangaraj1/uae_rule_book_QA_assistant
Model Overview
This model is a fine-tuned version of LFM2 (1.2B), optimized as a conversational assistant specifically for answering questions based on the UAE Central Bank Rulebook (Banking Regulations). It specializes in navigating regulatory sections such as Capital Adequacy, Licensing, Corporate Governance, and Risk Management.
The model is quantized to 4-bit precision using bitsandbytes
, balancing performance with memory efficiency for practical deployment.
Use Cases
Legal and regulatory Q&A: Ask precise questions like:
- "What does Article (1) of the Capital Adequacy section define?"
- "What are the minimum capital ratios specified in Article (2)?"
Educational Tool: Great for students or professionals seeking quick, accurate answers to banking regulation questions.
Limitations
- Hallucination Risk: Without explicit context or document retrieval, the model may hallucinate or generate plausible but incorrect answers.
- Domain-specific: Tailored exclusively to the UAE Central Bank Rulebook’s banking sections.
- Precision: May occasionally misuse percentages or article contents not in the training set.
📊 Dataset Creation Source Data: The dataset was built using publicly available content from the official UAE Central Bank Rulebook, accessible at rulebook.centralbank.ae. The rulebook outlines the legal and compliance frameworks governing financial institutions in the UAE, with a focus on banking regulations such as Capital Adequacy, Licensing, Governance, and Risk Management.
Preprocessing:
The scraped content was cleaned and segmented into approximately 7,000 text chunks.
Each chunk contains ~500 characters, preserving semantic boundaries such as article titles, clauses, and legal definitions.
These chunks were used as context for generating question-answer pairs.
The resulting dataset follows a structure of:
"context": rulebook chunk
"question": generated question
"answer": answer grounded in the context
This dataset was then used to fine-tune the model for domain-specific legal QA behavior.
Example Usage
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("rajeshthangaraj1/uae_rule_book_QA_assistant")
model = AutoModelForCausalLM.from_pretrained("rajeshthangaraj1/uae_rule_book_QA_assistant")
# messages = [
# {"role": "user", "content": "According to the UAE Central Bank Rulebook – Capital Adequacy Section, Article (1) provides definitions of key regulatory terms."},
# ]
messages = [
{"role": "system", "content": "You are a helpful AI assistant specialized in the UAE Central Bank Rulebook, specifically in the banking regulations section, including CapitalAdequacy, Licensing, Corporate Governance, and Risk Management.Your job is to answer questions based strictly on the contents of the rulebook. If the answer is not available in the rulebook or the article being referenced, clearly state that the information is not available.Your tone should be professional, clear, and informative. Do not invent or assume information. Base your response only on actual rules, definitions, and articles from the UAE Central Bank Rulebook.Always prefer referencing article numbers when possible.If the user's question mentions a specific article, respond with what that article says. If the question is general, provide a relevant and accurate explanation from the rulebook.Avoid any general or global banking answers unless they are also stated in the UAE Rulebook."},
{"role": "user", "content": "According to the UAE Central Bank Rulebook – Capital Adequacy Section, what does Article (2): Quantitative Requirements specify about the minimum capital ratios banks must maintain?"},
]
# Prepare input
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
if "token_type_ids" in inputs:
inputs.pop("token_type_ids")
# Generate response
outputs = model.generate(**inputs, max_new_tokens=40)
# Decode only the newly generated part
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
- Downloads last month
- 16