license: apache-2.0
base_model: openlm-research/open_llama_3b_v2
library_name: transformers
pipeline_tag: text-generation
model_type: peft
adapter_type: lora
language:
- en
- ms
tags:
- peft
- lora
- qlora
- address-normalization
- address-correction
- malaysia
Malaysian Address Corrector LoRA
This is a LoRA adapter for openlm-research/open_llama_3b_v2
fine-tuned to normalize and standardize Malaysian postal addresses.
It expands common abbreviations, enforces consistent comma-separated formatting, and outputs uppercase standardized addresses.
⚠️ Important: This repo contains adapters only — you must load them on top of the base model. The Hosted Inference widget will not run adapters directly.
Model Card for Model ID
This model is a LoRA-fine-tuned adapter built on top of OpenLLaMA 3B v2, specialized for Malaysian address correction. It:
Expands common local abbreviations (e.g., JLN → JALAN, TMN → TAMAN, WPKL → KUALA LUMPUR)
Normalizes spacing and adds commas, outputting addresses in a consistent, one-line, uppercase format
Formats addresses as [Address/Unit], [Street], [Locality/Area], [City], [Postcode], [State]
Runs efficiently on modest GPUs thanks to 4-bit quantization + LoRA, and supports easy batch or interactive usage
Ideal for developers needing clean, standardized Malaysian postal addresses for shipping labels, geocoding, or databases.
Model Details
Base model: openlm-research/open_llama_3b_v2 (Apache-2.0).
Technique: QLoRA-style PEFT (LoRA on 4-bit base)
Intended users: Developers standardizing Malaysian postal addresses
Uses
Correct and standardize Malaysian addresses in free-form text Expand common abbreviations (e.g., JLN, TMN, LRG, WPKL) Produce a single uppercase line suitable for label printing or geocoding prep
Out-of-Scope Use
Non-Malaysian address formats Entity verification/validation against authoritative sources Geocoding / latitude-longitude lookup
Bias, Risks & Limitations
Formatting assumptions: The model favors Malaysian conventions and may incorrectly reorder non-MY addresses.
Ambiguity: Abbreviations like HSN may map to multiple names; defaults are rule-based and may not match all cases.
Hallucination: The model can invent locality/state if the input is severely incomplete; keep a human in the loop for critical mailings.
Recommendations
Keep a deterministic rule layer (abbrev expansion + uppercasing + simple postcode/state checks).
If you have authoritative reference lists (states, cities, postcodes), validate the final line before use.
Training Details
Base model: openlm-research/open_llama_3b_v2
Method: LoRA fine-tuning with QLoRA (4-bit quantization)
Dataset: Synthetic + manually curated Malaysian address pairs (JSONL format: instruction, input, output)
Task: Causal LM, few-shot prompting with output delimiters ...
Epochs: 2
Batch size: 2 (gradient accumulation 8)
LR: 2e-4 (cosine schedule, warmup 5%)
How to use (LoRA adapter)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig from peft import PeftModel import torch, re
BASE = "openlm-research/open_llama_3b_v2" ADAPTER = "ramshafirdous/malaysian-address-corrector-lora"
bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, )
tok = AutoTokenizer.from_pretrained(BASE, use_fast=False) if tok.pad_token_id is None: tok.pad_token = tok.eos_token base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto", trust_remote_code=True) model = PeftModel.from_pretrained(base, ADAPTER).eval()
def tidy_commas_upper(s): s = re.sub(r"[\t|]+", ", ", s) s = re.sub(r"\s*,\s*", ", ", s) s = re.sub(r"\s{2,}", " ", s).strip() return s.upper()
OUT_S, OUT_E = "", "" FEWSHOT = ( "MALAYSIAN ADDRESS NORMALIZER.\n" "EXPAND ABBREVIATIONS. ONE LINE. ALL CAPS.\n" "FORMAT: [ADDRESS], [STREET], [LOCALITY], [CITY], [POSTCODE], [STATE]\n\n" f"Input: 8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor\n" f"Output: {OUT_S}8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR{OUT_E}\n" )
def correct_address(raw, max_new_tokens=128): prompt = f"{FEWSHOT}\nInput: {raw}\nOutput: {OUT_S}" enc = tok(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model.device) with torch.no_grad(): out = model.generate(**enc, max_new_tokens=max_new_tokens, do_sample=False, repetition_penalty=1.05, eos_token_id=tok.eos_token_id, pad_token_id=tok.pad_token_id) txt = tok.decode(out[0], skip_special_tokens=True) seg = txt.split(OUT_S, 1)[-1] seg = seg.split(OUT_E, 1)[0] if OUT_E in seg else seg.split("\n", 1)[0] return tidy_commas_upper(seg)
print(correct_address("11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor"))
Evaluation
Qualitative validation on held-out messy inputs:
Input | Output |
---|---|
11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor |
11A, JALAN BU 11/14, BANDAR UTAMA, PETALING JAYA, 47800, SELANGOR |
LEVEL 30 THE GARDENS NORTH TOWER MID VALLEY CITY 59200 WP Kuala Lumpur |
LEVEL 30, THE GARDENS NORTH TOWER, MID VALLEY CITY, 59200, KUALA LUMPUR |
8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor |
8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR |
Abbreviation coverage
Abbreviation | Expansion |
---|---|
JLN | JALAN |
TMN | TAMAN |
LRG | LORONG |
BDR | BANDAR |
PJS | PETALING JAYA SELATAN |
WPKL | KUALA LUMPUR |
KPG | KAMPUNG |
PLG | PULAU |
BLK | BLOK |
LEBUH RAYA / HWY / HWAY | LEBUH RAYA |
... | ... |
Known Limitations
The model relies on prompt patterns — inconsistent prompting may reduce accuracy.
Does not validate postcode vs. state matches.
May occasionally insert or omit commas if input spacing is irregular (use a rule-based post-processor like tidy_commas_upper).
Trained for Malaysian addresses only.
Not for parsing addresses into structured fields.
Not a geocoder — it does not verify location existence.
Model Card Authors
Author: Ramsha Firdous