File size: 7,045 Bytes
3caccad e5c1dd5 3caccad e5c1dd5 cf08666 e5c1dd5 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad cf08666 3caccad e5c1dd5 3caccad e5c1dd5 3caccad e5c1dd5 3caccad e5c1dd5 3caccad e5c1dd5 3caccad e5c1dd5 3caccad e5c1dd5 3caccad e5c1dd5 3caccad e5c1dd5 3caccad e5c1dd5 3caccad cf08666 3caccad e5c1dd5 3caccad e5c1dd5 3caccad cf08666 3caccad cf08666 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
---
license: apache-2.0
base_model: openlm-research/open_llama_3b_v2
library_name: transformers
pipeline_tag: text-generation
model_type: peft
adapter_type: lora
language:
- en
- ms
tags:
- peft
- lora
- qlora
- address-normalization
- address-correction
- malaysia
---
# Malaysian Address Corrector LoRA
This is a **LoRA adapter** for [`openlm-research/open_llama_3b_v2`](https://huggingface.co/openlm-research/open_llama_3b_v2) fine-tuned to **normalize and standardize Malaysian postal addresses**.
It expands common abbreviations, enforces consistent comma-separated formatting, and outputs **uppercase** standardized addresses.
⚠️ **Important:** This repo contains **adapters only** — you must load them on top of the base model. The Hosted Inference widget will not run adapters directly.
---
# Model Card for Model ID
This model is a LoRA-fine-tuned adapter built on top of OpenLLaMA 3B v2, specialized for Malaysian address correction. It:
Expands common local abbreviations (e.g., JLN → JALAN, TMN → TAMAN, WPKL → KUALA LUMPUR)
Normalizes spacing and adds commas, outputting addresses in a consistent, one-line, uppercase format
Formats addresses as [Address/Unit], [Street], [Locality/Area], [City], [Postcode], [State]
Runs efficiently on modest GPUs thanks to 4-bit quantization + LoRA, and supports easy batch or interactive usage
Ideal for developers needing clean, standardized Malaysian postal addresses for shipping labels, geocoding, or databases.
## Model Details
Base model: openlm-research/open_llama_3b_v2 (Apache-2.0).
Technique: QLoRA-style PEFT (LoRA on 4-bit base)
Intended users: Developers standardizing Malaysian postal addresses
## Uses
Correct and standardize Malaysian addresses in free-form text
Expand common abbreviations (e.g., JLN, TMN, LRG, WPKL)
Produce a single uppercase line suitable for label printing or geocoding prep
## Out-of-Scope Use
Non-Malaysian address formats
Entity verification/validation against authoritative sources
Geocoding / latitude-longitude lookup
## Bias, Risks & Limitations
Formatting assumptions: The model favors Malaysian conventions and may incorrectly reorder non-MY addresses.
Ambiguity: Abbreviations like HSN may map to multiple names; defaults are rule-based and may not match all cases.
Hallucination: The model can invent locality/state if the input is severely incomplete; keep a human in the loop for critical mailings.
## Recommendations
Keep a deterministic rule layer (abbrev expansion + uppercasing + simple postcode/state checks).
If you have authoritative reference lists (states, cities, postcodes), validate the final line before use.
## Training Details
Base model: openlm-research/open_llama_3b_v2
Method: LoRA fine-tuning with QLoRA (4-bit quantization)
Dataset: Synthetic + manually curated Malaysian address pairs (JSONL format: instruction, input, output)
Task: Causal LM, few-shot prompting with output delimiters <OUT>...</OUT>
Epochs: 2
Batch size: 2 (gradient accumulation 8)
LR: 2e-4 (cosine schedule, warmup 5%)
## How to use (LoRA adapter)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch, re
BASE = "openlm-research/open_llama_3b_v2"
ADAPTER = "ramshafirdous/malaysian-address-corrector-lora"
bnb = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16,
)
tok = AutoTokenizer.from_pretrained(BASE, use_fast=False)
if tok.pad_token_id is None: tok.pad_token = tok.eos_token
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(base, ADAPTER).eval()
def tidy_commas_upper(s):
s = re.sub(r"[\t|]+", ", ", s)
s = re.sub(r"\s*,\s*", ", ", s)
s = re.sub(r"\s{2,}", " ", s).strip()
return s.upper()
OUT_S, OUT_E = "<OUT>", "</OUT>"
FEWSHOT = (
"MALAYSIAN ADDRESS NORMALIZER.\n"
"EXPAND ABBREVIATIONS. ONE LINE. ALL CAPS.\n"
"FORMAT: [ADDRESS], [STREET], [LOCALITY], [CITY], [POSTCODE], [STATE]\n\n"
f"Input: 8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor\n"
f"Output: {OUT_S}8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR{OUT_E}\n"
)
def correct_address(raw, max_new_tokens=128):
prompt = f"{FEWSHOT}\nInput: {raw}\nOutput: {OUT_S}"
enc = tok(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model.device)
with torch.no_grad():
out = model.generate(**enc, max_new_tokens=max_new_tokens, do_sample=False,
repetition_penalty=1.05, eos_token_id=tok.eos_token_id,
pad_token_id=tok.pad_token_id)
txt = tok.decode(out[0], skip_special_tokens=True)
seg = txt.split(OUT_S, 1)[-1]
seg = seg.split(OUT_E, 1)[0] if OUT_E in seg else seg.split("\n", 1)[0]
return tidy_commas_upper(seg)
print(correct_address("11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor"))
## Evaluation
Qualitative validation on held-out messy inputs:
| Input | Output |
| ------------------------------------------------------------------------ | ------------------------------------------------------------------------- |
| `11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor` | `11A, JALAN BU 11/14, BANDAR UTAMA, PETALING JAYA, 47800, SELANGOR` |
| `LEVEL 30 THE GARDENS NORTH TOWER MID VALLEY CITY 59200 WP Kuala Lumpur` | `LEVEL 30, THE GARDENS NORTH TOWER, MID VALLEY CITY, 59200, KUALA LUMPUR` |
| `8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor` | `8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR` |
## Abbreviation coverage
| Abbreviation | Expansion |
| ----------------------- | --------------------- |
| JLN | JALAN |
| TMN | TAMAN |
| LRG | LORONG |
| BDR | BANDAR |
| PJS | PETALING JAYA SELATAN |
| WPKL | KUALA LUMPUR |
| KPG | KAMPUNG |
| PLG | PULAU |
| BLK | BLOK |
| LEBUH RAYA / HWY / HWAY | LEBUH RAYA |
| ... | ... |
## Known Limitations
The model relies on prompt patterns — inconsistent prompting may reduce accuracy.
Does not validate postcode vs. state matches.
May occasionally insert or omit commas if input spacing is irregular (use a rule-based post-processor like tidy_commas_upper).
Trained for Malaysian addresses only.
Not for parsing addresses into structured fields.
Not a geocoder — it does not verify location existence.
## Model Card Authors
Author: Ramsha Firdous
|