File size: 7,045 Bytes
3caccad
e5c1dd5
 
3caccad
e5c1dd5
 
 
 
 
 
cf08666
 
 
 
 
 
 
e5c1dd5
 
 
 
 
 
 
 
 
 
3caccad
 
 
 
cf08666
3caccad
cf08666
3caccad
cf08666
3caccad
cf08666
3caccad
cf08666
3caccad
cf08666
3caccad
 
cf08666
3caccad
cf08666
3caccad
cf08666
3caccad
cf08666
3caccad
 
 
cf08666
 
 
3caccad
cf08666
3caccad
cf08666
 
 
3caccad
cf08666
3caccad
cf08666
3caccad
cf08666
3caccad
cf08666
3caccad
cf08666
3caccad
cf08666
3caccad
cf08666
3caccad
 
 
 
e5c1dd5
 
 
3caccad
e5c1dd5
3caccad
e5c1dd5
3caccad
e5c1dd5
3caccad
e5c1dd5
3caccad
e5c1dd5
3caccad
 
e5c1dd5
3caccad
e5c1dd5
 
 
3caccad
e5c1dd5
 
3caccad
e5c1dd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3caccad
 
 
cf08666
3caccad
e5c1dd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3caccad
e5c1dd5
3caccad
cf08666
3caccad
cf08666
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
license: apache-2.0
base_model: openlm-research/open_llama_3b_v2
library_name: transformers
pipeline_tag: text-generation
model_type: peft
adapter_type: lora
language:
  - en
  - ms
tags:
  - peft
  - lora
  - qlora
  - address-normalization
  - address-correction
  - malaysia
---

# Malaysian Address Corrector LoRA

This is a **LoRA adapter** for [`openlm-research/open_llama_3b_v2`](https://huggingface.co/openlm-research/open_llama_3b_v2) fine-tuned to **normalize and standardize Malaysian postal addresses**.  

It expands common abbreviations, enforces consistent comma-separated formatting, and outputs **uppercase** standardized addresses.

⚠️ **Important:** This repo contains **adapters only** — you must load them on top of the base model. The Hosted Inference widget will not run adapters directly.

---

# Model Card for Model ID

This model is a LoRA-fine-tuned adapter built on top of OpenLLaMA 3B v2, specialized for Malaysian address correction. It:

Expands common local abbreviations (e.g., JLN → JALAN, TMN → TAMAN, WPKL → KUALA LUMPUR)

Normalizes spacing and adds commas, outputting addresses in a consistent, one-line, uppercase format

Formats addresses as [Address/Unit], [Street], [Locality/Area], [City], [Postcode], [State]

Runs efficiently on modest GPUs thanks to 4-bit quantization + LoRA, and supports easy batch or interactive usage

Ideal for developers needing clean, standardized Malaysian postal addresses for shipping labels, geocoding, or databases.


## Model Details

Base model: openlm-research/open_llama_3b_v2 (Apache-2.0). 

Technique: QLoRA-style PEFT (LoRA on 4-bit base)

Intended users: Developers standardizing Malaysian postal addresses

## Uses

Correct and standardize Malaysian addresses in free-form text
Expand common abbreviations (e.g., JLN, TMN, LRG, WPKL)
Produce a single uppercase line suitable for label printing or geocoding prep

## Out-of-Scope Use

Non-Malaysian address formats
Entity verification/validation against authoritative sources
Geocoding / latitude-longitude lookup

## Bias, Risks & Limitations

Formatting assumptions: The model favors Malaysian conventions and may incorrectly reorder non-MY addresses.

Ambiguity: Abbreviations like HSN may map to multiple names; defaults are rule-based and may not match all cases.

Hallucination: The model can invent locality/state if the input is severely incomplete; keep a human in the loop for critical mailings.

## Recommendations

Keep a deterministic rule layer (abbrev expansion + uppercasing + simple postcode/state checks).

If you have authoritative reference lists (states, cities, postcodes), validate the final line before use.


## Training Details

Base model: openlm-research/open_llama_3b_v2

Method: LoRA fine-tuning with QLoRA (4-bit quantization)

Dataset: Synthetic + manually curated Malaysian address pairs (JSONL format: instruction, input, output)

Task: Causal LM, few-shot prompting with output delimiters <OUT>...</OUT>

Epochs: 2

Batch size: 2 (gradient accumulation 8)

LR: 2e-4 (cosine schedule, warmup 5%)


## How to use (LoRA adapter)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch, re

BASE = "openlm-research/open_llama_3b_v2"
ADAPTER = "ramshafirdous/malaysian-address-corrector-lora"

bnb = BitsAndBytesConfig(
  load_in_4bit=True, bnb_4bit_quant_type="nf4",
  bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16,
)

tok = AutoTokenizer.from_pretrained(BASE, use_fast=False)
if tok.pad_token_id is None: tok.pad_token = tok.eos_token
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(base, ADAPTER).eval()

def tidy_commas_upper(s):
  s = re.sub(r"[\t|]+", ", ", s)
  s = re.sub(r"\s*,\s*", ", ", s)
  s = re.sub(r"\s{2,}", " ", s).strip()
  return s.upper()

OUT_S, OUT_E = "<OUT>", "</OUT>"
FEWSHOT = (
  "MALAYSIAN ADDRESS NORMALIZER.\n"
  "EXPAND ABBREVIATIONS. ONE LINE. ALL CAPS.\n"
  "FORMAT: [ADDRESS], [STREET], [LOCALITY], [CITY], [POSTCODE], [STATE]\n\n"
  f"Input: 8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor\n"
  f"Output: {OUT_S}8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR{OUT_E}\n"
)

def correct_address(raw, max_new_tokens=128):
  prompt = f"{FEWSHOT}\nInput: {raw}\nOutput: {OUT_S}"
  enc = tok(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model.device)
  with torch.no_grad():
    out = model.generate(**enc, max_new_tokens=max_new_tokens, do_sample=False,
                         repetition_penalty=1.05, eos_token_id=tok.eos_token_id,
                         pad_token_id=tok.pad_token_id)
  txt = tok.decode(out[0], skip_special_tokens=True)
  seg = txt.split(OUT_S, 1)[-1]
  seg = seg.split(OUT_E, 1)[0] if OUT_E in seg else seg.split("\n", 1)[0]
  return tidy_commas_upper(seg)

print(correct_address("11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor"))

## Evaluation

Qualitative validation on held-out messy inputs:

| Input                                                                    | Output                                                                    |
| ------------------------------------------------------------------------ | ------------------------------------------------------------------------- |
| `11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor`         | `11A, JALAN BU 11/14, BANDAR UTAMA, PETALING JAYA, 47800, SELANGOR`       |
| `LEVEL 30 THE GARDENS NORTH TOWER MID VALLEY CITY 59200 WP Kuala Lumpur` | `LEVEL 30, THE GARDENS NORTH TOWER, MID VALLEY CITY, 59200, KUALA LUMPUR` |
| `8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor`           | `8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR`    |


## Abbreviation coverage

| Abbreviation            | Expansion             |
| ----------------------- | --------------------- |
| JLN                     | JALAN                 |
| TMN                     | TAMAN                 |
| LRG                     | LORONG                |
| BDR                     | BANDAR                |
| PJS                     | PETALING JAYA SELATAN |
| WPKL                    | KUALA LUMPUR          |
| KPG                     | KAMPUNG               |
| PLG                     | PULAU                 |
| BLK                     | BLOK                  |
| LEBUH RAYA / HWY / HWAY | LEBUH RAYA            |
| ...                     | ...                   |


## Known Limitations

The model relies on prompt patterns — inconsistent prompting may reduce accuracy.

Does not validate postcode vs. state matches.

May occasionally insert or omit commas if input spacing is irregular (use a rule-based post-processor like tidy_commas_upper).

Trained for Malaysian addresses only.

Not for parsing addresses into structured fields.

Not a geocoder — it does not verify location existence.

## Model Card Authors

Author: Ramsha Firdous