---
license: apache-2.0
base_model: openlm-research/open_llama_3b_v2
library_name: transformers
pipeline_tag: text-generation
model_type: peft
adapter_type: lora
language: [en, ms]
tags: [peft, lora, qlora, address-normalization, address-correction, malaysia]
---

# Malaysian Address Corrector LoRA

This is a **LoRA adapter** for [`openlm-research/open_llama_3b_v2`](https://huggingface.co/openlm-research/open_llama_3b_v2) fine-tuned to **normalize and standardize Malaysian postal addresses**.  

It expands common abbreviations, enforces consistent comma-separated formatting, and outputs **uppercase** standardized addresses.

⚠️ **Important:** This repo contains **adapters only** — you must load them on top of the base model. The Hosted Inference widget will not run adapters directly.

---

# Model Card for Model ID

This model is a LoRA-fine-tuned adapter built on top of OpenLLaMA 3B v2, specialized for Malaysian address correction. It:

Expands common local abbreviations (e.g., JLN → JALAN, TMN → TAMAN, WPKL → KUALA LUMPUR)

Normalizes spacing and adds commas, outputting addresses in a consistent, one-line, uppercase format

Formats addresses as [Address/Unit], [Street], [Locality/Area], [City], [Postcode], [State]

Runs efficiently on modest GPUs thanks to 4-bit quantization + LoRA, and supports easy batch or interactive usage

Ideal for developers needing clean, standardized Malaysian postal addresses for shipping labels, geocoding, or databases.


## Model Details

Base model: openlm-research/open_llama_3b_v2 (Apache-2.0). 

Technique: QLoRA-style PEFT (LoRA on 4-bit base)

Intended users: Developers standardizing Malaysian postal addresses

## Uses

Correct and standardize Malaysian addresses in free-form text
Expand common abbreviations (e.g., JLN, TMN, LRG, WPKL)
Produce a single uppercase line suitable for label printing or geocoding prep

## Out-of-Scope Use

Non-Malaysian address formats
Entity verification/validation against authoritative sources
Geocoding / latitude-longitude lookup

## Bias, Risks & Limitations

Formatting assumptions: The model favors Malaysian conventions and may incorrectly reorder non-MY addresses.

Ambiguity: Abbreviations like HSN may map to multiple names; defaults are rule-based and may not match all cases.

Hallucination: The model can invent locality/state if the input is severely incomplete; keep a human in the loop for critical mailings.

## Recommendations

Keep a deterministic rule layer (abbrev expansion + uppercasing + simple postcode/state checks).

If you have authoritative reference lists (states, cities, postcodes), validate the final line before use.


## Training Details

Base model: openlm-research/open_llama_3b_v2

Method: LoRA fine-tuning with QLoRA (4-bit quantization)

Dataset: Synthetic + manually curated Malaysian address pairs (JSONL format: instruction, input, output)

Task: Causal LM, few-shot prompting with output delimiters <OUT>...</OUT>

Epochs: 2

Batch size: 2 (gradient accumulation 8)

LR: 2e-4 (cosine schedule, warmup 5%)


## How to use (LoRA adapter)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch, re

BASE = "openlm-research/open_llama_3b_v2"
ADAPTER = "ramshafirdous/malaysian-address-corrector-lora"

bnb = BitsAndBytesConfig(
  load_in_4bit=True, bnb_4bit_quant_type="nf4",
  bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16,
)

tok = AutoTokenizer.from_pretrained(BASE, use_fast=False)
if tok.pad_token_id is None: tok.pad_token = tok.eos_token
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(base, ADAPTER).eval()

def tidy_commas_upper(s):
  s = re.sub(r"[\t|]+", ", ", s)
  s = re.sub(r"\s*,\s*", ", ", s)
  s = re.sub(r"\s{2,}", " ", s).strip()
  return s.upper()

OUT_S, OUT_E = "<OUT>", "</OUT>"
FEWSHOT = (
  "MALAYSIAN ADDRESS NORMALIZER.\n"
  "EXPAND ABBREVIATIONS. ONE LINE. ALL CAPS.\n"
  "FORMAT: [ADDRESS], [STREET], [LOCALITY], [CITY], [POSTCODE], [STATE]\n\n"
  f"Input: 8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor\n"
  f"Output: {OUT_S}8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR{OUT_E}\n"
)

def correct_address(raw, max_new_tokens=128):
  prompt = f"{FEWSHOT}\nInput: {raw}\nOutput: {OUT_S}"
  enc = tok(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model.device)
  with torch.no_grad():
    out = model.generate(**enc, max_new_tokens=max_new_tokens, do_sample=False,
                         repetition_penalty=1.05, eos_token_id=tok.eos_token_id,
                         pad_token_id=tok.pad_token_id)
  txt = tok.decode(out[0], skip_special_tokens=True)
  seg = txt.split(OUT_S, 1)[-1]
  seg = seg.split(OUT_E, 1)[0] if OUT_E in seg else seg.split("\n", 1)[0]
  return tidy_commas_upper(seg)

print(correct_address("11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor"))

## Evaluation

Qualitative validation on held-out messy inputs:

| Input                                                                    | Output                                                                    |
| ------------------------------------------------------------------------ | ------------------------------------------------------------------------- |
| `11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor`         | `11A, JALAN BU 11/14, BANDAR UTAMA, PETALING JAYA, 47800, SELANGOR`       |
| `LEVEL 30 THE GARDENS NORTH TOWER MID VALLEY CITY 59200 WP Kuala Lumpur` | `LEVEL 30, THE GARDENS NORTH TOWER, MID VALLEY CITY, 59200, KUALA LUMPUR` |
| `8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor`           | `8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR`    |


## Abbreviation coverage

| Abbreviation            | Expansion             |
| ----------------------- | --------------------- |
| JLN                     | JALAN                 |
| TMN                     | TAMAN                 |
| LRG                     | LORONG                |
| BDR                     | BANDAR                |
| PJS                     | PETALING JAYA SELATAN |
| WPKL                    | KUALA LUMPUR          |
| KPG                     | KAMPUNG               |
| PLG                     | PULAU                 |
| BLK                     | BLOK                  |
| LEBUH RAYA / HWY / HWAY | LEBUH RAYA            |
| ...                     | ...                   |


## Known Limitations

The model relies on prompt patterns — inconsistent prompting may reduce accuracy.

Does not validate postcode vs. state matches.

May occasionally insert or omit commas if input spacing is irregular (use a rule-based post-processor like tidy_commas_upper).

Trained for Malaysian addresses only.

Not for parsing addresses into structured fields.

Not a geocoder — it does not verify location existence.

## Model Card Authors

Author: Ramsha Firdous