ramshafirdous's picture
Update README.md
e5c1dd5 verified
|
raw
history blame
7.05 kB
metadata
license: apache-2.0
base_model: openlm-research/open_llama_3b_v2
library_name: transformers
pipeline_tag: text-generation
model_type: peft
adapter_type: lora
language:
  - en
  - ms
tags:
  - peft
  - lora
  - qlora
  - address-normalization
  - address-correction
  - malaysia

Malaysian Address Corrector LoRA

This is a LoRA adapter for openlm-research/open_llama_3b_v2 fine-tuned to normalize and standardize Malaysian postal addresses.

It expands common abbreviations, enforces consistent comma-separated formatting, and outputs uppercase standardized addresses.

⚠️ Important: This repo contains adapters only — you must load them on top of the base model. The Hosted Inference widget will not run adapters directly.


Model Card for Model ID

This model is a LoRA-fine-tuned adapter built on top of OpenLLaMA 3B v2, specialized for Malaysian address correction. It:

Expands common local abbreviations (e.g., JLN → JALAN, TMN → TAMAN, WPKL → KUALA LUMPUR)

Normalizes spacing and adds commas, outputting addresses in a consistent, one-line, uppercase format

Formats addresses as [Address/Unit], [Street], [Locality/Area], [City], [Postcode], [State]

Runs efficiently on modest GPUs thanks to 4-bit quantization + LoRA, and supports easy batch or interactive usage

Ideal for developers needing clean, standardized Malaysian postal addresses for shipping labels, geocoding, or databases.

Model Details

Base model: openlm-research/open_llama_3b_v2 (Apache-2.0).

Technique: QLoRA-style PEFT (LoRA on 4-bit base)

Intended users: Developers standardizing Malaysian postal addresses

Uses

Correct and standardize Malaysian addresses in free-form text Expand common abbreviations (e.g., JLN, TMN, LRG, WPKL) Produce a single uppercase line suitable for label printing or geocoding prep

Out-of-Scope Use

Non-Malaysian address formats Entity verification/validation against authoritative sources Geocoding / latitude-longitude lookup

Bias, Risks & Limitations

Formatting assumptions: The model favors Malaysian conventions and may incorrectly reorder non-MY addresses.

Ambiguity: Abbreviations like HSN may map to multiple names; defaults are rule-based and may not match all cases.

Hallucination: The model can invent locality/state if the input is severely incomplete; keep a human in the loop for critical mailings.

Recommendations

Keep a deterministic rule layer (abbrev expansion + uppercasing + simple postcode/state checks).

If you have authoritative reference lists (states, cities, postcodes), validate the final line before use.

Training Details

Base model: openlm-research/open_llama_3b_v2

Method: LoRA fine-tuning with QLoRA (4-bit quantization)

Dataset: Synthetic + manually curated Malaysian address pairs (JSONL format: instruction, input, output)

Task: Causal LM, few-shot prompting with output delimiters ...

Epochs: 2

Batch size: 2 (gradient accumulation 8)

LR: 2e-4 (cosine schedule, warmup 5%)

How to use (LoRA adapter)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig from peft import PeftModel import torch, re

BASE = "openlm-research/open_llama_3b_v2" ADAPTER = "ramshafirdous/malaysian-address-corrector-lora"

bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, )

tok = AutoTokenizer.from_pretrained(BASE, use_fast=False) if tok.pad_token_id is None: tok.pad_token = tok.eos_token base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto", trust_remote_code=True) model = PeftModel.from_pretrained(base, ADAPTER).eval()

def tidy_commas_upper(s): s = re.sub(r"[\t|]+", ", ", s) s = re.sub(r"\s*,\s*", ", ", s) s = re.sub(r"\s{2,}", " ", s).strip() return s.upper()

OUT_S, OUT_E = "", "" FEWSHOT = ( "MALAYSIAN ADDRESS NORMALIZER.\n" "EXPAND ABBREVIATIONS. ONE LINE. ALL CAPS.\n" "FORMAT: [ADDRESS], [STREET], [LOCALITY], [CITY], [POSTCODE], [STATE]\n\n" f"Input: 8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor\n" f"Output: {OUT_S}8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR{OUT_E}\n" )

def correct_address(raw, max_new_tokens=128): prompt = f"{FEWSHOT}\nInput: {raw}\nOutput: {OUT_S}" enc = tok(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model.device) with torch.no_grad(): out = model.generate(**enc, max_new_tokens=max_new_tokens, do_sample=False, repetition_penalty=1.05, eos_token_id=tok.eos_token_id, pad_token_id=tok.pad_token_id) txt = tok.decode(out[0], skip_special_tokens=True) seg = txt.split(OUT_S, 1)[-1] seg = seg.split(OUT_E, 1)[0] if OUT_E in seg else seg.split("\n", 1)[0] return tidy_commas_upper(seg)

print(correct_address("11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor"))

Evaluation

Qualitative validation on held-out messy inputs:

Input Output
11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor 11A, JALAN BU 11/14, BANDAR UTAMA, PETALING JAYA, 47800, SELANGOR
LEVEL 30 THE GARDENS NORTH TOWER MID VALLEY CITY 59200 WP Kuala Lumpur LEVEL 30, THE GARDENS NORTH TOWER, MID VALLEY CITY, 59200, KUALA LUMPUR
8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor 8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR

Abbreviation coverage

Abbreviation Expansion
JLN JALAN
TMN TAMAN
LRG LORONG
BDR BANDAR
PJS PETALING JAYA SELATAN
WPKL KUALA LUMPUR
KPG KAMPUNG
PLG PULAU
BLK BLOK
LEBUH RAYA / HWY / HWAY LEBUH RAYA
... ...

Known Limitations

The model relies on prompt patterns — inconsistent prompting may reduce accuracy.

Does not validate postcode vs. state matches.

May occasionally insert or omit commas if input spacing is irregular (use a rule-based post-processor like tidy_commas_upper).

Trained for Malaysian addresses only.

Not for parsing addresses into structured fields.

Not a geocoder — it does not verify location existence.

Model Card Authors

Author: Ramsha Firdous