ramshafirdous
/

malaysian-address-corrector-lora

@@ -1,5 +1,13 @@
 ---
 library_name: transformers
 tags:
   - peft
   - lora
@@ -7,13 +15,16 @@ tags:
   - address-normalization
   - address-correction
   - malaysia
-license: apache-2.0
-base_model:
-- openlm-research/open_llama_3b_v2
-pipeline_tag: text-classification
-language:
-  - en
-  - ms
 ---
 # Model Card for Model ID
@@ -68,39 +79,110 @@ If you have authoritative reference lists (states, cities, postcodes), validate
 ## Training Details
-### Training Data
-Source: Private/local dataset created from real-world Malaysian address fragments (tab/CSV), plus pseudo-labels generated by deterministic expansion rules and tidy/uppercase standardization.
-Augmentation: Synthetic “messy” inputs created by replacing full forms with common abbreviations (e.g., JALAN → JLN) so the model learns to normalize them.
-Schema: JSON/JSONL with fields instruction, input, output.
-### Training Procedure
-PEFT: r=8, lora_alpha=16, lora_dropout=0.1, target modules q_proj,k_proj,v_proj,o_proj
-Optimizer/Schedule: AdamW, lr=2e-4, cosine decay, warmup 5%
-Batching: per_device_train_batch_size=2, gradient_accumulation_steps=8 (effective ~16)
-Epochs: 2–4 (depending on dataset size)
-Precision: 4-bit NF4 base compute (fp16)
-Framework: transformers==4.55.x, peft, datasets, accelerate, bitsandbytes
 ## Evaluation
 Qualitative validation on held-out messy inputs:
-Input (shortened)
-11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor
-LEVEL 30 THE GARDENS NORTH TOWER MID VALLEY CITY 59200 WP Kuala Lumpur
-Expected → Model Output
-11A, JALAN BU 11/14, BANDAR UTAMA, PETALING JAYA, 47800, SELANGOR
-LEVEL 30, THE GARDENS NORTH TOWER, MID VALLEY CITY, 59200, KUALA LUMPUR
 ## Model Card Authors

 ---
+license: apache-2.0
+base_model: openlm-research/open_llama_3b_v2
 library_name: transformers
+pipeline_tag: text-generation
+model_type: peft
+adapter_type: lora
+language:
+  - en
+  - ms
 tags:
   - peft
   - lora
   - address-normalization
   - address-correction
   - malaysia
+---
+# Malaysian Address Corrector LoRA
+This is a **LoRA adapter** for [`openlm-research/open_llama_3b_v2`](https://huggingface.co/openlm-research/open_llama_3b_v2) fine-tuned to **normalize and standardize Malaysian postal addresses**.
+It expands common abbreviations, enforces consistent comma-separated formatting, and outputs **uppercase** standardized addresses.
+⚠️ **Important:** This repo contains **adapters only** — you must load them on top of the base model. The Hosted Inference widget will not run adapters directly.
 ---
 # Model Card for Model ID
 ## Training Details
+Base model: openlm-research/open_llama_3b_v2
+Method: LoRA fine-tuning with QLoRA (4-bit quantization)
+Dataset: Synthetic + manually curated Malaysian address pairs (JSONL format: instruction, input, output)
+Task: Causal LM, few-shot prompting with output delimiters <OUT>...</OUT>
+Epochs: 2
+Batch size: 2 (gradient accumulation 8)
+LR: 2e-4 (cosine schedule, warmup 5%)
+## How to use (LoRA adapter)
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+from peft import PeftModel
+import torch, re
+BASE = "openlm-research/open_llama_3b_v2"
+ADAPTER = "ramshafirdous/malaysian-address-corrector-lora"
+bnb = BitsAndBytesConfig(
+  load_in_4bit=True, bnb_4bit_quant_type="nf4",
+  bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16,
+)
+tok = AutoTokenizer.from_pretrained(BASE, use_fast=False)
+if tok.pad_token_id is None: tok.pad_token = tok.eos_token
+base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto", trust_remote_code=True)
+model = PeftModel.from_pretrained(base, ADAPTER).eval()
+def tidy_commas_upper(s):
+  s = re.sub(r"[\t|]+", ", ", s)
+  s = re.sub(r"\s*,\s*", ", ", s)
+  s = re.sub(r"\s{2,}", " ", s).strip()
+  return s.upper()
+OUT_S, OUT_E = "<OUT>", "</OUT>"
+FEWSHOT = (
+  "MALAYSIAN ADDRESS NORMALIZER.\n"
+  "EXPAND ABBREVIATIONS. ONE LINE. ALL CAPS.\n"
+  "FORMAT: [ADDRESS], [STREET], [LOCALITY], [CITY], [POSTCODE], [STATE]\n\n"
+  f"Input: 8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor\n"
+  f"Output: {OUT_S}8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR{OUT_E}\n"
+)
+def correct_address(raw, max_new_tokens=128):
+  prompt = f"{FEWSHOT}\nInput: {raw}\nOutput: {OUT_S}"
+  enc = tok(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model.device)
+  with torch.no_grad():
+    out = model.generate(**enc, max_new_tokens=max_new_tokens, do_sample=False,
+                         repetition_penalty=1.05, eos_token_id=tok.eos_token_id,
+                         pad_token_id=tok.pad_token_id)
+  txt = tok.decode(out[0], skip_special_tokens=True)
+  seg = txt.split(OUT_S, 1)[-1]
+  seg = seg.split(OUT_E, 1)[0] if OUT_E in seg else seg.split("\n", 1)[0]
+  return tidy_commas_upper(seg)
+print(correct_address("11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor"))
 ## Evaluation
 Qualitative validation on held-out messy inputs:
+| Input                                                                    | Output                                                                    |
+| ------------------------------------------------------------------------ | ------------------------------------------------------------------------- |
+| `11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor`         | `11A, JALAN BU 11/14, BANDAR UTAMA, PETALING JAYA, 47800, SELANGOR`       |
+| `LEVEL 30 THE GARDENS NORTH TOWER MID VALLEY CITY 59200 WP Kuala Lumpur` | `LEVEL 30, THE GARDENS NORTH TOWER, MID VALLEY CITY, 59200, KUALA LUMPUR` |
+| `8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor`           | `8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR`    |
+## Abbreviation coverage
+| Abbreviation            | Expansion             |
+| ----------------------- | --------------------- |
+| JLN                     | JALAN                 |
+| TMN                     | TAMAN                 |
+| LRG                     | LORONG                |
+| BDR                     | BANDAR                |
+| PJS                     | PETALING JAYA SELATAN |
+| WPKL                    | KUALA LUMPUR          |
+| KPG                     | KAMPUNG               |
+| PLG                     | PULAU                 |
+| BLK                     | BLOK                  |
+| LEBUH RAYA / HWY / HWAY | LEBUH RAYA            |
+| ...                     | ...                   |
+## Known Limitations
+The model relies on prompt patterns — inconsistent prompting may reduce accuracy.
+Does not validate postcode vs. state matches.
+May occasionally insert or omit commas if input spacing is irregular (use a rule-based post-processor like tidy_commas_upper).
+Trained for Malaysian addresses only.
+Not for parsing addresses into structured fields.
+Not a geocoder — it does not verify location existence.
 ## Model Card Authors