ramshafirdous commited on
Commit
e5c1dd5
·
verified ·
1 Parent(s): cf08666

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -24
README.md CHANGED
@@ -1,5 +1,13 @@
1
  ---
 
 
2
  library_name: transformers
 
 
 
 
 
 
3
  tags:
4
  - peft
5
  - lora
@@ -7,13 +15,16 @@ tags:
7
  - address-normalization
8
  - address-correction
9
  - malaysia
10
- license: apache-2.0
11
- base_model:
12
- - openlm-research/open_llama_3b_v2
13
- pipeline_tag: text-classification
14
- language:
15
- - en
16
- - ms
 
 
 
17
  ---
18
 
19
  # Model Card for Model ID
@@ -68,39 +79,110 @@ If you have authoritative reference lists (states, cities, postcodes), validate
68
 
69
  ## Training Details
70
 
71
- ### Training Data
 
 
72
 
73
- Source: Private/local dataset created from real-world Malaysian address fragments (tab/CSV), plus pseudo-labels generated by deterministic expansion rules and tidy/uppercase standardization.
74
 
75
- Augmentation: Synthetic “messy” inputs created by replacing full forms with common abbreviations (e.g., JALAN → JLN) so the model learns to normalize them.
76
 
77
- Schema: JSON/JSONL with fields instruction, input, output.
78
 
79
- ### Training Procedure
80
 
81
- PEFT: r=8, lora_alpha=16, lora_dropout=0.1, target modules q_proj,k_proj,v_proj,o_proj
82
 
83
- Optimizer/Schedule: AdamW, lr=2e-4, cosine decay, warmup 5%
84
 
85
- Batching: per_device_train_batch_size=2, gradient_accumulation_steps=8 (effective ~16)
86
 
87
- Epochs: 2–4 (depending on dataset size)
 
 
88
 
89
- Precision: 4-bit NF4 base compute (fp16)
 
90
 
91
- Framework: transformers==4.55.x, peft, datasets, accelerate, bitsandbytes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
  ## Evaluation
94
 
95
  Qualitative validation on held-out messy inputs:
96
 
97
- Input (shortened)
98
- 11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor
99
- LEVEL 30 THE GARDENS NORTH TOWER MID VALLEY CITY 59200 WP Kuala Lumpur
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
- Expected Model Output
102
- 11A, JALAN BU 11/14, BANDAR UTAMA, PETALING JAYA, 47800, SELANGOR
103
- LEVEL 30, THE GARDENS NORTH TOWER, MID VALLEY CITY, 59200, KUALA LUMPUR
104
 
105
  ## Model Card Authors
106
 
 
1
  ---
2
+ license: apache-2.0
3
+ base_model: openlm-research/open_llama_3b_v2
4
  library_name: transformers
5
+ pipeline_tag: text-generation
6
+ model_type: peft
7
+ adapter_type: lora
8
+ language:
9
+ - en
10
+ - ms
11
  tags:
12
  - peft
13
  - lora
 
15
  - address-normalization
16
  - address-correction
17
  - malaysia
18
+ ---
19
+
20
+ # Malaysian Address Corrector LoRA
21
+
22
+ This is a **LoRA adapter** for [`openlm-research/open_llama_3b_v2`](https://huggingface.co/openlm-research/open_llama_3b_v2) fine-tuned to **normalize and standardize Malaysian postal addresses**.
23
+
24
+ It expands common abbreviations, enforces consistent comma-separated formatting, and outputs **uppercase** standardized addresses.
25
+
26
+ ⚠️ **Important:** This repo contains **adapters only** — you must load them on top of the base model. The Hosted Inference widget will not run adapters directly.
27
+
28
  ---
29
 
30
  # Model Card for Model ID
 
79
 
80
  ## Training Details
81
 
82
+ Base model: openlm-research/open_llama_3b_v2
83
+
84
+ Method: LoRA fine-tuning with QLoRA (4-bit quantization)
85
 
86
+ Dataset: Synthetic + manually curated Malaysian address pairs (JSONL format: instruction, input, output)
87
 
88
+ Task: Causal LM, few-shot prompting with output delimiters <OUT>...</OUT>
89
 
90
+ Epochs: 2
91
 
92
+ Batch size: 2 (gradient accumulation 8)
93
 
94
+ LR: 2e-4 (cosine schedule, warmup 5%)
95
 
 
96
 
97
+ ## How to use (LoRA adapter)
98
 
99
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
100
+ from peft import PeftModel
101
+ import torch, re
102
 
103
+ BASE = "openlm-research/open_llama_3b_v2"
104
+ ADAPTER = "ramshafirdous/malaysian-address-corrector-lora"
105
 
106
+ bnb = BitsAndBytesConfig(
107
+ load_in_4bit=True, bnb_4bit_quant_type="nf4",
108
+ bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16,
109
+ )
110
+
111
+ tok = AutoTokenizer.from_pretrained(BASE, use_fast=False)
112
+ if tok.pad_token_id is None: tok.pad_token = tok.eos_token
113
+ base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto", trust_remote_code=True)
114
+ model = PeftModel.from_pretrained(base, ADAPTER).eval()
115
+
116
+ def tidy_commas_upper(s):
117
+ s = re.sub(r"[\t|]+", ", ", s)
118
+ s = re.sub(r"\s*,\s*", ", ", s)
119
+ s = re.sub(r"\s{2,}", " ", s).strip()
120
+ return s.upper()
121
+
122
+ OUT_S, OUT_E = "<OUT>", "</OUT>"
123
+ FEWSHOT = (
124
+ "MALAYSIAN ADDRESS NORMALIZER.\n"
125
+ "EXPAND ABBREVIATIONS. ONE LINE. ALL CAPS.\n"
126
+ "FORMAT: [ADDRESS], [STREET], [LOCALITY], [CITY], [POSTCODE], [STATE]\n\n"
127
+ f"Input: 8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor\n"
128
+ f"Output: {OUT_S}8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR{OUT_E}\n"
129
+ )
130
+
131
+ def correct_address(raw, max_new_tokens=128):
132
+ prompt = f"{FEWSHOT}\nInput: {raw}\nOutput: {OUT_S}"
133
+ enc = tok(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model.device)
134
+ with torch.no_grad():
135
+ out = model.generate(**enc, max_new_tokens=max_new_tokens, do_sample=False,
136
+ repetition_penalty=1.05, eos_token_id=tok.eos_token_id,
137
+ pad_token_id=tok.pad_token_id)
138
+ txt = tok.decode(out[0], skip_special_tokens=True)
139
+ seg = txt.split(OUT_S, 1)[-1]
140
+ seg = seg.split(OUT_E, 1)[0] if OUT_E in seg else seg.split("\n", 1)[0]
141
+ return tidy_commas_upper(seg)
142
+
143
+ print(correct_address("11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor"))
144
 
145
  ## Evaluation
146
 
147
  Qualitative validation on held-out messy inputs:
148
 
149
+ | Input | Output |
150
+ | ------------------------------------------------------------------------ | ------------------------------------------------------------------------- |
151
+ | `11A, JALAN BU 11/14, BANDAR UTAMA PETALING JAYA 47800 Selangor` | `11A, JALAN BU 11/14, BANDAR UTAMA, PETALING JAYA, 47800, SELANGOR` |
152
+ | `LEVEL 30 THE GARDENS NORTH TOWER MID VALLEY CITY 59200 WP Kuala Lumpur` | `LEVEL 30, THE GARDENS NORTH TOWER, MID VALLEY CITY, 59200, KUALA LUMPUR` |
153
+ | `8 LRG ZAINAL ABIDIN 13 KAMPUNG PENDAMAR KLANG 41200 Selangor` | `8, LORONG ZAINAL ABIDIN 13, KAMPUNG PENDAMAR, KLANG, 41200, SELANGOR` |
154
+
155
+
156
+ ## Abbreviation coverage
157
+
158
+ | Abbreviation | Expansion |
159
+ | ----------------------- | --------------------- |
160
+ | JLN | JALAN |
161
+ | TMN | TAMAN |
162
+ | LRG | LORONG |
163
+ | BDR | BANDAR |
164
+ | PJS | PETALING JAYA SELATAN |
165
+ | WPKL | KUALA LUMPUR |
166
+ | KPG | KAMPUNG |
167
+ | PLG | PULAU |
168
+ | BLK | BLOK |
169
+ | LEBUH RAYA / HWY / HWAY | LEBUH RAYA |
170
+ | ... | ... |
171
+
172
+
173
+ ## Known Limitations
174
+
175
+ The model relies on prompt patterns — inconsistent prompting may reduce accuracy.
176
+
177
+ Does not validate postcode vs. state matches.
178
+
179
+ May occasionally insert or omit commas if input spacing is irregular (use a rule-based post-processor like tidy_commas_upper).
180
+
181
+ Trained for Malaysian addresses only.
182
+
183
+ Not for parsing addresses into structured fields.
184
 
185
+ Not a geocoder — it does not verify location existence.
 
 
186
 
187
  ## Model Card Authors
188