yhavinga commited on
Commit
003e9be
·
verified ·
1 Parent(s): 7335bd5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +262 -29
README.md CHANGED
@@ -11,53 +11,286 @@ tags:
11
  - unsloth
12
  licence: license
13
  pipeline_tag: text-generation
 
 
 
 
 
14
  ---
15
 
16
- # Model Card for n5_label_addition_model
17
 
18
- This model is a fine-tuned version of [unsloth/Phi-4-mini-instruct-bnb-4bit](https://huggingface.co/unsloth/Phi-4-mini-instruct-bnb-4bit).
19
- It has been trained using [TRL](https://github.com/huggingface/trl).
20
 
21
- ## Quick start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ```python
24
- from transformers import pipeline
 
 
25
 
26
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
27
- generator = pipeline("text-generation", model="None", device="cuda")
28
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
29
- print(output["generated_text"])
 
 
 
 
 
 
 
 
 
 
 
 
30
  ```
31
 
32
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/yepster/n5-label-addition-unsloth/runs/xz3qtjtl)
35
 
 
36
 
37
- This model was trained with SFT.
 
 
 
 
38
 
39
- ### Framework versions
40
 
41
- - PEFT 0.16.0
42
- - TRL: 0.19.1
43
- - Transformers: 4.53.1
44
- - Pytorch: 2.7.1
45
- - Datasets: 4.0.0
46
- - Tokenizers: 0.21.2
47
 
48
- ## Citations
 
 
 
 
49
 
 
50
 
 
51
 
52
- Cite TRL as:
53
-
54
  ```bibtex
55
- @misc{vonwerra2022trl,
56
- title = {{TRL: Transformer Reinforcement Learning}},
57
- author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
58
- year = 2020,
59
- journal = {GitHub repository},
60
- publisher = {GitHub},
61
- howpublished = {\url{https://github.com/huggingface/trl}}
62
  }
63
  ```
 
11
  - unsloth
12
  licence: license
13
  pipeline_tag: text-generation
14
+ license: apache-2.0
15
+ datasets:
16
+ - UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps
17
+ language:
18
+ - nl
19
  ---
20
 
21
+ # Phi-4-mini N5 Label Addition Fine-tune
22
 
23
+ This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for adding human-readable labels (rdfs:label) to JSON-LD structures, trained as part of the WIM (Text-to-Knowledge Graph) pipeline on the signaalberichten dataset.
 
24
 
25
+ ## Model Details
26
+
27
+ ### Model Description
28
+
29
+ - **Developed by:** UWV InnovatieHub
30
+ - **Model type:** Causal Language Model with LoRA fine-tuning
31
+ - **Language(s):** Dutch (nl)
32
+ - **License:** MIT
33
+ - **Finetuned from:** microsoft/Phi-4-mini-instruct (3.82B parameters)
34
+ - **Training Framework:** Unsloth (optimized training for efficient processing)
35
+
36
+ ### Training Details
37
+
38
+ - **Dataset:** [UWV/wim_instruct_signaalberichten_to_jsonld_agent_steps](https://huggingface.co/datasets/UWV/wim_instruct_signaalberichten_to_jsonld_agent_steps)
39
+ - **Dataset Size:** 4,525 N5-specific examples (label addition tasks)
40
+ - **Training Duration:** 1 hour 44 minutes
41
+ - **Hardware:** NVIDIA A100 80GB
42
+ - **Epochs:** 3.1
43
+ - **Steps:** 1,735
44
+ - **Training Metrics:**
45
+ - Final Training Loss: 0.7864
46
+ - Training samples/second: 2.209
47
+ - Learning rate (final): 6.26e-10
48
+
49
+ ### LoRA Configuration
50
+
51
+ ```python
52
+ {
53
+ "r": 512, # Large rank for quality
54
+ "lora_alpha": 1024, # Alpha (2:1 ratio)
55
+ "lora_dropout": 0.1, # Higher dropout for small dataset
56
+ "bias": "none",
57
+ "task_type": "CAUSAL_LM",
58
+ "target_modules": [
59
+ "q_proj", "k_proj", "v_proj", "o_proj" # Attention layers only
60
+ ]
61
+ }
62
+ ```
63
+
64
+ ### Training Configuration
65
+
66
+ ```python
67
+ {
68
+ "model": "phi4-mini",
69
+ "max_seq_length": 4096,
70
+ "batch_size": 8,
71
+ "gradient_accumulation_steps": 1,
72
+ "effective_batch_size": 8,
73
+ "learning_rate": 2e-5,
74
+ "warmup_steps": 50,
75
+ "max_grad_norm": 1.0,
76
+ "lr_scheduler": "cosine",
77
+ "optimizer": "paged_adamw_8bit",
78
+ "bf16": True,
79
+ "seed": 42
80
+ }
81
+ ```
82
+
83
+ ## Intended Uses & Limitations
84
+
85
+ ### Intended Uses
86
+
87
+ - **Label Addition**: Add human-readable Dutch labels (rdfs:label) to JSON-LD structures
88
+ - **Knowledge Graph Enhancement**: Fifth step (N5) in the WIM pipeline
89
+ - **Government Services**: Optimized for citizen complaints and government service descriptions
90
+ - **JSON-LD Enrichment**: Make knowledge graphs more accessible with descriptive labels
91
+
92
+ ### Limitations
93
+
94
+ - Trained on signaalberichten dataset (different domain than N1-N3)
95
+ - Best performance on government/municipal service contexts
96
+ - Requires well-formed JSON-LD as input
97
+ - Limited to 4K token context (sufficient for label addition)
98
+ - Small training dataset (4,525 examples)
99
+
100
+ ## How to Use
101
+
102
+ ### Option 1: Using the Merged Model (Recommended)
103
+
104
+ ```python
105
+ from transformers import AutoModelForCausalLM, AutoTokenizer
106
+ import torch
107
+ import json
108
+
109
+ # Load the merged model (ready to use)
110
+ model = AutoModelForCausalLM.from_pretrained(
111
+ "UWV/wim-n5-phi4-mini-merged",
112
+ torch_dtype=torch.bfloat16,
113
+ device_map="auto",
114
+ trust_remote_code=True
115
+ )
116
+ tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-merged")
117
+
118
+ # Prepare input - JSON-LD without labels (citizen complaint)
119
+ json_ld = {
120
+ "@context": "https://schema.org",
121
+ "@type": "Report",
122
+ "about": {
123
+ "@type": "CivicStructure",
124
+ "name": "Speeltuin Vondelpark"
125
+ },
126
+ "reportedBy": {
127
+ "@type": "Person",
128
+ "address": {
129
+ "@type": "PostalAddress",
130
+ "addressLocality": "Amsterdam"
131
+ }
132
+ }
133
+ }
134
+
135
+ messages = [
136
+ {
137
+ "role": "system",
138
+ "content": "Je bent een expert in het toevoegen van Nederlandse labels aan JSON-LD."
139
+ },
140
+ {
141
+ "role": "user",
142
+ "content": f"""Voeg rdfs:label toe aan de volgende JSON-LD:
143
+
144
+ {json.dumps(json_ld, ensure_ascii=False, indent=2)}
145
+
146
+ Geef de complete JSON-LD terug met labels."""
147
+ }
148
+ ]
149
+
150
+ # Apply chat template and generate
151
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
152
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096)
153
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
154
+
155
+ with torch.no_grad():
156
+ outputs = model.generate(
157
+ **inputs,
158
+ max_new_tokens=1000,
159
+ temperature=0.1, # Low temperature for consistent labeling
160
+ do_sample=True,
161
+ top_p=0.95,
162
+ pad_token_id=tokenizer.pad_token_id,
163
+ eos_token_id=tokenizer.eos_token_id,
164
+ )
165
+
166
+ # Decode response
167
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
168
+ if "assistant:" in response:
169
+ response = response.split("assistant:")[-1].strip()
170
+
171
+ print(response)
172
+ ```
173
+
174
+ ### Option 2: Using the LoRA Adapter
175
 
176
  ```python
177
+ from transformers import AutoModelForCausalLM, AutoTokenizer
178
+ from peft import PeftModel
179
+ import torch
180
 
181
+ # Load base model
182
+ base_model = AutoModelForCausalLM.from_pretrained(
183
+ "microsoft/Phi-4-mini-instruct",
184
+ torch_dtype=torch.bfloat16,
185
+ device_map="auto",
186
+ trust_remote_code=True
187
+ )
188
+
189
+ # Load adapter
190
+ model = PeftModel.from_pretrained(
191
+ base_model,
192
+ "UWV/wim-n5-phi4-mini-adapter"
193
+ )
194
+ tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-adapter")
195
+
196
+ # Use same inference code as above...
197
  ```
198
 
199
+ ## Expected Output Format
200
+
201
+ The model adds `rdfs:label` properties to make JSON-LD more human-readable:
202
+
203
+ ```json
204
+ {
205
+ "@context": "https://schema.org",
206
+ "@type": "Report",
207
+ "rdfs:label": "Melding",
208
+ "about": {
209
+ "@type": "CivicStructure",
210
+ "rdfs:label": "Speeltuin Vondelpark",
211
+ "name": "Speeltuin Vondelpark"
212
+ },
213
+ "reportedBy": {
214
+ "@type": "Person",
215
+ "rdfs:label": "Melder",
216
+ "address": {
217
+ "@type": "PostalAddress",
218
+ "rdfs:label": "Adres in Amsterdam",
219
+ "addressLocality": "Amsterdam"
220
+ }
221
+ }
222
+ }
223
+ ```
224
+
225
+ ## Dataset Information
226
+
227
+ The model was trained on the [UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps) dataset, which contains:
228
+
229
+ - **Source**: Signaalberichten (citizen complaints to municipalities)
230
+ - **Domain**: Government services and municipal operations
231
+ - **N5 Examples**: 4,525 label addition tasks
232
+ - **Average Token Length**: 1,636 tokens
233
+ - **Max Token Length**: 2,332 tokens
234
+ - **Format**: ChatML-formatted instruction-following examples
235
+ - **Task**: Add Dutch rdfs:label properties to JSON-LD
236
+
237
+ **Important**: This dataset is different from the Wikipedia-based dataset used for N1-N3 models.
238
+
239
+ ## Training Results
240
+
241
+ The model completed 3.1 epochs through the dataset:
242
+ - **Final Training Loss**: 0.7864
243
+ - **Training Efficiency**: 2.209 samples/second
244
+
245
+ ### Loss Progression
246
+ - Started at ~1.13 loss
247
+ - Rapid improvement in first epoch
248
+ - Stable convergence throughout training
249
+ - Final learning rate: 6.26e-10 (cosine decay)
250
+ - Gradient norms: Stable around 0.6-0.7
251
+
252
+ ## Model Versions
253
+
254
+ - **Merged Model**: `UWV/wim-n5-phi4-mini-merged`
255
+ - Note: Merge failed due to known Phi-4 issue
256
+ - Adapter weights saved instead
257
+ - Model works fine for inference
258
+
259
+ - **LoRA Adapter**: `UWV/wim-n5-phi4-mini-adapter` (~2.29 GB)
260
+ - Requires base Phi-4-mini-instruct model
261
+ - Large adapter due to r=512
262
+ - Includes all training configurations
263
 
264
+ ## Pipeline Context
265
 
266
+ This model is part of the WIM (Text-to-Knowledge Graph) pipeline:
267
 
268
+ 1. **N1**: Entity Extraction
269
+ 2. **N2**: Schema.org Type Selection
270
+ 3. **N3**: Transform to JSON-LD
271
+ 4. **N4**: Validation
272
+ 5. **N5 (This Model)**: Add Human-Readable Labels
273
 
274
+ N5 is trained on a different dataset (signaalberichten) than N1-N3, focusing on government services and citizen interactions rather than encyclopedic content.
275
 
276
+ ## Performance Characteristics
 
 
 
 
 
277
 
278
+ - **Sequence Length**: Average 1,636 tokens (moderate length)
279
+ - **Batch Processing**: Can handle batch size 8 with 4K context
280
+ - **Inference Speed**: Fast label addition to existing JSON-LD
281
+ - **Memory Usage**: ~10GB VRAM with 4K context
282
+ - **Domain**: Specialized for Dutch government/municipal contexts
283
 
284
+ ## Citation
285
 
286
+ If you use this model, please cite:
287
 
 
 
288
  ```bibtex
289
+ @misc{wim-n5-phi4-mini,
290
+ author = {UWV InnovatieHub},
291
+ title = {Phi-4-mini N5 Label Addition Model},
292
+ year = {2025},
293
+ publisher = {HuggingFace},
294
+ url = {https://huggingface.co/UWV/wim-n5-phi4-mini-merged}
 
295
  }
296
  ```