yhavinga commited on
Commit
b749117
·
verified ·
1 Parent(s): 212c8e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +246 -29
README.md CHANGED
@@ -11,53 +11,270 @@ tags:
11
  - unsloth
12
  licence: license
13
  pipeline_tag: text-generation
 
 
 
 
 
14
  ---
15
 
16
- # Model Card for n2_schema_retrieval_model
17
 
18
- This model is a fine-tuned version of [unsloth/Phi-4-mini-instruct-bnb-4bit](https://huggingface.co/unsloth/Phi-4-mini-instruct-bnb-4bit).
19
- It has been trained using [TRL](https://github.com/huggingface/trl).
20
 
21
- ## Quick start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ```python
24
- from transformers import pipeline
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
27
- generator = pipeline("text-generation", model="None", device="cuda")
28
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
29
- print(output["generated_text"])
 
 
 
 
 
 
 
 
 
 
 
30
  ```
31
 
32
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/yepster/n2-schema-retrieval-unsloth/runs/5pwqqkrw)
35
 
 
36
 
37
- This model was trained with SFT.
 
 
 
 
38
 
39
- ### Framework versions
40
 
41
- - PEFT 0.16.0
42
- - TRL: 0.19.1
43
- - Transformers: 4.53.1
44
- - Pytorch: 2.7.1
45
- - Datasets: 4.0.0
46
- - Tokenizers: 0.21.2
47
 
48
- ## Citations
 
 
 
49
 
 
50
 
 
51
 
52
- Cite TRL as:
53
-
54
  ```bibtex
55
- @misc{vonwerra2022trl,
56
- title = {{TRL: Transformer Reinforcement Learning}},
57
- author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
58
- year = 2020,
59
- journal = {GitHub repository},
60
- publisher = {GitHub},
61
- howpublished = {\url{https://github.com/huggingface/trl}}
62
  }
63
  ```
 
11
  - unsloth
12
  licence: license
13
  pipeline_tag: text-generation
14
+ license: apache-2.0
15
+ datasets:
16
+ - UWV/wim-instruct-wiki-to-jsonld-agent-steps
17
+ language:
18
+ - nl
19
  ---
20
 
21
+ # Phi-4-mini N2 Schema.org Retrieval Fine-tune
22
 
23
+ This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for Schema.org type selection from entity descriptions, trained as part of the WIM (Wikipedia to Knowledge Graph) pipeline.
 
24
 
25
+ ## Model Details
26
+
27
+ ### Model Description
28
+
29
+ - **Developed by:** UWV InnovatieHub
30
+ - **Model type:** Causal Language Model with LoRA fine-tuning
31
+ - **Language(s):** Dutch (nl)
32
+ - **License:** MIT
33
+ - **Finetuned from:** microsoft/Phi-4-mini-instruct (3.82B parameters)
34
+ - **Training Framework:** Unsloth (optimized training for efficient processing)
35
+
36
+ ### Training Details
37
+
38
+ - **Dataset:** [UWV/wim-instruct-wiki-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-wiki-to-jsonld-agent-steps)
39
+ - **Dataset Size:** 104,684 N2-specific examples (schema retrieval tasks)
40
+ - **Training Duration:** 16 hours 33 minutes
41
+ - **Hardware:** NVIDIA A100 80GB
42
+ - **Epochs:** 1.56
43
+ - **Steps:** 5,000
44
+ - **Training Metrics:**
45
+ - Final Training Loss: 0.9303
46
+ - Final Eval Loss: 0.7903
47
+ - Training samples/second: 2.684
48
+ - Gradient norm (final): ~0.57
49
+
50
+ ### LoRA Configuration
51
 
52
  ```python
53
+ {
54
+ "r": 512, # Rank (same as N1 for consistency)
55
+ "lora_alpha": 1024, # Alpha (2:1 ratio)
56
+ "lora_dropout": 0.05, # Dropout for regularization
57
+ "bias": "none",
58
+ "task_type": "CAUSAL_LM",
59
+ "target_modules": [
60
+ "q_proj", "k_proj", "v_proj", "o_proj" # Attention layers only
61
+ ]
62
+ }
63
+ ```
64
+
65
+ ### Training Configuration
66
 
67
+ ```python
68
+ {
69
+ "model": "phi4-mini",
70
+ "max_seq_length": 8192,
71
+ "batch_size": 32,
72
+ "gradient_accumulation_steps": 1,
73
+ "effective_batch_size": 32,
74
+ "learning_rate": 2e-5,
75
+ "warmup_steps": 100,
76
+ "max_grad_norm": 1.0,
77
+ "lr_scheduler": "cosine",
78
+ "optimizer": "paged_adamw_8bit",
79
+ "bf16": True,
80
+ "seed": 42
81
+ }
82
  ```
83
 
84
+ ## Intended Uses & Limitations
85
+
86
+ ### Intended Uses
87
+
88
+ - **Schema.org Type Selection**: Select appropriate Schema.org types for entities
89
+ - **Knowledge Graph Construction**: Second step (N2) in the WIM pipeline
90
+ - **Entity Classification**: Map entity descriptions to standardized Schema.org vocabulary
91
+ - **High-throughput Processing**: Optimized for batch processing with short sequences
92
+
93
+ ### Limitations
94
+
95
+ - Optimized for Schema.org vocabulary only
96
+ - Best performance on entity descriptions from encyclopedic content
97
+ - Requires entity descriptions from N1 output
98
+ - Limited to 8K token context (sufficient for all N2 examples)
99
+
100
+ ## How to Use
101
+
102
+ ### Option 1: Using the Merged Model (Recommended)
103
+
104
+ ```python
105
+ from transformers import AutoModelForCausalLM, AutoTokenizer
106
+ import torch
107
+ import json
108
+
109
+ # Load the merged model (ready to use)
110
+ model = AutoModelForCausalLM.from_pretrained(
111
+ "UWV/wim-n2-phi4-mini-merged",
112
+ torch_dtype=torch.bfloat16,
113
+ device_map="auto",
114
+ trust_remote_code=True
115
+ )
116
+ tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n2-phi4-mini-merged")
117
+
118
+ # Prepare input (example from Dutch Wikipedia)
119
+ entities = [
120
+ {
121
+ "name": "Pedro Nunesplein",
122
+ "description": "Een plein in Amsterdam genoemd naar Pedro Nunes"
123
+ },
124
+ {
125
+ "name": "Amsterdam",
126
+ "description": "Hoofdstad van Nederland"
127
+ }
128
+ ]
129
+
130
+ messages = [
131
+ {
132
+ "role": "system",
133
+ "content": "Je bent een expert in schema.org vocabulaire en semantische mapping."
134
+ },
135
+ {
136
+ "role": "user",
137
+ "content": f"""Selecteer voor elke entiteit het meest passende Schema.org type:
138
+
139
+ {json.dumps(entities, ensure_ascii=False, indent=2)}
140
+
141
+ Geef een JSON array met elke entiteit en het Schema.org type."""
142
+ }
143
+ ]
144
+
145
+ # Apply chat template and generate
146
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
147
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=8192)
148
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
149
+
150
+ with torch.no_grad():
151
+ outputs = model.generate(
152
+ **inputs,
153
+ max_new_tokens=500,
154
+ temperature=0.1, # Low temperature for consistent classification
155
+ do_sample=True,
156
+ top_p=0.95,
157
+ pad_token_id=tokenizer.pad_token_id,
158
+ eos_token_id=tokenizer.eos_token_id,
159
+ )
160
+
161
+ # Decode response
162
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
163
+ if "assistant:" in response:
164
+ response = response.split("assistant:")[-1].strip()
165
+
166
+ print(response)
167
+ ```
168
+
169
+ ### Option 2: Using the LoRA Adapter
170
+
171
+ ```python
172
+ from transformers import AutoModelForCausalLM, AutoTokenizer
173
+ from peft import PeftModel
174
+ import torch
175
+
176
+ # Load base model
177
+ base_model = AutoModelForCausalLM.from_pretrained(
178
+ "microsoft/Phi-4-mini-instruct",
179
+ torch_dtype=torch.bfloat16,
180
+ device_map="auto",
181
+ trust_remote_code=True
182
+ )
183
+
184
+ # Load adapter
185
+ model = PeftModel.from_pretrained(
186
+ base_model,
187
+ "UWV/wim-n2-phi4-mini-adapter"
188
+ )
189
+ tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n2-phi4-mini-adapter")
190
+
191
+ # Use same inference code as above...
192
+ ```
193
+
194
+ ## Expected Output Format
195
+
196
+ The model outputs JSON with Schema.org type selections:
197
+
198
+ ```json
199
+ [
200
+ {
201
+ "name": "Pedro Nunesplein",
202
+ "schema_type": "Place",
203
+ "schema_url": "https://schema.org/Place"
204
+ },
205
+ {
206
+ "name": "Amsterdam",
207
+ "schema_type": "City",
208
+ "schema_url": "https://schema.org/City"
209
+ }
210
+ ]
211
+ ```
212
+
213
+ ## Dataset Information
214
+
215
+ The model was trained on the [UWV/wim-instruct-wiki-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-wiki-to-jsonld-agent-steps) dataset, which contains:
216
+
217
+ - **Source**: Entity descriptions from N1 processing of Dutch Wikipedia
218
+ - **Processing**: Multi-agent pipeline converting text to JSON-LD
219
+ - **N2 Examples**: 104,684 schema selection tasks (largest subset)
220
+ - **Average Token Length**: 663 tokens (very short sequences)
221
+ - **Max Token Length**: 7,488 tokens
222
+ - **Format**: ChatML-formatted instruction-following examples
223
+ - **Task**: Select appropriate Schema.org types for entities
224
+
225
+ ## Training Results
226
+
227
+ The model completed 1.56 epochs through the large dataset:
228
+ - **Final Training Loss**: 0.9303
229
+ - **Training Efficiency**: 2.684 samples/second
230
+
231
+ ### Loss Progression
232
+ - Started at ~0.77 loss
233
+ - Stable training with gradual improvement
234
+ - Learning rate: Cosine decay to 2e-12
235
+ - Gradient norms: Stable around 0.5-0.7
236
+
237
+ ## Model Versions
238
+
239
+ - **Merged Model**: `UWV/wim-n2-phi4-mini-merged` (7.17 GB)
240
+ - Ready to use without adapter loading
241
+ - Recommended for production inference
242
+ - Successfully merged (no Phi-4 issues)
243
+
244
+ - **LoRA Adapter**: `UWV/wim-n2-phi4-mini-adapter` (~1.14 GB)
245
+ - Requires base Phi-4-mini-instruct model
246
+ - Useful for further fine-tuning or experiments
247
+ - Large adapter due to r=512 (same as N1)
248
 
249
+ ## Pipeline Context
250
 
251
+ This model is part of the WIM (Wikipedia to Knowledge Graph) pipeline:
252
 
253
+ 1. **N1**: Entity Extraction
254
+ 2. **N2 (This Model)**: Schema.org Type Selection
255
+ 3. **N3**: Transform to JSON-LD
256
+ 4. **N4**: Validation
257
+ 5. **N5**: Add Human-Readable Labels
258
 
259
+ N2 processes the largest number of examples (104K) but with the shortest sequences, making it highly efficient for batch processing. Despite using a larger LoRA configuration (r=512) than typically needed for this simpler task, the model trained efficiently and merged successfully.
260
 
261
+ ## Performance Characteristics
 
 
 
 
 
262
 
263
+ - **Sequence Length**: Average 663 tokens (10x shorter than N1, 60x shorter than N3)
264
+ - **Batch Processing**: Can handle batch size 32+ due to short sequences
265
+ - **Inference Speed**: Very fast due to short context requirements
266
+ - **Memory Usage**: ~11GB VRAM with 8K context
267
 
268
+ ## Citation
269
 
270
+ If you use this model, please cite:
271
 
 
 
272
  ```bibtex
273
+ @misc{wim-n2-phi4-mini,
274
+ author = {UWV InnovatieHub},
275
+ title = {Phi-4-mini N2 Schema.org Retrieval Model},
276
+ year = {2025},
277
+ publisher = {HuggingFace},
278
+ url = {https://huggingface.co/UWV/wim-n2-phi4-mini-merged}
 
279
  }
280
  ```