yhavinga commited on
Commit
efe6adf
·
verified ·
1 Parent(s): 003e9be

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -63
README.md CHANGED
@@ -18,9 +18,10 @@ language:
18
  - nl
19
  ---
20
 
21
- # Phi-4-mini N5 Label Addition Fine-tune
22
 
23
- This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for adding human-readable labels (rdfs:label) to JSON-LD structures, trained as part of the WIM (Text-to-Knowledge Graph) pipeline on the signaalberichten dataset.
 
 
24
 
25
  ## Model Details
26
 
@@ -43,6 +44,7 @@ This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://hu
43
  - **Steps:** 1,735
44
  - **Training Metrics:**
45
  - Final Training Loss: 0.7864
 
46
  - Training samples/second: 2.209
47
  - Learning rate (final): 6.26e-10
48
 
@@ -84,18 +86,18 @@ This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://hu
84
 
85
  ### Intended Uses
86
 
87
- - **Label Addition**: Add human-readable Dutch labels (rdfs:label) to JSON-LD structures
88
- - **Knowledge Graph Enhancement**: Fifth step (N5) in the WIM pipeline
89
- - **Government Services**: Optimized for citizen complaints and government service descriptions
90
- - **JSON-LD Enrichment**: Make knowledge graphs more accessible with descriptive labels
91
 
92
  ### Limitations
93
 
94
- - Trained on signaalberichten dataset (different domain than N1-N3)
95
- - Best performance on government/municipal service contexts
96
- - Requires well-formed JSON-LD as input
97
- - Limited to 4K token context (sufficient for label addition)
98
- - Small training dataset (4,525 examples)
99
 
100
  ## How to Use
101
 
@@ -115,35 +117,32 @@ model = AutoModelForCausalLM.from_pretrained(
115
  )
116
  tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-merged")
117
 
118
- # Prepare input - JSON-LD without labels (citizen complaint)
119
- json_ld = {
120
- "@context": "https://schema.org",
121
- "@type": "Report",
122
- "about": {
123
- "@type": "CivicStructure",
124
- "name": "Speeltuin Vondelpark"
125
- },
126
- "reportedBy": {
127
- "@type": "Person",
128
- "address": {
129
- "@type": "PostalAddress",
130
- "addressLocality": "Amsterdam"
131
- }
132
- }
133
- }
134
 
135
  messages = [
136
  {
137
  "role": "system",
138
- "content": "Je bent een expert in het toevoegen van Nederlandse labels aan JSON-LD."
139
  },
140
  {
141
  "role": "user",
142
- "content": f"""Voeg rdfs:label toe aan de volgende JSON-LD:
 
 
 
 
143
 
144
- {json.dumps(json_ld, ensure_ascii=False, indent=2)}
 
145
 
146
- Geef de complete JSON-LD terug met labels."""
 
147
  }
148
  ]
149
 
@@ -198,27 +197,20 @@ tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-adapter")
198
 
199
  ## Expected Output Format
200
 
201
- The model adds `rdfs:label` properties to make JSON-LD more human-readable:
202
 
203
  ```json
204
  {
205
- "@context": "https://schema.org",
206
- "@type": "Report",
207
- "rdfs:label": "Melding",
208
- "about": {
209
- "@type": "CivicStructure",
210
- "rdfs:label": "Speeltuin Vondelpark",
211
- "name": "Speeltuin Vondelpark"
212
- },
213
- "reportedBy": {
214
- "@type": "Person",
215
- "rdfs:label": "Melder",
216
- "address": {
217
- "@type": "PostalAddress",
218
- "rdfs:label": "Adres in Amsterdam",
219
- "addressLocality": "Amsterdam"
220
- }
221
- }
222
  }
223
  ```
224
 
@@ -227,14 +219,14 @@ The model adds `rdfs:label` properties to make JSON-LD more human-readable:
227
  The model was trained on the [UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps) dataset, which contains:
228
 
229
  - **Source**: Signaalberichten (citizen complaints to municipalities)
230
- - **Domain**: Government services and municipal operations
231
- - **N5 Examples**: 4,525 label addition tasks
232
  - **Average Token Length**: 1,636 tokens
233
  - **Max Token Length**: 2,332 tokens
234
  - **Format**: ChatML-formatted instruction-following examples
235
- - **Task**: Add Dutch rdfs:label properties to JSON-LD
236
 
237
- **Important**: This dataset is different from the Wikipedia-based dataset used for N1-N3 models.
238
 
239
  ## Training Results
240
 
@@ -261,17 +253,20 @@ The model completed 3.1 epochs through the dataset:
261
  - Large adapter due to r=512
262
  - Includes all training configurations
263
 
264
- ## Pipeline Context
265
 
266
- This model is part of the WIM (Text-to-Knowledge Graph) pipeline:
267
 
268
- 1. **N1**: Entity Extraction
269
- 2. **N2**: Schema.org Type Selection
270
- 3. **N3**: Transform to JSON-LD
271
- 4. **N4**: Validation
272
- 5. **N5 (This Model)**: Add Human-Readable Labels
273
 
274
- N5 is trained on a different dataset (signaalberichten) than N1-N3, focusing on government services and citizen interactions rather than encyclopedic content.
 
 
 
275
 
276
  ## Performance Characteristics
277
 
@@ -288,9 +283,9 @@ If you use this model, please cite:
288
  ```bibtex
289
  @misc{wim-n5-phi4-mini,
290
  author = {UWV InnovatieHub},
291
- title = {Phi-4-mini N5 Label Addition Model},
292
  year = {2025},
293
  publisher = {HuggingFace},
294
  url = {https://huggingface.co/UWV/wim-n5-phi4-mini-merged}
295
  }
296
- ```
 
18
  - nl
19
  ---
20
 
 
21
 
22
+ # Phi-4-mini N5 Complaint Categorization Fine-tune
23
+
24
+ This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for categorizing citizen complaints into predefined topic and experience labels, trained on the signaalberichten dataset.
25
 
26
  ## Model Details
27
 
 
44
  - **Steps:** 1,735
45
  - **Training Metrics:**
46
  - Final Training Loss: 0.7864
47
+ - Final Eval Loss: 0.7796
48
  - Training samples/second: 2.209
49
  - Learning rate (final): 6.26e-10
50
 
 
86
 
87
  ### Intended Uses
88
 
89
+ - **Complaint Categorization**: Classify citizen complaints into topic and experience categories
90
+ - **Municipal Service Analysis**: Analyze phone transcripts and written complaints
91
+ - **Topic Detection**: Identify what the complaint is about (e.g., waste, parking, permits)
92
+ - **Experience Analysis**: Determine how citizens experience the service (e.g., communication, speed, clarity)
93
 
94
  ### Limitations
95
 
96
+ - Trained on signaalberichten dataset (Dutch municipal complaints)
97
+ - Fixed label vocabulary (cannot create new labels)
98
+ - Best performance on complaint/service interaction texts
99
+ - Limited to 4K token context (sufficient for most complaints)
100
+ - Specific to Dutch government/municipal contexts
101
 
102
  ## How to Use
103
 
 
117
  )
118
  tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-merged")
119
 
120
+ # Prepare input - complaint text for categorization
121
+ complaint_text = """
122
+ Burger: Nou, waar ik dus over wil klagen is het afval in de buurt.
123
+ Het is echt niet normaal meer, met al die vuilniszakken die op straat worden gegooid.
124
+ De containers zijn vaak vol en er komen ook ratten.
125
+ Ik had al eens gebeld maar er wordt niks aan gedaan!
126
+ """
 
 
 
 
 
 
 
 
 
127
 
128
  messages = [
129
  {
130
  "role": "system",
131
+ "content": "Jij bent een expert in het toewijzen van labels aan een tekst."
132
  },
133
  {
134
  "role": "user",
135
+ "content": f"""Analyseer de onderstaande tekst en bepaal welke labels van toepassing zijn.
136
+
137
+ **Onderwerp labels** (selecteer wat van toepassing is):
138
+ Vuil/ongedierte overlast, Bruikbaarheid/beschikbaarheid afvalcontainers,
139
+ Parkeeroverlast, Vergunningen, etc.
140
 
141
+ **Beleving labels** (selecteer wat van toepassing is):
142
+ Communicatie, Op de hoogte houden, Statusinformatie, Snelheid van afhandeling, etc.
143
 
144
+ **Tekst om te analyseren**:
145
+ {complaint_text}"""
146
  }
147
  ]
148
 
 
197
 
198
  ## Expected Output Format
199
 
200
+ The model outputs a JSON response with categorization results:
201
 
202
  ```json
203
  {
204
+ "reasoning": "Omdat de burger klaagt over afval dat op straat wordt gegooid, volle containers en rattenoverlast, zijn de onderwerpen 'Vuil/ongedierte overlast' en 'Bruikbaarheid/beschikbaarheid afvalcontainers' het meest van toepassing. De beleving is negatief: de burger ervaart frustratie over het uitblijven van actie en het gebrek aan terugkoppeling.",
205
+ "onderwerp_labels": [
206
+ "Vuil/ongedierte overlast",
207
+ "Bruikbaarheid/beschikbaarheid afvalcontainers"
208
+ ],
209
+ "beleving_labels": [
210
+ "Op de hoogte houden",
211
+ "Statusinformatie",
212
+ "Communicatie"
213
+ ]
 
 
 
 
 
 
 
214
  }
215
  ```
216
 
 
219
  The model was trained on the [UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps) dataset, which contains:
220
 
221
  - **Source**: Signaalberichten (citizen complaints to municipalities)
222
+ - **Domain**: Phone transcripts and written complaints about municipal services
223
+ - **N5 Examples**: 4,525 complaint categorization tasks
224
  - **Average Token Length**: 1,636 tokens
225
  - **Max Token Length**: 2,332 tokens
226
  - **Format**: ChatML-formatted instruction-following examples
227
+ - **Task**: Categorize complaints into predefined topic and experience labels
228
 
229
+ **Important**: This is a different task and dataset from the WIM pipeline (N1-N4) which focuses on Wikipedia to JSON-LD conversion.
230
 
231
  ## Training Results
232
 
 
253
  - Large adapter due to r=512
254
  - Includes all training configurations
255
 
256
+ ## Model Context
257
 
258
+ **Note**: Despite the "n5" naming, this model is NOT part of the WIM (Wikipedia to Knowledge Graph) pipeline that includes N1-N4. This is a separate task focused on complaint categorization.
259
 
260
+ ### WIM Pipeline (Wikipedia to JSON-LD):
261
+ 1. **N1**: Entity Extraction from Wikipedia text
262
+ 2. **N2**: Schema.org Type Selection for entities
263
+ 3. **N3**: Transform to JSON-LD format
264
+ 4. **N4**: Validation of JSON-LD
265
 
266
+ ### This Model (N5 - Complaint Categorization):
267
+ - **Task**: Categorize citizen complaints into topic and experience labels
268
+ - **Dataset**: Signaalberichten (municipal complaints)
269
+ - **Domain**: Government services and citizen interactions
270
 
271
  ## Performance Characteristics
272
 
 
283
  ```bibtex
284
  @misc{wim-n5-phi4-mini,
285
  author = {UWV InnovatieHub},
286
+ title = {Phi-4-mini N5 Complaint Categorization Model},
287
  year = {2025},
288
  publisher = {HuggingFace},
289
  url = {https://huggingface.co/UWV/wim-n5-phi4-mini-merged}
290
  }
291
+ ```