File size: 3,904 Bytes
f8a82af
6dee27a
d8cb7c4
f8a82af
90da933
d8cb7c4
90da933
d8cb7c4
90da933
 
 
d8cb7c4
 
 
 
 
7ce8fe2
 
90da933
f4343c5
 
ac33016
f4343c5
 
 
 
 
 
 
 
 
 
 
 
ac33016
f4343c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90da933
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
951e98f
 
90da933
 
 
 
 
 
 
 
 
 
 
 
1859e31
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
language: gsw
license: cc
---

# Swiss German Part-of-Speech Tagging Model

The **swiss_german_pos_model** is a part-of-speech tagging model for Swiss German. The model is trained on [Universal POS tags (upos)](https://universaldependencies.org/u/pos/).

### Training procedure and data sets

1) Base model: German LM: [dbmdz/bert-base-german-cased](https://huggingface.co/dbmdz/bert-base-german-cased)
2) Continued LM training with [swisscrawl data](https://icosys.ch/swisscrawl)
3) Task fine-tuning on the [UD\_German-HDT](https://github.com/UniversalDependencies/UD_German-HDT/tree/master) data set with [character-level noise](https://aclanthology.org/2022.findings-acl.321/)
4) Task fine-tuning on the Swiss German [NOAH-Corpus](https://noe-eva.github.io/NOAH-Corpus/) (train + dev split)

- Accuracy on Swiss German NOAH test split: 0.9587
- Accuracy on German UD_German-HDT test set after GSW fine-tuning: 0.9553 (vs. 0.9814 at step 3 before GSW fine-tuning)

### Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model = AutoModelForTokenClassification.from_pretrained("noeminaepli/swiss_german_pos_model")
tokenizer = AutoTokenizer.from_pretrained("noeminaepli/swiss_german_pos_model")

pos_tagger = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
tokens = pos_tagger("Worum söu mes ned chönne?")

```

Output:

```json
[{'entity_group': 'ADV',
  'score': 0.9627313,
  'word': 'Worum',
  'start': 0,
  'end': 5},
 {'entity_group': 'VERB',
  'score': 0.98772717,
  'word': 'söu',
  'start': 6,
  'end': 9},
 {'entity_group': 'PRON',
  'score': 0.99970305,
  'word': 'mes',
  'start': 10,
  'end': 13},
 {'entity_group': 'PART',
  'score': 0.9999368,
  'word': 'ned',
  'start': 14,
  'end': 17},
 {'entity_group': 'VERB',
  'score': 0.99841064,
  'word': 'chönne',
  'start': 18,
  'end': 24},
 {'entity_group': 'PUNCT',
  'score': 0.9999957,
  'word': '?',
  'start': 24,
  'end': 25}]

```




### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 1
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5.0

### Framework versions

- Transformers 4.25.0.dev0
- Pytorch 1.13.1
- Datasets 2.8.0
- Tokenizers 0.13.2


### Citation

```bib
@inproceedings{aepli-sennrich-2022-improving,
    title = "Improving Zero-Shot Cross-lingual Transfer Between Closely Related Languages by Injecting Character-Level Noise",
    author = {Aepli, No{\"e}mi  and
      Sennrich, Rico},
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-acl.321",
    doi = "10.18653/v1/2022.findings-acl.321",
    pages = "4074--4083",
    abstract = "Cross-lingual transfer between a high-resource language and its dialects or closely related language varieties should be facilitated by their similarity. However, current approaches that operate in the embedding space do not take surface similarity into account. This work presents a simple yet effective strategy to improve cross-lingual transfer between closely related varieties. We propose to augment the data of the high-resource source language with character-level noise to make the model more robust towards spelling variations. Our strategy shows consistent improvements over several languages and tasks: Zero-shot transfer of POS tagging and topic identification between language varieties from the Finnic, West and North Germanic, and Western Romance language branches. Our work provides evidence for the usefulness of simple surface-level noise in improving transfer between language varieties.",
}
```