docs: mention tokenizer fixes
Browse files
README.md
CHANGED
@@ -18,6 +18,17 @@ pipeline_tag: fill-mask
|
|
18 |
|
19 |
This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
|
20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
## Table of Contents
|
22 |
- [Performance Highlights](#performance-highlights)
|
23 |
- [Quick Start](#quick-start)
|
|
|
18 |
|
19 |
This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
|
20 |
|
21 |
+
## Tokenizer Fix
|
22 |
+
|
23 |
+
This repository is a fork of the original [Ettin 400M](jhu-clsp/ettin-encoder-400m).
|
24 |
+
|
25 |
+
Due to not optimal performance on token classification tasks - as reported [here](https://github.com/AnswerDotAI/ModernBERT/issues/149) - the following fixes w.r.t. the tokenizer are applied:
|
26 |
+
|
27 |
+
* `add_prefix_space` is set to `True`
|
28 |
+
* `tokenizer_class` is set to `RobertaTokenizerFast`
|
29 |
+
|
30 |
+
More experiments on token classification tasks can be found in my [ModernBERT NER repo](https://github.com/stefan-it/modern-bert-ner).
|
31 |
+
|
32 |
## Table of Contents
|
33 |
- [Performance Highlights](#performance-highlights)
|
34 |
- [Quick Start](#quick-start)
|