stefan-it commited on
Commit
941e3f7
·
verified ·
1 Parent(s): 82d97b3

docs: mention tokenizer fixes

Browse files
Files changed (1) hide show
  1. README.md +11 -0
README.md CHANGED
@@ -18,6 +18,17 @@ pipeline_tag: fill-mask
18
 
19
  This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
20
 
 
 
 
 
 
 
 
 
 
 
 
21
  ## Table of Contents
22
  - [Performance Highlights](#performance-highlights)
23
  - [Quick Start](#quick-start)
 
18
 
19
  This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
20
 
21
+ ## Tokenizer Fix
22
+
23
+ This repository is a fork of the original [Ettin 400M](jhu-clsp/ettin-encoder-400m).
24
+
25
+ Due to not optimal performance on token classification tasks - as reported [here](https://github.com/AnswerDotAI/ModernBERT/issues/149) - the following fixes w.r.t. the tokenizer are applied:
26
+
27
+ * `add_prefix_space` is set to `True`
28
+ * `tokenizer_class` is set to `RobertaTokenizerFast`
29
+
30
+ More experiments on token classification tasks can be found in my [ModernBERT NER repo](https://github.com/stefan-it/modern-bert-ner).
31
+
32
  ## Table of Contents
33
  - [Performance Highlights](#performance-highlights)
34
  - [Quick Start](#quick-start)