stefan-it
/

ettin-encoder-400m-tokenizer-fix

Model card Files Files and versions

stefan-it commited on Jul 16

Commit

941e3f7

·

verified ·

1 Parent(s): 82d97b3

docs: mention tokenizer fixes

Files changed (1) hide show

README.md +11 -0

README.md CHANGED Viewed

@@ -18,6 +18,17 @@ pipeline_tag: fill-mask
 This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
 ## Table of Contents
 - [Performance Highlights](#performance-highlights)
 - [Quick Start](#quick-start)

 This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
+## Tokenizer Fix
+This repository is a fork of the original [Ettin 400M](jhu-clsp/ettin-encoder-400m).
+Due to not optimal performance on token classification tasks - as reported [here](https://github.com/AnswerDotAI/ModernBERT/issues/149) - the following fixes w.r.t. the tokenizer are applied:
+* `add_prefix_space` is set to `True`
+* `tokenizer_class` is set to `RobertaTokenizerFast`
+More experiments on token classification tasks can be found in my [ModernBERT NER repo](https://github.com/stefan-it/modern-bert-ner).
 ## Table of Contents
 - [Performance Highlights](#performance-highlights)
 - [Quick Start](#quick-start)