Update metadata and improve model card for Ettin decoder model (#2)
Browse files- Update metadata and improve model card for Ettin decoder model (cacef4824d9d72fecac984c396b457d1a50e31f9)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,39 +1,51 @@
|
|
1 |
---
|
2 |
-
license: mit
|
3 |
language:
|
4 |
- en
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
---
|
|
|
7 |
# Ettin: an Open Suite of Paired Encoders and Decoders
|
8 |
|
9 |
[](https://opensource.org/licenses/MIT)
|
10 |
[](https://arxiv.org/abs/2507.11412)
|
11 |
-
[](https://huggingface.co/datasets/jhu-clsp)
|
13 |
[](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
|
14 |
|
15 |
> π― **TL;DR**: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.
|
16 |
|
17 |
-
π [Paper](https://arxiv.org/abs/2507.11412) |
|
18 |
|
19 |
This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
|
20 |
|
21 |
## Table of Contents
|
22 |
-
- [Performance Highlights](
|
23 |
-
- [Quick Start](
|
24 |
- [Model Description](#model-description)
|
25 |
- [Training Data](#training-data)
|
26 |
-
- [Model Family](
|
27 |
- [Encoder Models](#encoder-models)
|
28 |
- [Decoder Models](#decoder-models)
|
29 |
- [Cross-Objective Models](#cross-objective-models)
|
30 |
- [Accessing Training Checkpoints](#accessing-training-checkpoints)
|
31 |
-
- [Research Applications](
|
32 |
- [Training Details](#training-details)
|
33 |
- [Model Architecture](#model-architecture)
|
34 |
- [Usage Examples](#usage-examples)
|
35 |
- [Fine-tuning Examples](#fine-tuning-examples)
|
|
|
|
|
36 |
- [Citation](#citation)
|
|
|
37 |
|
38 |
## π Performance Highlights
|
39 |
|
@@ -82,11 +94,11 @@ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
|
|
82 |
|
83 |
Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
|
84 |
|
85 |
-
1.
|
86 |
-
2.
|
87 |
-
3.
|
88 |
-
4.
|
89 |
-
5.
|
90 |
|
91 |
This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
|
92 |
|
@@ -94,12 +106,12 @@ This approach allows for true apples-to-apples comparisons between encoder and d
|
|
94 |
|
95 |
The training data is publicly available and split across different phases:
|
96 |
|
97 |
-
-
|
98 |
-
-
|
99 |
-
-
|
100 |
-
-
|
101 |
|
102 |
-
## Model Family
|
103 |
|
104 |
### Encoder Models
|
105 |
|
@@ -174,9 +186,9 @@ All raw training checkpoints are available in the [jhu-clsp/ettin-checkpoints](h
|
|
174 |
#### HuggingFace Format Checkpoints
|
175 |
Each model repository contains multiple tagged versions representing different training stages:
|
176 |
|
177 |
-
-
|
178 |
-
-
|
179 |
-
-
|
180 |
|
181 |
```python
|
182 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
@@ -209,19 +221,19 @@ This checkpoint availability enables detailed analysis of training dynamics, los
|
|
209 |
|
210 |
Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
|
211 |
|
212 |
-
-
|
213 |
-
-
|
214 |
-
-
|
215 |
-
-
|
216 |
-
-
|
217 |
|
218 |
### Use Cases for Researchers
|
219 |
|
220 |
-
-
|
221 |
-
-
|
222 |
-
-
|
223 |
-
-
|
224 |
-
-
|
225 |
|
226 |
### Reproducibility
|
227 |
|
@@ -238,14 +250,14 @@ All training artifacts are publicly available:
|
|
238 |
**Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
|
239 |
|
240 |
**Training Phases:**
|
241 |
-
-
|
242 |
-
-
|
243 |
-
-
|
244 |
|
245 |
**Key Features:**
|
246 |
-
-
|
247 |
-
-
|
248 |
-
-
|
249 |
|
250 |
## Model Architecture
|
251 |
|
@@ -256,8 +268,6 @@ All training artifacts are publicly available:
|
|
256 |
| Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
|
257 |
| Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
|
258 |
|
259 |
-
|
260 |
-
|
261 |
## Usage Examples
|
262 |
|
263 |
### Encoder: Masked Language Modeling
|
@@ -376,7 +386,7 @@ def main():
|
|
376 |
eval_dataset = dataset_dict["test"]
|
377 |
|
378 |
# 3. Define a loss function
|
379 |
-
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)
|
380 |
|
381 |
run_name = f"{model_shortname}-DPR-{lr}"
|
382 |
# 4. (Optional) Specify training arguments
|
@@ -388,16 +398,16 @@ def main():
|
|
388 |
per_device_train_batch_size=512,
|
389 |
per_device_eval_batch_size=512,
|
390 |
warmup_ratio=0.05,
|
391 |
-
fp16=False,
|
392 |
-
bf16=True,
|
393 |
-
batch_sampler=BatchSamplers.NO_DUPLICATES,
|
394 |
learning_rate=lr,
|
395 |
# Optional tracking/debugging parameters:
|
396 |
save_strategy="steps",
|
397 |
save_steps=500,
|
398 |
save_total_limit=2,
|
399 |
logging_steps=500,
|
400 |
-
run_name=run_name,
|
401 |
)
|
402 |
|
403 |
# 5. (Optional) Create an evaluator & evaluate the base model
|
@@ -432,6 +442,7 @@ def main():
|
|
432 |
if __name__ == "__main__":
|
433 |
main()
|
434 |
```
|
|
|
435 |
</details>
|
436 |
|
437 |
|
@@ -487,8 +498,8 @@ def main():
|
|
487 |
output_dir=output_dir,
|
488 |
num_train_epochs=num_train_epochs,
|
489 |
per_device_train_batch_size=batch_size,
|
490 |
-
fp16=False,
|
491 |
-
bf16=True,
|
492 |
run_name=run_name,
|
493 |
logging_steps=10,
|
494 |
learning_rate=lr,
|
@@ -572,9 +583,9 @@ args = SparseEncoderTrainingArguments(
|
|
572 |
per_device_eval_batch_size=16,
|
573 |
learning_rate=2e-5,
|
574 |
warmup_ratio=0.1,
|
575 |
-
fp16=True,
|
576 |
-
bf16=False,
|
577 |
-
batch_sampler=BatchSamplers.NO_DUPLICATES,
|
578 |
# Optional tracking/debugging parameters:
|
579 |
eval_strategy="steps",
|
580 |
eval_steps=1000,
|
@@ -582,7 +593,7 @@ args = SparseEncoderTrainingArguments(
|
|
582 |
save_steps=1000,
|
583 |
save_total_limit=2,
|
584 |
logging_steps=200,
|
585 |
-
run_name=run_name,
|
586 |
)
|
587 |
|
588 |
# 6. (Optional) Create an evaluator & evaluate the base model
|
@@ -644,7 +655,7 @@ def main():
|
|
644 |
|
645 |
train_batch_size = 64
|
646 |
num_epochs = 1
|
647 |
-
num_hard_negatives = 5
|
648 |
|
649 |
# 1a. Load a model to finetune with 1b. (Optional) model card data
|
650 |
model = CrossEncoder(
|
@@ -671,13 +682,13 @@ def main():
|
|
671 |
hard_train_dataset = mine_hard_negatives(
|
672 |
train_dataset,
|
673 |
embedding_model,
|
674 |
-
num_negatives=num_hard_negatives,
|
675 |
-
margin=0,
|
676 |
-
range_min=0,
|
677 |
-
range_max=100,
|
678 |
-
sampling_strategy="top",
|
679 |
-
batch_size=4096,
|
680 |
-
output_format="labeled-pair",
|
681 |
use_faiss=True,
|
682 |
)
|
683 |
logging.info(hard_train_dataset)
|
@@ -703,8 +714,8 @@ def main():
|
|
703 |
hard_eval_dataset = mine_hard_negatives(
|
704 |
eval_dataset,
|
705 |
embedding_model,
|
706 |
-
corpus=full_dataset["answer"],
|
707 |
-
num_negatives=30,
|
708 |
batch_size=4096,
|
709 |
include_positives=True,
|
710 |
output_format="n-tuple",
|
@@ -743,8 +754,8 @@ def main():
|
|
743 |
per_device_eval_batch_size=train_batch_size,
|
744 |
learning_rate=2e-5,
|
745 |
warmup_ratio=0.1,
|
746 |
-
fp16=False,
|
747 |
-
bf16=True,
|
748 |
dataloader_num_workers=4,
|
749 |
load_best_model_at_end=True,
|
750 |
metric_for_best_model="eval_gooaq-dev_ndcg@10",
|
@@ -756,7 +767,7 @@ def main():
|
|
756 |
save_total_limit=2,
|
757 |
logging_steps=200,
|
758 |
logging_first_step=True,
|
759 |
-
run_name=run_name,
|
760 |
seed=12,
|
761 |
)
|
762 |
|
@@ -783,7 +794,8 @@ def main():
|
|
783 |
model.push_to_hub(run_name)
|
784 |
except Exception:
|
785 |
logging.error(
|
786 |
-
f"Error uploading model to the Hugging Face Hub
|
|
|
787 |
f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
|
788 |
f"and saving it using `model.push_to_hub('{run_name}')`."
|
789 |
)
|
@@ -882,7 +894,7 @@ def main(script_args, training_args, model_args):
|
|
882 |
if config.architectures and any(arch in valid_image_text_architectures for arch in config.architectures):
|
883 |
from transformers import AutoModelForImageTextToText
|
884 |
|
885 |
-
model_kwargs.pop("use_cache", None)
|
886 |
model = AutoModelForImageTextToText.from_pretrained(model_args.model_name_or_path, **model_kwargs)
|
887 |
else:
|
888 |
model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)
|
@@ -942,6 +954,80 @@ if __name__ == "__main__":
|
|
942 |
```
|
943 |
</details>
|
944 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
945 |
## Citation
|
946 |
|
947 |
If you use Ettin models in your research, please cite our work:
|
@@ -956,4 +1042,12 @@ If you use Ettin models in your research, please cite our work:
|
|
956 |
primaryClass={cs.CL},
|
957 |
url={https://arxiv.org/abs/2507.11412},
|
958 |
}
|
959 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
2 |
language:
|
3 |
- en
|
4 |
+
license: mit
|
5 |
+
pipeline_tag: text-generation
|
6 |
+
library_name: transformers
|
7 |
+
datasets:
|
8 |
+
- jhu-clsp/ettin-pretraining-data
|
9 |
+
- jhu-clsp/ettin-extension-data
|
10 |
+
- jhu-clsp/ettin-decay-data
|
11 |
+
tags:
|
12 |
+
- ettin
|
13 |
+
- decoder
|
14 |
---
|
15 |
+
|
16 |
# Ettin: an Open Suite of Paired Encoders and Decoders
|
17 |
|
18 |
[](https://opensource.org/licenses/MIT)
|
19 |
[](https://arxiv.org/abs/2507.11412)
|
20 |
+
[](https://huggingface.co/jhu-clsp)
|
21 |
[](https://huggingface.co/datasets/jhu-clsp)
|
22 |
[](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
|
23 |
|
24 |
> π― **TL;DR**: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.
|
25 |
|
26 |
+
π [Paper](https://arxiv.org/abs/2507.11412) | π€ [Model Collection](https://huggingface.co/jhu-clsp) | π [Training Data](https://huggingface.co/datasets/jhu-clsp)
|
27 |
|
28 |
This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
|
29 |
|
30 |
## Table of Contents
|
31 |
+
- [π Performance Highlights](#-performance-highlights)
|
32 |
+
- [π Quick Start](#-quick-start)
|
33 |
- [Model Description](#model-description)
|
34 |
- [Training Data](#training-data)
|
35 |
+
- [π€ Model Family](#-model-family)
|
36 |
- [Encoder Models](#encoder-models)
|
37 |
- [Decoder Models](#decoder-models)
|
38 |
- [Cross-Objective Models](#cross-objective-models)
|
39 |
- [Accessing Training Checkpoints](#accessing-training-checkpoints)
|
40 |
+
- [π¬ Research Applications](#-research-applications)
|
41 |
- [Training Details](#training-details)
|
42 |
- [Model Architecture](#model-architecture)
|
43 |
- [Usage Examples](#usage-examples)
|
44 |
- [Fine-tuning Examples](#fine-tuning-examples)
|
45 |
+
- [π Training and Evaluation](#-training-and-evaluation)
|
46 |
+
- [β FAQ](#-faq)
|
47 |
- [Citation](#citation)
|
48 |
+
- [License](#license)
|
49 |
|
50 |
## π Performance Highlights
|
51 |
|
|
|
94 |
|
95 |
Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
|
96 |
|
97 |
+
1. **Identical training data** - Same high-quality mixture across all models
|
98 |
+
2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
|
99 |
+
3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
|
100 |
+
4. **Consistent training recipe** - Three-phase training with 2T tokens
|
101 |
+
5. **Multiple scales** - From 17M to 1B parameters
|
102 |
|
103 |
This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
|
104 |
|
|
|
106 |
|
107 |
The training data is publicly available and split across different phases:
|
108 |
|
109 |
+
- **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
|
110 |
+
- **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
|
111 |
+
- **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
|
112 |
+
- **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
|
113 |
|
114 |
+
## π€ Model Family
|
115 |
|
116 |
### Encoder Models
|
117 |
|
|
|
186 |
#### HuggingFace Format Checkpoints
|
187 |
Each model repository contains multiple tagged versions representing different training stages:
|
188 |
|
189 |
+
- **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
|
190 |
+
- **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
|
191 |
+
- **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
|
192 |
|
193 |
```python
|
194 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
221 |
|
222 |
Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
|
223 |
|
224 |
+
- **Identical Training Data**: Same 2T token mixture across all models
|
225 |
+
- **Matched Architectures**: Only attention patterns and objectives differ
|
226 |
+
- **Open Everything**: Training data, model weights, and batch-level training order
|
227 |
+
- **Multiple Scales**: Fair comparison from 17M to 1B parameters
|
228 |
+
- **250+ Checkpoints**: Complete training trajectory analysis
|
229 |
|
230 |
### Use Cases for Researchers
|
231 |
|
232 |
+
- **Architecture Studies**: Compare encoder vs decoder capabilities fairly
|
233 |
+
- **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
|
234 |
+
- **Scaling Laws**: Study how architectural advantages change with scale
|
235 |
+
- **Transfer Learning**: Investigate cross-objective training effectiveness
|
236 |
+
- **Replication Studies**: First open replication of ModernBERT training recipe
|
237 |
|
238 |
### Reproducibility
|
239 |
|
|
|
250 |
**Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
|
251 |
|
252 |
**Training Phases:**
|
253 |
+
- **Pre-training**: 1.7T tokens with diverse data mixture
|
254 |
+
- **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
|
255 |
+
- **Decay phase**: 100B tokens with premium data sources
|
256 |
|
257 |
**Key Features:**
|
258 |
+
- Context length: Up to 8K tokens
|
259 |
+
- Vocabulary: 50,368 tokens (ModernBERT tokenizer)
|
260 |
+
- Deep but efficient architectures following MobileLLM principles
|
261 |
|
262 |
## Model Architecture
|
263 |
|
|
|
268 |
| Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
|
269 |
| Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
|
270 |
|
|
|
|
|
271 |
## Usage Examples
|
272 |
|
273 |
### Encoder: Masked Language Modeling
|
|
|
386 |
eval_dataset = dataset_dict["test"]
|
387 |
|
388 |
# 3. Define a loss function
|
389 |
+
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16) # Increase mini_batch_size if you have enough VRAM
|
390 |
|
391 |
run_name = f"{model_shortname}-DPR-{lr}"
|
392 |
# 4. (Optional) Specify training arguments
|
|
|
398 |
per_device_train_batch_size=512,
|
399 |
per_device_eval_batch_size=512,
|
400 |
warmup_ratio=0.05,
|
401 |
+
fp16=False, # Set to False if GPU can't handle FP16
|
402 |
+
bf16=True, # Set to True if GPU supports BF16
|
403 |
+
batch_sampler=BatchSamplers.NO_DUPLICATES, # (Cached)MultipleNegativesRankingLoss benefits from no duplicates
|
404 |
learning_rate=lr,
|
405 |
# Optional tracking/debugging parameters:
|
406 |
save_strategy="steps",
|
407 |
save_steps=500,
|
408 |
save_total_limit=2,
|
409 |
logging_steps=500,
|
410 |
+
run_name=run_name, # Used in `wandb`, `tensorboard`, `neptune`, etc. if installed
|
411 |
)
|
412 |
|
413 |
# 5. (Optional) Create an evaluator & evaluate the base model
|
|
|
442 |
if __name__ == "__main__":
|
443 |
main()
|
444 |
```
|
445 |
+
|
446 |
</details>
|
447 |
|
448 |
|
|
|
498 |
output_dir=output_dir,
|
499 |
num_train_epochs=num_train_epochs,
|
500 |
per_device_train_batch_size=batch_size,
|
501 |
+
fp16=False, # Set to False if you get an error that your GPU can't run on FP16
|
502 |
+
bf16=True, # Set to True if you have a GPU that supports BF16
|
503 |
run_name=run_name,
|
504 |
logging_steps=10,
|
505 |
learning_rate=lr,
|
|
|
583 |
per_device_eval_batch_size=16,
|
584 |
learning_rate=2e-5,
|
585 |
warmup_ratio=0.1,
|
586 |
+
fp16=True, # Set to False if you get an error that your GPU can't run on FP16
|
587 |
+
bf16=False, # Set to True if you have a GPU that supports BF16
|
588 |
+
batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
|
589 |
# Optional tracking/debugging parameters:
|
590 |
eval_strategy="steps",
|
591 |
eval_steps=1000,
|
|
|
593 |
save_steps=1000,
|
594 |
save_total_limit=2,
|
595 |
logging_steps=200,
|
596 |
+
run_name=run_name, # Will be used in W&B if `wandb` is installed
|
597 |
)
|
598 |
|
599 |
# 6. (Optional) Create an evaluator & evaluate the base model
|
|
|
655 |
|
656 |
train_batch_size = 64
|
657 |
num_epochs = 1
|
658 |
+
num_hard_negatives = 5 # How many hard negatives should be mined for each question-answer pair
|
659 |
|
660 |
# 1a. Load a model to finetune with 1b. (Optional) model card data
|
661 |
model = CrossEncoder(
|
|
|
682 |
hard_train_dataset = mine_hard_negatives(
|
683 |
train_dataset,
|
684 |
embedding_model,
|
685 |
+
num_negatives=num_hard_negatives, # How many negatives per question-answer pair
|
686 |
+
margin=0, # Similarity between query and negative samples should be x lower than query-positive similarity
|
687 |
+
range_min=0, # Skip the x most similar samples
|
688 |
+
range_max=100, # Consider only the x most similar samples
|
689 |
+
sampling_strategy="top", # Sample the top negatives from the range
|
690 |
+
batch_size=4096, # Use a batch size of 4096 for the embedding model
|
691 |
+
output_format="labeled-pair", # The output format is (query, passage, label), as required by BinaryCrossEntropyLoss
|
692 |
use_faiss=True,
|
693 |
)
|
694 |
logging.info(hard_train_dataset)
|
|
|
714 |
hard_eval_dataset = mine_hard_negatives(
|
715 |
eval_dataset,
|
716 |
embedding_model,
|
717 |
+
corpus=full_dataset["answer"], # Use the full dataset as the corpus
|
718 |
+
num_negatives=30, # How many documents to rerank
|
719 |
batch_size=4096,
|
720 |
include_positives=True,
|
721 |
output_format="n-tuple",
|
|
|
754 |
per_device_eval_batch_size=train_batch_size,
|
755 |
learning_rate=2e-5,
|
756 |
warmup_ratio=0.1,
|
757 |
+
fp16=False, # Set to False if you get an error that your GPU can't run on FP16
|
758 |
+
bf16=True, # Set to True if you have a GPU that supports BF16
|
759 |
dataloader_num_workers=4,
|
760 |
load_best_model_at_end=True,
|
761 |
metric_for_best_model="eval_gooaq-dev_ndcg@10",
|
|
|
767 |
save_total_limit=2,
|
768 |
logging_steps=200,
|
769 |
logging_first_step=True,
|
770 |
+
run_name=run_name, # Will be used in W&B if `wandb` is installed
|
771 |
seed=12,
|
772 |
)
|
773 |
|
|
|
794 |
model.push_to_hub(run_name)
|
795 |
except Exception:
|
796 |
logging.error(
|
797 |
+
f"Error uploading model to the Hugging Face Hub:
|
798 |
+
{traceback.format_exc()}To upload it manually, you can run "
|
799 |
f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
|
800 |
f"and saving it using `model.push_to_hub('{run_name}')`."
|
801 |
)
|
|
|
894 |
if config.architectures and any(arch in valid_image_text_architectures for arch in config.architectures):
|
895 |
from transformers import AutoModelForImageTextToText
|
896 |
|
897 |
+
model_kwargs.pop("use_cache", None) # Image models do not support cache
|
898 |
model = AutoModelForImageTextToText.from_pretrained(model_args.model_name_or_path, **model_kwargs)
|
899 |
else:
|
900 |
model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)
|
|
|
954 |
```
|
955 |
</details>
|
956 |
|
957 |
+
## π Training and Evaluation
|
958 |
+
|
959 |
+
### Pre-training
|
960 |
+
For details on model pre-training, data preparation, and training recipes:
|
961 |
+
- **π [Pre-training Guide](pretraining/README.md)** - Complete training setup, data mixture, and ModernBERT recipe adaptation
|
962 |
+
|
963 |
+
### Evaluation
|
964 |
+
|
965 |
+
#### Encoder Evaluation
|
966 |
+
- **π [Encoder on Generative Tasks](docs/encoder-generative-eval.md)** - Evaluating encoders on language modeling tasks using our lm-evaluation-harness fork
|
967 |
+
- **π [Encoder Retrieval Training](docs/retrieval.md)** - Fine-tuning on MS MARCO and evaluation on MTEB v2 English
|
968 |
+
- **π― [GLUE Evaluation](glue_evaluation/README.md)** - Comprehensive GLUE benchmark evaluation with fine-tuning scripts
|
969 |
+
|
970 |
+
#### Decoder Evaluation
|
971 |
+
- **π― [Decoder on Generative Tasks](docs/decoder-eval.md)** - Using EleutherAI evaluation harness (commit `867413f8677f00f6a817262727cbb041bf36192a`) for comprehensive generative task evaluation
|
972 |
+
|
973 |
+
#### Bias Evaluation
|
974 |
+
- **βοΈ [Gender Bias Evaluation](bias_eval/README.md)** - Comprehensive gender bias testing using Winogender dataset gotcha examples. Tests how well models handle counter-stereotypical pronouns in occupational contexts. Supports both encoder (MLM) and decoder (perplexity) evaluation methods.
|
975 |
+
|
976 |
+
### Quick Decoder Evaluation Example
|
977 |
+
|
978 |
+
```bash
|
979 |
+
# Clone the specific commit of lm-evaluation-harness
|
980 |
+
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
|
981 |
+
cd lm-evaluation-harness
|
982 |
+
git checkout 867413f8677f00f6a817262727cbb041bf36192a
|
983 |
+
pip install -e .
|
984 |
+
|
985 |
+
# Run evaluation on Ettin decoder
|
986 |
+
lm_eval --model hf \
|
987 |
+
--model_args pretrained=jhu-clsp/ettin-decoder-150m \
|
988 |
+
--tasks hellaswag,arc_easy,arc_challenge,winogrande \
|
989 |
+
--device cuda:0 \
|
990 |
+
--batch_size 8
|
991 |
+
```
|
992 |
+
|
993 |
+
## β FAQ
|
994 |
+
|
995 |
+
### Model Loading Issues
|
996 |
+
|
997 |
+
**Q: I'm getting an error that ModernBERT-decoder isn't found.**
|
998 |
+
**A:** Make sure you have the latest version of transformers installed:
|
999 |
+
```bash
|
1000 |
+
# for the latest version until the official pypi release:
|
1001 |
+
pip install git+https://github.com/huggingface/transformers.git
|
1002 |
+
```
|
1003 |
+
|
1004 |
+
**Q: Which model should I choose for my task?**
|
1005 |
+
**A:**
|
1006 |
+
- **Classification/Retrieval/Understanding**: Use encoder models
|
1007 |
+
- **Text Generation/Chat/Completion**: Use decoder models
|
1008 |
+
- **Research on cross-training**: Use cross-objective models
|
1009 |
+
- **Size selection**: Start with 150M for experimentation, scale up to 400M or 1B for production
|
1010 |
+
|
1011 |
+
**Q: How do I access training checkpoints?**
|
1012 |
+
**A:** Each model has multiple git tags for different training stages. Use the `revision` parameter:
|
1013 |
+
```python
|
1014 |
+
model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m", revision="step500000")
|
1015 |
+
```
|
1016 |
+
|
1017 |
+
**Q: Can I continue training these models?**
|
1018 |
+
**A:** Yes! We provide raw checkpoints in the [jhu-clsp/ettin-checkpoints](https://huggingface.co/datasets/jhu-clsp/ettin-checkpoints) dataset that can be loaded into training frameworks.
|
1019 |
+
|
1020 |
+
**Q: What's the difference between cross-objective models and regular models?**
|
1021 |
+
**A:** Cross-objective models started as one architecture (e.g., decoder) and were continued with a different objective (e.g., MLM). They demonstrate the limitations of cross-training and generally underperform native models.
|
1022 |
+
|
1023 |
+
**Q: How do I reproduce the paper results?**
|
1024 |
+
**A:** See our evaluation guides:
|
1025 |
+
- [Encoder Generative Eval](docs/encoder-generative-eval.md)
|
1026 |
+
- [Retrieval Eval](docs/retrieval.md)
|
1027 |
+
- [GLUE Eval](glue_evaluation/README.md)
|
1028 |
+
- [Decoder Eval](docs/decoder-eval.md)
|
1029 |
+
- [Pre-training](pretraining/README.md)
|
1030 |
+
|
1031 |
## Citation
|
1032 |
|
1033 |
If you use Ettin models in your research, please cite our work:
|
|
|
1042 |
primaryClass={cs.CL},
|
1043 |
url={https://arxiv.org/abs/2507.11412},
|
1044 |
}
|
1045 |
+
```
|
1046 |
+
|
1047 |
+
## License
|
1048 |
+
|
1049 |
+
This project is licensed under the MIT License - see the [LICENSE](https://github.com/jhu-clsp/ettin-encoder-vs-decoder/blob/main/LICENSE) file for details.
|
1050 |
+
|
1051 |
+
---
|
1052 |
+
|
1053 |
+
**Contact**: For questions about the models or research, please open an issue or contact the authors.
|