Text Generation
Transformers
PyTorch
English
modernbert-decoder
ettin
decoder
orionweller nielsr HF Staff commited on
Commit
ac53c21
Β·
verified Β·
1 Parent(s): 1072c4a

Update metadata and improve model card for Ettin decoder model (#2)

Browse files

- Update metadata and improve model card for Ettin decoder model (cacef4824d9d72fecac984c396b457d1a50e31f9)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +160 -66
README.md CHANGED
@@ -1,39 +1,51 @@
1
  ---
2
- license: mit
3
  language:
4
  - en
5
- pipeline_tag: fill-mask
 
 
 
 
 
 
 
 
 
6
  ---
 
7
  # Ettin: an Open Suite of Paired Encoders and Decoders
8
 
9
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
10
  [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2507.11412)
11
- [![Models](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-12%20Models-blue)](https://huggingface.co/jhu-clsp)
12
  [![Data](https://img.shields.io/badge/πŸ€—%20Training%20Data-2T%20Tokens-green)](https://huggingface.co/datasets/jhu-clsp)
13
  [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
14
 
15
  > 🎯 **TL;DR**: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.
16
 
17
- πŸ“„ [Paper](https://arxiv.org/abs/2507.11412) | πŸš€ [GitHub Repository](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
18
 
19
  This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
20
 
21
  ## Table of Contents
22
- - [Performance Highlights](#performance-highlights)
23
- - [Quick Start](#quick-start)
24
  - [Model Description](#model-description)
25
  - [Training Data](#training-data)
26
- - [Model Family](#model-family)
27
  - [Encoder Models](#encoder-models)
28
  - [Decoder Models](#decoder-models)
29
  - [Cross-Objective Models](#cross-objective-models)
30
  - [Accessing Training Checkpoints](#accessing-training-checkpoints)
31
- - [Research Applications](#research-applications)
32
  - [Training Details](#training-details)
33
  - [Model Architecture](#model-architecture)
34
  - [Usage Examples](#usage-examples)
35
  - [Fine-tuning Examples](#fine-tuning-examples)
 
 
36
  - [Citation](#citation)
 
37
 
38
  ## πŸ“Š Performance Highlights
39
 
@@ -82,11 +94,11 @@ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
82
 
83
  Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
84
 
85
- 1. **Identical training data** - Same high-quality mixture across all models
86
- 2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
87
- 3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
88
- 4. **Consistent training recipe** - Three-phase training with 2T tokens
89
- 5. **Multiple scales** - From 17M to 1B parameters
90
 
91
  This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
92
 
@@ -94,12 +106,12 @@ This approach allows for true apples-to-apples comparisons between encoder and d
94
 
95
  The training data is publicly available and split across different phases:
96
 
97
- - **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
98
- - **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
99
- - **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
100
- - **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
101
 
102
- ## Model Family
103
 
104
  ### Encoder Models
105
 
@@ -174,9 +186,9 @@ All raw training checkpoints are available in the [jhu-clsp/ettin-checkpoints](h
174
  #### HuggingFace Format Checkpoints
175
  Each model repository contains multiple tagged versions representing different training stages:
176
 
177
- - **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
178
- - **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
179
- - **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
180
 
181
  ```python
182
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -209,19 +221,19 @@ This checkpoint availability enables detailed analysis of training dynamics, los
209
 
210
  Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
211
 
212
- - **Identical Training Data**: Same 2T token mixture across all models
213
- - **Matched Architectures**: Only attention patterns and objectives differ
214
- - **Open Everything**: Training data, model weights, and batch-level training order
215
- - **Multiple Scales**: Fair comparison from 17M to 1B parameters
216
- - **250+ Checkpoints**: Complete training trajectory analysis
217
 
218
  ### Use Cases for Researchers
219
 
220
- - **Architecture Studies**: Compare encoder vs decoder capabilities fairly
221
- - **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
222
- - **Scaling Laws**: Study how architectural advantages change with scale
223
- - **Transfer Learning**: Investigate cross-objective training effectiveness
224
- - **Replication Studies**: First open replication of ModernBERT training recipe
225
 
226
  ### Reproducibility
227
 
@@ -238,14 +250,14 @@ All training artifacts are publicly available:
238
  **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
239
 
240
  **Training Phases:**
241
- - **Pre-training**: 1.7T tokens with diverse data mixture
242
- - **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
243
- - **Decay phase**: 100B tokens with premium data sources
244
 
245
  **Key Features:**
246
- - Context length: Up to 8K tokens
247
- - Vocabulary: 50,368 tokens (ModernBERT tokenizer)
248
- - Deep but efficient architectures following MobileLLM principles
249
 
250
  ## Model Architecture
251
 
@@ -256,8 +268,6 @@ All training artifacts are publicly available:
256
  | Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
257
  | Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
258
 
259
-
260
-
261
  ## Usage Examples
262
 
263
  ### Encoder: Masked Language Modeling
@@ -376,7 +386,7 @@ def main():
376
  eval_dataset = dataset_dict["test"]
377
 
378
  # 3. Define a loss function
379
- loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16) # Increase mini_batch_size if you have enough VRAM
380
 
381
  run_name = f"{model_shortname}-DPR-{lr}"
382
  # 4. (Optional) Specify training arguments
@@ -388,16 +398,16 @@ def main():
388
  per_device_train_batch_size=512,
389
  per_device_eval_batch_size=512,
390
  warmup_ratio=0.05,
391
- fp16=False, # Set to False if GPU can't handle FP16
392
- bf16=True, # Set to True if GPU supports BF16
393
- batch_sampler=BatchSamplers.NO_DUPLICATES, # (Cached)MultipleNegativesRankingLoss benefits from no duplicates
394
  learning_rate=lr,
395
  # Optional tracking/debugging parameters:
396
  save_strategy="steps",
397
  save_steps=500,
398
  save_total_limit=2,
399
  logging_steps=500,
400
- run_name=run_name, # Used in `wandb`, `tensorboard`, `neptune`, etc. if installed
401
  )
402
 
403
  # 5. (Optional) Create an evaluator & evaluate the base model
@@ -432,6 +442,7 @@ def main():
432
  if __name__ == "__main__":
433
  main()
434
  ```
 
435
  </details>
436
 
437
 
@@ -487,8 +498,8 @@ def main():
487
  output_dir=output_dir,
488
  num_train_epochs=num_train_epochs,
489
  per_device_train_batch_size=batch_size,
490
- fp16=False, # Set to False if you get an error that your GPU can't run on FP16
491
- bf16=True, # Set to True if you have a GPU that supports BF16
492
  run_name=run_name,
493
  logging_steps=10,
494
  learning_rate=lr,
@@ -572,9 +583,9 @@ args = SparseEncoderTrainingArguments(
572
  per_device_eval_batch_size=16,
573
  learning_rate=2e-5,
574
  warmup_ratio=0.1,
575
- fp16=True, # Set to False if you get an error that your GPU can't run on FP16
576
- bf16=False, # Set to True if you have a GPU that supports BF16
577
- batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
578
  # Optional tracking/debugging parameters:
579
  eval_strategy="steps",
580
  eval_steps=1000,
@@ -582,7 +593,7 @@ args = SparseEncoderTrainingArguments(
582
  save_steps=1000,
583
  save_total_limit=2,
584
  logging_steps=200,
585
- run_name=run_name, # Will be used in W&B if `wandb` is installed
586
  )
587
 
588
  # 6. (Optional) Create an evaluator & evaluate the base model
@@ -644,7 +655,7 @@ def main():
644
 
645
  train_batch_size = 64
646
  num_epochs = 1
647
- num_hard_negatives = 5 # How many hard negatives should be mined for each question-answer pair
648
 
649
  # 1a. Load a model to finetune with 1b. (Optional) model card data
650
  model = CrossEncoder(
@@ -671,13 +682,13 @@ def main():
671
  hard_train_dataset = mine_hard_negatives(
672
  train_dataset,
673
  embedding_model,
674
- num_negatives=num_hard_negatives, # How many negatives per question-answer pair
675
- margin=0, # Similarity between query and negative samples should be x lower than query-positive similarity
676
- range_min=0, # Skip the x most similar samples
677
- range_max=100, # Consider only the x most similar samples
678
- sampling_strategy="top", # Sample the top negatives from the range
679
- batch_size=4096, # Use a batch size of 4096 for the embedding model
680
- output_format="labeled-pair", # The output format is (query, passage, label), as required by BinaryCrossEntropyLoss
681
  use_faiss=True,
682
  )
683
  logging.info(hard_train_dataset)
@@ -703,8 +714,8 @@ def main():
703
  hard_eval_dataset = mine_hard_negatives(
704
  eval_dataset,
705
  embedding_model,
706
- corpus=full_dataset["answer"], # Use the full dataset as the corpus
707
- num_negatives=30, # How many documents to rerank
708
  batch_size=4096,
709
  include_positives=True,
710
  output_format="n-tuple",
@@ -743,8 +754,8 @@ def main():
743
  per_device_eval_batch_size=train_batch_size,
744
  learning_rate=2e-5,
745
  warmup_ratio=0.1,
746
- fp16=False, # Set to False if you get an error that your GPU can't run on FP16
747
- bf16=True, # Set to True if you have a GPU that supports BF16
748
  dataloader_num_workers=4,
749
  load_best_model_at_end=True,
750
  metric_for_best_model="eval_gooaq-dev_ndcg@10",
@@ -756,7 +767,7 @@ def main():
756
  save_total_limit=2,
757
  logging_steps=200,
758
  logging_first_step=True,
759
- run_name=run_name, # Will be used in W&B if `wandb` is installed
760
  seed=12,
761
  )
762
 
@@ -783,7 +794,8 @@ def main():
783
  model.push_to_hub(run_name)
784
  except Exception:
785
  logging.error(
786
- f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
 
787
  f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
788
  f"and saving it using `model.push_to_hub('{run_name}')`."
789
  )
@@ -882,7 +894,7 @@ def main(script_args, training_args, model_args):
882
  if config.architectures and any(arch in valid_image_text_architectures for arch in config.architectures):
883
  from transformers import AutoModelForImageTextToText
884
 
885
- model_kwargs.pop("use_cache", None) # Image models do not support cache
886
  model = AutoModelForImageTextToText.from_pretrained(model_args.model_name_or_path, **model_kwargs)
887
  else:
888
  model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)
@@ -942,6 +954,80 @@ if __name__ == "__main__":
942
  ```
943
  </details>
944
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
945
  ## Citation
946
 
947
  If you use Ettin models in your research, please cite our work:
@@ -956,4 +1042,12 @@ If you use Ettin models in your research, please cite our work:
956
  primaryClass={cs.CL},
957
  url={https://arxiv.org/abs/2507.11412},
958
  }
959
- ```
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  language:
3
  - en
4
+ license: mit
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
+ datasets:
8
+ - jhu-clsp/ettin-pretraining-data
9
+ - jhu-clsp/ettin-extension-data
10
+ - jhu-clsp/ettin-decay-data
11
+ tags:
12
+ - ettin
13
+ - decoder
14
  ---
15
+
16
  # Ettin: an Open Suite of Paired Encoders and Decoders
17
 
18
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
19
  [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2507.11412)
20
+ [![Models](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-24%20Models-blue)](https://huggingface.co/jhu-clsp)
21
  [![Data](https://img.shields.io/badge/πŸ€—%20Training%20Data-2T%20Tokens-green)](https://huggingface.co/datasets/jhu-clsp)
22
  [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
23
 
24
  > 🎯 **TL;DR**: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.
25
 
26
+ πŸ“„ [Paper](https://arxiv.org/abs/2507.11412) | πŸ€— [Model Collection](https://huggingface.co/jhu-clsp) | πŸ“Š [Training Data](https://huggingface.co/datasets/jhu-clsp)
27
 
28
  This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
29
 
30
  ## Table of Contents
31
+ - [πŸ“Š Performance Highlights](#-performance-highlights)
32
+ - [πŸš€ Quick Start](#-quick-start)
33
  - [Model Description](#model-description)
34
  - [Training Data](#training-data)
35
+ - [πŸ€– Model Family](#-model-family)
36
  - [Encoder Models](#encoder-models)
37
  - [Decoder Models](#decoder-models)
38
  - [Cross-Objective Models](#cross-objective-models)
39
  - [Accessing Training Checkpoints](#accessing-training-checkpoints)
40
+ - [πŸ”¬ Research Applications](#-research-applications)
41
  - [Training Details](#training-details)
42
  - [Model Architecture](#model-architecture)
43
  - [Usage Examples](#usage-examples)
44
  - [Fine-tuning Examples](#fine-tuning-examples)
45
+ - [πŸ“‹ Training and Evaluation](#-training-and-evaluation)
46
+ - [❓ FAQ](#-faq)
47
  - [Citation](#citation)
48
+ - [License](#license)
49
 
50
  ## πŸ“Š Performance Highlights
51
 
 
94
 
95
  Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
96
 
97
+ 1. **Identical training data** - Same high-quality mixture across all models
98
+ 2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
99
+ 3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
100
+ 4. **Consistent training recipe** - Three-phase training with 2T tokens
101
+ 5. **Multiple scales** - From 17M to 1B parameters
102
 
103
  This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
104
 
 
106
 
107
  The training data is publicly available and split across different phases:
108
 
109
+ - **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
110
+ - **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
111
+ - **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
112
+ - **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
113
 
114
+ ## πŸ€– Model Family
115
 
116
  ### Encoder Models
117
 
 
186
  #### HuggingFace Format Checkpoints
187
  Each model repository contains multiple tagged versions representing different training stages:
188
 
189
+ - **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
190
+ - **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
191
+ - **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
192
 
193
  ```python
194
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
221
 
222
  Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
223
 
224
+ - **Identical Training Data**: Same 2T token mixture across all models
225
+ - **Matched Architectures**: Only attention patterns and objectives differ
226
+ - **Open Everything**: Training data, model weights, and batch-level training order
227
+ - **Multiple Scales**: Fair comparison from 17M to 1B parameters
228
+ - **250+ Checkpoints**: Complete training trajectory analysis
229
 
230
  ### Use Cases for Researchers
231
 
232
+ - **Architecture Studies**: Compare encoder vs decoder capabilities fairly
233
+ - **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
234
+ - **Scaling Laws**: Study how architectural advantages change with scale
235
+ - **Transfer Learning**: Investigate cross-objective training effectiveness
236
+ - **Replication Studies**: First open replication of ModernBERT training recipe
237
 
238
  ### Reproducibility
239
 
 
250
  **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
251
 
252
  **Training Phases:**
253
+ - **Pre-training**: 1.7T tokens with diverse data mixture
254
+ - **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
255
+ - **Decay phase**: 100B tokens with premium data sources
256
 
257
  **Key Features:**
258
+ - Context length: Up to 8K tokens
259
+ - Vocabulary: 50,368 tokens (ModernBERT tokenizer)
260
+ - Deep but efficient architectures following MobileLLM principles
261
 
262
  ## Model Architecture
263
 
 
268
  | Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
269
  | Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
270
 
 
 
271
  ## Usage Examples
272
 
273
  ### Encoder: Masked Language Modeling
 
386
  eval_dataset = dataset_dict["test"]
387
 
388
  # 3. Define a loss function
389
+ loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16) # Increase mini_batch_size if you have enough VRAM
390
 
391
  run_name = f"{model_shortname}-DPR-{lr}"
392
  # 4. (Optional) Specify training arguments
 
398
  per_device_train_batch_size=512,
399
  per_device_eval_batch_size=512,
400
  warmup_ratio=0.05,
401
+ fp16=False, # Set to False if GPU can't handle FP16
402
+ bf16=True, # Set to True if GPU supports BF16
403
+ batch_sampler=BatchSamplers.NO_DUPLICATES, # (Cached)MultipleNegativesRankingLoss benefits from no duplicates
404
  learning_rate=lr,
405
  # Optional tracking/debugging parameters:
406
  save_strategy="steps",
407
  save_steps=500,
408
  save_total_limit=2,
409
  logging_steps=500,
410
+ run_name=run_name, # Used in `wandb`, `tensorboard`, `neptune`, etc. if installed
411
  )
412
 
413
  # 5. (Optional) Create an evaluator & evaluate the base model
 
442
  if __name__ == "__main__":
443
  main()
444
  ```
445
+
446
  </details>
447
 
448
 
 
498
  output_dir=output_dir,
499
  num_train_epochs=num_train_epochs,
500
  per_device_train_batch_size=batch_size,
501
+ fp16=False, # Set to False if you get an error that your GPU can't run on FP16
502
+ bf16=True, # Set to True if you have a GPU that supports BF16
503
  run_name=run_name,
504
  logging_steps=10,
505
  learning_rate=lr,
 
583
  per_device_eval_batch_size=16,
584
  learning_rate=2e-5,
585
  warmup_ratio=0.1,
586
+ fp16=True, # Set to False if you get an error that your GPU can't run on FP16
587
+ bf16=False, # Set to True if you have a GPU that supports BF16
588
+ batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
589
  # Optional tracking/debugging parameters:
590
  eval_strategy="steps",
591
  eval_steps=1000,
 
593
  save_steps=1000,
594
  save_total_limit=2,
595
  logging_steps=200,
596
+ run_name=run_name, # Will be used in W&B if `wandb` is installed
597
  )
598
 
599
  # 6. (Optional) Create an evaluator & evaluate the base model
 
655
 
656
  train_batch_size = 64
657
  num_epochs = 1
658
+ num_hard_negatives = 5 # How many hard negatives should be mined for each question-answer pair
659
 
660
  # 1a. Load a model to finetune with 1b. (Optional) model card data
661
  model = CrossEncoder(
 
682
  hard_train_dataset = mine_hard_negatives(
683
  train_dataset,
684
  embedding_model,
685
+ num_negatives=num_hard_negatives, # How many negatives per question-answer pair
686
+ margin=0, # Similarity between query and negative samples should be x lower than query-positive similarity
687
+ range_min=0, # Skip the x most similar samples
688
+ range_max=100, # Consider only the x most similar samples
689
+ sampling_strategy="top", # Sample the top negatives from the range
690
+ batch_size=4096, # Use a batch size of 4096 for the embedding model
691
+ output_format="labeled-pair", # The output format is (query, passage, label), as required by BinaryCrossEntropyLoss
692
  use_faiss=True,
693
  )
694
  logging.info(hard_train_dataset)
 
714
  hard_eval_dataset = mine_hard_negatives(
715
  eval_dataset,
716
  embedding_model,
717
+ corpus=full_dataset["answer"], # Use the full dataset as the corpus
718
+ num_negatives=30, # How many documents to rerank
719
  batch_size=4096,
720
  include_positives=True,
721
  output_format="n-tuple",
 
754
  per_device_eval_batch_size=train_batch_size,
755
  learning_rate=2e-5,
756
  warmup_ratio=0.1,
757
+ fp16=False, # Set to False if you get an error that your GPU can't run on FP16
758
+ bf16=True, # Set to True if you have a GPU that supports BF16
759
  dataloader_num_workers=4,
760
  load_best_model_at_end=True,
761
  metric_for_best_model="eval_gooaq-dev_ndcg@10",
 
767
  save_total_limit=2,
768
  logging_steps=200,
769
  logging_first_step=True,
770
+ run_name=run_name, # Will be used in W&B if `wandb` is installed
771
  seed=12,
772
  )
773
 
 
794
  model.push_to_hub(run_name)
795
  except Exception:
796
  logging.error(
797
+ f"Error uploading model to the Hugging Face Hub:
798
+ {traceback.format_exc()}To upload it manually, you can run "
799
  f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
800
  f"and saving it using `model.push_to_hub('{run_name}')`."
801
  )
 
894
  if config.architectures and any(arch in valid_image_text_architectures for arch in config.architectures):
895
  from transformers import AutoModelForImageTextToText
896
 
897
+ model_kwargs.pop("use_cache", None) # Image models do not support cache
898
  model = AutoModelForImageTextToText.from_pretrained(model_args.model_name_or_path, **model_kwargs)
899
  else:
900
  model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)
 
954
  ```
955
  </details>
956
 
957
+ ## πŸ“‹ Training and Evaluation
958
+
959
+ ### Pre-training
960
+ For details on model pre-training, data preparation, and training recipes:
961
+ - **πŸ“– [Pre-training Guide](pretraining/README.md)** - Complete training setup, data mixture, and ModernBERT recipe adaptation
962
+
963
+ ### Evaluation
964
+
965
+ #### Encoder Evaluation
966
+ - **πŸ“Š [Encoder on Generative Tasks](docs/encoder-generative-eval.md)** - Evaluating encoders on language modeling tasks using our lm-evaluation-harness fork
967
+ - **πŸ” [Encoder Retrieval Training](docs/retrieval.md)** - Fine-tuning on MS MARCO and evaluation on MTEB v2 English
968
+ - **🎯 [GLUE Evaluation](glue_evaluation/README.md)** - Comprehensive GLUE benchmark evaluation with fine-tuning scripts
969
+
970
+ #### Decoder Evaluation
971
+ - **🎯 [Decoder on Generative Tasks](docs/decoder-eval.md)** - Using EleutherAI evaluation harness (commit `867413f8677f00f6a817262727cbb041bf36192a`) for comprehensive generative task evaluation
972
+
973
+ #### Bias Evaluation
974
+ - **βš–οΈ [Gender Bias Evaluation](bias_eval/README.md)** - Comprehensive gender bias testing using Winogender dataset gotcha examples. Tests how well models handle counter-stereotypical pronouns in occupational contexts. Supports both encoder (MLM) and decoder (perplexity) evaluation methods.
975
+
976
+ ### Quick Decoder Evaluation Example
977
+
978
+ ```bash
979
+ # Clone the specific commit of lm-evaluation-harness
980
+ git clone https://github.com/EleutherAI/lm-evaluation-harness.git
981
+ cd lm-evaluation-harness
982
+ git checkout 867413f8677f00f6a817262727cbb041bf36192a
983
+ pip install -e .
984
+
985
+ # Run evaluation on Ettin decoder
986
+ lm_eval --model hf \
987
+ --model_args pretrained=jhu-clsp/ettin-decoder-150m \
988
+ --tasks hellaswag,arc_easy,arc_challenge,winogrande \
989
+ --device cuda:0 \
990
+ --batch_size 8
991
+ ```
992
+
993
+ ## ❓ FAQ
994
+
995
+ ### Model Loading Issues
996
+
997
+ **Q: I'm getting an error that ModernBERT-decoder isn't found.**
998
+ **A:** Make sure you have the latest version of transformers installed:
999
+ ```bash
1000
+ # for the latest version until the official pypi release:
1001
+ pip install git+https://github.com/huggingface/transformers.git
1002
+ ```
1003
+
1004
+ **Q: Which model should I choose for my task?**
1005
+ **A:**
1006
+ - **Classification/Retrieval/Understanding**: Use encoder models
1007
+ - **Text Generation/Chat/Completion**: Use decoder models
1008
+ - **Research on cross-training**: Use cross-objective models
1009
+ - **Size selection**: Start with 150M for experimentation, scale up to 400M or 1B for production
1010
+
1011
+ **Q: How do I access training checkpoints?**
1012
+ **A:** Each model has multiple git tags for different training stages. Use the `revision` parameter:
1013
+ ```python
1014
+ model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m", revision="step500000")
1015
+ ```
1016
+
1017
+ **Q: Can I continue training these models?**
1018
+ **A:** Yes! We provide raw checkpoints in the [jhu-clsp/ettin-checkpoints](https://huggingface.co/datasets/jhu-clsp/ettin-checkpoints) dataset that can be loaded into training frameworks.
1019
+
1020
+ **Q: What's the difference between cross-objective models and regular models?**
1021
+ **A:** Cross-objective models started as one architecture (e.g., decoder) and were continued with a different objective (e.g., MLM). They demonstrate the limitations of cross-training and generally underperform native models.
1022
+
1023
+ **Q: How do I reproduce the paper results?**
1024
+ **A:** See our evaluation guides:
1025
+ - [Encoder Generative Eval](docs/encoder-generative-eval.md)
1026
+ - [Retrieval Eval](docs/retrieval.md)
1027
+ - [GLUE Eval](glue_evaluation/README.md)
1028
+ - [Decoder Eval](docs/decoder-eval.md)
1029
+ - [Pre-training](pretraining/README.md)
1030
+
1031
  ## Citation
1032
 
1033
  If you use Ettin models in your research, please cite our work:
 
1042
  primaryClass={cs.CL},
1043
  url={https://arxiv.org/abs/2507.11412},
1044
  }
1045
+ ```
1046
+
1047
+ ## License
1048
+
1049
+ This project is licensed under the MIT License - see the [LICENSE](https://github.com/jhu-clsp/ettin-encoder-vs-decoder/blob/main/LICENSE) file for details.
1050
+
1051
+ ---
1052
+
1053
+ **Contact**: For questions about the models or research, please open an issue or contact the authors.