EttinX-sts-xxs / README.md

Update README.md

a0315e0 verified about 1 month ago

5.72 kB

	---
	license: mit
	datasets:
	- dleemiller/wiki-sim
	- sentence-transformers/stsb
	language:
	- en
	metrics:
	- spearmanr
	- pearsonr
	base_model:
	- jhu-clsp/ettin-encoder-17m
	pipeline_tag: text-classification
	library_name: sentence-transformers
	tags:
	- cross-encoder
	- modernbert
	- sts
	- stsb
	- stsbenchmark-sts
	model-index:
	- name: CrossEncoder based on jhu-clsp/ettin-encoder-17m
	results:
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: sts test
	type: sts-test
	metrics:
	- type: pearson_cosine
	value: 0.8413715686076841
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.8310895302151975
	name: Spearman Cosine
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: sts dev
	type: sts-dev
	metrics:
	- type: pearson_cosine
	value: 0.8815197312565873
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.8786002071426082
	name: Spearman Cosine
	---

	# EttinX Cross-Encoder: Semantic Similarity (STS)

	Cross encoders are high performing encoder models that compare two texts and output a 0-1 score.
	I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs.
	They're simple to use, fast and very accurate.

	The Ettin series followed up with new encoders trained on the ModernBERT architecture, with a range of sizes, starting at 17M.
	The reduced parameters and computationally efficient interleaved local/global attention layers make this a very fast model,
	which can easily process a few hundred sentence pairs per second on CPU, and a few thousand per second on my A6000.

	---

	## Features
	- High performing: Achieves Pearson: 0.8414 and Spearman: 0.8311 on the STS-Benchmark test set.
	- Efficient architecture: Based on the Ettin-encoder design (17M parameters), offering very fast inference speeds.
	- Extended context length: Processes sequences up to 8192 tokens, great for LLM output evals.
	- Diversified training: Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`.

	---

	## Performance


	\| Model \| STS-B Test Pearson \| STS-B Test Spearman \| Context Length \| Parameters \| Speed \|
	\|--------------------------------\|--------------------\|---------------------\|----------------\|------------\|---------\|
	\| `ModernCE-large-sts` \| 0.9256 \| 0.9215 \| 8192 \| 395M \| Medium \|
	\| `ModernCE-base-sts` \| 0.9162 \| 0.9122 \| 8192 \| 149M \| Fast \|
	\| `stsb-roberta-large` \| 0.9147 \| - \| 512 \| 355M \| Slow \|
	\| `EttinX-sts-m` \| 0.9143 \| 0.9102 \| 8192 \| 149M \| Fast \|
	\| `EttinX-sts-s` \| 0.9004 \| 0.8926 \| 8192 \| 68M \| Very Fast \|
	\| `stsb-distilroberta-base` \| 0.8792 \| - \| 512 \| 82M \| Fast \|
	\| `EttinX-sts-xs` \| 0.8763 \| 0.8689 \| 8192 \| 32M \| Very Fast \|
	\| `EttinX-sts-xxs` \| 0.8414 \| 0.8311 \| 8192 \| 17M \| Very Fast \|


	---

	## Usage

	To use EttinX for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library:

	```python
	from sentence_transformers import CrossEncoder

	# Load EttinX model
	model = CrossEncoder("dleemiller/EttinX-sts-xxs")

	# Predict similarity scores for sentence pairs
	sentence_pairs = [
	("It's a wonderful day outside.", "It's so sunny today!"),
	("It's a wonderful day outside.", "He drove to work earlier."),
	]
	scores = model.predict(sentence_pairs)

	print(scores) # Outputs: array([0.9184, 0.0123], dtype=float32)
	```

	### Output
	The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity.

	---

	## Training Details

	### Pretraining
	The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset. This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.
	- Classifier Dropout: a somewhat large classifier dropout of 0.3, to reduce overreliance on teacher scores.
	- Objective: STS-B scores from `cross-encoder/stsb-roberta-large`.

	### Fine-Tuning
	Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset.

	### Validation Results
	The model achieved the following test set performance after fine-tuning:
	- Pearson Correlation: 0.8414
	- Spearman Correlation: 0.8311

	---

	## Model Card

	- Architecture: Ettin-encoder-17m
	- Tokenizer: Custom tokenizer trained with modern techniques for long-context handling.
	- Pretraining Data: `dleemiller/wiki-sim (pair-score-sampled)`
	- Fine-Tuning Data: `sentence-transformers/stsb`

	---

	## Thank You

	Thanks to the Johns Hopkins team for providing the ModernBERT models, and the Sentence Transformers team for their leadership in transformer encoder models.

	---

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{ettinxstsb2025,
	author = {Miller, D. Lee},
	title = {EttinX STS: An STS cross encoder model},
	year = {2025},
	publisher = {Hugging Face Hub},
	url = {https://huggingface.co/dleemiller/EttinX-sts-xxs},
	}
	```

	---

	## License

	This model is licensed under the [MIT License](LICENSE).