Sentence Transformers

university

https://www.sbert.net

AI & ML interests

In the following you find models tuned to be used for sentence / text embedding generation. They can be used with the sentence-transformers package.

Recent Activity

tomaarsen new activity 7 days ago

sentence-transformers/static-similarity-mrl-multilingual-v1:dim is 1024？

tomaarsen new activity 19 days ago

sentence-transformers/all-MiniLM-L6-v2:Updated feature-extraction API URL

tomaarsen new activity 19 days ago

sentence-transformers/all-MiniLM-L6-v2:Why pipeline/feature-extraction API does not work?

View all activity

sentence-transformers's activity

tomaarsen

in sentence-transformers/static-similarity-mrl-multilingual-v1 7 days ago

dim is 1024？

#5 opened 7 days ago by

chaochaoli

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 19 days ago

Updated feature-extraction API URL

🤗 👍 3

#116 opened 19 days ago by

tomaarsen

Why pipeline/feature-extraction API does not work?

➕ 1

#115 opened 19 days ago by

gkwan-guides-3

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 about 1 month ago

Error 422

#111 opened about 1 month ago by

SparkFounders

tomaarsen

posted an update about 2 months ago

Post

3417

I just released Sentence Transformers v4.1; featuring ONNX and OpenVINO backends for rerankers offering 2-3x speedups and improved hard negatives mining which helps prepare stronger training datasets. Details:

🏎️ ONNX, OpenVINO, Optimization, Quantization
- I've added ONNX and OpenVINO support with just one extra argument: "backend" when loading the CrossEncoder reranker, e.g.: CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
- The export_optimized_onnx_model, export_dynamic_quantized_onnx_model, and export_static_quantized_openvino_model functions now work with CrossEncoder rerankers, allowing you to optimize (e.g. fusions, gelu approximations, etc.) or quantize (int8 weights) rerankers.
- I've uploaded ~340 ONNX & OpenVINO models for all existing models under the cross-encoder Hugging Face organization. You can use these without having to export when loading.

⛏ Improved Hard Negatives Mining
- Added 'absolute_margin' and 'relative_margin' arguments to mine_hard_negatives.
- absolute_margin ensures that sim(query, negative) < sim(query, positive) - absolute_margin, i.e. an absolute margin between the negative & positive similarities.
- relative_margin ensures that sim(query, negative) < sim(query, positive) * (1 - relative_margin), i.e. a relative margin between the negative & positive similarities.
- Inspired by the excellent NV-Retriever paper from NVIDIA.

And several other small improvements. Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v4.1.0

With this release, I introduce near-feature parity between the SentenceTransformer embedding & CrossEncoder reranker models, which I've wanted to do for quite some time! With rerankers very strongly supported now, it's time to look forward to other useful architectures!

tomaarsen

in sentence-transformers/stsb about 2 months ago

Unable to download the Dataset

#1 opened about 2 months ago by

ajaykrishna2222

pcuenq

authored a paper about 2 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 186

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 about 2 months ago

Error using model in python: NameError: name 'init_empty_weights' is not defined

👀 1

#108 opened about 2 months ago by

Shira1234

tomaarsen

updated a collection about 2 months ago

Embedding Model Datasets

Collection

A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers • 70 items • Updated Apr 7 • 127

tomaarsen

updated a dataset 2 months ago

sentence-transformers/msmarco

Viewer • Updated Mar 31 • 527M • 537 • 4

tomaarsen

posted an update 2 months ago

Post

2590

‼️Sentence Transformers v4.0 is out! You can now train and finetune reranker models with multi-GPU training, bf16 support, loss logging, callbacks & much more. I also prove that finetuning on your domain helps much more than you might think.

1️⃣ Reranker Training Refactor
Reranker models can now be trained using an extensive trainer with a lot of powerful features:
- MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
- bf16 training support; loss logging
- Evaluation datasets + evaluation loss
- Improved callback support + an excellent Weights & Biases integration
- Gradient checkpointing, gradient accumulation
- Model card generation
- Resuming from a training checkpoint without performance loss
- Hyperparameter Optimization
and much more!

Read my detailed blogpost to learn about the components that make up this new training approach: https://huggingface.co/blog/train-reranker
Notably, the release is fully backwards compatible: all deprecations are soft, meaning that they still work but emit a warning informing you how to upgrade.

2️⃣ New Reranker Losses
- 11 new losses:
- 2 traditional losses: BinaryCrossEntropy and CrossEntropy
- 2 distillation losses: MSE and MarginMSE
- 2 in-batch negatives losses: MNRL (a.k.a. InfoNCE) and CMNRL
- 5 learning to rank losses: Lambda, p-ListMLE, ListNet, RankNet, ListMLE

3️⃣ New Reranker Documentation
- New Training Overview, Loss Overview, API Reference docs
- 5 new, 1 refactored training examples docs pages
- 13 new, 6 refactored training scripts
- Migration guides (2.x -> 3.x, 3.x -> 4.x)

4️⃣ Blogpost
Alongside the release, I've written a blogpost where I finetune ModernBERT on a generic question-answer dataset. My finetunes easily outperform all general-purpose reranker models, even models 4x as big. Finetuning on your domain is definitely worth it: https://huggingface.co/blog/train-reranker

See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v4.0.1

tomaarsen

in sentence-transformers/distiluse-base-multilingual-cased-v2 3 months ago

Update README.md

👍 1

#11 opened 3 months ago by

FischerWells

tomaarsen

posted an update 3 months ago

Post

6772

An assembly of 18 European companies, labs, and universities have banded together to launch 🇪🇺 EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.

🇪🇺 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi
3️⃣ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion
➡️ Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common.
⚙️ Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported.
🔥 A new Pareto frontier (stronger *and* smaller) for multilingual encoder models
📊 Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight.
📝 Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code.

Check out the release blogpost here: https://huggingface.co/blog/EuroBERT/release
* EuroBERT/EuroBERT-210m
* EuroBERT/EuroBERT-610m
* EuroBERT/EuroBERT-2.1B

The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!

1 reply

tomaarsen

updated a dataset 3 months ago

sentence-transformers/msmarco-msmarco-MiniLM-L6-v3

Viewer • Updated Mar 6 • 80.6M • 588 • 2

tomaarsen

updated 5 models 3 months ago

AI & ML interests

Recent Activity

Team members 5

sentence-transformers's activity

dim is 1024？

Updated feature-extraction API URL

Why pipeline/feature-extraction API does not work?

Error 422

Unable to download the Dataset

Error using model in python: NameError: name 'init_empty_weights' is not defined

Update README.md