
Pulkit Mehta
AI & ML interests
Recent Activity
Organizations

1️⃣ Sparse Encoder Models
Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:
- Full SPLADE, Inference-free SPLADE, and CSR architecture support
- 4 new modules, 12 new losses, 9 new evaluators
- Integration with @elastic-co , @opensearch-project , @NAVER LABS Europe, @qdrant , @IBM , etc.
- Decode interpretable embeddings to understand token importance
- Hybrid search integration to get the best of both worlds
2️⃣ Enhanced Encode Methods & Multi-Processing
- Introduce encode_query & encode_document automatically use predefined prompts
- No more manual pool management - just pass device list directly to encode()
- Much cleaner and easier to use than the old multi-process approach
3️⃣ Router Module & Advanced Training
- Router module with different processing paths for queries vs documents
- Custom learning rates for different parameter groups
- Composite loss logging - see individual loss components
- Perfect for two-tower architectures
4️⃣ Comprehensive Documentation & Training
- New Training Overview, Loss Overview, API Reference docs
- 6 new training example documentation pages
- Full integration examples with major search engines
- Extensive blogpost on training sparse models
Read the comprehensive blogpost about training sparse embedding models: https://huggingface.co/blog/train-sparse-encoder
See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v5.0.0
What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!

Great work . Best part is interpretability and speed ..
@tomaarsen
- I am planning to fine tune a model for text to code retrieval with below setup .. please guide if this setting seems fine for start or anything I can tune to do better .. Idea is to do decent on text to code and eval on (https://github.com/CoIR-team/coir)
Training dataset - claudios/code_search_net .. filter on Python code .. query is doc string of code and passage is code ... loss - SparseMultipleNegativesRankingLoss.. not able to think of decent dev evaluation .. shall I use SparseTripletEvaluator .. also , just query and positive passage is fine because I believe negative options will be all other data in that batch or we have to explicitly prepare data ( mine negative data ) .. please guide ..

Training and Finetuning Sparse Embedding Models with Sentence Transformers v5
great product .. can we drag entire row vs 1 column at a time ?

Tiny Agents in Python: a MCP-powered agent in ~70 lines of code
Tiny Agents: a MCP-powered agent in 50 lines of code
