16 24 44

Pulkit Mehta

pulkitmehtawork

AI & ML interests

None yet

Recent Activity

liked a model 28 days ago

PhysicsWallahAI/Aryabhata-1.0

updated a model about 2 months ago

pulkitmehtawork/sparse-distilbert-base-uncased-python-code-lightening

published a model about 2 months ago

pulkitmehtawork/sparse-distilbert-base-uncased-python-code-lightening

View all activity

Organizations

liked a model 28 days ago

PhysicsWallahAI/Aryabhata-1.0

Text Generation • 8B • Updated 13 days ago • 1.95k • 98

updated a model about 2 months ago

pulkitmehtawork/sparse-distilbert-base-uncased-python-code-lightening

Feature Extraction • 0.1B • Updated Jul 4 • 17

published a model about 2 months ago

pulkitmehtawork/sparse-distilbert-base-uncased-python-code-lightening

Feature Extraction • 0.1B • Updated Jul 4 • 17

liked a model about 2 months ago

prithivida/Splade_PP_en_v1

Feature Extraction • Updated Jun 30 • 17.5k • 28

reacted to tomaarsen's post with 🔥 about 2 months ago

Post

2976

‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance! Details:

1️⃣ Sparse Encoder Models
Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:

- Full SPLADE, Inference-free SPLADE, and CSR architecture support
- 4 new modules, 12 new losses, 9 new evaluators
- Integration with @elastic-co , @opensearch-project , @NAVER LABS Europe, @qdrant , @IBM , etc.
- Decode interpretable embeddings to understand token importance
- Hybrid search integration to get the best of both worlds

2️⃣ Enhanced Encode Methods & Multi-Processing
- Introduce encode_query & encode_document automatically use predefined prompts
- No more manual pool management - just pass device list directly to encode()
- Much cleaner and easier to use than the old multi-process approach

3️⃣ Router Module & Advanced Training
- Router module with different processing paths for queries vs documents
- Custom learning rates for different parameter groups
- Composite loss logging - see individual loss components
- Perfect for two-tower architectures

4️⃣ Comprehensive Documentation & Training
- New Training Overview, Loss Overview, API Reference docs
- 6 new training example documentation pages
- Full integration examples with major search engines
- Extensive blogpost on training sparse models

Read the comprehensive blogpost about training sparse embedding models: https://huggingface.co/blog/train-sparse-encoder

See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v5.0.0

What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!

commented on Training and Finetuning Sparse Embedding Models with Sentence Transformers v5 about 2 months ago

Great work . Best part is interpretability and speed .. @tomaarsen - I am planning to fine tune a model for text to code retrieval with below setup .. please guide if this setting seems fine for start or anything I can tune to do better .. Idea is to do decent on text to code and eval on (https://github.com/CoIR-team/coir)
Training dataset - claudios/code_search_net .. filter on Python code .. query is doc string of code and passage is code ... loss - SparseMultipleNegativesRankingLoss.. not able to think of decent dev evaluation .. shall I use SparseTripletEvaluator .. also , just query and positive passage is fine because I believe negative options will be all other data in that batch or we have to explicitly prepare data ( mine negative data ) .. please guide ..