smollm3-3b-embed-de / README.md
johannhartmann's picture
Upload German SmolLM3-3B embedding model
cf2641a verified
---
language: de
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- embeddings
- german
- text-embedding
model-index:
- name: smollm3-3b-embed-de
results: []
---
# SmolLM3-3B German Embeddings
Experimental German text embedding model based on [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B), trained using the LLM2Vec approach to transform a decoder-only LLM into a powerful text encoder.
## Model Description
This model represents German text as dense vectors suitable for semantic search, clustering, and similarity tasks. It was created by adapting SmolLM3-3B through a two-stage training process that enables bidirectional attention and teaches the model to generate meaningful text representations.
### Key Features
- **Architecture**: SmolLM3-3B with bidirectional attention
- **Embedding Dimension**: 2048
- **Max Sequence Length**: 512 tokens
- **Language**: German (primary), may have some cross-lingual capabilities
- **Training Method**: LLM2Vec (MNTP + Supervised Contrastive Learning)
## Training Process
### Stage 1: Bidirectional Conversion & MNTP (Masked Next Token Prediction)
1. **Model Transformation**: Modified SmolLM3-3B architecture to enable bidirectional attention by:
- Removing causal attention masks
- Enabling position-agnostic attention computation
- Preserving the original model weights
2. **MNTP Training**:
- **Dataset**: 50,000 samples from German Wikipedia
- **Task**: Predicting masked tokens using bidirectional context
- **Training Steps**: 1,000
- **Batch Size**: 512 (64 per device × 8 gradient accumulation)
- **LoRA Configuration**: rank=16, alpha=32
- **Learning Rate**: 1e-4 with warmup
### Stage 2: Supervised Contrastive Learning
3. **Supervised Fine-tuning**:
- **Dataset**: German text pairs with positive/negative examples
- **Training Format**: Contrastive learning using (query, positive, negative) triplets
- **Training Steps**: 500 steps
- **Batch Size**: 32 (16 per device × 2 gradient accumulation)
- **Learning Rate**: 2e-4 with warmup
- **Loss**: Contrastive loss to maximize similarity between semantically related texts
### Training Infrastructure
- **Hardware**: NVIDIA RTX A6000 (48GB VRAM)
- **Precision**: bfloat16
- **Framework**: Transformers + PEFT + LLM2Vec
## Usage
### Using with LLM2Vec Library
```python
from llm2vec import LLM2Vec
import torch
# Load model
model = LLM2Vec.from_pretrained(
"mayflowergmbh/smollm3-3b-embed-de",
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Encode German texts
texts = [
"Berlin ist die Hauptstadt von Deutschland.",
"Die deutsche Hauptstadt ist Berlin.",
"München ist eine Stadt in Bayern."
]
embeddings = model.encode(texts)
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
```
### Using with Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
# Note: Requires adapter for sentence-transformers compatibility
model = SentenceTransformer('path/to/smollm3-3b-embed-de')
embeddings = model.encode(texts)
```
## Intended Uses
### Primary Use Cases
- **Semantic Search**: Find relevant documents in German text corpora
- **Text Classification**: Use embeddings as features for downstream classifiers
- **Clustering**: Group similar German texts together
- **Duplicate Detection**: Identify semantically similar content
- **Question Answering**: Match questions with relevant answers
### Example: Semantic Search
```python
# Create document embeddings
documents = [
"Die Katze sitzt auf dem Sofa.",
"Der Hund spielt im Garten.",
"Python ist eine Programmiersprache.",
"Machine Learning revolutioniert die Technologie."
]
doc_embeddings = model.encode(documents)
# Search with a query
query = "Haustiere und ihre Aktivitäten"
query_embedding = model.encode([query])
# Find most similar documents
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = similarities.argsort()[-3:][::-1]
for idx in top_indices:
print(f"Score: {similarities[idx]:.3f} - {documents[idx]}")
```
## Performance Characteristics
### Strengths
- Excellent German language understanding
- Strong performance on semantic similarity tasks
- Efficient inference despite larger model size
- Benefits from SmolLM3's strong foundation
### Limitations
- Larger than typical embedding models (3B parameters)
- Requires GPU for optimal performance
- Limited to 512 token sequences
- Primarily optimized for German (cross-lingual performance not evaluated)
## Model Architecture Details
```
Base Model: SmolLM3-3B
- Hidden Size: 2048
- Intermediate Size: 11008
- Number of Layers: 36
- Number of Attention Heads: 16
- Vocabulary Size: 128256
- Position Embeddings: 65536 (RoPE)
```
## Training Hyperparameters
**MNTP Stage:**
- Learning Rate: 1e-4
- Batch Size: 512
- Max Sequence Length: 512
- Gradient Accumulation: 8
- LoRA r: 16
- LoRA alpha: 32
- Warmup Steps: 100
- Total Steps: 1000
**Supervised Stage:**
- Learning Rate: 2e-4
- Batch Size: 32
- Max Sequence Length: 256
- Training Epochs: 3
- Warmup Steps: 100
- Weight Decay: 0.01
## Ethical Considerations
- **Bias**: Model may reflect biases present in German Wikipedia and training data
- **Use Cases**: Should not be used for making decisions about individuals
- **Privacy**: Do not use with personally identifiable information
## Citation
If you use this model, please cite:
```bibtex
@misc{smollm3-embed-de,
title={SmolLM3-3B German Embeddings},
author={Johann-Peter Hartmann},
year={2025},
publisher={Mayflower GmbH},
url={https://huggingface.co/mayflowergmbh/smollm3-3b-embed-de}
}
@article{llm2vec,
title={LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders},
author={Behnamghader, Parishad and others},
journal={arXiv preprint arXiv:2404.05961},
year={2024}
}
```
## Acknowledgments
- Base model: [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- Training methodology: [McGill-NLP/LLM2Vec](https://github.com/McGill-NLP/llm2vec)
- Training data: German Wikipedia
## Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/johannhartmann/german-llm-embed).